We've seen a few cases where customers bring up a sharded cluster running with authentication and the shard primaries get errors querying the config servers saying that they are unauthenticated. This causes the system to be unusable. It appears as though the mongods aren't even trying to authenticate to the config servers, even though they successfully authenticate to the other nodes in their replica set. The problem seems to be that the ShardingConnectionHook, which also handes authenticating all connections used by sharding, isn't being set on the pool. Restarting the mongods seems to resolve the issues, which further supports my theory that this is a race condition.
Investigation into the code brings us to the following function in d_state.cpp:
void ShardedConnectionInfo::addHook() { static bool done = false; if (!done) { LOG(1) << "adding sharding hook" << endl; pool.addHook(new ShardingConnectionHook(false)); shardConnectionPool.addHook(new ShardingConnectionHook(true)); done = true; } }
This is the code that is used to set the connection hook on the pools. This code is not thread-safe and there's a potential race condition that could lead to 2 connections calling addHook at the same time. Since addHook is basically just an add to an stl::list, and stl isn't thread safe, this could potentially corrupt the connection hooks linked list structure. This is my current theory as to how the ShardingConnectionHook can fail to be set.