Support,
We have encountered an issue where a primary appears to be locked in a state that blocks operations on all databases including local and prevents serverStatus from returning. When currentOp() was retrieved there were no active operations holding a global write lock only a database read operation targeting a global read lock and database read lock.
"locks" : { "^" : "r", "^mydb" : "R" }
To resolve the issue we must kill the primary and force an election. The secondaries report the following errors shortly after the primary enters this state but do not trigger an election.
[rsBackgroundSync] Socket recv() timeout 192.168.1.81:27017 [rsBackgroundSync] SocketException: remote: 192.168.1.81:27017 error: 9001 socket exception [RECV_TIMEOUT] server [192.168.1.81:27017] [rsBackgroundSync] DBClientCursor::init call() failed [rsBackgroundSync] replSet sync source problem: 10276 DBClientBase::findN: transport error: server1.example.com:27017 ns: local.oplog.rs query: {} [rsBackgroundSync] replSet syncing to: server1.example.com:27017
Prior to the issue the mongod is processing deletes using the Bulk() operations builder. The mongod is part of a sharded cluster with 20+ shards running version 2.6.10 on Linux. The shard key is _id: hashed and these bulk deletes are not on the shard key.
Can you please provide any known bugs that have the same symptoms as above or advise on finding the root cause?
Thank you,
Jason