-
Type: Bug
-
Resolution: Duplicate
-
Priority: Major - P3
-
None
-
Affects Version/s: 2.2.3
-
Component/s: Replication
-
None
-
ALL
The compact command was issued on primary. Node went into RECOVERING state as expected. But shortly after that, while still doing compact, it somehow became SECONDARY. We run compact command weekly by cron and we have never seen such behavior before.
Replica set config:
{ "_id": "RS_Crawler_2", "version": 190441, "members": [ { "_id": 0, "host": "crwlrdb04:8086", "priority": 10 }, { "_id": 1, "host": "crwlrdb05:8086" }, { "_id": 2, "host": "crwlrdb06:8086", "arbiterOnly": true }, { "_id": 3, "host": "backup.ec2:18089", "priority": 0, "slaveDelay": 21600, "hidden": true, "buildIndexes": false } ] }
Log of crwlrdb04 (node performing compact):
Sun Apr 7 04:00:11 [conn548282] replSet info stepping down as primary secs=600 Sun Apr 7 04:00:11 [conn548282] replSet relinquishing primary state Sun Apr 7 04:00:11 [conn548282] replSet SECONDARY Sun Apr 7 04:00:11 [conn548282] replSet closing client sockets after relinquishing primary Sun Apr 7 04:00:11 [conn548283] replSet going into maintenance mode (-3 other tasks) Sun Apr 7 04:00:11 [conn548283] replSet RECOVERING Sun Apr 7 04:00:11 [conn548282] SocketException handling request, closing client connection: 9001 socket exception [2] server [10.100.0.28:59852] Sun Apr 7 04:00:11 [conn410325] SocketException handling request, closing client connection: 9001 socket exception [2] server [xxx.xxx.xxx.221:56357] Sun Apr 7 04:00:11 [conn548283] compact crawlerDb.itemContents begin Sun Apr 7 04:00:11 [conn548283] paddingFactor:1 paddingBytes:0 Sun Apr 7 04:00:12 [conn548283] compact 37 extents Sun Apr 7 04:00:12 [conn548283] compact orphan deleted lists Sun Apr 7 04:00:12 [conn548283] compact dropping indexes Sun Apr 7 04:00:12 [conn548283] compact begin extent #0 for namespace crawlerDb.itemContents Sun Apr 7 04:00:12 [conn548283] compact paging in len=0.004096MB Sun Apr 7 04:00:12 [conn548283] compact copying records Sun Apr 7 04:00:12 [conn548283] compact finished extent #0 containing 2 documents (0.003576MB) oldPadding: 1.08226 1 Sun Apr 7 04:00:12 [conn548283] compact begin extent #1 for namespace crawlerDb.itemContents Sun Apr 7 04:00:12 [conn548283] compact paging in len=0.831488MB Sun Apr 7 04:00:12 [conn548283] compact copying records Sun Apr 7 04:00:12 [rsHealthPoll] replSet info backup.ec2:18089 is down (or slow to respond): socket exception [SEND_ERROR] for xxx.xxx.xxx.239:18089 Sun Apr 7 04:00:12 [rsHealthPoll] replSet member backup.ec2:18089 is now in state DOWN Sun Apr 7 04:00:13 [rsHealthPoll] replSet info crwlrdb06:8086 is down (or slow to respond): socket exception [SEND_ERROR] for 10.101.0.72:8086 Sun Apr 7 04:00:13 [rsHealthPoll] replSet member crwlrdb06:8086 is now in state DOWN Sun Apr 7 04:00:14 [conn430592] SocketException handling request, closing client connection: 9001 socket exception [2] server [10.101.0.71:33255] Sun Apr 7 04:00:14 [conn548283] compact finished extent #1 containing 43 documents (0.828596MB) oldPadding: 1.00007 1 Sun Apr 7 04:00:14 [conn548283] compact begin extent #2 for namespace crawlerDb.itemContents Sun Apr 7 04:00:14 [conn548283] compact paging in len=3.32595MB Sun Apr 7 04:00:14 [rsSync] replSet SECONDARY Sun Apr 7 04:00:14 [conn548283] compact copying records Sun Apr 7 04:00:14 [conn548283] compact finished extent #2 containing 151 documents (3.32295MB) oldPadding: 1.0001 1 Sun Apr 7 04:00:14 [conn548283] compact begin extent #3 for namespace crawlerDb.itemContents Sun Apr 7 04:00:14 [conn548283] compact paging in len=12.7427MB
Log of crwlrdb05:
Sun Apr 7 04:00:11 [rsHealthPoll] replSet member crwlrdb04:8086 is now in state RECOVERING Sun Apr 7 04:00:12 [rsMgr] not electing self, backup.ec2:18089 would veto Sun Apr 7 04:00:12 [rsBackgroundSync] replSet db exception in producer: 10278 dbclient error communicating with server: crwlrdb04:8086 Sun Apr 7 04:00:15 [rsHealthPoll] replSet member crwlrdb04:8086 is now in state SECONDARY Sun Apr 7 04:00:16 [rsMgr] replSet info electSelf 1 Sun Apr 7 04:00:16 [rsMgr] replSet PRIMARY Sun Apr 7 04:00:17 [conn913928] query local.oplog.rs query: { ts: { $gte: new Date(5864049901565902853) } } cursorid:3635176764573440955 ntoreturn:0 ntoskip:0 nscanned:2 keyUpdates:0 locks(micros) r:104450 nreturned:1 reslen:217 104ms Sun Apr 7 04:00:25 [conn913936] no current chunk manager found for this shard, will initialize Sun Apr 7 04:00:25 [conn913936] setShardVersion - relocking slow: 96 ... Sun Apr 7 04:10:17 [conn914483] replSet info stepping down as primary secs=1 Sun Apr 7 04:10:17 [conn914483] replSet relinquishing primary state Sun Apr 7 04:10:17 [conn914483] replSet SECONDARY Sun Apr 7 04:10:17 [conn914483] replSet closing client sockets after relinquishing primary Sun Apr 7 04:10:18 [rsHealthPoll] replSet info crwlrdb06:8086 is down (or slow to respond): socket exception [SEND_ERROR] for 10.101.0.72:8086 Sun Apr 7 04:10:18 [rsHealthPoll] replSet member crwlrdb06:8086 is now in state DOWN Sun Apr 7 04:10:18 [rsMgr] replSet info electSelf 1 Sun Apr 7 04:10:18 [rsMgr] replSet couldn't elect self, only received -19999 votes Sun Apr 7 04:10:19 [conn913928] SocketException handling request, closing client connection: 9001 socket exception [2] server [10.101.0.70:38568] Sun Apr 7 04:10:20 [rsHealthPoll] replSet member crwlrdb06:8086 is up Sun Apr 7 04:10:20 [rsHealthPoll] replSet member crwlrdb06:8086 is now in state ARBITER Sun Apr 7 04:10:20 [rsMgr] not electing self, backup.ec2:18089 would veto Sun Apr 7 04:10:20 [ReplicaSetMonitorWatcher] Socket say send() errno:9 Bad file descriptor 10.101.0.25:8086 Sun Apr 7 04:10:20 [ReplicaSetMonitorWatcher] Socket say send() errno:9 Bad file descriptor 10.101.0.24:8086 Sun Apr 7 04:10:20 [ReplicaSetMonitorWatcher] trying reconnect to crwlrdb02:8086 Sun Apr 7 04:10:20 [ReplicaSetMonitorWatcher] reconnect crwlrdb02:8086 ok Sun Apr 7 04:10:20 [ReplicaSetMonitorWatcher] Socket say send() errno:9 Bad file descriptor 10.101.0.25:8086 Sun Apr 7 04:10:22 [conn914484] replSet info voting yea for crwlrdb04:8086 (0) Sun Apr 7 04:10:26 [rsMgr] not electing self, backup.ec2:18089 would veto Sun Apr 7 04:10:30 [ReplicaSetMonitorWatcher] trying reconnect to crwlrdb01:8086 Sun Apr 7 04:10:30 [ReplicaSetMonitorWatcher] reconnect crwlrdb01:8086 ok Sun Apr 7 04:10:30 [ReplicaSetMonitorWatcher] Socket say send() errno:9 Bad file descriptor 10.101.0.24:8086 Sun Apr 7 04:10:32 [rsMgr] not electing self, backup.ec2:18089 would veto Sun Apr 7 04:10:37 [rsMgr] not electing self, backup.ec2:18089 would veto Sun Apr 7 04:10:38 [LockPinger] Socket say send() errno:9 Bad file descriptor 10.101.0.43:8086 Sun Apr 7 04:10:38 [LockPinger] Socket say send() errno:9 Bad file descriptor 10.101.0.44:8086 Sun Apr 7 04:10:38 [LockPinger] Socket say send() errno:9 Bad file descriptor 10.101.0.45:8086 Sun Apr 7 04:10:38 [LockPinger] scoped connection to crwlrcfgdb01:8086,crwlrcfgdb02:8086,crwlrcfgdb03:8086 not being returned to the pool Sun Apr 7 04:10:38 [LockPinger] warning: distributed lock pinger 'crwlrcfgdb01:8086,crwlrcfgdb02:8086,crwlrcfgdb03:8086/crwlrdb05.prod.dal.us:8086:1361707225:275889730' detected an exception while pinging. :: caused by :: SyncClusterConnection::udpate prepare failed: 9001 socket exception [2] server [10.101.0.43:8086] crwlrcfgdb01:8086:{}9001 socket exception [2] server [10.101.0.44:8086] crwlrcfgdb02:8086:{}9001 socket exception [2] server [10.101.0.45:8086] crwlrcfgdb03:8086:{} Sun Apr 7 04:10:42 [rsMgr] not electing self, backup.ec2:18089 would veto Sun Apr 7 04:10:47 [rsMgr] not electing self, backup.ec2:18089 would veto Sun Apr 7 04:10:52 [rsMgr] not electing self, backup.ec2:18089 would veto Sun Apr 7 04:10:57 [rsMgr] not electing self, backup.ec2:18089 would veto Sun Apr 7 04:11:02 [rsMgr] not electing self, backup.ec2:18089 would veto Sun Apr 7 04:11:07 [rsMgr] not electing self, backup.ec2:18089 would veto ...
Log of crwlrdb06 (arbiter node):
Sun Apr 7 04:00:12 [rsHealthPoll] replSet member crwlrdb04:8086 is now in state RECOVERING Sun Apr 7 04:00:12 [rsMgr] replSet I don't see a primary and I can't elect myself Sun Apr 7 04:00:14 [rsHealthPoll] replSet member crwlrdb04:8086 is now in state SECONDARY Sun Apr 7 04:00:16 [conn890924] replSet info voting yea for crwlrdb05:8086 (1) Sun Apr 7 04:00:18 [rsHealthPoll] replSet member crwlrdb05:8086 is now in state PRIMARY Sun Apr 7 04:10:17 [rsMgr] stepping down crwlrdb05:8086 (priority 1), crwlrdb04:8086 is priority 10 and 0 seconds behind Sun Apr 7 04:10:18 [rsHealthPoll] replSet member crwlrdb05:8086 is now in state SECONDARY Sun Apr 7 04:10:18 [rsMgr] stepping down crwlrdb05:8086 (priority 1), crwlrdb04:8086 is priority 10 and 0 seconds behind Sun Apr 7 04:10:18 [rsMgr] stepping down crwlrdb05:8086 failed: { errmsg: "not primary so can't step down", ok: 0.0 } Sun Apr 7 04:10:22 [conn891051] replSet info voting yea for crwlrdb04:8086 (0) Sun Apr 7 04:11:23 [rsMgr] replSet I don't see a primary and I can't elect myself Sun Apr 7 04:12:38 [rsMgr] replSet I don't see a primary and I can't elect myself Sun Apr 7 04:13:53 [rsMgr] replSet I don't see a primary and I can't elect myself Sun Apr 7 04:15:08 [rsMgr] replSet I don't see a primary and I can't elect myself Sun Apr 7 04:16:23 [rsMgr] replSet I don't see a primary and I can't elect myself ... Sun Apr 7 04:40:10 [rsHealthPoll] replSet member crwlrdb04:8086 is now in state PRIMARY Mon Apr 8 04:49:15 [rsHealthPoll] replSet member crwlrdb05:8086 is now in state RECOVERING
Why it matters?
crwlrdb04, the node under compact, has highest priority in replica set. When it wrongly reported its status as SECONDARY (though still doing compact command), other PRIMARY was stepped down and all nodes voted for crwlrdb04 to become a primary. But since it was busy with compact, it actually became a primary only in 40 minutes. During this 40 minutes replica set had no primary at all and was nonoperational.
- duplicates
-
SERVER-9338 db.runCommand( { compact: '<collection>' } ) on a primary decrements maintenanceMode
- Closed