-
Type: Bug
-
Resolution: Cannot Reproduce
-
Priority: Major - P3
-
None
-
Affects Version/s: 3.2.11
-
Component/s: Replication
-
None
-
ALL
-
The primary member of the 3 node replica set was OOM killed and a secondary member was promoted to primary. Upon restart the dead member, it came up but it's stuck in ROLLBACK state with these logs:
mongod.log
2017-01-04T06:08:58.781+0000 I REPL [ReplicationExecutor] syncing from: ip-10-0-17-156:27017
2017-01-04T06:08:58.796+0000 I REPL [rsBackgroundSync] Starting rollback due to OplogStartMissing: our last op time fetched: (term: -1, timestamp: Jan 4 03:39:41:1d). source's GTE: (term: -1, timestamp: Jan 4 03:39:46:1) hashes: (-8237435499851558070/-4585935198278308689)
2017-01-04T06:08:58.796+0000 I REPL [rsBackgroundSync] beginning rollback
2017-01-04T06:08:58.796+0000 I REPL [rsBackgroundSync] rollback 0
2017-01-04T06:08:58.796+0000 I REPL [rsBackgroundSync] rollback 1
2017-01-04T06:08:58.798+0000 I REPL [rsBackgroundSync] rollback 2 FindCommonPoint
2017-01-04T06:08:58.799+0000 I REPL [rsBackgroundSync] rollback our last optime: Jan 4 03:39:41:1d
2017-01-04T06:08:58.799+0000 I REPL [rsBackgroundSync] rollback their last optime: Jan 4 06:08:57:2
2017-01-04T06:08:58.799+0000 I REPL [rsBackgroundSync] rollback diff in end of log times: -8956 seconds
2017-01-04T06:09:28.797+0000 I - [rsBackgroundSync] caught exception (socket exception [FAILED_STATE] for ip-10-0-17-156:27017 (10.0.17.156) failed) in destructor (kill)
2017-01-04T06:09:28.797+0000 W REPL [rsBackgroundSync] rollback 2 exception 10278 dbclient error communicating with server: ip-10-0-17-156:27017; sleeping 1 min
The error suggests network issue which is totally incorrect. The servers can access each other just fine:
[ec2-user@ip-10-0-33-140 ~]$ nc -v -z ip-10-0-17-156 27017 Connection to ip-10-0-17-156 27017 port [tcp/*] succeeded!
I can even connect mongo shell to remote server and run queries fine. Plus, if I delete all data and do a full resync, it's able to connect without any issues.