Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-27573

Member stuck in ROLLBACK after unclean restart

    • Type: Icon: Bug Bug
    • Resolution: Cannot Reproduce
    • Priority: Icon: Major - P3 Major - P3
    • None
    • Affects Version/s: 3.2.11
    • Component/s: Replication
    • None
    • ALL
    • Hide

      It does not happen every time.

      1. Kill primary member by OOMKILL
      2. Wait for a secondary member to become primary
      3. Restart killed member

      Show
      It does not happen every time. 1. Kill primary member by OOMKILL 2. Wait for a secondary member to become primary 3. Restart killed member

      The primary member of the 3 node replica set was OOM killed and a secondary member was promoted to primary. Upon restart the dead member, it came up but it's stuck in ROLLBACK state with these logs:

      mongod.log
      2017-01-04T06:08:58.781+0000 I REPL     [ReplicationExecutor] syncing from: ip-10-0-17-156:27017
      2017-01-04T06:08:58.796+0000 I REPL     [rsBackgroundSync] Starting rollback due to OplogStartMissing: our last op time fetched: (term: -1, timestamp: Jan  4 03:39:41:1d). source's GTE: (term: -1, timestamp: Jan  4 03:39:46:1) hashes: (-8237435499851558070/-4585935198278308689)
      2017-01-04T06:08:58.796+0000 I REPL     [rsBackgroundSync] beginning rollback
      2017-01-04T06:08:58.796+0000 I REPL     [rsBackgroundSync] rollback 0
      2017-01-04T06:08:58.796+0000 I REPL     [rsBackgroundSync] rollback 1
      2017-01-04T06:08:58.798+0000 I REPL     [rsBackgroundSync] rollback 2 FindCommonPoint
      2017-01-04T06:08:58.799+0000 I REPL     [rsBackgroundSync] rollback our last optime:   Jan  4 03:39:41:1d
      2017-01-04T06:08:58.799+0000 I REPL     [rsBackgroundSync] rollback their last optime: Jan  4 06:08:57:2
      2017-01-04T06:08:58.799+0000 I REPL     [rsBackgroundSync] rollback diff in end of log times: -8956 seconds
      2017-01-04T06:09:28.797+0000 I -        [rsBackgroundSync] caught exception (socket exception [FAILED_STATE] for ip-10-0-17-156:27017 (10.0.17.156) failed) in destructor (kill)
      2017-01-04T06:09:28.797+0000 W REPL     [rsBackgroundSync] rollback 2 exception 10278 dbclient error communicating with server: ip-10-0-17-156:27017; sleeping 1 min
      

      The error suggests network issue which is totally incorrect. The servers can access each other just fine:

      [ec2-user@ip-10-0-33-140 ~]$ nc -v -z ip-10-0-17-156 27017
      Connection to ip-10-0-17-156 27017 port [tcp/*] succeeded!
      

      I can even connect mongo shell to remote server and run queries fine. Plus, if I delete all data and do a full resync, it's able to connect without any issues.

        1. affected_mongo.log
          259 kB
        2. mms-fullpage.png
          mms-fullpage.png
          2.02 MB
        3. mongod-27017.log.2017-01-04T06-05-48
          1.09 MB
        4. mongod-27017.log.2017-01-04T06-36-02
          2.34 MB
        5. mongod-27017.log.2017-01-04T07-18-26
          3.20 MB
        6. primary_mongo.log
          4.99 MB

            Assignee:
            kelsey.schubert@mongodb.com Kelsey Schubert
            Reporter:
            amanpreet@codigami.com Amanpreet Singh
            Votes:
            0 Vote for this issue
            Watchers:
            7 Start watching this issue

              Created:
              Updated:
              Resolved: