Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-30952

Initial (re)sync never completes, stuck in a loop

    • Type: Icon: Bug Bug
    • Resolution: Duplicate
    • Priority: Icon: Major - P3 Major - P3
    • None
    • Affects Version/s: 3.4.7
    • Component/s: Replication
    • None
    • ALL
      • Run a 3-node cluster
      • Run map/reduce jobs that require a temp collection
      • Re-sync one of the nodes, the re-sync will fail and start again

      We have a 3-node replicaset running v3.4.7, the primary is running 6 map/reduce jobs every minute or so and due to circumstances we had to re-sync one of the secondary nodes. However at one of the last steps of the re-sync we get the following error:

      2017-09-05T17:56:04.348+0000 E REPL     [repl writer worker 5] Error applying operation: OplogOperationUnsupported: Applying renameCollection not supported in initial sync: { ts: Timestamp 1504588941000|592, h: 4948566672906734558, v: 2, op: "c", ns: "graphs.$cmd", o: { renameCollection: "graphs.tmp.agg_out.989", to: "graphs.graphs_temp", stayTemp: false, dropTarget: true } } ({ ts: Timestamp 1504588941000|592, h: 4948566672906734558, v: 2, op: "c", ns: "graphs.$cmd", o: { renameCollection: "graphs.tmp.agg_out.989", to: "graphs.graphs_temp", stayTemp: false, dropTarget: true } })
      2017-09-05T17:56:04.348+0000 E REPL     [replication-168] Failed to apply batch due to 'OplogOperationUnsupported: error applying batch: Applying renameCollection not supported in initial sync: { ts: Timestamp 1504588941000|592, h: 4948566672906734558, v: 2, op: "c", ns: "graphs.$cmd", o: { renameCollection: "graphs.tmp.agg_out.989", to: "graphs.graphs_temp", stayTemp: false, dropTarget: true } }'
      2017-09-05T17:56:04.348+0000 I ASIO     [NetworkInterfaceASIO-RS-0] Ending connection to host graphs1-mongo3:27017 due to bad connection status; 2 connections to that host remain open
      2017-09-05T17:56:04.348+0000 I REPL     [replication-167] Finished fetching oplog during initial sync: CallbackCanceled: Callback canceled. Last fetched optime and hash: { ts: Timestamp 1504634159000|4672, t: -1 }[-4041766555669726456]
      2017-09-05T17:56:04.348+0000 I REPL     [replication-168] Initial sync attempt finishing up.
      

      After this, MongoDB cleans up the files and starts the re-sync again, it's now basically stuck in a very big loop. This used to work fine with 2.x, when we re-synced quite a few times.

      I'm not sure what to do about this, we can't stop the map/reduce jobs for the duration of the resync as it takes about 8 hours to get to this point.

            Assignee:
            kelsey.schubert@mongodb.com Kelsey Schubert
            Reporter:
            robert@appsignal.com Robert Beekman
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

              Created:
              Updated:
              Resolved: