Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-26943

Non-replacement updates to the config.shards collection can crash the CSRS secondary after rollback

    • Type: Icon: Bug Bug
    • Resolution: Done
    • Priority: Icon: Major - P3 Major - P3
    • 3.4.0-rc4
    • Affects Version/s: 3.4.0-rc2
    • Component/s: Sharding
    • None
    • Fully Compatible
    • ALL
    • Hide

      No deterministic way. Hit through the continuous stepdown suite.

      Show
      No deterministic way. Hit through the continuous stepdown suite.
    • Sharding 2016-11-21
    • 0

      The config servers have a special opObserver insert hook to intercept updates from a legacy v3.2 mongos to the config.shards collection and maintain the shard identity.

      This hook always always expects that a complete shard document is inserted (which is correct on the primaries). However on a secondary, which is recovering from a rollback, if an update is followed by delete, it may end up trying to apply the update after a previously applied deletion, which will convert the update to an upsert and cause an invariant, because this results in an incomplete shard document.

      For example, the following sequence:

      c23012| 2016-11-07T19:01:26.092+0000 D ASIO     [NetworkInterfaceASIO-RS-0] Request 286 finished with response: { cursor: { firstBatch: [ { ts: Timestamp 1478545283000|1, t: 4, h: 565510199539623323, v: 2, op: "u", ns: "config.shards", o2: { _id: "shard0001" }, o: { $set: { draining: true } } }, { ts: Timestamp 1478545285000|8, t: 4, h: -4558147567226493446, v: 2, op: "d", ns: "config.shards", o: { _id: "shard0001" } }, ok: 1.0 }
      
      c23012| 2016-11-07T19:01:26.092+0000 I REPL     [rsBackgroundSync] Starting rollback due to OplogStartMissing: our last op time fetched: { ts: Timestamp 1478545283000|1, t: 3 }. source's GTE: { ts: Timestamp 1478545283000|1, t: 4 } hashes: (-6821259113153738378/565510199539623323)
      
      c23012| 2016-11-07T19:01:26.107+0000 D ASIO     [rsBackgroundSync] startCommand: RemoteCommand 298 -- target:ip-10-152-38-201:23013 db:local expDate:2016-11-07T19:01:31.107+0000 cmd:{ find: "oplog.rs", filter: { ts: { $gte: Timestamp 1478545272000|5 } }, tailable: true, oplogReplay: true, awaitData: true, maxTimeMS: 60000, term: 4 }
      
      c23012| 2016-11-07T19:01:26.108+0000 D ASIO     [NetworkInterfaceASIO-RS-0] Request 298 finished with response: { cursor: { firstBatch: [ { ts: Timestamp 1478545283000|1, t: 4, h: 565510199539623323, v: 2, op: "u", ns: "config.shards", o2: { _id: "shard0001" }, o: { $set: { draining: true } } }, { ts: Timestamp 1478545285000|8, t: 4, h: -4558147567226493446, v: 2, op: "d", ns: "config.shards", o: { _id: "shard0001" } }, ok: 1.0 }
      

      Results in this fatal exception:

      c23012| 2016-11-07T19:01:26.109+0000 F REPL     [repl writer worker 15] writer worker caught exception: 4 Missing expected field "host" on: { ts: Timestamp 1478545283000|1, t: 4, h: 565510199539623323, v: 2, op: "u", ns: "config.shards", o2: { _id: "shard0001" }, o: { $set: { draining: true } } }
      c23012| 2016-11-07T19:01:26.109+0000 I -        [repl writer worker 15] Fatal assertion 16359 NoSuchKey: Missing expected field "host" at src/mongo/db/repl/sync_tail.cpp 1054
      c23012| 2016-11-07T19:01:26.109+0000 I -        [repl writer worker 15]
      c23012|
      c23012| ***aborting after fassert() failure
      

            Assignee:
            esha.maharishi@mongodb.com Esha Maharishi (Inactive)
            Reporter:
            kaloian.manassiev@mongodb.com Kaloian Manassiev
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

              Created:
              Updated:
              Resolved: