Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-71565

Resuming a change stream with a $match filter which excludes the resume point fails

    • Type: Icon: Bug Bug
    • Resolution: Unresolved
    • Priority: Icon: Major - P3 Major - P3
    • None
    • Affects Version/s: 6.0.3
    • Component/s: None
    • Query Execution
    • ALL

      So I just hit this when upgrading from 4.4.18 to 6.0.3. I believe that this is either a regression or an undefined behavior that was fixed but broke backwards compatibility. In my scenario, we are waiting for calls to "robots". Once we receive a call to a robot based on its name (an insert), we latch onto the call document and wait for future updates in a more select change stream. In order to prevent races, we use the resume token from the original change stream. What ends up happening is that the server thinks some resume token no longer exists. What appears to actually be happening is that in 6.0, the server actually increments the cluster time ordinal by one from the token passed in. Depending on the amount of writes to the oplog, there's a chance (especially in testing) that that token does not exist yet, resulting in an error. If a single write happens after the first insert that causes that ordinal incremented cluster time to exist, no error happens, but we miss the next update event which does not appear to make sense (and feels like a bug). Attached are two sets of reproduction scripts.

      Repro 1:

      The first script to execute, watcher.js, waits for a document to be inserted and then starts up a new change stream waiting for the next update to that inserted document using the resume token to not miss any updates. Then execute writer.js which will insert a document and then update it. You should see an error like:

      {
        _id: {
          _data: '82637D61C1000000012B022C0100296E5A1004C28754200E80474C90E2EAA34447212546645F69640064637D61C184E895993F2B1C840004'
        },
        operationType: 'insert',
        clusterTime: Timestamp({ t: 1669161409, i: 1 }),
        wallTime: ISODate("2022-11-22T23:56:49.618Z"),
        fullDocument: { _id: ObjectId("637d61c184e895993f2b1c84"), host: 'host1' },
        ns: { db: 'data', coll: 'stuff' },
        documentKey: { _id: ObjectId("637d61c184e895993f2b1c84") }
      }
      MongoServerError: Executor error during getMore :: caused by :: cannot resume stream; the resume token was not found. {_data: "82637D61C1000000022B022C0100296E5A1004C28754200E80474C90E2EAA34447212546645F69640064637D61C184E895993F2B1C840004"}
      

      If you inspect the tokens, you'll see that they are off by one in the cluster time ordinal.

      > decodeResumeToken('82637D61C1000000012B022C0100296E5A1004C28754200E80474C90E2EAA34447212546645F69640064637D61C184E895993F2B1C840004')
      {
        timestamp: new Timestamp({ t: 1669161409, i: 1 }),
        version: 1,
        tokenType: 128,
        txnOpIndex: 0,
        fromInvalidate: false,
        uuid: new UUID("c2875420-0e80-474c-90e2-eaa344472125"),
        documentKey: { _id: new ObjectId("637d61c184e895993f2b1c84") }
      }
      > decodeResumeToken('82637D61C1000000022B022C0100296E5A1004C28754200E80474C90E2EAA34447212546645F69640064637D61C184E895993F2B1C840004')
      {
        timestamp: new Timestamp({ t: 1669161409, i: 2 }),
        version: 1,
        tokenType: 128,
        txnOpIndex: 0,
        fromInvalidate: false,
        uuid: new UUID("c2875420-0e80-474c-90e2-eaa344472125"),
        documentKey: { _id: new ObjectId("637d61c184e895993f2b1c84") }
      }
      

      Repro 2:

      Run watcher.js again but this time run writer2.js which will do an insert of another document before the update. In this case, watcher.js just hangs forever which I'd expect to not happen which again leads me to believe something fishy is happening.

      This also works fine on 5.0.14. Presently, this issue is preventing us from upgrading our Atlas cluster to 6.0.3.

      It's possible this is related to SERVER-67677 since

        1. watcher_workaround1.js
          0.7 kB
        2. watcher.js
          0.6 kB
        3. writer.js
          0.2 kB
        4. writer2.js
          0.2 kB

            Assignee:
            backlog-query-execution [DO NOT USE] Backlog - Query Execution
            Reporter:
            eric@erdaniels.com Eric Daniels
            Votes:
            0 Vote for this issue
            Watchers:
            18 Start watching this issue

              Created:
              Updated: