-
Type: Bug
-
Resolution: Unresolved
-
Priority: Major - P3
-
None
-
Affects Version/s: 6.0.3
-
Component/s: None
-
Query Execution
-
ALL
-
(copied to CRM)
So I just hit this when upgrading from 4.4.18 to 6.0.3. I believe that this is either a regression or an undefined behavior that was fixed but broke backwards compatibility. In my scenario, we are waiting for calls to "robots". Once we receive a call to a robot based on its name (an insert), we latch onto the call document and wait for future updates in a more select change stream. In order to prevent races, we use the resume token from the original change stream. What ends up happening is that the server thinks some resume token no longer exists. What appears to actually be happening is that in 6.0, the server actually increments the cluster time ordinal by one from the token passed in. Depending on the amount of writes to the oplog, there's a chance (especially in testing) that that token does not exist yet, resulting in an error. If a single write happens after the first insert that causes that ordinal incremented cluster time to exist, no error happens, but we miss the next update event which does not appear to make sense (and feels like a bug). Attached are two sets of reproduction scripts.
Repro 1:
The first script to execute, watcher.js, waits for a document to be inserted and then starts up a new change stream waiting for the next update to that inserted document using the resume token to not miss any updates. Then execute writer.js which will insert a document and then update it. You should see an error like:
{ _id: { _data: '82637D61C1000000012B022C0100296E5A1004C28754200E80474C90E2EAA34447212546645F69640064637D61C184E895993F2B1C840004' }, operationType: 'insert', clusterTime: Timestamp({ t: 1669161409, i: 1 }), wallTime: ISODate("2022-11-22T23:56:49.618Z"), fullDocument: { _id: ObjectId("637d61c184e895993f2b1c84"), host: 'host1' }, ns: { db: 'data', coll: 'stuff' }, documentKey: { _id: ObjectId("637d61c184e895993f2b1c84") } } MongoServerError: Executor error during getMore :: caused by :: cannot resume stream; the resume token was not found. {_data: "82637D61C1000000022B022C0100296E5A1004C28754200E80474C90E2EAA34447212546645F69640064637D61C184E895993F2B1C840004"}
If you inspect the tokens, you'll see that they are off by one in the cluster time ordinal.
> decodeResumeToken('82637D61C1000000012B022C0100296E5A1004C28754200E80474C90E2EAA34447212546645F69640064637D61C184E895993F2B1C840004') { timestamp: new Timestamp({ t: 1669161409, i: 1 }), version: 1, tokenType: 128, txnOpIndex: 0, fromInvalidate: false, uuid: new UUID("c2875420-0e80-474c-90e2-eaa344472125"), documentKey: { _id: new ObjectId("637d61c184e895993f2b1c84") } } > decodeResumeToken('82637D61C1000000022B022C0100296E5A1004C28754200E80474C90E2EAA34447212546645F69640064637D61C184E895993F2B1C840004') { timestamp: new Timestamp({ t: 1669161409, i: 2 }), version: 1, tokenType: 128, txnOpIndex: 0, fromInvalidate: false, uuid: new UUID("c2875420-0e80-474c-90e2-eaa344472125"), documentKey: { _id: new ObjectId("637d61c184e895993f2b1c84") } }
Repro 2:
Run watcher.js again but this time run writer2.js which will do an insert of another document before the update. In this case, watcher.js just hangs forever which I'd expect to not happen which again leads me to believe something fishy is happening.
This also works fine on 5.0.14. Presently, this issue is preventing us from upgrading our Atlas cluster to 6.0.3.
It's possible this is related to SERVER-67677 since
- related to
-
SERVER-67677 Resuming a change stream with a resume token for an event that no longer matches returns an error
- Backlog