Loading...

XML

Word

Printable

JSON

Type: Bug
Resolution: Unresolved
Priority: Major - P3
Fix Version/s: None
Affects Version/s: 6.0.3
Component/s: None
Labels:
- changestreams

Assigned Teams:

Query Execution
Operating System:
ALL
Case:
Confidence Status:
None

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name:
None
Goal Link:
None

So I just hit this when upgrading from 4.4.18 to 6.0.3. I believe that this is either a regression or an undefined behavior that was fixed but broke backwards compatibility. In my scenario, we are waiting for calls to "robots". Once we receive a call to a robot based on its name (an insert), we latch onto the call document and wait for future updates in a more select change stream. In order to prevent races, we use the resume token from the original change stream. What ends up happening is that the server thinks some resume token no longer exists. What appears to actually be happening is that in 6.0, the server actually increments the cluster time ordinal by one from the token passed in. Depending on the amount of writes to the oplog, there's a chance (especially in testing) that that token does not exist yet, resulting in an error. If a single write happens after the first insert that causes that ordinal incremented cluster time to exist, no error happens, but we miss the next update event which does not appear to make sense (and feels like a bug). Attached are two sets of reproduction scripts.

Repro 1:

The first script to execute, watcher.js, waits for a document to be inserted and then starts up a new change stream waiting for the next update to that inserted document using the resume token to not miss any updates. Then execute writer.js which will insert a document and then update it. You should see an error like:

{
  _id: {
    _data: '82637D61C1000000012B022C0100296E5A1004C28754200E80474C90E2EAA34447212546645F69640064637D61C184E895993F2B1C840004'
  },
  operationType: 'insert',
  clusterTime: Timestamp({ t: 1669161409, i: 1 }),
  wallTime: ISODate("2022-11-22T23:56:49.618Z"),
  fullDocument: { _id: ObjectId("637d61c184e895993f2b1c84"), host: 'host1' },
  ns: { db: 'data', coll: 'stuff' },
  documentKey: { _id: ObjectId("637d61c184e895993f2b1c84") }
}
MongoServerError: Executor error during getMore :: caused by :: cannot resume stream; the resume token was not found. {_data: "82637D61C1000000022B022C0100296E5A1004C28754200E80474C90E2EAA34447212546645F69640064637D61C184E895993F2B1C840004"}

If you inspect the tokens, you'll see that they are off by one in the cluster time ordinal.

> decodeResumeToken('82637D61C1000000012B022C0100296E5A1004C28754200E80474C90E2EAA34447212546645F69640064637D61C184E895993F2B1C840004')
{
  timestamp: new Timestamp({ t: 1669161409, i: 1 }),
  version: 1,
  tokenType: 128,
  txnOpIndex: 0,
  fromInvalidate: false,
  uuid: new UUID("c2875420-0e80-474c-90e2-eaa344472125"),
  documentKey: { _id: new ObjectId("637d61c184e895993f2b1c84") }
}
> decodeResumeToken('82637D61C1000000022B022C0100296E5A1004C28754200E80474C90E2EAA34447212546645F69640064637D61C184E895993F2B1C840004')
{
  timestamp: new Timestamp({ t: 1669161409, i: 2 }),
  version: 1,
  tokenType: 128,
  txnOpIndex: 0,
  fromInvalidate: false,
  uuid: new UUID("c2875420-0e80-474c-90e2-eaa344472125"),
  documentKey: { _id: new ObjectId("637d61c184e895993f2b1c84") }
}

Repro 2:

Run watcher.js again but this time run writer2.js which will do an insert of another document before the update. In this case, watcher.js just hangs forever which I'd expect to not happen which again leads me to believe something fishy is happening.

This also works fine on 5.0.14. Presently, this issue is preventing us from upgrading our Atlas cluster to 6.0.3.

It's possible this is related to SERVER-67677 since

- - Sort By Name
  - Sort By Date
  - Ascending
  - Descending
  - Thumbnails
  - List
  - Download All

watcher_workaround1.js
0.7 kB
Nov 23 2022 12:46:51 AM UTC
watcher.js
0.6 kB
Nov 23 2022 12:25:41 AM UTC
writer.js
0.2 kB
Nov 23 2022 12:25:41 AM UTC
writer2.js
0.2 kB
Nov 23 2022 12:43:49 AM UTC

related to

SERVER-67677 Resuming a change stream with a resume token for an event that no longer matches returns an error

Backlog

Assignee:: [DO NOT USE] Backlog - Query Execution

Reporter:: Eric Daniels

Participants:: [DO NOT USE] Backlog - Query Execution, Bernard Gorman, Eric Daniels, Kelsey Schubert

Votes:: 0 Vote for this issue

Watchers:: 19 Start watching this issue

Created:: Nov 23 2022 12:25:21 AM UTC

Updated:: Nov 18 2024 03:51:02 PM UTC

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates