-
Type: Task
-
Resolution: Unresolved
-
Priority: Major - P3
-
None
-
Affects Version/s: None
-
Component/s: None
-
None
-
Query Execution
After SERVER-66631 change_stream_multitenant_sharded_cluster_passthrough randomly started failing with different test cases. The failures were related to ChangeStreamHistoryLost.
Evergreen Link here.
To mitigate this issue sleep was added.
The current explanation for this problem is as follows:
We are creating the change collection on every primary node explicitly and independently by issuing the enablement command.
Each node's latest oplog timestamp might be different, for eg, the latest oplog timestamp for node1 might be Timestamp (133456788, 1) and for the other, it could be Timestamp (123456789, 1).
As such, when we create change collection on these nodes, their corresponding oplog entries in node 1 would become Timestamp(133456788, 2) and on node 2 Timestamp(123456789, 2). These will also define the start timestamp for each change collection.
Since the timestamps are different in both nodes, a getMore with timestamp Timestamp (133456788, 1) on node 2 will cause the change stream to fail.
Since there is no entity (like configSvr in the case of mongoS) that orchestrates the creation process, the differences in the timestamps on different nodes seem to be causing this situation.
And since the differences in timestamps between nodes are smaller (test-fixture spins up nodes quickly), the sleep causes the periodic-noop writer to write a few oplog entries and bump up the timestamp. The latest oplog timestamp is now later than the beginning of all change collections' first entries and thus prevents failures.
It should be noted that there is already a ticket to enable change stream in mongoQ - SERVER-68341 and that should solve this problem. This is more about further investigating the issue and coming up with a better workaround (not using sleep) for time being.
- is related to
-
SERVER-68341 Implement enable/disable command for mongoQ in serverless
- Backlog