Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-69068

Further investigate random failures in multi-tenant-passthrough test cases

    • Type: Icon: Task Task
    • Resolution: Unresolved
    • Priority: Icon: Major - P3 Major - P3
    • None
    • Affects Version/s: None
    • Component/s: None
    • None
    • Query Execution

      After SERVER-66631 change_stream_multitenant_sharded_cluster_passthrough randomly started failing with different test cases. The failures were related to ChangeStreamHistoryLost.
      Evergreen Link here.

      To mitigate this issue sleep was added.

      The current explanation for this problem is as follows:

      We are creating the change collection on every primary node explicitly and independently by issuing the enablement command.
      Each node's latest oplog timestamp might be different, for eg, the latest oplog timestamp for node1 might be Timestamp (133456788, 1) and for the other, it could be Timestamp (123456789, 1).

      As such, when we create change collection on these nodes, their corresponding oplog entries in node 1 would become Timestamp(133456788, 2) and on node 2 Timestamp(123456789, 2). These will also define the start timestamp for each change collection.

      Since the timestamps are different in both nodes, a getMore with timestamp Timestamp (133456788, 1) on node 2 will cause the change stream to fail.

      Since there is no entity (like configSvr in the case of mongoS) that orchestrates the creation process, the differences in the timestamps on different nodes seem to be causing this situation.

      And since the differences in timestamps between nodes are smaller (test-fixture spins up nodes quickly), the sleep causes the periodic-noop writer to write a few oplog entries and bump up the timestamp. The latest oplog timestamp is now later than the beginning of all change collections' first entries and thus prevents failures.

       

      It should be noted that there is already a ticket to enable change stream in mongoQ - SERVER-68341  and that should solve this problem. This is more about further investigating the issue and coming up with a better workaround (not using sleep) for time being.

            Assignee:
            backlog-query-execution [DO NOT USE] Backlog - Query Execution
            Reporter:
            rishab.joshi@mongodb.com Rishab Joshi (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

              Created:
              Updated: