Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-94001

Make LogicalSessionCacheRefresh more resilient to errors from a single shard

    • Type: Icon: Improvement Improvement
    • Resolution: Unresolved
    • Priority: Icon: Major - P3 Major - P3
    • None
    • Affects Version/s: None
    • Component/s: None
    • None
    • Cluster Scalability
    • Cluster Scalability 2024-09-02, Cluster Scalability 2024-10-14, Cluster Scalability 2024-10-28, Cluster Scalability 2024-11-11

      Currently the periodic thread sequence goes like this:

      1. Purge sessions pending to be refreshed ended with endSessionsFromClient command.
      2. Perform refresh. This updates the ping on sessions collection for each currently active in-memory session.
      3. Scan the sessions collection and check which sessions still exists. Sessions that no longer exists in the collection are treated as "expired" sessions and we call kill cursors on them.
      4. Once we finished doing this, we clear the list of "sessions pending to be refreshed".

      However, any assertion that occurs abort the entire sequence. This means that if a single shard keeps on causing step#2 to assert, then it will not clear the list of "sessions pending to be refreshed" and can cause it to accumulate.

      As a concrete example, imagine this setup:
      session collection chunk distribution:
      shard0: lsid: 0->10
      shard1: lsid: 10->20
      shard2: lsid: 20->30

      lsid in memory:
      shard10: 0, 10, 20
      shard11: 1, 11, 21
      shard12: 2, 12, 22

      Note that shard10, shard11, shard12 each will have to target shard0 when performing the session refresh since one of it's lsids touches the chunk shard0 owns. So, if the write to shard0 is causing errors, then shards10, 11 and 12 won't be able to purge expired sessions. Also note that it is not unusual for multiple shards to have the same lsid in memory because some ops can hit multiple shards. In an extreme case where we perform a broadcast query with lsid: 4, then all shards will now have lsid: 4 in memory. And using the previous example, all shards will now have to target shard0 when performing logical session cache refresh.

            Assignee:
            Unassigned Unassigned
            Reporter:
            randolph@mongodb.com Randolph Tan
            Votes:
            0 Vote for this issue
            Watchers:
            14 Start watching this issue

              Created:
              Updated: