Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-65949

Sharded time series leads to high query rates on the primary shard with CommandOnShardedViewNotSupportedOnMongod errors

    • Query Integration
    • ALL
    • Hide

      My setup:

      • Everything is running locally
      • Docker Desktop
      • Mongo 5.0.7
      • 2 shards, with 3 servers in a replica set per shard
      • 1 mongos
      • 1 config replica set with 3 servers
      • Load tester application that pushes a lot of time series data, while simultaneously doing reads, written using the .NET C# driver, v. 2.15.0

      Reproduction steps:

      1. Setup a 2 shard cluster, even in docker
      2. Create a sharded time series collection, and insert few records.  I added a few million records.  The destination bucket of each record doesn't make a difference, meaning the meta field can vary i.e. different sensor IDs.
      3. Perform a sustained, heavy read load on the collection.
      4. Observe that the primary shard for the time series view processes a lot of get queries, while the other shard processes none.  This can be seen in the mongostat output attached, where shard1 on the left processes 1k+ "query" ops, but shard2 on the right processes none of these "query" ops
      5. Observe the following logs entry on mongos
        1. "ctx":"conn1516","msg":"Unable to establish remote cursors","attr":{"error":{"code":169,"codeName":"CommandOnShardedViewNotSupportedOnMongod","errmsg":"Resolved views on sharded collections must be executed by mongos","resolvedView":{"ns":"EndpointDataProtoTests.system.buckets.EndpointData:Endpoints-NormalV8","pipeline":[{"$_internalUnpackBucket":{"timeField":"t","metaField":"m","bucketMaxSpanSeconds":3600,"exclude":[]}}],"collation":{"locale":"simple"}}},"nRemotes":0}}

      The attached image is a mongostat output that was captured by NoSqlBooster 7.1.  The left side is a direct connection to the primary of shard1, and the right a direct connection to the primary of shard2

      Show
      My setup: Everything is running locally Docker Desktop Mongo 5.0.7 2 shards, with 3 servers in a replica set per shard 1 mongos 1 config replica set with 3 servers Load tester application that pushes a lot of time series data, while simultaneously doing reads, written using the .NET C# driver, v. 2.15.0 Reproduction steps: Setup a 2 shard cluster, even in docker Create a sharded time series collection, and insert few records.  I added a few million records.  The destination bucket of each record doesn't make a difference, meaning the meta field can vary i.e. different sensor IDs. Perform a sustained, heavy read load on the collection. Observe that the primary shard for the time series view processes a lot of get queries, while the other shard processes none.  This can be seen in the mongostat output attached, where shard1 on the left processes 1k+ "query" ops, but shard2 on the right processes none of these "query" ops Observe the following logs entry on mongos "ctx" : "conn1516" , "msg" : "Unable to establish remote cursors" , "attr" :{ "error" :{ "code" :169, "codeName" : "CommandOnShardedViewNotSupportedOnMongod" , "errmsg" : "Resolved views on sharded collections must be executed by mongos" , "resolvedView" :{ "ns" : "EndpointDataProtoTests.system.buckets.EndpointData:Endpoints-NormalV8" , "pipeline" :[{ "$_internalUnpackBucket" :{ "timeField" : "t" , "metaField" : "m" , "bucketMaxSpanSeconds" :3600, "exclude" :[]}}], "collation" :{ "locale" : "simple" }}}, "nRemotes" :0}} The attached image is a mongostat output that was captured by NoSqlBooster 7.1.  The left side is a direct connection to the primary of shard1, and the right a direct connection to the primary of shard2

      Sharding a time series collection leads to higher throughput on write, but reads affect the whole cluster, because the primary shard has a spike in CPU usage.  When reviewing the logs of Mongos, several log entries state that Resolved views on sharded collections must be executed by mongos.  When I stop the read load, these messages are no longer logged.

      From my research, it seems like this can be related to SERVER-43376 - Operations on non-sharded views in sharded clusters extra round trip

      This is a problem for us, because adding a read load affects the whole cluster's performance.  Our workload has about 25% reads for every 100% of writes.

      I found the problem while load testing my sharded time series prototype on Atlas

        1. Shard mongo-stat.png
          Shard mongo-stat.png
          35 kB
        2. shard2-t2.png
          shard2-t2.png
          166 kB
        3. shard2.png
          shard2.png
          62 kB
        4. shard1-t2-v2.png
          shard1-t2-v2.png
          179 kB
        5. shard1-t2.png
          shard1-t2.png
          150 kB
        6. shard1.png
          shard1.png
          77 kB
        7. Screenshot 2022-04-26 104816.png
          Screenshot 2022-04-26 104816.png
          206 kB
        8. Multi-shard Metrics overview 2.png
          Multi-shard Metrics overview 2.png
          204 kB
        9. Multi-shard Metrics overview 1.png
          Multi-shard Metrics overview 1.png
          472 kB
        10. Mongotop.png
          Mongotop.png
          173 kB
        11. image-2022-06-14-10-20-55-809.png
          image-2022-06-14-10-20-55-809.png
          421 kB
        12. image-2022-06-09-14-04-39-666.png
          image-2022-06-09-14-04-39-666.png
          102 kB
        13. image-2022-06-09-13-59-20-641.png
          image-2022-06-09-13-59-20-641.png
          92 kB
        14. data - attempt2.zip
          2.07 MB
        15. data and loadtester2.zip
          5.25 MB
        16. All shards mongostat.png
          All shards mongostat.png
          271 kB

            Assignee:
            backlog-query-integration [DO NOT USE] Backlog - Query Integration
            Reporter:
            marnu.vdmerwe@iotnxt.com Marnu vd Merwe
            Votes:
            1 Vote for this issue
            Watchers:
            12 Start watching this issue

              Created:
              Updated: