-
Type: Improvement
-
Resolution: Duplicate
-
Priority: Major - P3
-
None
-
Affects Version/s: None
-
Component/s: Sharding
-
Query 2020-06-01, Query 2020-06-15
Suppose you have a 2 shards db and a sharded collection with a shard key 'origin'. 2.6 M documents.
a sh.status() gives:
{ "_id" : "test", "primary" : "sh_0", "partitioned" : true, "version" : { "uuid" : UUID("28b86279-c3e8-4432-b325-28136e353a85"), "lastMod" : 1 } } { "_id" : "test", "primary" : "sh_0", "partitioned" : true, "version" : { "uuid" : UUID("28b86279-c3e8-4432-b325-28136e353a85"), "lastMod" : 1 } } test.sh_coll shard key: { "origin" : 1 } unique: false balancing: true chunks: sh_0 5 sh_1 5 { "origin" : { "$minKey" : 1 } } -->> { "origin" : "A000001" } on : sh_1 Timestamp(3, 2) { "origin" : "A000001" } -->> { "origin" : "F000001" } on : sh_1 Timestamp(3, 3) { "origin" : "F000001" } -->> { "origin" : "H050002" } on : sh_1 Timestamp(3, 4) { "origin" : "H050002" } -->> { "origin" : "K000003" } on : sh_1 Timestamp(3, 5) { "origin" : "K000003" } -->> { "origin" : "N" } on : sh_1 Timestamp(3, 6) { "origin" : "N" } -->> { "origin" : "P050000" } on : sh_0 Timestamp(3, 7) { "origin" : "P050000" } -->> { "origin" : "S000001" } on : sh_0 Timestamp(3, 8) { "origin" : "S000001" } -->> { "origin" : "U050002" } on : sh_0 Timestamp(3, 9) { "origin" : "U050002" } -->> { "origin" : "Y045375" } on : sh_0 Timestamp(3, 10) { "origin" : "Y045375" } -->> { "origin" : { "$maxKey" : 1 } } on : sh_0 Timestamp(3, 11)
So, sorted by 'origin', half the records are in one shard the the next half is in the other shard.
Now suppose we want to get all the records, sorted by 'origin' and every 1000 records, make a process that takes 1 seconds... Getting half of the records takes more than 10mn (the default timeout for cursors).
If we don't specify a batch_size, we don't even arrive to 1.182 records.
If we specify a batch size of 1000, we get a "RecordNotFound" error when, after having returned half the records + the records from the first get from the 2nd shard, mongos tries to get more data from the 2nd shard.
That's what we wanted to show up because whatever the batch_size you specify, you won't be able to get all the records. Using a small batch size is often described as the way to solve cursors timeout. In a multi sharded db it is not the case.
Of course, this example has been crafted to reproduce the issue. It is not a real use case but I have real ones where I have 10 shards, process of each record takes more than 1 second and cursorTimeoutMillis: "7200000" (2 hours) and still get CursorNotFound errors.
no_cursor_timeout=True is not an option because it can lead to ghost cursors in the db
- duplicates
-
SERVER-6036 Disable cursor timeout for cursors that belong to a session
- Closed