Mongos instances which do not receive any requests with the primary read preference do not get their chunk location configuration updated after a chunk migration. This results in missing data in query results in cases where the query includes the shard key and the mongos routes the query to the wrong shard.
The only workarounds I have come up with so far is to hit every mongos instance with a dummy primary read pref query for each sharded collection (or maybe call the refresh command against the mongos) at some regular interval.
Background info:
I run a single 5-node replica which spans 3 data centers. 3 nodes in the central "primary" DC, 1 node in each of our regional "secondary" DCs. My application is read-only, runs in all 3 DCs, has high read performance requirements, and high tolerance for eventual consistency. As a result, I run with the "nearest" read preference so that my app running in a regional DC will prefer to read from the mongodb secondary replica running in the same DC, rather than going all the way back to the primary mongodb in the central DC.
We've hit VM RAM capacity issues, and are now attempting to shard in-place into 3 shards, with a mongos instance co-located with each app instance. Everything went smoothly at first, I allowed the balancer to migrate some chunks to the new shards. After a few chunks I disabled the balancer to verify no production errors, and found that objects which had moved are no longer coming back in queries by shard key.
If I make an identical query agains the mongos from the shell (which defaults to primary read preference) I see the following in the logs and get correct results:
2017-08-10T17:30:45.750+0000 D QUERY [conn87] Received error status for query query: { guid: "some_guid" } sort: {} projection: {} on attempt 1 of 10: SendStaleConfig: [MyDb.myCollection] shard version not ok: version mismatch detected for MyDb.myCollection ( ns : MyDb.myCollection, received : 118|0||598b5cf1b6ff8d56d195d96f, wanted : 121|1||598b5cf1b6ff8d56d195d96f, send )
Afterwards, my app's queries (using readPref=nearest) correctly return the same results.
- duplicates
-
SERVER-28948 open up secondaries to checking shardVersion
- Closed