Normal case:
1. Primary starts updating config.cache.chunks, sets the refreshing flag to true for config.cache.collections document.
2. Primary finishes updating config.cache.chunks, sets the refreshing flag to false for config.cache.collections document.
3. Secondary CatalogCache tries to refresh, sees that refreshing flag is true, waits for signal to become false.
4. Secondary replication intercepts the oplog entry for setting the refresh flag to false, then notifies the CatalogCache to check again.
5. Secondary CatalogCache checks and sees that the flag is false, so it proceeds to reading config.cache.chunks.
Stuck case:
1. Secondary batch applied up to the oplog at T10 that includes the oplog that sets refreshing flag to true (but not the oplog that sets it to false).
2. Secondary CatalogCache tries to refresh, sees that refreshing flag is true, waits for signal to become false.
3. Secondary replication intercepts the oplog entry with timestamp at T15 for setting the refresh flag to false, then notifies the CatalogCache to check again.
4. Secondary CatalogCache tries to read the refresh flag, but since the secondary is in the middle of batch replication, it reads from the last stable snapshot, which is at T10, and sees the flag is still set to true. So it ends up waiting for the notification again.
Once it ends up on this state, it will have to wait for the next refresh to happen and the snapshot it will have to read from must also have the flag set to false in order to break out of the loop.
Note: in order to hit this, there should be multiple refreshes that happen in a short span such that it will make the secondary wait for replication on the first refresh that happened on primary, and hit the stuck case above on the next incoming ones.
Original description:
We got very strange issue: all queries for one collection (product.productVariants) began to stick and end in timeout only on one secondary node in cluster.
Every query writes log messages like:
2019-08-09T11:45:01.100+0000 I SHARDING [conn1687013] Failed to refresh metadata for collectionproduct.productVariants :: caused by :: MaxTimeMSExpired: operation exceeded time limit 2019-08-09T11:45:01.100+0000 I COMMAND [conn1687013] Timed out obtaining lock while trying to gather storage statistics for a slow operation 2019-08-09T11:45:01.100+0000 I COMMAND [conn1687013] command product.productVariants command: find \{ find: "productVariants", filter: { productId: { $in: [ "1464081735096156158-24-1-553-3541579132", "1465973890490715674-11-1-582-2096574603", "1465973890500950185-13-1-582-976357012", "1461932094024184784-166-1-553-3714864457", "1461935063694876366-54-1-553-2708221516", "1461935988919956077-34-1-553-1636835731", "1461937588876550474-147-1-553-178727871", "1461929262265524984-74-1-553-2941111310", "1463810588441994707-236-1-553-151836953", "1463810333622018613-11-1-553-3495339665", "1462962266606256953-204-1-553-3651680234", "1461930397452040780-28-1-553-3620775575" ] } }, readConcern: \{ level: "local" }, maxTimeMS: 60000, shardVersion: [ Timestamp(4547, 1), ObjectId('5c9d25a355eaa4f13e23434e') ], $readPreference: \{ mode: "secondaryPreferred", tags: [ { role: "main" } ] }, $clusterTime: \{ clusterTime: Timestamp(1565351041, 36), signature: { hash: BinData(0, 1ACB7B7FA0F5A4A9E7930DDAC1AFB93C717B5717), keyId: 6695428731097317566 } }, $configServerState: \{ opTime: { ts: Timestamp(1565351040, 874), t: 3 } }, $db: "product" } numYields:0 ok:0 errMsg:"epoch mismatch detected for product.productVariants, the collection may have been dropped and recreated" errName:StaleConfig errCode:13388 reslen:587 locks:\{ Global: { acquireCount: { r: 3 }, acquireWaitCount: \{ r: 1 }, timeAcquiringMicros: \{ r: 729 } }, Database: \{ acquireCount: { r: 2 } }, Collection: \{ acquireCount: { r: 2 } } } protocol:op_msg 60010msMetadata was refreshed immediately after executing query without {{readConcern: { level: "local" }}} directly on problem mongod instance.
We found related metadata update in configuration replica set logs:
2019-08-09T11:41:09.495+0000 I SH_REFR [ConfigServerCatalogCacheLoader-7172] Refresh for collection product.productVariants from version 4546|8||5c9d25a355eaa4f13e23434e to version 4547|1||5c9d25a355eaa4f13e23434e took 3 ms 2019-08-09T11:41:09.567+0000 I SHARDING [Balancer] distributed lock 'product.productVariants' acquired for 'Migrating chunk(s) in collection product.productVariants', ts : 5d10d3d9b6fda9ab6affa646After than we grep Marking collection in problem mongod instance logs:
2019-08-09T11:36:51.393+0000 I SHARDING [repl writer worker 6] Marking collection product.productVariants with collection version: 4546|1||5c9d25a355eaa4f13e23434e, shard version: 4546|1||5c9d25a355eaa4f13e23434e as unsharded 2019-08-09T11:36:51.633+0000 I SHARDING [conn1685706] Marking collection product.productVariants as sharded with collection version: 4546|8||5c9d25a355eaa4f13e23434e, shard version: 4546|8||5c9d25a355eaa4f13e23434e 2019-08-09T11:41:08.422+0000 I SHARDING [repl writer worker 6] Marking collection product.productVariants with collection version: 4546|8||5c9d25a355eaa4f13e23434e, shard version: 4546|8||5c9d25a355eaa4f13e23434e as unsharded 2019-08-09T12:58:17.988+0000 I SHARDING [conn1720807] Marking collection product.productVariants as sharded with collection version: 4550|5||5c9d25a355eaa4f13e23434e, shard version: 4550|5||5c9d25a355eaa4f13e23434e 2019-08-09T13:01:36.409+0000 I SHARDING [repl writer worker 10] Marking collection product.productVariants with collection version: 4551|1||5c9d25a355eaa4f13e23434e, shard version: 4551|1||5c9d25a355eaa4f13e23434e as unsharded 2019-08-09T13:01:36.449+0000 I SHARDING [conn1721230] Marking collection product.productVariants as sharded with collection version: 4552|1||5c9d25a355eaa4f13e23434e, shard version: 4552|1||5c9d25a355eaa4f13e23434eIn our case we got above hour between Marking as unsharded and Marking as sharded.
- is depended on by
-
SERVER-69813 Get rid of temporary solution for stuck secondaries
- Backlog
- is related to
-
SERVER-44105 Perform ShardServerCatalogLoader writes in a single transaction
- Closed