Loading...

XML

Word

Printable

JSON

Type: Bug
Resolution: Fixed
Priority: Major - P3
Fix Version/s: 4.3.1, 4.2.2, 4.0.14
Affects Version/s: 4.0.0
Component/s: Sharding
Labels:
None

Backwards Compatibility:
Fully Compatible
Operating System:
ALL
Backport Requested:

v4.2, v4.0
Steps To Reproduce:

Hide

Unknown

Show
Unknown
Sprint:
Sharding 2019-10-21, Sharding 2019-11-04
Case:
Linked BF Score:
8
Confidence Status:
None
Work Order:
3

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name:
None
Goal Link:
None

Normal case:

1. Primary starts updating config.cache.chunks, sets the refreshing flag to true for config.cache.collections document.
2. Primary finishes updating config.cache.chunks, sets the refreshing flag to false for config.cache.collections document.
3. Secondary CatalogCache tries to refresh, sees that refreshing flag is true, waits for signal to become false.
4. Secondary replication intercepts the oplog entry for setting the refresh flag to false, then notifies the CatalogCache to check again.
5. Secondary CatalogCache checks and sees that the flag is false, so it proceeds to reading config.cache.chunks.

Stuck case:

1. Secondary batch applied up to the oplog at T10 that includes the oplog that sets refreshing flag to true (but not the oplog that sets it to false).
2. Secondary CatalogCache tries to refresh, sees that refreshing flag is true, waits for signal to become false.
3. Secondary replication intercepts the oplog entry with timestamp at T15 for setting the refresh flag to false, then notifies the CatalogCache to check again.
4. Secondary CatalogCache tries to read the refresh flag, but since the secondary is in the middle of batch replication, it reads from the last stable snapshot, which is at T10, and sees the flag is still set to true. So it ends up waiting for the notification again.

Once it ends up on this state, it will have to wait for the next refresh to happen and the snapshot it will have to read from must also have the flag set to false in order to break out of the loop.

Note: in order to hit this, there should be multiple refreshes that happen in a short span such that it will make the secondary wait for replication on the first refresh that happened on primary, and hit the stuck case above on the next incoming ones.

Original description:

We got very strange issue: all queries for one collection (product.productVariants) began to stick and end in timeout only on one secondary node in cluster.

Every query writes log messages like:

2019-08-09T11:45:01.100+0000 I SHARDING [conn1687013] Failed to refresh metadata for collectionproduct.productVariants :: caused by :: MaxTimeMSExpired: operation exceeded time limit
2019-08-09T11:45:01.100+0000 I COMMAND [conn1687013] Timed out obtaining lock while trying to gather storage statistics for a slow operation
2019-08-09T11:45:01.100+0000 I COMMAND [conn1687013] command product.productVariants command: find \{ find: "productVariants", filter: { productId: { $in: [ "1464081735096156158-24-1-553-3541579132", "1465973890490715674-11-1-582-2096574603", "1465973890500950185-13-1-582-976357012", "1461932094024184784-166-1-553-3714864457", "1461935063694876366-54-1-553-2708221516", "1461935988919956077-34-1-553-1636835731", "1461937588876550474-147-1-553-178727871", "1461929262265524984-74-1-553-2941111310", "1463810588441994707-236-1-553-151836953", "1463810333622018613-11-1-553-3495339665", "1462962266606256953-204-1-553-3651680234", "1461930397452040780-28-1-553-3620775575" ] } }, readConcern: \{ level: "local" }, maxTimeMS: 60000, shardVersion: [ Timestamp(4547, 1), ObjectId('5c9d25a355eaa4f13e23434e') ], $readPreference: \{ mode: "secondaryPreferred", tags: [ { role: "main" } ] }, $clusterTime: \{ clusterTime: Timestamp(1565351041, 36), signature: { hash: BinData(0, 1ACB7B7FA0F5A4A9E7930DDAC1AFB93C717B5717), keyId: 6695428731097317566 } }, $configServerState: \{ opTime: { ts: Timestamp(1565351040, 874), t: 3 } }, $db: "product" } numYields:0 ok:0 errMsg:"epoch mismatch detected for product.productVariants, the collection may have been dropped and recreated" errName:StaleConfig errCode:13388 reslen:587 locks:\{ Global: { acquireCount: { r: 3 }, acquireWaitCount: \{ r: 1 }, timeAcquiringMicros: \{ r: 729 } }, Database: \{ acquireCount: { r: 2 } }, Collection: \{ acquireCount: { r: 2 } } } protocol:op_msg 60010ms

Metadata was refreshed immediately after executing query without {{readConcern: { level: "local" }}} directly on problem mongod instance.

We found related metadata update in configuration replica set logs:

2019-08-09T11:41:09.495+0000 I SH_REFR [ConfigServerCatalogCacheLoader-7172] Refresh for collection product.productVariants from version 4546|8||5c9d25a355eaa4f13e23434e to version 4547|1||5c9d25a355eaa4f13e23434e took 3 ms
2019-08-09T11:41:09.567+0000 I SHARDING [Balancer] distributed lock 'product.productVariants' acquired for 'Migrating chunk(s) in collection product.productVariants', ts : 5d10d3d9b6fda9ab6affa646

After than we grep Marking collection in problem mongod instance logs:

2019-08-09T11:36:51.393+0000 I SHARDING [repl writer worker 6] Marking collection product.productVariants with collection version: 4546|1||5c9d25a355eaa4f13e23434e, shard version: 4546|1||5c9d25a355eaa4f13e23434e as unsharded
2019-08-09T11:36:51.633+0000 I SHARDING [conn1685706] Marking collection product.productVariants as sharded with collection version: 4546|8||5c9d25a355eaa4f13e23434e, shard version: 4546|8||5c9d25a355eaa4f13e23434e
2019-08-09T11:41:08.422+0000 I SHARDING [repl writer worker 6] Marking collection product.productVariants with collection version: 4546|8||5c9d25a355eaa4f13e23434e, shard version: 4546|8||5c9d25a355eaa4f13e23434e as unsharded
2019-08-09T12:58:17.988+0000 I SHARDING [conn1720807] Marking collection product.productVariants as sharded with collection version: 4550|5||5c9d25a355eaa4f13e23434e, shard version: 4550|5||5c9d25a355eaa4f13e23434e
2019-08-09T13:01:36.409+0000 I SHARDING [repl writer worker 10] Marking collection product.productVariants with collection version: 4551|1||5c9d25a355eaa4f13e23434e, shard version: 4551|1||5c9d25a355eaa4f13e23434e as unsharded
2019-08-09T13:01:36.449+0000 I SHARDING [conn1721230] Marking collection product.productVariants as sharded with collection version: 4552|1||5c9d25a355eaa4f13e23434e, shard version: 4552|1||5c9d25a355eaa4f13e23434e

In our case we got above hour between Marking as unsharded and Marking as sharded.

is depended on by

SERVER-69813 Get rid of temporary solution for stuck secondaries

Backlog

is related to

SERVER-44105 Perform ShardServerCatalogLoader writes in a single transaction

Closed

Assignee:: Randolph Tan
Reporter:: Artem
Participants:: Artem, Danny Hatcher, Githook User, Kelsey Schubert, Peter Ivanov, Randolph Tan, Sergey Zagursky
Votes:: 0 Vote for this issue
Watchers:: 20 Start watching this issue

Created:: Aug 09 2019 02:05:58 PM UTC
Updated:: Oct 29 2023 10:18:13 PM UTC
Resolved:: Oct 23 2019 06:37:06 PM UTC

Details

Description

Attachments

Issue Links

Activity

People

Dates