-
Type: Bug
-
Resolution: Done
-
Priority: Major - P3
-
None
-
Affects Version/s: None
-
Component/s: None
-
None
-
ALL
-
Execution Team 2020-09-21
Drivers use the replSetStepDown command in a few tests. Ideally the replSetStepDown should cause a reliable primary change (even better if it happens fast). Starting near the beginning of May (CDRIVER-3662 with v4.5.0-1128-g7ee768a), drivers have begun to see the replSetStepDown fail frequently on macOS and Windows. The failure looks like this:
{ "ok" : 0, "errmsg" : "Unable to acquire X lock on '{4611686018427387905: ReplicationStateTransition, 1}' within 5000ms. ...", "code" : 24, "codeName" : "LockTimeout", }
While debugging these failures in the C# driver, we saw the following behavior when using {"replSetStepDown":30,"secondaryCatchUpPeriodSecs":30,"force":false} without replSetFreeze on 4.4.0-rc10-34-gcfd8b75. C# is a little unique in that it runs each replSetStepDown test twice in a row to test both async+sync codepaths. In the attached logs, the first execution of the test succeeds and the second fails because the election takes over 30 seconds to complete.
- At time 18:08.907 the primary receives the replSetStepDownTest request:
{"t":{"$date":"2020-06-24T13:18:08.907+01:00"},"s":"I", "c":"COMMAND", "id":51803, "ctx":"conn1752","msg":"Slow query","attr":{"type":"command","ns":"admin.$cmd","appName":"replSetStepDownTest","command":{"replSetStepDown":30,"secondaryCatchUpPeriodSecs":30,"force":false,"$db":"admin","lsid":{"id":{"$uuid":"51d0c38e-a100-457c-9249-4e87c426e493"}},"$clusterTime":{"clusterTime":{"$timestamp":{"t":1593001088,"i":9}},"signature":{"hash":{"$binary":{"base64":"AAAAAAAAAAAAAAAAAAAAAAAAAAA=","subType":"0"}},"keyId":0}}},"numYields":0,"reslen":163,"locks":{"ReplicationStateTransition":{"acquireCount":{"w":1,"W":1}},"Global":{"acquireCount":{"w":1}},"Database":{"acquireCount":{"w":1}},"Collection":{"acquireCount":{"w":1}},"Mutex":{"acquireCount":{"r":1}}},"storage":{},"protocol":"op_msg","durationMillis":1}}
- Meanwhile the secondary is not eligible to become primary yet (this will be solved by replSetFreeze):
{"t":{"$date":"2020-06-24T13:18:08.907+01:00"},"s":"I", "c":"COMMAND", "id":21581, "ctx":"conn2951","msg":"Received replSetStepUp request"} {"t":{"$date":"2020-06-24T13:18:08.907+01:00"},"s":"I", "c":"ELECTION", "id":4615657, "ctx":"conn2951","msg":"Not starting an election for a replSetStepUp request, since we are not electable","attr":{"reason":"Not standing for election because I am still waiting for stepdown period to end at 2020-06-24T13:18:29.331+01:00 (mask 0x10)"}} {"t":{"$date":"2020-06-24T13:18:08.907+01:00"},"s":"I", "c":"COMMAND", "id":21582, "ctx":"conn2951","msg":"replSetStepUp request failed","attr":{"error":" :: caused by :: CommandFailed: Election failed."}}
- Shortly before the secondary becomes eligible, the ident dropper task begins and doesn't stop until 11 seconds later (from 18:26.157 to 18:37.369). This might be related to
SERVER-48154:{"t":{"$date":"2020-06-24T13:18:26.157+01:00"},"s":"I", "c":"STORAGE", "id":22260, "ctx":"TimestampMonitor","msg":"Removing drop-pending idents with drop timestamps before timestamp","attr":{"timestamp":{"$timestamp":{"t":1593001083,"i":9}}}} {"t":{"$date":"2020-06-24T13:18:26.158+01:00"},"s":"I", "c":"STORAGE", "id":22237, "ctx":"TimestampMonitor","msg":"Completing drop for ident","attr":{"ident":"index-2384-4567462317691918719","namespace":"keyvault.datakeys","dropTimestamp":{"$timestamp":{"t": 1593001044,"i":11}}}} ... A lot more ident drops... {"t":{"$date":"2020-06-24T13:18:37.369+01:00"},"s":"I", "c":"STORAGE", "id":22237, "ctx":"TimestampMonitor","msg":"Completing drop for ident","attr":{"ident":"collection-3663-4567462317691918719","namespace":"Tests06241216.testcollection","dropTimestamp": {"$timestamp":{"t":1593001078,"i":6}}}} ...
- The secondary is unblocked and finally is able to step up:
{"t":{"$date":"2020-06-24T13:18:39.625+01:00"},"s":"I", "c":"REPL", "id":21331, "ctx":"OplogApplier-0","msg":"Transition to primary complete; database writes are now permitted"}
Log are attached: drivers-1290.zip
- is related to
-
SERVER-48154 ident dropper should periodically yield Global IS lock
- Closed