-
Type: Bug
-
Resolution: Fixed
-
Priority: Major - P3
-
Affects Version/s: 3.6.15
-
Component/s: Replication
-
None
-
Fully Compatible
-
ALL
-
v4.2, v4.0, v3.6
-
Repl 2019-11-18, Repl 2019-12-02
-
8
The test fails (very rarely) due to a race in how the repl.buffer.count metric is calculated. There's a period when the rsBackgroundSync thread has added oplog entries to the buffer but hasn't yet incremented repl.buffer.count. During this period, the ReplBatcher thread can clear the buffer and decrement repl.buffer.count. Since the count can be decremented before it's incremented, it can be briefly negative. The server_status_metrics.js test doesn't expect this race.
First, the test inserts 1000 docs with w: 2. The secondary's oplog buffer fills and empties, the metric is incremented by 1000 and decremented by 1000. The test calls serverStatus on the secondary and checks that repl.buffer.count >= 0, in fact it's 0, and the assertion passes.
Next, the test updates all 1000 docs with w: 2. Events proceed perhaps in this order:
- the rsBackgroundSync thread in BackgroundSync::_enqueueDocuments buffers 1000 oplog entries, bufferCountGauge is still 0
- the ReplBatcher thread in SyncTail::tryPopAndWaitForMore calls bufferCountGauge.decrement(1) a thousand times, now it's -1000
- the test calls serverStatus, repl.buffer.count is -1000 so the test will fail
- the rsBackgroundSync thread in BackgroundSync::_enqueueDocuments calls bufferCountGauge.increment(1000)