-
Type: Bug
-
Resolution: Fixed
-
Priority: Major - P3
-
Affects Version/s: None
-
Component/s: Replication
-
None
-
Fully Compatible
-
ALL
-
Repl 2021-03-08, Repl 2021-03-22
-
25
The primary and secondary (which was the former primary) have diverging oplogs. This is because we have set the primary catchup timeout to 0 seconds, and we immediately step up the new primary and it writes a no-op start-of-term entry. As a result, the secondary goes into rollback.
There are two possible solutions:
1. (Prevent rollback) Wait until the 20020's start-of-term oplog entry has been fetched by all nodes, by pausing here. For example, adding a "sleep(3000);" prevented rollback. Of course, adding a sleep there wouldn't be a stable solution, but maybe we could add some kind of synchronization there so that all nodes have fetched 20020's entry. That way, when 20021 steps up, it will write its own start-of-term entry after 20020's entry, and the two oplogs will not diverge.
2. (Work with possibility of rollback) Change step 9 to be "w: 3", here. After this line, 20020 and 20022 should have completed rollback if they rolled back, and the three nodes should be in sync.