-
Type: Bug
-
Resolution: Duplicate
-
Priority: Major - P3
-
None
-
Affects Version/s: 3.4.1
-
Component/s: Replication
-
None
-
ALL
-
In replication scenario when there is a lot of inserts on the primary node secondary is not always able to replicate all inserts in time. Documentation on RECOVERING state clearly says that in this situation secondary should transition to RECOVERING mode and manual intervention is required:
Due to overload, a secondary may fall far enough behind the other members of the replica set such that it may need to resync with the rest of the set. When this happens, the member enters the RECOVERING state and requires manual intervention.
But in reality secondary node sometimes tries to rollback in this situation leading to failed rollback and wrong node state.
I believe the problem is in BackgroundSync::_produce function in src/mongo/db/repl/bgsync.cpp. When OplogStartMissing status code is returned by _syncSourceResolver this function correctly transition node to RECOVERING state. But when the same status code is returned by oplogFetcher the code executes rollback without going to RECOVERING state. I think oplog fetcher can return OplogStartMissing }} when the rollback is necessary or when secondary fall far behind primary. So there should be additional check if rollback is necessary or secondary should go to {{RECOVERING state.
- duplicates
-
SERVER-27403 Consider term and rbid when validating the proposed sync source
- Closed
- is related to
-
SERVER-28068 Do not go into rollback due to falling off the back of your sync source's oplog
- Closed