-
Type: Bug
-
Resolution: Done
-
Priority: Major - P3
-
None
-
Affects Version/s: 2.4.10, 2.6.0
-
Component/s: Replication
-
None
-
Replication
-
ALL
-
(copied to CRM)
We observed in production an example of a replica set node going into a FATAL state as a result of a failed oplog query against an inaccessible primary node during the rollback 2 FindCommonPoint phase.
There might be other instances of replication failures resulting in FATAL but this is one instance we have observed in production.
FATAL node logs:
[rsBackgroundSync] replSet rollback 2 FindCommonPoint [rsBackgroundSync] DBClientCursor::init call() failed [rsBackgroundSync] replSet remote oplog empty or unreadable [rsBackgroundSync] replSet error fatal, stopping replication
Primary replica set node relinquishing its PRIMARY status:
[rsMgr] replSet relinquishing primary state [rsMgr] replSet SECONDARY [rsMgr] replSet closing client sockets after relinquishing primary (fatal node tries unsuccessfully to query this node's oplog while primary is closing client connections)
Health Poll logs on non-FATAL node in same replica set:
[rsHealthPoll] replSet member (fatal node hostname:port) is now in state FATAL
If there is a way to handle this case more gracefully, perhaps it might be possible to avoid going into a FATAL state.
- is related to
-
SERVER-15089 Thread applier (bgsync) through replication coordinator
- Closed
- related to
-
SERVER-18035 Data Replicator: Refactor Rollback Code
- Closed
-
SERVER-5930 rollback loop should be smarter
- Closed