-
Type: Bug
-
Resolution: Duplicate
-
Priority: Major - P3
-
None
-
Affects Version/s: 3.2.10
-
Component/s: Replication
-
None
-
Replication
-
ALL
-
-
(copied to CRM)
Under PV1 when using a PSA (or PSSSA) replset spread across three data centres, the primary node flaps between DC1 and DC2 every 10 seconds during a netsplit between DC1 and DC2. Each data centre receives roughly half the writes (assuming roughly constant write traffic). When the netsplit is resolved, the data in the non-primary data centre is rolled back.
When the netsplit occurs, the following sequence of events happen:
1. Secondary in DC2 is unable to contact a primary for 10 seconds and calls a new term.
2. The DC3 arbiter announces the new term to DC1.
3. The DC1 primary steps down.
4. Client connections are dropped.
5. The node in DC2 is elected primary.
6. Clients reconnect and find DC2 is now primary. DC2 starts accepting writes.
7. 10 seconds later, DC1 hasn’t been able to contact a primary and the process repeats itself.
Here is a snippet of logs from the arbiter demonstrating the flapping behaviour:
2016-10-19T22:49:47.655+0000 I REPL [ReplicationExecutor] Member 10.0.0.102:27018 is now in state SECONDARY
2016-10-19T22:49:47.669+0000 I REPL [ReplicationExecutor] Member 10.0.0.101:27017 is now in state PRIMARY
2016-10-19T22:49:57.672+0000 I REPL [ReplicationExecutor] Member 10.0.0.102:27017 is now in state PRIMARY
2016-10-19T22:50:02.672+0000 I ASIO [ReplicationExecutor] dropping unhealthy pooled connection to 10.0.0.101:27017
2016-10-19T22:50:02.672+0000 I ASIO [ReplicationExecutor] after drop, pool was empty, going to spawn some connections
2016-10-19T22:50:02.673+0000 I ASIO [NetworkInterfaceASIO-Replication-0] Connecting to 10.0.0.101:27017
2016-10-19T22:50:02.674+0000 I ASIO [NetworkInterfaceASIO-Replication-0] Successfully connected to 10.0.0.101:27017
2016-10-19T22:50:02.675+0000 I REPL [ReplicationExecutor] Member 10.0.0.101:27017 is now in state SECONDARY
2016-10-19T22:50:12.676+0000 I ASIO [ReplicationExecutor] dropping unhealthy pooled connection to 10.0.0.102:27017
2016-10-19T22:50:12.676+0000 I ASIO [ReplicationExecutor] after drop, pool was empty, going to spawn some connections
2016-10-19T22:50:12.676+0000 I ASIO [NetworkInterfaceASIO-Replication-0] Connecting to 10.0.0.102:27017
2016-10-19T22:50:12.677+0000 I ASIO [NetworkInterfaceASIO-Replication-0] Successfully connected to 10.0.0.102:27017
2016-10-19T22:50:12.677+0000 I REPL [ReplicationExecutor] Member 10.0.0.101:27018 is now in state PRIMARY
2016-10-19T22:50:12.678+0000 I REPL [ReplicationExecutor] Member 10.0.0.102:27017 is now in state SECONDARY
2016-10-19T22:50:22.665+0000 I REPL [ReplicationExecutor] Member 10.0.0.102:27018 is now in state PRIMARY
N.B. Flapping does not occur with PSS/PV1 or PSA/PV0.
- duplicates
-
SERVER-27125 Arbiters in pv1 should vote no in elections if they can see a healthy primary of equal or greater priority to the candidate
- Closed
- related to
-
SERVER-14539 Full consensus arbiter (i.e. uses an oplog)
- Backlog
-
SERVER-26728 Add jstest that primary doesn't flap in PSA configuration with partition between the two data bearing nodes
- Closed
-
SERVER-26725 Automatically reconfig pv1 replica sets using priorities or arbiters to pv0
- Closed