-
Type: Improvement
-
Resolution: Unresolved
-
Priority: Major - P3
-
None
-
Affects Version/s: None
-
Component/s: Replication
-
None
-
Replication
Arbiters with Instant Replay
Introduction
This document proposes an alternative to traditional 3-node clusters (called replica sets) used for achieving high reliability / high availability MongoDB configurations. The idea consists of two parts: have an arbiter with a log of write operations, and have failed nodes return to a majority-confirmed consistent state without refetching documents. The goal is to improve performance and availability while maintaining sufficient reliability guarantees, in particular in single data-center configurations.
MongoDB replica sets consist of a number of nodes each with a mongod instance that maintain a full copy of the database. A single mongod is elected primary and is the only node to accept writes. Replica sets with an even number of data-bearing nodes typically use an additional non-data-bearing arbiter node as a tiebreaker during elections for primary.
The secondary nodes replicate all writes by replaying all write operations stored in the oplog on the primary. A database client can accompany write operations with a write concern, which requires writes to be recorded by a majority of nodes before being considered successful, thus giving full protection against data loss due to single-node failures.
When a node recovers from an unexpected shutdown or hardware failure, it will first find the latest common entry between its oplog and that of another node in the cluster (sync source). Then it will proceed as follows:
- roll back all local (orphaned) changes in the oplog by truncating the oplog and getting fresh copies of any documents referenced to restore consistency
- retrieve any new oplog entries from the sync source and apply them locally
The 3-node replica set
In this configuration, three mongod instances maintain a full copy of the database, while maintaining full write availability in a degraded state with a single node down.
However, this guarantee comes at the significant cost of tripling the amount of storage required. Additionally, majority writes require not just the latency of reaching stable storage on a local node, but also incur the latency of going through the network to a secondary, committing to stable storage there, and retrieving the acknowledgement of the successful commit. (Note that these latencies are typically overlapped in time, however.)
The 2-node with arbiter replica set
An alternative to the full 3-node setup is to have two data-bearing nodes and an arbiter. The arbiter serves as tie-breaking vote in electing a new primary if the existing one fails.
This configuration addresses the cost aspect of storage by only requiring a doubling of the number of servers compared to a standalone mongod. However, now the degraded state means that writes cannot be written to a majority; such writes risk being rolled back if the remaining primary fails.
When a failed node comes back online, it must first catch up with the primary before full reliability is restored. If a node goes offline for 10 minutes for scheduled service, and it takes an additional 5 minutes for the node to catch up, the system has been unavailable for majority writes for 15 minutes, and all writes accepted during that period would be lost if the remaining node failed. Contrast this with non-degraded operation, where the typical window of vulnerability to rollback is measured in seconds or even fractions of seconds.
Introducing Arbiters with Instant Replay
The main issue with arbiters in the previous section is that they are data-less and therefore do not help establishing a quorum for majority writes. This section proposes a new kind of arbiter, the arbiter with instant replay. Rather than maintaining a copy of the entire database, this new arbiter will just maintain an oplog. It will also begin acknowledging writes. However, the arbiter will not discard entries past the time at which the set became degraded (a primary was elected without the full number of votes). When the oplog fills up, it will instead stop recording and acknowledging writes.
In addition, a change is needed for regular nodes acting as primary, where at all times a checkpoint is kept of a recent majority-committed snapshot. This allows a node to perform a local rollback without requiring fetching documents that were locally modified. So, for recovery, a node can use any node as a sync source, including arbiters.
With the proposed changes, arbiters will be able to confirm majority writes. If one node goes down, any writes taken by the primary in the degraded state will be safe, even if it goes down as well. As long as a majority of nodes, either arbiters or regular full data-bearing nodes, are up, the replica set can recover and accept majority writes.
Finally, in normal operation, it is expected that the arbiter will be able to confirm writes much faster as it doesn't need to perform random I/O but rather only sequential writes to a log, speeding up majority writes.
- is related to
-
SERVER-14539 Full consensus arbiter (i.e. uses an oplog)
- Backlog