This bug does not normally affect the mongo system we have set up. However, when AWS lost power to one of our EBS volumes, it became very apparent that we could not start any more mongos processes, so our production system came down.
Basics
While it is not easy to get AWS to lose power to EBS volumes, it is very easy to reproduce this bug using NFS and IPTables. We'll have 1 NFS server, and then 1 NFS client. The client will be running all instances of mongod and mongos. The NFS server will host a single share that the NFS client will use for a single config server.
NFS Server Setup
sudo apt-get install nfs-kernel-server
sudo mkdir /srv/nfs/mongo
sudo vi /etc/exports
# /etc/exports
/srv/nfs/mongo <IP of NFS client>/32(rw,sync,no_subtree_check,no_root_squash)
sudo /etc/init.d/nfs-kernel-server restart
NFS Client Setup
sudo apt-get install nfs-common
sudo mkdir -p /nfs/mongo
sudo vi /etc/fstab
<IP of NFS server>:/srv/nfs/mongo /nfs/mongo nfs4 _netdev,auto 0 0
sudo mount /nfs/mongo
Mongo Setup (on same server as NFS Client)
sudo mkdir /db/a1
sudo mkdir /db/a2
sudo mkdir /db/a3
sudo mkdir /db/b1
sudo mkdir /db/b2
sudo mkdir /db/b3
sudo mkdir /db/c1
sudo ln -s /nfs/mongo/c2 /db/c2
sudo mkdir /db/c3
sudo mkdir /var/run/mongo
sudo mkdir /db/logs
/usr/bin/mongod --configsvr --smallfiles --fork --port 27050 --dbpath /db/c1 --logpath /db/logs/c1.log --logappend --pidfilepath /var/run/mongo/c1.pid --maxConns 1024
/usr/bin/mongod --configsvr --smallfiles --fork --port 27051 --dbpath /db/c2 --logpath /db/logs/c2.log --logappend --pidfilepath /var/run/mongo/c1.pid --maxConns 1024
/usr/bin/mongod --configsvr --smallfiles --fork --port 27052 --dbpath /db/c3 --logpath /db/logs/c3.log --logappend --pidfilepath /var/run/mongo/c1.pid --maxConns 1024
/usr/bin/mongod --shardsvr --smallfiles --fork --port 27150 --dbpath /db/a1 --logpath /db/logs/a1.log --logappend --pidfilepath /var/run/mongo/c1.pid --maxConns 1024 --replSet a
/usr/bin/mongod --shardsvr --smallfiles --fork --port 27151 --dbpath /db/a2 --logpath /db/logs/a2.log --logappend --pidfilepath /var/run/mongo/c1.pid --maxConns 1024 --replSet a
/usr/bin/mongod --shardsvr --smallfiles --fork --port 27152 --dbpath /db/a3 --logpath /db/logs/a3.log --logappend --pidfilepath /var/run/mongo/c1.pid --maxConns 1024 --replSet a
/usr/bin/mongod --shardsvr --smallfiles --fork --port 27250 --dbpath /db/b1 --logpath /db/logs/b1.log --logappend --pidfilepath /var/run/mongo/c1.pid --maxConns 1024 --replSet b
/usr/bin/mongod --shardsvr --smallfiles --fork --port 27251 --dbpath /db/b2 --logpath /db/logs/b2.log --logappend --pidfilepath /var/run/mongo/c1.pid --maxConns 1024 --replSet b
/usr/bin/mongod --shardsvr --smallfiles --fork --port 27252 --dbpath /db/b3 --logpath /db/logs/b3.log --logappend --pidfilepath /var/run/mongo/c1.pid --maxConns 1024 --replSet b
sleep 10
echo "rs.initiate({_id: 'a', members: [{_id: 0, host: 'localhost:27150', priority: 2},{_id: 1, host: 'localhost:27151', priority: 1},{_id: 2, host: 'localhost:27152', priority: 0}]})" | mongo localhost:27150
echo "rs.initiate({_id: 'b', members: [{_id: 0, host: 'localhost:27250', priority: 2},{_id: 1, host: 'localhost:27251', priority: 1},{_id: 2, host: 'localhost:27252', priority: 0}]})" | mongo localhost:27250
sleep 30
{{echo "db.runCommand(
{addshard: 'a/localhost:27150'})" | mongo admin}}
{{echo "db.runCommand(
)" | mongo admin}}
In a different terminal (one that can be tied up):
/usr/bin/mongos --configdb localhost:27050,localhost:27051,localhost:27052 --fork --logpath /var/log/mongos.log --logappend --port 27017 --maxConns 1024
Notice that mongos starts normally.
Baseline
Connect, using mongo, to the mongos process. Insert some items. Find some items. Do whatever. Notice it all works as expected.
Kill the storage associated with one of the mongod config servers. On the NFS Server:
sudo iptables -I INPUT -s <IP of NFS client>/32 -j DROP
Connect, reconnect, etc. using the mongos process. Notice it all still works as expected.
Bug Manifestation
Kill the mongos process (Ctrl-C should be fine). After it's down, start it up again using the same command as before.
/usr/bin/mongos --configdb localhost:27050,localhost:27051,localhost:27052 --fork --logpath /var/log/mongos.log --logappend --port 27017 --maxConns 1024
Notice that mongos will hang for a minute, and then die.
Expected Outcome
Mongos, even though it connected successfully to the config server with the downed data store, should timeout on it's operations, and treat the config server as a downed server; this should result in a successful start of mongos.
- is related to
-
SERVER-6313 config server timeouts not used in all places
- Closed
- related to
-
SERVER-5064 mongos can't start when one config server is down. only with keyFile options
- Closed