-
Type: Bug
-
Resolution: Duplicate
-
Priority: Major - P3
-
None
-
Affects Version/s: 2.2.0
-
Component/s: Sharding
-
None
-
Linux
We have a sharded cluster with two shards, each a replica set with two machines. Upon shutting down one of these (SECONDARY in 2nd shard, 'RIGHT' below), the PRIMARY also stepped down due to not seeing a majority (no arbiter). This seems to have really upset the mongos, 4 (out of 10) of which segfaulted:
Thu Jan 3 16:37:25 [WriteBackListener-RIGHT:27018] DBClientCursor::init call() failed
Thu Jan 3 16:37:25 [WriteBackListener-RIGHT:27018] WriteBackListener exception : DBClientBase::findN: transport error: RIGHT:27018 ns: admin.$cmd query:
Thu Jan 3 16:37:25 [WriteBackListener-LEFT:27018] GLE is
{ singleShard: "events/LEFT:27018,RIGHT:27018", updatedExisting: true, n: 1, lastOp: 5829355722684497925, connectionId: 104511, err: null, ok: 1.0 }Thu Jan 3 16:37:25 [WriteBackListener-LEFT:27018] GLE is
{ singleShard: "events/LEFT:27018,RIGHT:27018", updatedExisting: true, n: 1, lastOp: 5829355722684497926, connectionId: 104511, err: null, ok: 1. 0 }Thu Jan 3 16:37:25 [WriteBackListener-g29i:27018] GLE is { singleShard: "events2/g19i:27018,g29i:27018", updatedExisting: true, n: 1, lastOp: 5829355722684497935, connectionId: 202, err: null, ok: 1.0
}
Thu Jan 3 16:37:27 [WriteBackListener-RIGHT:27018] WriteBackListener exception : socket exception [CONNECT_ERROR] for RIGHT:27018
Thu Jan 3 16:37:29 [WriteBackListener-g29i:27018] GLE is { singleShard: "events2/g19i:27018,g29i:27018", updatedExisting: true, n: 1, lastOp: 5829355739864367115, connectionId: 202, err: null, ok: 1.0
}Thu Jan 3 16:37:30 [WriteBackListener-g29i:27018] GLE is { singleShard: "events2/g19i:27018,g29i:27018", updatedExisting: true, n: 1, lastOp: 5829355744159334401, connectionId: 202, err: null, ok: 1.0
}
Thu Jan 3 16:37:30 [WriteBackListener-RIGHT:27018] WriteBackListener exception : socket exception [CONNECT_ERROR] for RIGHT:27018
Thu Jan 3 16:37:31 [Balancer] distributed lock 'balancer/g14.hypem.com:27018:1355859540:1804289383' acquired, ts : 50e6082b44377d91a6ecadd3
Thu Jan 3 16:37:31 [Balancer] distributed lock 'balancer/g14.hypem.com:27018:1355859540:1804289383' unlocked.
Thu Jan 3 16:37:32 [ReplicaSetMonitorWatcher] DBClientCursor::init call() failed
Thu Jan 3 16:37:34 [WriteBackListener-RIGHT:27018] WriteBackListener exception : socket exception [CONNECT_ERROR] for RIGHT:27018
Thu Jan 3 16:37:37 [Balancer] distributed lock 'balancer/g14.hypem.com:27018:1355859540:1804289383' acquired, ts : 50e6083144377d91a6ecadd4
Thu Jan 3 16:37:37 [Balancer] distributed lock 'balancer/g14.hypem.com:27018:1355859540:1804289383' unlocked.
Thu Jan 3 16:37:39 [WriteBackListener-RIGHT:27018] WriteBackListener exception : socket exception [CONNECT_ERROR] for RIGHT:27018
Thu Jan 3 16:37:39 [WriteBackListener-g29i:27018] GLE is { singleShard: "events2/g19i:27018,g29i:27018", updatedExisting: true, n: 1, lastOp: 5829355782814040073, connectionId: 202, err: null, ok: 1.0
}
Thu Jan 3 16:37:41 [WriteBackListener-g29i:27018] GLE is { singleShard: "events2/g19i:27018,g29i:27018", updatedExisting: true, n: 1, lastOp: 5829355791403974659, connectionId: 202, err: null, ok: 1.0
}
Thu Jan 3 16:37:42 [WriteBackListener-g29i:27018] GLE is { singleShard: "events2/g19i:27018,g29i:27018", updatedExisting: true, n: 1, lastOp: 5829355795698941961, connectionId: 202, err: null, ok: 1.0
}
Thu Jan 3 16:37:42 [ReplicaSetMonitorWatcher] trying reconnect to RIGHT:27018
Thu Jan 3 16:37:42 [ReplicaSetMonitorWatcher] reconnect RIGHT:27018 failed couldn't connect to server RIGHT:27018
Thu Jan 3 16:37:43 [Balancer] distributed lock 'balancer/g14.hypem.com:27018:1355859540:1804289383' acquired, ts : 50e6083744377d91a6ecadd5
Thu Jan 3 16:37:43 [Balancer] events2 is unavailable
Thu Jan 3 16:37:43 [Balancer] distributed lock 'balancer/g14.hypem.com:27018:1355859540:1804289383' unlocked.
Thu Jan 3 16:37:43 [WriteBackListener-g29i:27018] GLE is { singleShard: "events2/g19i:27018,g29i:27018", updatedExisting: true, n: 1, lastOp: 5829355799993909263, connectionId: 202, err: null, ok: 1.0
}
Thu Jan 3 16:37:44 [WriteBackListener-g29i:27018] GLE is { singleShard: "events2/g19i:27018,g29i:27018", updatedExisting: true, n: 1, lastOp: 5829355804288876560, connectionId: 202, err: null, ok: 1.0
}
Thu Jan 3 16:37:45 [WriteBackListener-RIGHT:27018] WriteBackListener exception : socket exception [CONNECT_ERROR] for RIGHT:27018
Thu Jan 3 16:37:45 [WriteBackListener-g29i:27018] GLE is { singleShard: "events2/g19i:27018,g29i:27018", updatedExisting: true, n: 1, lastOp: 5829355808583843853, connectionId: 202, err: null, ok: 1.0
}
Thu Jan 3 16:37:46 [WriteBackListener-LEFT:27018] GLE is
Thu Jan 3 16:37:46 [WriteBackListener-LEFT:27018] GLE is
{ singleShard: "events/LEFT:27018,RIGHT:27018", updatedExisting: true, n: 1, lastOp: 5829355812878811139, connectionId: 104511, err: null, ok: 1. 0 }Thu Jan 3 16:37:46 [WriteBackListener-g29i:27018] GLE is { singleShard: "events2/g19i:27018,g29i:27018", updatedExisting: true, n: 1, lastOp: 5829355812878811145, connectionId: 202, err: null, ok: 1.0
}
Thu Jan 3 16:37:46 [WriteBackListener-LEFT:27018] GLE is
Thu Jan 3 16:37:46 [WriteBackListener-LEFT:27018] GLE is
{ singleShard: "events/LEFT:27018,RIGHT:27018", updatedExisting: true, n: 1, lastOp: 5829355812878811149, connectionId: 104511, err: null, ok: 1. 0 }Thu Jan 3 16:37:47 [WriteBackListener-g29i:27018] GLE is { singleShard: "events2/g19i:27018,g29i:27018", updatedExisting: true, n: 1, lastOp: 5829355817173778439, connectionId: 202, err: null, ok: 1.0
}
Thu Jan 3 16:37:48 [WriteBackListener-g29i:27018] GLE is { singleShard: "events2/g19i:27018,g29i:27018", updatedExisting: true, n: 1, lastOp: 5829355821468745731, connectionId: 202, err: null, ok: 1.0
}
Thu Jan 3 16:37:49 [Balancer] distributed lock 'balancer/g14.hypem.com:27018:1355859540:1804289383' acquired, ts : 50e6083d44377d91a6ecadd6
Thu Jan 3 16:37:49 [Balancer] events2 is unavailable
Thu Jan 3 16:37:49 [Balancer] distributed lock 'balancer/g14.hypem.com:27018:1355859540:1804289383' unlocked.
Thu Jan 3 16:37:49 [WriteBackListener-g29i:27018] GLE is { singleShard: "events2/g19i:27018,g29i:27018", updatedExisting: true, n: 1, lastOp: 5829355825763713029, connectionId: 202, err: null, ok: 1.0
}
Thu Jan 3 16:37:49 [WriteBackListener-LEFT:27018] DBClientCursor::init call() failed
Thu Jan 3 16:37:49 [WriteBackListener-LEFT:27018] WriteBackListener exception : DBClientBase::findN: transport error: LEFT:27018 ns: admin.$cmd query:
Thu Jan 3 16:37:49 [WriteBackListener-g29i:27018] GLE is { singleShard: "events2/g19i:27018,g29i:27018", updatedExisting: true, n: 1, lastOp: 5829355825763713039, connectionId: 202, err: null, ok: 1.0
}
Thu Jan 3 16:37:50 [WriteBackListener-g29i:27018] GLE is { singleShard: "events2/g19i:27018,g29i:27018", updatedExisting: true, n: 1, lastOp: 5829355830058680325, connectionId: 202, err: null, ok: 1.0
}
Thu Jan 3 16:37:50 [WriteBackListener-LEFT:27018] Socket recv() errno:104 Connection reset by peer 10.0.25.172:27018
Thu Jan 3 16:37:50 [WriteBackListener-LEFT:27018] SocketException: remote: 10.0.25.172:27018 error: 9001 socket exception [1] server [10.0.25.172:27018]
Thu Jan 3 16:37:50 [WriteBackListener-LEFT:27018] DBClientCursor::init call() failed
Thu Jan 3 16:37:50 [WriteBackListener-LEFT:27018] WriteBackListener exception : DBClientBase::findN: transport error: LEFT:27018 ns: admin.$cmd query:
Thu Jan 3 16:37:50 [WriteBackListener-g29i:27018] GLE is { singleShard: "events2/g19i:27018,g29i:27018", updatedExisting: true, n: 1, lastOp: 5829355830058680332, connectionId: 202, err: null, ok: 1.0
}
Thu Jan 3 16:37:50 [WriteBackListener-g29i:27018] GLE is { singleShard: "events2/g19i:27018,g29i:27018", updatedExisting: true, n: 1, lastOp: 5829355830058680336, connectionId: 202, err: null, ok: 1.0
}
Thu Jan 3 16:37:51 [conn64348] Socket recv() errno:104 Connection reset by peer 10.0.25.172:27018
Thu Jan 3 16:37:51 [conn64348] SocketException: remote: 10.0.25.172:27018 error: 9001 socket exception [1] server [10.0.25.172:27018]
Thu Jan 3 16:37:51 [conn64348] DBClientCursor::init lazy say() failedThu Jan 3 16:37:51 [conn64348] DBClientCursor::init message from say() was empty
Thu Jan 3 16:37:51 [conn64348] slave no longer has secondary status: LEFT:27018
Thu Jan 3 16:37:51 [conn64348] sharded connection to events/LEFT:27018,RIGHT:27018 not being returned to the pool
Thu Jan 3 16:37:51 [conn64348] Socket recv() errno:104 Connection reset by peer 10.0.25.172:27018
Thu Jan 3 16:37:51 [conn64348] SocketException: remote: 10.0.25.172:27018 error: 9001 socket exception [1] server [10.0.25.172:27018] Thu Jan 3 16:37:51 [conn64348] DBClientCursor::init call() failed
Thu Jan 3 16:37:51 [conn64348] trying reconnect to RIGHT:27018
Thu Jan 3 16:37:51 [conn64348] reconnect RIGHT:27018 failed couldn't connect to server RIGHT:27018
Thu Jan 3 16:37:51 [conn64330] trying reconnect to LEFT:27018
Thu Jan 3 16:37:51 [conn64330] reconnect LEFT:27018 ok
Thu Jan 3 16:37:51 [WriteBackListener-g29i:27018] GLE is { singleShard: "events2/g19i:27018,g29i:27018", updatedExisting: true, n: 1, lastOp: 5829355834353647627, connectionId: 202, err: null, ok: 1.0
Thu Jan 3 16:37:52 [conn64348] Socket recv() errno:104 Connection reset by peer 10.0.25.172:27018
Thu Jan 3 16:37:52 [conn64348] SocketException: remote: 10.0.25.172:27018 error: 9001 socket exception [1] server [10.0.25.172:27018]
Thu Jan 3 16:37:52 [conn64348] DBClientCursor::init lazy say() failed
Thu Jan 3 16:37:52 [conn64348] DBClientCursor::init message from say() was empty
Thu Jan 3 16:37:52 [conn64348] slave no longer has secondary status: LEFT:27018
Thu Jan 3 16:37:52 [conn64348] sharded connection to events/LEFT:27018,RIGHT:27018 not being returned to the poolThu Jan 3 16:37:52 [WriteBackListener-g29i:27018] GLE is { singleShard: "events2/g19i:27018,g29i:27018", updatedExisting: true, n: 1, lastOp: 5829355838648614917, connectionId: 202, err: null, ok: 1.0
}
Thu Jan 3 16:37:52 [conn64330] Socket recv() errno:104 Connection reset by peer 10.0.25.172:27018
Thu Jan 3 16:37:52 [conn64330] SocketException: remote: 10.0.25.172:27018 error: 9001 socket exception [1] server [10.0.25.172:27018]
Thu Jan 3 16:37:52 [conn64330] DBClientCursor::init lazy say() failed
Thu Jan 3 16:37:52 [conn64330] DBClientCursor::init message from say() was empty
Thu Jan 3 16:37:52 [conn64330] slave no longer has secondary status: LEFT:27018
Thu Jan 3 16:37:52 [conn64330] sharded connection to events/LEFT:27018,RIGHT:27018 not being returned to the pool
Thu Jan 3 16:37:52 [WriteBackListener-RIGHT:27018] WriteBackListener exception : socket exception [CONNECT_ERROR] for RIGHT:27018
Thu Jan 3 16:37:52 [conn64320] Socket recv() errno:104 Connection reset by peer 10.0.25.172:27018
Thu Jan 3 16:37:52 [conn64320] SocketException: remote: 10.0.25.172:27018 error: 9001 socket exception [1] server [10.0.25.172:27018]
Thu Jan 3 16:37:52 [conn64320] DBClientCursor::init lazy say() failed
Thu Jan 3 16:37:52 [conn64320] DBClientCursor::init message from say() was empty
Thu Jan 3 16:37:52 [conn64320] slave no longer has secondary status: LEFT:27018
Thu Jan 3 16:37:52 [conn64320] warning: Weird shift of primary detected
Thu Jan 3 16:37:53 [conn64348] Socket recv() errno:104 Connection reset by peer 10.0.25.172:27018
Thu Jan 3 16:37:53 [conn64348] SocketException: remote: 10.0.25.172:27018 error: 9001 socket exception [1] server [10.0.25.172:27018]
Thu Jan 3 16:37:53 [conn64348] DBClientCursor::init lazy say() failed
Thu Jan 3 16:37:53 [conn64348] DBClientCursor::init message from say() was empty
Thu Jan 3 16:37:53 [conn64348] slave no longer has secondary status: LEFT:27018
Thu Jan 3 16:37:53 [conn64348] sharded connection to events/LEFT:27018,RIGHT:27018 not being returned to the pool
Thu Jan 3 16:37:53 [conn64348] trying reconnect to RIGHT:27018
Thu Jan 3 16:37:53 [conn64348] reconnect RIGHT:27018 failed couldn't connect to server RIGHT:27018
Thu Jan 3 16:37:53 [conn64330] Socket recv() errno:104 Connection reset by peer 10.0.25.172:27018
Thu Jan 3 16:37:53 [conn64330] SocketException: remote: 10.0.25.172:27018 error: 9001 socket exception [1] server [10.0.25.172:27018]
Thu Jan 3 16:37:53 [conn64330] DBClientCursor::init lazy say() failed
Thu Jan 3 16:37:53 [conn64330] DBClientCursor::init message from say() was empty
Thu Jan 3 16:37:53 [conn64330] slave no longer has secondary status: LEFT:27018
Thu Jan 3 16:37:53 [conn64330] sharded connection to events/LEFT:27018,RIGHT:27018 not being returned to the pool
Thu Jan 3 16:37:53 [ReplicaSetMonitorWatcher] ReplicaSetMonitor::_checkConnection: LEFT:27018
Thu Jan 3 16:37:53 [ReplicaSetMonitorWatcher] ReplicaSetMonitor::_checkConnection: caught exception RIGHT:27018 socket exception [FAILED_STATE] for RIGHT:27018
Thu Jan 3 16:37:53 [conn64320] Socket say send() errno:104 Connection reset by peer 10.0.25.172:27018
Thu Jan 3 16:37:53 [conn64320] DBException in process: socket exception [SEND_ERROR] for 10.0.25.172:27018
Thu Jan 3 16:37:54 [conn64330] got not master for: LEFT:27018
Received signal 11
Backtrace: 0x8386d5 0x7fc340c094c0 0xc61e30 /usr/bin/mongos(_ZN5mongo17printStackAndExitEi+0x75)[0x8386d5]
/lib/x86_64-linux-gnu/libc.so.6(+0x364c0)[0x7fc340c094c0]
/usr/bin/mongos(_ZTVN5mongo18DBClientConnectionE+0x10)[0xc61e30]
- duplicates
-
SERVER-7061 mongos can use invalid ptr to master conn when setShardVersion fails
- Closed