ISSUE SUMMARY
When cleaning up after a chunk migration, if the shard is a replica set the primary node may for the cleanup to be replicated to secondaries. If during this time the primary becomes non-primary, the node throws an exception and aborts.
The only configurations affected by this issue are sharded clusters where shards are replica sets, the balancer is enabled, and chunk migrations have occurred.
USER IMPACT
Under the conditions described above, the cleanup operation fails with an exception and the node aborts.
WORKAROUNDS
A possible workaround is to enable synchronous migration cleanups:
use config db.settings.update({ _id: 'balancer' }, { $set: { _waitForDelete: true }});
Note however that this makes migration cleanups slower, so this workaround has a performance impact.
AFFECTED VERSIONS
MongoDB 2.6 production releases up to 2.6.3 are affected by this issue.
FIX VERSION
The fix is included in the 2.6.4 production release.
RESOLUTION DETAILS
Catch the exception and log a message to the log file. The affected node is no longer primary, so the operation should not be restarted.
Original description
One of our our mongod's crashed. Thge last log lines are:
2014-06-16T01:06:07.662+0200 [conn170044] command admin.$cmd command: serverStatus { serverStatus: 1 } keyUpdates:0 numYields:0 locks(micros) r:39 reslen:3982 3926153ms 2014-06-16T01:06:07.662+0200 [conn170732] serverStatus was very slow: { after basic: 0, after asserts: 0, after backgroundFlushing: 0, after connections: 0, after cursors: 0, after extra_info: 0, after globalLock: 0, after indexCounters: 0, after locks: 0, after network: 0, after opcounters: 0, after opcountersRepl: 0, after recordStats: 838061, after repl: 838351, at end: 838351 } 2014-06-16T01:06:07.662+0200 [conn170732] command admin.$cmd command: serverStatus { serverStatus: 1 } keyUpdates:0 numYields:0 locks(micros) r:45 reslen:3982 857713ms 2014-06-16T01:06:07.664+0200 [conn170015] serverStatus was very slow: { after basic: 0, after asserts: 0, after backgroundFlushing: 0, after connections: 0, after cursors: 0, after extra_info: 0, after globalLock: 0, after indexCounters: 0, after locks: 0, after network: 0, after opcounters: 0, after opcountersRepl: 0, after recordStats: 4003515, after repl: 4003805, at end: 4003805 } 2014-06-16T01:06:07.664+0200 [conn170015] command admin.$cmd command: serverStatus { serverStatus: 1 } keyUpdates:0 numYields:0 locks(micros) r:62 reslen:3982 4063527ms 2014-06-16T01:06:07.679+0200 [conn170894] end connection 172.16.64.36:57234 (50 connections now open) 2014-06-16T01:06:07.680+0200 [conn170839] SocketException handling request, closing client connection: 9001 socket exception [SEND_ERROR] server [172.16.64.110:53226] 2014-06-16T01:06:07.680+0200 [conn170008] SocketException handling request, closing client connection: 9001 socket exception [SEND_ERROR] server [172.16.64.36:34421] 2014-06-16T01:06:07.680+0200 [conn170072] SocketException handling request, closing client connection: 9001 socket exception [SEND_ERROR] server [172.16.64.36:40371] 2014-06-16T01:06:07.680+0200 [conn170120] SocketException handling request, closing client connection: 9001 socket exception [SEND_ERROR] server [172.16.64.36:44809] 2014-06-16T01:06:07.680+0200 [conn169952] SocketException handling request, closing client connection: 9001 socket exception [SEND_ERROR] server [172.16.64.110:45250] 2014-06-16T01:06:07.680+0200 [conn169973] SocketException handling request, closing client connection: 9001 socket exception [SEND_ERROR] server [172.16.64.110:45466] 2014-06-16T01:06:07.680+0200 [conn170677] SocketException handling request, closing client connection: 9001 socket exception [SEND_ERROR] server [172.16.64.36:37881] 2014-06-16T01:06:07.680+0200 [conn170193] SocketException handling request, closing client connection: 9001 socket exception [SEND_ERROR] server [172.16.64.36:51427] 2014-06-16T01:06:07.680+0200 [conn169990] SocketException handling request, closing client connection: 9001 socket exception [SEND_ERROR] server [172.16.64.36:33256] 2014-06-16T01:06:07.680+0200 [conn170545] SocketException handling request, closing client connection: 9001 socket exception [SEND_ERROR] server [172.16.64.110:50537] 2014-06-16T01:06:07.680+0200 [conn170836] SocketException handling request, closing client connection: 9001 socket exception [SEND_ERROR] server [172.16.64.36:52316] 2014-06-16T01:06:07.680+0200 [conn170578] SocketException handling request, closing client connection: 9001 socket exception [SEND_ERROR] server [172.16.64.110:50862] 2014-06-16T01:06:07.680+0200 [conn169975] SocketException handling request, closing client connection: 9001 socket exception [SEND_ERROR] server [172.16.64.36:59916] 2014-06-16T01:06:07.680+0200 [conn170811] SocketException handling request, closing client connection: 9001 socket exception [SEND_ERROR] server [172.16.64.36:49906] 2014-06-16T01:06:07.680+0200 [conn170094] SocketException handling request, closing client connection: 9001 socket exception [SEND_ERROR] server [172.16.64.36:42575] 2014-06-16T01:06:07.708+0200 [RangeDeleter] 0x11c0e91 0x1163109 0x11477e6 0xebb7ab 0xeb5cfd 0xde2830 0xddd2dd 0x1205829 0x7f9f89db38ca 0x7f9f8916692d /usr/bin/mongod(_ZN5mongo15printStackTraceERSo+0x21) [0x11c0e91] /usr/bin/mongod(_ZN5mongo10logContextEPKc+0x159) [0x1163109] /usr/bin/mongod(_ZN5mongo11msgassertedEiPKc+0xe6) [0x11477e6] /usr/bin/mongod(_ZN5mongo13SlaveTracking18opReplicatedEnoughENS_6OpTimeERKSs+0x8b) [0xebb7ab] /usr/bin/mongod(_ZN5mongo18opReplicatedEnoughENS_6OpTimeENS_11BSONElementE+0xdd) [0xeb5cfd] /usr/bin/mongod(_ZN5mongo17RangeDeleterDBEnv11deleteRangeERKNS_10StringDataERKNS_7BSONObjES6_S6_bPSs+0xbb0) [0xde2830] /usr/bin/mongod(_ZN5mongo12RangeDeleter6doWorkEv+0x86d) [0xddd2dd] /usr/bin/mongod() [0x1205829] /lib/libpthread.so.0(+0x68ca) [0x7f9f89db38ca] /lib/libc.so.6(clone+0x6d) [0x7f9f8916692d] 2014-06-16T01:06:07.731+0200 [RangeDeleter] SEVERE: terminate() called, printing stack (if implemented for platform): 0x11c0e91 0x11c0118 0x7f9f8995c166 0x7f9f8995c193 0x7f9f8995c28e 0x114788e 0xebb7ab 0xeb5cfd 0xde2830 0xddd2dd 0x1205829 0x7f9f89db38ca 0x7f9f8916692d /usr/bin/mongod(_ZN5mongo15printStackTraceERSo+0x21) [0x11c0e91] /usr/bin/mongod() [0x11c0118] /usr/lib/libstdc++.so.6(+0xcb166) [0x7f9f8995c166] /usr/lib/libstdc++.so.6(+0xcb193) [0x7f9f8995c193] /usr/lib/libstdc++.so.6(+0xcb28e) [0x7f9f8995c28e] /usr/bin/mongod(_ZN5mongo11msgassertedEiPKc+0x18e) [0x114788e] /usr/bin/mongod(_ZN5mongo13SlaveTracking18opReplicatedEnoughENS_6OpTimeERKSs+0x8b) [0xebb7ab] /usr/bin/mongod(_ZN5mongo18opReplicatedEnoughENS_6OpTimeENS_11BSONElementE+0xdd) [0xeb5cfd] /usr/bin/mongod(_ZN5mongo17RangeDeleterDBEnv11deleteRangeERKNS_10StringDataERKNS_7BSONObjES6_S6_bPSs+0xbb0) [0xde2830] /usr/bin/mongod(_ZN5mongo12RangeDeleter6doWorkEv+0x86d) [0xddd2dd] /usr/bin/mongod() [0x1205829] /lib/libpthread.so.0(+0x68ca) [0x7f9f89db38ca] /lib/libc.so.6(clone+0x6d) [0x7f9f8916692d]
The crashed mongod is member of a replSet consisting of 3 members. The replSet is one of 5 shards. The cluster is running with authentication.
- duplicates
-
SERVER-14274 awaitReplication in LegacyReplicationCoordinator
- Closed
- is related to
-
SERVER-14551 Runner yield during migration cleanup (removeRange) results in fassert
- Closed
- related to
-
SERVER-14389 segmentation fault in RangeDeleter::canEnqueue_inlock
- Closed