The problem is quite the same as SERVER-22618,but it's already closed, so I create a new issue here.
2017-08-14T10:00:53.015+0800 I - [RangeDeleter] Assertion: 64:waiting for replication timed out 2017-08-14T10:00:53.018+0800 I CONTROL [RangeDeleter] 0x16f6552 0x16a0d28 0x16896c8 0x168977c 0xc8cad4 0x1231e5c 0x122f679 0x2001310 0x2b642e152aa1 0x2b642e45093d ----- BEGIN BACKTRACE ----- {"backtrace":[{"b":"400000","o":"12F6552","s":"_ZN5mongo15printStackTraceERSo"},{"b":"400000","o":"12A0D28","s":"_ZN5mongo10logContextEPKc"},{"b":"400000","o":"12896C8","s":"_ZN5mongo11msgassertedEiPKc"},{"b":"400000","o":"128977C"},{"b":"400000","o":"88CAD4","s":"_ZN5mongo7Helpers11removeRangeEPNS_16OperationContextERKNS_8KeyRangeEbRKNS_19WriteConcernOptionsEPNS0_11RemoveSaverEbb"},{"b":"400000","o":"E31E5C","s":"_ZN5mongo17RangeDeleterDBEnv11deleteRangeEPNS_16OperationContextERKNS_16RangeDeleteEntryEPxPSs"},{"b":"400000","o":"E2F679","s":"_ZN5mongo12RangeDeleter6doWorkEv"},{"b":"400000","o":"1C01310","s":"execute_native_thread_routine"},{"b":"2B642E14B000","o":"7AA1"},{"b":"2B642E368000","o":"E893D","s":"clone"}],"processInfo":{ "mongodbVersion" : "3.2.13", "gitVersion" : "22ec9e93b40c85fc7cae7d56e7d6a02fd811088c", "compiledModules" : [ "rocks" ], "uname" : { "sysname" : "Linux", "release" : "2.6.32-358.23.2.ali1195.el6.x86_64", "version" : "#1 SMP Wed Oct 29 20:23:37 CST 2014", "machine" : "x86_64" }, "somap" : [ { "elfType" : 2, "b" : "400000" }, { "b" : "7FFF6E0FF000", "elfType" : 3, "buildId" : "64BEF22A9E37EEE1F8FC59AB5AE35209E49D47EA" }, { "b" : "2B642DABA000", "path" : "/lib64/librt.so.1", "elfType" : 3, "buildId" : "BCA23A35FA002CA6FC960E2794BCA3AE9F5BD371" }, { "b" : "2B642DCC2000", "path" : "/lib64/libdl.so.2", "elfType" : 3, "buildId" : "F9230CF0BAC57ADEF2764E4F59E8556753907D31" }, { "b" : "2B642DEC7000", "path" : "/lib64/libm.so.6", "elfType" : 3, "buildId" : "C919E2ECEAE285422EC98BF3EB8C089E8DE9F1AA" }, { "b" : "2B642E14B000", "path" : "/lib64/libpthread.so.0", "elfType" : 3, "buildId" : "0E51B5AD014B2F88B369C39264F02DE3AFC38A63" }, { "b" : "2B642E368000", "path" : "/lib64/libc.so.6", "elfType" : 3, "buildId" : "6F269DCAFA81F126A3671240850D8EE9A5AF543A" }, { "b" : "2B642D898000", "path" : "/lib64/ld-linux-x86-64.so.2", "elfType" : 3, "buildId" : "2946077BEE84E65DD56C7732F2D89F536DBE0966" } ] }} mongod(_ZN5mongo15printStackTraceERSo+0x32) [0x16f6552] mongod(_ZN5mongo10logContextEPKc+0x138) [0x16a0d28] mongod(_ZN5mongo11msgassertedEiPKc+0x88) [0x16896c8] mongod(+0x128977C) [0x168977c] mongod(_ZN5mongo7Helpers11removeRangeEPNS_16OperationContextERKNS_8KeyRangeEbRKNS_19WriteConcernOptionsEPNS0_11RemoveSaverEbb+0x13F4) [0xc8cad4] mongod(_ZN5mongo17RangeDeleterDBEnv11deleteRangeEPNS_16OperationContextERKNS_16RangeDeleteEntryEPxPSs+0x44C) [0x1231e5c] mongod(_ZN5mongo12RangeDeleter6doWorkEv+0x209) [0x122f679] mongod(execute_native_thread_routine+0x20) [0x2001310] libpthread.so.0(+0x7AA1) [0x2b642e152aa1] libc.so.6(clone+0x6D) [0x2b642e45093d] ----- END BACKTRACE -----
I dive into the source code and find the reason.
In function Helpers::removeRange, it try to ignore the writeconcern timeout error by comparing the error code with ErrorCodes::ExceededTimeLimit.
if (writeConcern.shouldWaitForOtherNodes() && numDeleted > 0) { repl::ReplicationCoordinator::StatusAndDuration replStatus = repl::getGlobalReplicationCoordinator()->awaitReplication( txn, repl::ReplClientInfo::forClient(txn->getClient()).getLastOp(), writeConcern); if (replStatus.status.code() == ErrorCodes::ExceededTimeLimit) { warning(LogComponent::kSharding) << "replication to secondaries for removeRange at " "least 60 seconds behind"; } else { massertStatusOK(replStatus.status); } millisWaitingForReplication += replStatus.duration; }
But in the implement of awaitReplication, it return ErrorCodes::WriteConcernFailed when timeout because of replication lag.
if (writeConcern.wTimeout != WriteConcernOptions::kNoTimeout && elapsed > Milliseconds{writeConcern.wTimeout}) { return StatusAndDuration( Status(ErrorCodes::WriteConcernFailed, "waiting for replication timed out"), elapsed); }
I am trying to create a pull request to fix it, but find the master branch has changed a lot, but both the latest version of 3.2 and 3.4 have this problem.
I am confused why removeRange need assert when awaitReplication failed, maybe we can remove the assert or change the error code of awaitReplication.