I recently ran a --repair on one of my secondaries and got this crash when the oplog was done building. My hunch is that the cause was not enough disk space on the volume that holds the oplog (50 GB partition for a 30 GB oplog... so just less than 2x space). Still, would have been nice if the case was handled gracefully (in this case, the oplog wasn't damaged to begin with) instead of failing & leaving the DB in a corrupt state.
Tue Jul 10 14:31:10 [initandlisten] clone local.oplog.rs 31507327 Tue Jul 10 14:31:17 [initandlisten] warning Listener::getElapsedTimeMillis returning 0ms Tue Jul 10 14:31:21 [initandlisten] warning Listener::getElapsedTimeMillis returning 0ms Tue Jul 10 14:31:25 [initandlisten] warning Listener::getElapsedTimeMillis returning 0ms Tue Jul 10 14:31:29 [initandlisten] warning Listener::getElapsedTimeMillis returning 0ms Tue Jul 10 14:31:33 [initandlisten] warning Listener::getElapsedTimeMillis returning 0ms Tue Jul 10 14:31:35 [initandlisten] 32861623 objects cloned so far from collection local.oplog.rs Tue Jul 10 14:31:42 [initandlisten] warning Listener::getElapsedTimeMillis returning 0ms Tue Jul 10 14:32:10 [initandlisten] clone local.oplog.rs 33154047 Tue Jul 10 14:32:21 [initandlisten] warning Listener::getElapsedTimeMillis returning 0ms Tue Jul 10 14:32:23 [initandlisten] old journal file will be removed: /data/journal/j._15 Tue Jul 10 14:32:31 [initandlisten] warning Listener::getElapsedTimeMillis returning 0ms Tue Jul 10 14:32:36 [initandlisten] 33745416 objects cloned so far from collection local.oplog.rs Tue Jul 10 14:33:03 [initandlisten] warning Listener::getElapsedTimeMillis returning 0ms Tue Jul 10 14:33:10 [initandlisten] clone local.oplog.rs 33975423 Tue Jul 10 14:33:24 [initandlisten] warning Listener::getElapsedTimeMillis returning 0ms Tue Jul 10 14:33:37 [initandlisten] 34269926 objects cloned so far from collection local.oplog.rs Tue Jul 10 14:33:39 [initandlisten] warning Listener::getElapsedTimeMillis returning 0ms Tue Jul 10 14:33:43 [initandlisten] build index local.me { _id: 1 } Tue Jul 10 14:33:44 [initandlisten] build index done 1 records 0.023 secs Tue Jul 10 14:33:44 [initandlisten] build index local.replset.minvalid { _id: 1 } Tue Jul 10 14:33:44 [initandlisten] build index done 1 records 0 secs Tue Jul 10 14:33:44 [initandlisten] build index local.slaves { _id: 1 } Tue Jul 10 14:33:44 [initandlisten] build index done 1 records 0.001 secs Tue Jul 10 14:33:49 [initandlisten] Assertion: 14043:delete data files with a directoryperdb caught boost exception: boost::filesystem::remove: Device or resource busy: "/data/local" 0x584102 0x8bbb38 0x8bcb49 0xa91ae8 0xa92d9a 0xa94874 0xa951ad 0xa9eeda 0x7f20448c630d 0x501409 /usr/bin/mongod(_ZN5mongo11msgassertedEiPKc+0x112) [0x584102] /usr/bin/mongod(_ZN5mongo16_deleteDataFilesEPKc+0x298) [0x8bbb38] /usr/bin/mongod(_ZN5mongo14repairDatabaseESsRSsbb+0x7b9) [0x8bcb49] /usr/bin/mongod(_ZN5mongo11doDBUpgradeERKSsSsPNS_14DataFileHeaderE+0x68) [0xa91ae8] /usr/bin/mongod() [0xa92d9a] /usr/bin/mongod(_ZN5mongo14_initAndListenEi+0x404) [0xa94874] /usr/bin/mongod(_ZN5mongo13initAndListenEi+0x1d) [0xa951ad] /usr/bin/mongod(main+0x9aaa) [0xa9eeda] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xed) [0x7f20448c630d] /usr/bin/mongod(__gxx_personality_v0+0x3f1) [0x501409] Tue Jul 10 14:33:49 [initandlisten] exception in initAndListen: 14043 delete data files with a directoryperdb caught boost exception: boost::filesystem::remove: Device or resource busy: "/data/local", terminating
After restarting the server from that error, I got this. I'm guessing that's just a byproduct of the problems caused by the repair failing, but figured it could be helpful. After a third restart, it successfully connected to the replica set primary and began a resync.
Tue Jul 10 14:39:33 [initandlisten] connection accepted from 10.170.90.116:50480 #101 Tue Jul 10 14:39:34 [rsStart] trying to contact ec2-50-18-168-160.us-west-1.compute.amazonaws.com:27017 Tue Jul 10 14:39:34 [rsStart] trying to contact ec2-50-18-235-86.us-west-1.compute.amazonaws.com:27017 Tue Jul 10 14:39:34 Invalid access at address: 0xfffffff7 Tue Jul 10 14:39:34 Got signal: 11 (Segmentation fault). Tue Jul 10 14:39:34 Backtrace: 0xa90999 0xa90f70 0x7fb6904a4060 0x7fb6902314fb 0x7c80cd 0x7c9a47 0x7c9f3d 0xaab3e0 0x7fb69049befc 0x7fb68fa3589d /usr/bin/mongod(_ZN5mongo10abruptQuitEi+0x399) [0xa90999] /usr/bin/mongod(_ZN5mongo24abruptQuitWithAddrSignalEiP7siginfoPv+0x220) [0xa90f70] /lib/x86_64-linux-gnu/libpthread.so.0(+0x10060) [0x7fb6904a4060] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(_ZNSsC1ERKSs+0xb) [0x7fb6902314fb] /usr/bin/mongod(_ZN5mongo11ReplSetImpl10loadConfigEv+0x32d) [0x7c80cd] /usr/bin/mongod(_ZN5mongo11ReplSetImplC2ERNS_14ReplSetCmdlineE+0x3f7) [0x7c9a47] /usr/bin/mongod(_ZN5mongo13startReplSetsEPNS_14ReplSetCmdlineE+0x5d) [0x7c9f3d] /usr/bin/mongod(thread_proxy+0x80) [0xaab3e0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x7efc) [0x7fb69049befc] /lib/x86_64-linux-gnu/libc.so.6(clone+0x6d) [0x7fb68fa3589d] Logstream::get called in uninitialized state Tue Jul 10 14:39:34 ERROR: Client::shutdown not called: rsStart
- is related to
-
SERVER-3759 filesystem ops may cause termination when no space left on device
- Closed