-
Type: Bug
-
Resolution: Done
-
Priority: Critical - P2
-
Affects Version/s: 2.6.4, 2.8.0-rc0
-
Component/s: Replication, Sharding
-
None
-
Fully Compatible
-
ALL
-
Original title:
rs.stepDown during migration causes fassert in logOp
Original description:
At 3am cron issues an rs.stepDown() as part of weekly compaction - having read
SERVER-14186I assume the migrate chunk errors are not relevant, but I've left them in the logs in case2014-10-04T03:04:57.225+0000 [migrateThread] doc_metadata.metadata caught assertion addKeysToIndex doc_metadata.metadata.$url_1_sourceKey_1_id: ObjectId('519239f8e4b0be945d6b9fe6') 2014-10-04T03:04:57.225+0000 [migrateThread] scoped connection to replica_set1/10.40.131.234:27018,10.40.213.215:27018 not being returned to the pool 2014-10-04T03:04:57.225+0000 [migrateThread] about to log metadata event: { _id: "ip-10-60-18-179-2014-10-04T03:04:57-542f63d9417de929885b7e2e", server: "ip-10-60-18-179", clientAddr: ":27017", time: new Date(1412391897225), what: "moveChunk.to", ns: "doc_metadata.metadata", details: { min: { _id: ObjectId('5191d4f3e4b07927e24302bc') }, max: { _id: ObjectId('51926d79e4b056bb2e3823fc') }, step 1 of 5: 2, step 2 of 5: 5296, note: "aborted" } } 2014-10-04T03:04:57.258+0000 [migrateThread] ERROR: migrate failed: insertDocument :: caused by :: 17280 Btree::insert: key too large to index, failing doc_metadata.metadata.$url_1_sourceKey_1 1163 { : "http://www.shippingonline.cn/Yellowpages/world_list.asp?page=164&user_addr='''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''...", : "www.google.com.search.64.84." } 2014-10-04T03:05:01.748+0000 [conn528628] end connection 10.226.114.59:33504 (272 connections now open) 2014-10-04T03:05:01.748+0000 [initandlisten] connection accepted from 10.226.114.59:33525 #528632 (273 connections now open) 2014-10-04T03:05:01.887+0000 [conn478926] remotely refreshing metadata for doc_metadata.metadata based on current shard version 660|1||53da748318304c3808174ba3, current metadata version is 660|7||53da748318304c3808174ba3 2014-10-04T03:05:01.889+0000 [conn478926] metadata of collection doc_metadata.metadata already up to date (shard version : 660|1||53da748318304c3808174ba3, took 1ms) 2014-10-04T03:05:01.903+0000 [migrateThread] starting receiving-end of migration of chunk { _id: ObjectId('5191d4f3e4b07927e24302bc') } -> { _id: ObjectId('51926d79e4b056bb2e3823fc') } for collection doc_metadata.metadata from replica_set1/10.40.131.234:27018,10.40.213.215:27018 at epoch 53da748318304c3808174ba3 2014-10-04T03:05:01.985+0000 [conn528629] end connection 10.245.78.137:48391 (272 connections now open) 2014-10-04T03:05:01.987+0000 [initandlisten] connection accepted from 10.245.78.137:48402 #528633 (273 connections now open) 2014-10-04T03:05:05.819+0000 [initandlisten] connection accepted from 127.0.0.1:37058 #528634 (274 connections now open) 2014-10-04T03:05:05.824+0000 [conn528634] end connection 127.0.0.1:37058 (273 connections now open) 2014-10-04T03:05:05.900+0000 [initandlisten] connection accepted from 127.0.0.1:37059 #528635 (274 connections now open) 2014-10-04T03:05:05.929+0000 [conn528635] 10.226.114.59:27019 is now closest at 542f63e1:37 2014-10-04T03:05:05.929+0000 [conn528635] replSet info stepping down as primary secs=60 2014-10-04T03:05:05.939+0000 [conn528635] replSet relinquishing primary state 2014-10-04T03:05:05.939+0000 [conn528635] replSet SECONDARY (lots of) 2014-10-04T03:05:05.940+0000 [conn478632] end connection 10.40.213.215:59486 (273 connections now open) 2014-10-04T03:05:05.951+0000 [migrateThread] replSet error : logOp() but not primary 2014-10-04T03:05:05.951+0000 [migrateThread] doc_metadata.metadata Fatal Assertion 17405 2014-10-04T03:05:06.000+0000 [initandlisten] connection accepted from 10.40.131.234:33182 #528636 (17 connections now open) 2014-10-04T03:05:06.008+0000 [initandlisten] connection accepted from 127.0.0.1:37060 #528637 (18 connections now open) 2014-10-04T03:05:06.067+0000 [conn528635] SocketException handling request, closing client connection: 9001 socket exception [SEND_ERROR] server [127.0.0.1:37059] 2014-10-04T03:05:06.081+0000 [initandlisten] connection accepted from 10.226.114.59:33553 #528638 (18 connections now open) 2014-10-04T03:05:06.126+0000 [migrateThread] doc_metadata.metadata 0x11e6111 0x1187e49 0x116a97d 0xe577aa 0xe5000e 0xa4d139 0xfd5594 0xfda178 0xfc5839 0x122aaa9 0x7f72b0715851 0x7f72afab811d mongod(_ZN5mongo15printStackTraceERSo+0x21) [0x11e6111] mongod(_ZN5mongo10logContextEPKc+0x159) [0x1187e49] mongod(_ZN5mongo13fassertFailedEi+0xcd) [0x116a97d] mongod() [0xe577aa] mongod(_ZN5mongo5logOpEPKcS1_RKNS_7BSONObjEPS2_PbbPS3_+0xee) [0xe5000e] mongod(_ZN5mongo7Helpers11removeRangeERKNS_8KeyRangeEbbPNS0_11RemoveSaverEbb+0xb09) [0xa4d139] mongod(_ZN5mongo13MigrateStatus3_goEv+0x1644) [0xfd5594] mongod(_ZN5mongo13MigrateStatus2goEv+0x28) [0xfda178] mongod(_ZN5mongo13migrateThreadEv+0x59) [0xfc5839] mongod() [0x122aaa9] /lib64/libpthread.so.0(+0x7851) [0x7f72b0715851] /lib64/libc.so.6(clone+0x6d) [0x7f72afab811d] 2014-10-04T03:05:06.126+0000 [migrateThread] ***aborting after fassert() failure 2014-10-04T03:05:06.133+0000 [migrateThread] SEVERE: Got signal: 6 (Aborted). Backtrace:0x11e6111 0x11e54ee 0x7f72afa02920 0x7f72afa028a5 0x7f72afa04085 0x116a9ea 0xe577aa 0xe5000e 0xa4d139 0xfd5594 0xfda178 0xfc5839 0x122aaa9 0x7f72b0715851 0x7f72afab811d mongod(_ZN5mongo15printStackTraceERSo+0x21) [0x11e6111] mongod() [0x11e54ee] /lib64/libc.so.6(+0x32920) [0x7f72afa02920] /lib64/libc.so.6(gsignal+0x35) [0x7f72afa028a5] /lib64/libc.so.6(abort+0x175) [0x7f72afa04085] mongod(_ZN5mongo13fassertFailedEi+0x13a) [0x116a9ea] mongod() [0xe577aa] mongod(_ZN5mongo5logOpEPKcS1_RKNS_7BSONObjEPS2_PbbPS3_+0xee) [0xe5000e] mongod(_ZN5mongo7Helpers11removeRangeERKNS_8KeyRangeEbbPNS0_11RemoveSaverEbb+0xb09) [0xa4d139] mongod(_ZN5mongo13MigrateStatus3_goEv+0x1644) [0xfd5594] mongod(_ZN5mongo13MigrateStatus2goEv+0x28) [0xfda178] mongod(_ZN5mongo13migrateThreadEv+0x59) [0xfc5839] mongod() [0x122aaa9] /lib64/libpthread.so.0(+0x7851) [0x7f72b0715851] /lib64/libc.so.6(clone+0x6d) [0x7f72afab811d](Other issue - the primary node was running at 200% repeating the same log messages over and over again 10x per second:
2014-10-24T13:02:10.289+0000 [rsHealthPoll] warning: Failed to connect to 10.60.18.179:27019, reason: errno:111 Connection refused 2014-10-24T13:02:10.289+0000 [rsHealthPoll] couldn't connect to 10.60.18.179:27019: couldn't connect to server 10.60.18.179:27019 (10.60.18.179) failed, connection attempt failed
- is duplicated by
-
SERVER-16115 Helpers::removeRange should check if master
- Closed
- is related to
-
SERVER-14551 Runner yield during migration cleanup (removeRange) results in fassert
- Closed