Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-17164

mongod sometimes crashes when restarting to disable auth

    • Type: Icon: Bug Bug
    • Resolution: Done
    • Priority: Icon: Major - P3 Major - P3
    • 3.0.0-rc8
    • Affects Version/s: 3.0.0-rc6, 3.0.0-rc7
    • Component/s: Replication
    • None
    • Fully Compatible
    • ALL
    • Hide
      1. Set up a sharded cluster with auth enabled with 2 mongoses, 3 config servers, and 2 shards each with 3 members
      2. Add some users, including an admin user
      3. Modify all process config files to have auth disabled
      4. Restart all processes following the partial order specified above in the ticket's Description
      5. Sometimes one of the replica set members will crash

      If you have trouble reproducing the crash, I can help you setup the automation test suite where this crash sometimes occurs.

      Show
      Set up a sharded cluster with auth enabled with 2 mongoses, 3 config servers, and 2 shards each with 3 members Add some users, including an admin user Modify all process config files to have auth disabled Restart all processes following the partial order specified above in the ticket's Description Sometimes one of the replica set members will crash If you have trouble reproducing the crash, I can help you setup the automation test suite where this crash sometimes occurs.

      Sometimes a 3.0.0-rc7 mongod that is part of a replica set which is part of a sharded cluster will crash when it is restarted in order to disable auth.

      I discovered this while running Automation's test suite. Although the mongod that crashes is part of a replica set, the crash does not appear to happen when the replica set is not part of a sharded cluster.

      The crash also happens with 3.0.0-rc6.

      The test that triggers it sets up a sharded cluster with auth enabled. Users are added. Then all processes are restarted with auth disabled. There is a partial order to the stopping and starting. Replica set members can stop and start immediately. Mongoses can stop immediately and can each start only after all other mongoses and config servers are either stopped or have restarted. Config servers can stop only after all mongoses have stopped. Config servers can each start after all other mongoses and config servers are either stopped or have restarted.

      Here is the config file of a replica set member before it was restarted (with auth on):

      net:
        port: 9003
      processManagement:
        fork: "true"
      replication:
        oplogSizeMB: 64
        replSetName: a
      security:
        authorization: enabled
        keyFile: /tmp/mms-automation/test/output/keyfile.keyfile
      storage:
        dbPath: /tmp/mms-automation/test/output/data/process9003
        journal:
          enabled: false
        preallocDataFiles: true
        smallFiles: true
      systemLog:
        destination: file
        path: /tmp/mms-automation/test/logs/run9003
      

      Here is the new config file (with auth off) for the same replica set member:

      net:
        port: 9003
      processManagement:
        fork: "true"
      replication:
        oplogSizeMB: 64
        replSetName: a
      storage:
        dbPath: /tmp/mms-automation/test/output/data/process9003
        journal:
          enabled: false
        preallocDataFiles: true
        smallFiles: true
      systemLog:
        destination: file
        path: /tmp/mms-automation/test/logs/run9003
      

      And here is the log after the process attempts to start with the new config file:

      2015-02-03T12:19:18.945-0500 I CONTROL  [initandlisten] MongoDB starting : pid=85055 port=9003 dbpath=/tmp/mms-automation/test/output/data/process9003 64-bit host=neurofunk.local
      2015-02-03T12:19:18.945-0500 I CONTROL  [initandlisten] db version v3.0.0-rc7
      2015-02-03T12:19:18.945-0500 I CONTROL  [initandlisten] git version: e4c60053b2967e16f765fa25d16aa6d629faa196
      2015-02-03T12:19:18.945-0500 I CONTROL  [initandlisten] build info: Darwin bs-osx108-2 12.5.0 Darwin Kernel Version 12.5.0: Sun Sep 29 13:33:47 PDT 2013; root:xnu-2050.48.12~1/RELEASE_X86_64 x86_64 BOOST_LIB_VERSION=1_49
      2015-02-03T12:19:18.945-0500 I CONTROL  [initandlisten] allocator: system
      2015-02-03T12:19:18.945-0500 I CONTROL  [initandlisten] options: { config: "/tmp/mms-automation/test/output/data/process9003/automation-mongod.conf", net: { port: 9003 }, processManagement: { fork: true }, replication: { oplogSizeMB: 64, replSetName: "a" }, storage: { dbPath: "/tmp/mms-automation/test/output/data/process9003", journal: { enabled: false }, mmapv1: { preallocDataFiles: true, smallFiles: true } }, systemLog: { destination: "file", path: "/tmp/mms-automation/test/logs/run9003" } }
      2015-02-03T12:19:18.975-0500 I NETWORK  [initandlisten] waiting for connections on port 9003
      2015-02-03T12:19:18.977-0500 I NETWORK  [initandlisten] connection accepted from 10.4.102.189:52938 #1 (1 connection now open)
      2015-02-03T12:19:18.982-0500 I NETWORK  [initandlisten] connection accepted from 10.4.102.189:52939 #2 (2 connections now open)
      2015-02-03T12:19:19.008-0500 I REPL     [ReplicationExecutor] new replica set config in use: { _id: "a", version: 1, members: [ { _id: 0, host: "neurofunk.local:9001", arbiterOnly: false, buildIndexes: true, hidden: false, priority: 1.0, tags: {}, slaveDelay: 0, votes: 1 }, { _id: 1, host: "neurofunk.local:9002", arbiterOnly: false, buildIndexes: true, hidden: false, priority: 1.0, tags: {}, slaveDelay: 0, votes: 1 }, { _id: 2, host: "neurofunk.local:9003", arbiterOnly: false, buildIndexes: true, hidden: false, priority: 1.0, tags: {}, slaveDelay: 0, votes: 1 } ], settings: { chainingAllowed: true, heartbeatTimeoutSecs: 10, getLastErrorModes: {}, getLastErrorDefaults: { w: 1, wtimeout: 0 } } }
      2015-02-03T12:19:19.008-0500 I REPL     [ReplicationExecutor] transition to STARTUP2
      2015-02-03T12:19:19.009-0500 I REPL     [ReplicationExecutor] Starting replication applier threads
      2015-02-03T12:19:19.009-0500 I REPL     [ReplicationExecutor] transition to RECOVERING
      2015-02-03T12:19:19.011-0500 I REPL     [ReplicationExecutor] transition to SECONDARY
      2015-02-03T12:19:19.040-0500 I REPL     [ReplicationExecutor] Error in heartbeat request to neurofunk.local:9001; Unauthorized not authorized on admin to execute command { replSetHeartbeat: "a", pv: 1, v: 1, from: "neurofunk.local:9003", fromId: 2, checkEmpty: false }
      2015-02-03T12:19:19.040-0500 I REPL     [ReplicationExecutor] transition to RECOVERING
      2015-02-03T12:19:19.040-0500 I REPL     [ReplicationExecutor] Member neurofunk.local:9002 is now in state RECOVERING
      2015-02-03T12:19:19.040-0500 I REPL     [ReplicationExecutor] Standing for election
      2015-02-03T12:19:19.041-0500 I REPL     [ReplicationExecutor] not electing self, neurofunk.local:9002 would veto with 'I don't think neurofunk.local:9003 is electable because the member is not currently a secondary; member is more than 10 seconds behind the most up-to-date member (mask 0xA)'
      2015-02-03T12:19:19.042-0500 I REPL     [ReplicationExecutor] not electing self, we are not freshest
      2015-02-03T12:19:19.065-0500 I ACCESS   [conn1] Successfully authenticated as principal testuser on admin
      2015-02-03T12:19:19.066-0500 I ACCESS   [conn2] Successfully authenticated as principal testuser on admin
      2015-02-03T12:19:19.450-0500 I NETWORK  [initandlisten] connection accepted from 10.4.102.189:53033 #3 (3 connections now open)
      2015-02-03T12:19:20.669-0500 I NETWORK  [initandlisten] connection accepted from 10.4.102.189:52767 #4 (4 connections now open)
      2015-02-03T12:19:20.713-0500 I NETWORK  [initandlisten] connection accepted from 10.4.102.189:53143 #5 (5 connections now open)
      2015-02-03T12:19:20.717-0500 I ACCESS   [conn5] SCRAM-SHA-1 authentication failed for __system on local ; AuthenticationFailed It is not possible to authenticate as the __system user on servers started without a --keyFile parameter
      2015-02-03T12:19:20.718-0500 I NETWORK  [conn5] end connection 10.4.102.189:53143 (4 connections now open)
      2015-02-03T12:19:20.853-0500 I NETWORK  [initandlisten] connection accepted from 10.4.102.189:53175 #6 (5 connections now open)
      2015-02-03T12:19:20.921-0500 I NETWORK  [initandlisten] connection accepted from 10.4.102.189:53197 #7 (6 connections now open)
      2015-02-03T12:19:21.045-0500 I REPL     [ReplicationExecutor] Member neurofunk.local:9002 is now in state SECONDARY
      2015-02-03T12:19:21.045-0500 I REPL     [ReplicationExecutor] Standing for election
      2015-02-03T12:19:21.045-0500 I REPL     [ReplicationExecutor] Error in heartbeat request to neurofunk.local:9001; Unauthorized not authorized on admin to execute command { replSetHeartbeat: "a", pv: 1, v: 1, from: "neurofunk.local:9003", fromId: 2, checkEmpty: false }
      2015-02-03T12:19:21.046-0500 I REPL     [ReplicationExecutor] transition to SECONDARY
      2015-02-03T12:19:21.046-0500 I REPL     [ReplicationExecutor] transition to PRIMARY
      2015-02-03T12:19:21.047-0500 I REPL     [ReplicationExecutor] replSet possible election tie; sleeping 212ms until 2015-02-03T12:19:21.258-0500
      2015-02-03T12:19:21.047-0500 I -        [ReplicationExecutor] Invariant failure _role == Role::candidate src/mongo/db/repl/topology_coordinator_impl.cpp 1897
      2015-02-03T12:19:21.049-0500 I CONTROL  [ReplicationExecutor] 
       0x1068bb459 0x10686f7c0 0x10685ca93 0x106628f08 0x1065ea2e5 0x106611e84 0x1065f526d 0x1065f80f6 0x1065f2ad7 0x1068eec51 0x7fff9694e2fc 0x7fff9694e279 0x7fff9694c4b1
      ----- BEGIN BACKTRACE -----
      {"backtrace":[{"b":"106160000","o":"75B459"},{"b":"106160000","o":"70F7C0"},{"b":"106160000","o":"6FCA93"},{"b":"106160000","o":"4C8F08"},{"b":"106160000","o":"48A2E5"},{"b":"106160000","o":"4B1E84"},{"b":"106160000","o":"49526D"},{"b":"106160000","o":"4980F6"},{"b":"106160000","o":"492AD7"},{"b":"106160000","o":"78EC51"},{"b":"7FFF9694B000","o":"32FC"},{"b":"7FFF9694B000","o":"3279"},{"b":"7FFF9694B000","o":"14B1"}],"processInfo":{ "mongodbVersion" : "3.0.0-rc7", "gitVersion" : "e4c60053b2967e16f765fa25d16aa6d629faa196", "uname" : { "sysname" : "Darwin", "release" : "14.0.0", "version" : "Darwin Kernel Version 14.0.0: Fri Sep 19 00:26:44 PDT 2014; root:xnu-2782.1.97~2/RELEASE_X86_64", "machine" : "x86_64" }, "somap" : [ { "path" : "/tmp/mms-automation/test/versions/mongodb-osx-x86_64-3.0.0-rc7/bin/mongod", "machType" : 2, "b" : "106160000", "buildId" : "0A95B1834CD531D7A030ADBA1D96FB0A" }, { "path" : "/usr/lib/libSystem.B.dylib", "machType" : 6, "b" : "7FFF98912000", "buildId" : "DA954461EC6A3DF085516FC810627627" }, { "path" : "/usr/lib/libc++.1.dylib", "machType" : 6, "b" : "7FFF94BDB000", "buildId" : "1B9530FD989B3174BB1CBDC159501710" }, { "path" : "/usr/lib/system/libcache.dylib", "machType" : 6, "b" : "7FFF946AA000", "buildId" : "45E9A2E799C436B2BEE30C4E11614AD1" }, { "path" : "/usr/lib/system/libcommonCrypto.dylib", "machType" : 6, "b" : "7FFF9B867000", "buildId" : "D381EBC669D831D380845A80A32CB748" }, { "path" : "/usr/lib/system/libcompiler_rt.dylib", "machType" : 6, "b" : "7FFF98914000", "buildId" : "BF8FC133EE103DA69B9092039E28678F" }, { "path" : "/usr/lib/system/libcopyfile.dylib", "machType" : 6, "b" : "7FFF8D368000", "buildId" : "0C68D3A6ACDD3EF3991ACC82C32AB836" }, { "path" : "/usr/lib/system/libcorecrypto.dylib", "machType" : 6, "b" : "7FFF92659000", "buildId" : "E178980139853949B7366B3378873301" }, { "path" : "/usr/lib/system/libdispatch.dylib", "machType" : 6, "b" : "7FFF976D5000", "buildId" : "502CF32B669B3709886208188225E4F0" }, { "path" : "/usr/lib/system/libdyld.dylib", "machType" : 6, "b" : "7FFF91CFA000", "buildId" : "19FAF435C16533749DEFD7BBA7D61DB6" }, { "path" : "/usr/lib/system/libkeymgr.dylib", "machType" : 6, "b" : "7FFF8FBBF000", "buildId" : "77845842DE703CC5BD01C3D14227CED5" }, { "path" : "/usr/lib/system/liblaunch.dylib", "machType" : 6, "b" : "7FFF956C8000", "buildId" : "8A9889248BE735FEBF7D322E90EFE49E" }, { "path" : "/usr/lib/system/libmacho.dylib", "machType" : 6, "b" : "7FFF92DBD000", "buildId" : "126CA2EDDE91308F8881B9DAEC3C63B6" }, { "path" : "/usr/lib/system/libquarantine.dylib", "machType" : 6, "b" : "7FFF912FE000", "buildId" : "DC0416272D92361CBABFA869A5C72293" }, { "path" : "/usr/lib/system/libremovefile.dylib", "machType" : 6, "b" : "7FFF99AD8000", "buildId" : "3485B5F46CE83C628DFD8736ED6E8531" }, { "path" : "/usr/lib/system/libsystem_asl.dylib", "machType" : 6, "b" : "7FFF926D0000", "buildId" : "F153AC5B0542356E88C820A62CA704E2" }, { "path" : "/usr/lib/system/libsystem_blocks.dylib", "machType" : 6, "b" : "7FFF9978E000", "buildId" : "9615D10AFCA73BE4AA1A1B195DACE1A1" }, { "path" : "/usr/lib/system/libsystem_c.dylib", "machType" : 6, "b" : "7FFF9479D000", "buildId" : "C185E86274243210B5286B822577A4B8" }, { "path" : "/usr/lib/system/libsystem_configuration.dylib", "machType" : 6, "b" : "7FFF93738000", "buildId" : "9FBA1CE497D0347EA44393ED94512E92" }, { "path" : "/usr/lib/system/libsystem_coreservices.dylib", "machType" : 6, "b" : "7FFF93907000", "buildId" : "41B7C5785A5331C8A96FC73E030B0938" }, { "path" : "/usr/lib/system/libsystem_coretls.dylib", "machType" : 6, "b" : "7FFF8E6FA000", "buildId" : "EBBF7EF680D83F8F825CB412BD6D22C0" }, { "path" : "/usr/lib/system/libsystem_dnssd.dylib", "machType" : 6, "b" : "7FFF923CA000", "buildId" : "62B70ECAE40D3C63896E7F00EC386DDB" }, { "path" : "/usr/lib/system/libsystem_info.dylib", "machType" : 6, "b" : "7FFF9373B000", "buildId" : "B85A85D585303A93B0C34DEC41F79478" }, { "path" : "/usr/lib/system/libsystem_kernel.dylib", "machType" : 6, "b" : "7FFF95BB1000", "buildId" : "93E0E0A975B63904BB4E4BC7C05F4B6B" }, { "path" : "/usr/lib/system/libsystem_m.dylib", "machType" : 6, "b" : "7FFF978A7000", "buildId" : "1E12AB456D9636D0A226F24D9FB0D9D6" }, { "path" : "/usr/lib/system/libsystem_malloc.dylib", "machType" : 6, "b" : "7FFF99ADA000", "buildId" : "19BCC25757173502A71F95D65AFA861B" }, { "path" : "/usr/lib/system/libsystem_network.dylib", "machType" : 6, "b" : "7FFF8E897000", "buildId" : "C0B2313D47BE38A9BEE62634A4F5E14B" }, { "path" : "/usr/lib/system/libsystem_networkextension.dylib", "machType" : 6, "b" : "7FFF900D0000", "buildId" : "29AB225BD7FB30ED960065D44B9A9442" }, { "path" : "/usr/lib/system/libsystem_notify.dylib", "machType" : 6, "b" : "7FFF9B873000", "buildId" : "61147800F3203DAA850CBADF33855F29" }, { "path" : "/usr/lib/system/libsystem_platform.dylib", "machType" : 6, "b" : "7FFF96A66000", "buildId" : "64E34079D7123D669CE2418624A5C040" }, { "path" : "/usr/lib/system/libsystem_pthread.dylib", "machType" : 6, "b" : "7FFF9694B000", "buildId" : "26B1897F0CD330F3B55A37CB45062D73" }, { "path" : "/usr/lib/system/libsystem_sandbox.dylib", "machType" : 6, "b" : "7FFF8FEEB000", "buildId" : "DB9962EF889831CC9B87E01F8CE74C9D" }, { "path" : "/usr/lib/system/libsystem_secinit.dylib", "machType" : 6, "b" : "7FFF93CB6000", "buildId" : "581DAD0F6B633A48B63B917AF799ABAA" }, { "path" : "/usr/lib/system/libsystem_stats.dylib", "machType" : 6, "b" : "7FFF98D7A000", "buildId" : "1DB0443659743F1686CC5FF5F390339C" }, { "path" : "/usr/lib/system/libsystem_trace.dylib", "machType" : 6, "b" : "7FFF92506000", "buildId" : "A9E6B7D8C3273742AC5486C94218B1DF" }, { "path" : "/usr/lib/system/libunc.dylib", "machType" : 6, "b" : "7FFF9211B000", "buildId" : "5676F7EAC1DF329FB006D2C3022B7D70" }, { "path" : "/usr/lib/system/libunwind.dylib", "machType" : 6, "b" : "7FFF93CB0000", "buildId" : "BE7E51A0B6EA3A549CCA9D88F683A6D6" }, { "path" : "/usr/lib/system/libxpc.dylib", "machType" : 6, "b" : "7FFF9B6C5000", "buildId" : "9437C02EA07B38C891CB299FAA63083D" }, { "path" : "/usr/lib/libobjc.A.dylib", "machType" : 6, "b" : "7FFF8D451000", "buildId" : "3B60CD9074A23A5D9686B0772159792A" }, { "path" : "/usr/lib/libauto.dylib", "machType" : 6, "b" : "7FFF92D76000", "buildId" : "A260789BD4D8316A9490254767B8A5F1" }, { "path" : "/usr/lib/libc++abi.dylib", "machType" : 6, "b" : "7FFF91CCE000", "buildId" : "88A22A0F87C63002BFBAAC0F2808B8B9" }, { "path" : "/usr/lib/libDiagnosticMessagesClient.dylib", "machType" : 6, "b" : "7FFF8E17E000", "buildId" : "2EE8E4365CDC34C599595BA218D507FB" } ] }}
       mongod(_ZN5mongo15printStackTraceERNSt3__113basic_ostreamIcNS0_11char_traitsIcEEEE+0x39) [0x1068bb459]
       mongod(_ZN5mongo10logContextEPKc+0x100) [0x10686f7c0]
       mongod(_ZN5mongo15invariantFailedEPKcS1_j+0x183) [0x10685ca93]
       mongod(_ZN5mongo4repl23TopologyCoordinatorImpl19processLoseElectionEv+0xD8) [0x106628f08]
       mongod(_ZN5mongo4repl26ReplicationCoordinatorImpl25_onFreshnessCheckCompleteEv+0x1F5) [0x1065ea2e5]
       mongod(_ZN5mongo4repl19ScatterGatherRunner16_processResponseERKNS0_19ReplicationExecutor25RemoteCommandCallbackDataEPS1_+0xE4) [0x106611e84]
       mongod(_ZN5mongo4replL21remoteCommandFinishedERKNS0_19ReplicationExecutor12CallbackDataERKNSt3__18functionIFvRKNS1_25RemoteCommandCallbackDataEEEERKNS1_20RemoteCommandRequestERKNS_10StatusWithINS1_21RemoteCommandResponseEEE+0x2BD) [0x1065f526d]
       mongod(_ZN5mongo4repl12_GLOBAL__N_112callNoExceptERKNSt3__18functionIFvvEEE+0x16) [0x1065f80f6]
       mongod(_ZN5mongo4repl19ReplicationExecutor3runEv+0x637) [0x1065f2ad7]
       mongod(_ZN5boost12_GLOBAL__N_112thread_proxyEPv+0xB1) [0x1068eec51]
       libsystem_pthread.dylib(_pthread_body+0x83) [0x7fff9694e2fc]
       libsystem_pthread.dylib(_pthread_body+0x0) [0x7fff9694e279]
       libsystem_pthread.dylib(thread_start+0xD) [0x7fff9694c4b1]
      -----  END BACKTRACE  -----
      2015-02-03T12:19:21.049-0500 I -        [ReplicationExecutor] 
      
      ***aborting after invariant() failure
      

            Assignee:
            spencer@mongodb.com Spencer Brody (Inactive)
            Reporter:
            tim.olsen@mongodb.com Timothy Olsen (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            8 Start watching this issue

              Created:
              Updated:
              Resolved: