Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-8074

mongos segfault on replica set member shutdown

    • Type: Icon: Bug Bug
    • Resolution: Duplicate
    • Priority: Icon: Major - P3 Major - P3
    • None
    • Affects Version/s: 2.2.0
    • Component/s: Sharding
    • None
    • Linux

      We have a sharded cluster with two shards, each a replica set with two machines. Upon shutting down one of these (SECONDARY in 2nd shard, 'RIGHT' below), the PRIMARY also stepped down due to not seeing a majority (no arbiter). This seems to have really upset the mongos, 4 (out of 10) of which segfaulted:

      Thu Jan 3 16:37:25 [WriteBackListener-RIGHT:27018] DBClientCursor::init call() failed
      Thu Jan 3 16:37:25 [WriteBackListener-RIGHT:27018] WriteBackListener exception : DBClientBase::findN: transport error: RIGHT:27018 ns: admin.$cmd query:

      { writebacklisten: ObjectId('50d0c65444377d91a6e94 862') }

      Thu Jan 3 16:37:25 [WriteBackListener-LEFT:27018] GLE is

      { singleShard: "events/LEFT:27018,RIGHT:27018", updatedExisting: true, n: 1, lastOp: 5829355722684497925, connectionId: 104511, err: null, ok: 1.0 }

      Thu Jan 3 16:37:25 [WriteBackListener-LEFT:27018] GLE is

      { singleShard: "events/LEFT:27018,RIGHT:27018", updatedExisting: true, n: 1, lastOp: 5829355722684497926, connectionId: 104511, err: null, ok: 1. 0 }

      Thu Jan 3 16:37:25 [WriteBackListener-g29i:27018] GLE is { singleShard: "events2/g19i:27018,g29i:27018", updatedExisting: true, n: 1, lastOp: 5829355722684497935, connectionId: 202, err: null, ok: 1.0
      }
      Thu Jan 3 16:37:27 [WriteBackListener-RIGHT:27018] WriteBackListener exception : socket exception [CONNECT_ERROR] for RIGHT:27018
      Thu Jan 3 16:37:29 [WriteBackListener-g29i:27018] GLE is { singleShard: "events2/g19i:27018,g29i:27018", updatedExisting: true, n: 1, lastOp: 5829355739864367115, connectionId: 202, err: null, ok: 1.0
      }Thu Jan 3 16:37:30 [WriteBackListener-g29i:27018] GLE is { singleShard: "events2/g19i:27018,g29i:27018", updatedExisting: true, n: 1, lastOp: 5829355744159334401, connectionId: 202, err: null, ok: 1.0
      }
      Thu Jan 3 16:37:30 [WriteBackListener-RIGHT:27018] WriteBackListener exception : socket exception [CONNECT_ERROR] for RIGHT:27018
      Thu Jan 3 16:37:31 [Balancer] distributed lock 'balancer/g14.hypem.com:27018:1355859540:1804289383' acquired, ts : 50e6082b44377d91a6ecadd3
      Thu Jan 3 16:37:31 [Balancer] distributed lock 'balancer/g14.hypem.com:27018:1355859540:1804289383' unlocked.
      Thu Jan 3 16:37:32 [ReplicaSetMonitorWatcher] DBClientCursor::init call() failed
      Thu Jan 3 16:37:34 [WriteBackListener-RIGHT:27018] WriteBackListener exception : socket exception [CONNECT_ERROR] for RIGHT:27018
      Thu Jan 3 16:37:37 [Balancer] distributed lock 'balancer/g14.hypem.com:27018:1355859540:1804289383' acquired, ts : 50e6083144377d91a6ecadd4
      Thu Jan 3 16:37:37 [Balancer] distributed lock 'balancer/g14.hypem.com:27018:1355859540:1804289383' unlocked.
      Thu Jan 3 16:37:39 [WriteBackListener-RIGHT:27018] WriteBackListener exception : socket exception [CONNECT_ERROR] for RIGHT:27018
      Thu Jan 3 16:37:39 [WriteBackListener-g29i:27018] GLE is { singleShard: "events2/g19i:27018,g29i:27018", updatedExisting: true, n: 1, lastOp: 5829355782814040073, connectionId: 202, err: null, ok: 1.0
      }
      Thu Jan 3 16:37:41 [WriteBackListener-g29i:27018] GLE is { singleShard: "events2/g19i:27018,g29i:27018", updatedExisting: true, n: 1, lastOp: 5829355791403974659, connectionId: 202, err: null, ok: 1.0
      }
      Thu Jan 3 16:37:42 [WriteBackListener-g29i:27018] GLE is { singleShard: "events2/g19i:27018,g29i:27018", updatedExisting: true, n: 1, lastOp: 5829355795698941961, connectionId: 202, err: null, ok: 1.0
      }
      Thu Jan 3 16:37:42 [ReplicaSetMonitorWatcher] trying reconnect to RIGHT:27018
      Thu Jan 3 16:37:42 [ReplicaSetMonitorWatcher] reconnect RIGHT:27018 failed couldn't connect to server RIGHT:27018
      Thu Jan 3 16:37:43 [Balancer] distributed lock 'balancer/g14.hypem.com:27018:1355859540:1804289383' acquired, ts : 50e6083744377d91a6ecadd5
      Thu Jan 3 16:37:43 [Balancer] events2 is unavailable
      Thu Jan 3 16:37:43 [Balancer] distributed lock 'balancer/g14.hypem.com:27018:1355859540:1804289383' unlocked.
      Thu Jan 3 16:37:43 [WriteBackListener-g29i:27018] GLE is { singleShard: "events2/g19i:27018,g29i:27018", updatedExisting: true, n: 1, lastOp: 5829355799993909263, connectionId: 202, err: null, ok: 1.0
      }
      Thu Jan 3 16:37:44 [WriteBackListener-g29i:27018] GLE is { singleShard: "events2/g19i:27018,g29i:27018", updatedExisting: true, n: 1, lastOp: 5829355804288876560, connectionId: 202, err: null, ok: 1.0
      }
      Thu Jan 3 16:37:45 [WriteBackListener-RIGHT:27018] WriteBackListener exception : socket exception [CONNECT_ERROR] for RIGHT:27018
      Thu Jan 3 16:37:45 [WriteBackListener-g29i:27018] GLE is { singleShard: "events2/g19i:27018,g29i:27018", updatedExisting: true, n: 1, lastOp: 5829355808583843853, connectionId: 202, err: null, ok: 1.0
      }
      Thu Jan 3 16:37:46 [WriteBackListener-LEFT:27018] GLE is

      { singleShard: "events/LEFT:27018,RIGHT:27018", updatedExisting: true, n: 1, lastOp: 5829355812878811137, connectionId: 104511, err: null, ok: 1.0 }

      Thu Jan 3 16:37:46 [WriteBackListener-LEFT:27018] GLE is

      { singleShard: "events/LEFT:27018,RIGHT:27018", updatedExisting: true, n: 1, lastOp: 5829355812878811139, connectionId: 104511, err: null, ok: 1. 0 }

      Thu Jan 3 16:37:46 [WriteBackListener-g29i:27018] GLE is { singleShard: "events2/g19i:27018,g29i:27018", updatedExisting: true, n: 1, lastOp: 5829355812878811145, connectionId: 202, err: null, ok: 1.0
      }
      Thu Jan 3 16:37:46 [WriteBackListener-LEFT:27018] GLE is

      { singleShard: "events/LEFT:27018,RIGHT:27018", updatedExisting: true, n: 1, lastOp: 5829355812878811147, connectionId: 104511, err: null, ok: 1. 0 }

      Thu Jan 3 16:37:46 [WriteBackListener-LEFT:27018] GLE is

      { singleShard: "events/LEFT:27018,RIGHT:27018", updatedExisting: true, n: 1, lastOp: 5829355812878811149, connectionId: 104511, err: null, ok: 1. 0 }

      Thu Jan 3 16:37:47 [WriteBackListener-g29i:27018] GLE is { singleShard: "events2/g19i:27018,g29i:27018", updatedExisting: true, n: 1, lastOp: 5829355817173778439, connectionId: 202, err: null, ok: 1.0
      }
      Thu Jan 3 16:37:48 [WriteBackListener-g29i:27018] GLE is { singleShard: "events2/g19i:27018,g29i:27018", updatedExisting: true, n: 1, lastOp: 5829355821468745731, connectionId: 202, err: null, ok: 1.0
      }
      Thu Jan 3 16:37:49 [Balancer] distributed lock 'balancer/g14.hypem.com:27018:1355859540:1804289383' acquired, ts : 50e6083d44377d91a6ecadd6
      Thu Jan 3 16:37:49 [Balancer] events2 is unavailable
      Thu Jan 3 16:37:49 [Balancer] distributed lock 'balancer/g14.hypem.com:27018:1355859540:1804289383' unlocked.
      Thu Jan 3 16:37:49 [WriteBackListener-g29i:27018] GLE is { singleShard: "events2/g19i:27018,g29i:27018", updatedExisting: true, n: 1, lastOp: 5829355825763713029, connectionId: 202, err: null, ok: 1.0
      }
      Thu Jan 3 16:37:49 [WriteBackListener-LEFT:27018] DBClientCursor::init call() failed
      Thu Jan 3 16:37:49 [WriteBackListener-LEFT:27018] WriteBackListener exception : DBClientBase::findN: transport error: LEFT:27018 ns: admin.$cmd query:

      { writebacklisten: ObjectId('50d0c65444377d91a6e94 862') }

      Thu Jan 3 16:37:49 [WriteBackListener-g29i:27018] GLE is { singleShard: "events2/g19i:27018,g29i:27018", updatedExisting: true, n: 1, lastOp: 5829355825763713039, connectionId: 202, err: null, ok: 1.0
      }
      Thu Jan 3 16:37:50 [WriteBackListener-g29i:27018] GLE is { singleShard: "events2/g19i:27018,g29i:27018", updatedExisting: true, n: 1, lastOp: 5829355830058680325, connectionId: 202, err: null, ok: 1.0
      }
      Thu Jan 3 16:37:50 [WriteBackListener-LEFT:27018] Socket recv() errno:104 Connection reset by peer 10.0.25.172:27018
      Thu Jan 3 16:37:50 [WriteBackListener-LEFT:27018] SocketException: remote: 10.0.25.172:27018 error: 9001 socket exception [1] server [10.0.25.172:27018]
      Thu Jan 3 16:37:50 [WriteBackListener-LEFT:27018] DBClientCursor::init call() failed
      Thu Jan 3 16:37:50 [WriteBackListener-LEFT:27018] WriteBackListener exception : DBClientBase::findN: transport error: LEFT:27018 ns: admin.$cmd query:

      { writebacklisten: ObjectId('50d0c65444377d91a6e94 862') }

      Thu Jan 3 16:37:50 [WriteBackListener-g29i:27018] GLE is { singleShard: "events2/g19i:27018,g29i:27018", updatedExisting: true, n: 1, lastOp: 5829355830058680332, connectionId: 202, err: null, ok: 1.0
      }
      Thu Jan 3 16:37:50 [WriteBackListener-g29i:27018] GLE is { singleShard: "events2/g19i:27018,g29i:27018", updatedExisting: true, n: 1, lastOp: 5829355830058680336, connectionId: 202, err: null, ok: 1.0
      }
      Thu Jan 3 16:37:51 [conn64348] Socket recv() errno:104 Connection reset by peer 10.0.25.172:27018
      Thu Jan 3 16:37:51 [conn64348] SocketException: remote: 10.0.25.172:27018 error: 9001 socket exception [1] server [10.0.25.172:27018]
      Thu Jan 3 16:37:51 [conn64348] DBClientCursor::init lazy say() failedThu Jan 3 16:37:51 [conn64348] DBClientCursor::init message from say() was empty
      Thu Jan 3 16:37:51 [conn64348] slave no longer has secondary status: LEFT:27018
      Thu Jan 3 16:37:51 [conn64348] sharded connection to events/LEFT:27018,RIGHT:27018 not being returned to the pool
      Thu Jan 3 16:37:51 [conn64348] Socket recv() errno:104 Connection reset by peer 10.0.25.172:27018
      Thu Jan 3 16:37:51 [conn64348] SocketException: remote: 10.0.25.172:27018 error: 9001 socket exception [1] server [10.0.25.172:27018] Thu Jan 3 16:37:51 [conn64348] DBClientCursor::init call() failed
      Thu Jan 3 16:37:51 [conn64348] trying reconnect to RIGHT:27018
      Thu Jan 3 16:37:51 [conn64348] reconnect RIGHT:27018 failed couldn't connect to server RIGHT:27018
      Thu Jan 3 16:37:51 [conn64330] trying reconnect to LEFT:27018
      Thu Jan 3 16:37:51 [conn64330] reconnect LEFT:27018 ok
      Thu Jan 3 16:37:51 [WriteBackListener-g29i:27018] GLE is { singleShard: "events2/g19i:27018,g29i:27018", updatedExisting: true, n: 1, lastOp: 5829355834353647627, connectionId: 202, err: null, ok: 1.0
      Thu Jan 3 16:37:52 [conn64348] Socket recv() errno:104 Connection reset by peer 10.0.25.172:27018
      Thu Jan 3 16:37:52 [conn64348] SocketException: remote: 10.0.25.172:27018 error: 9001 socket exception [1] server [10.0.25.172:27018]
      Thu Jan 3 16:37:52 [conn64348] DBClientCursor::init lazy say() failed
      Thu Jan 3 16:37:52 [conn64348] DBClientCursor::init message from say() was empty
      Thu Jan 3 16:37:52 [conn64348] slave no longer has secondary status: LEFT:27018
      Thu Jan 3 16:37:52 [conn64348] sharded connection to events/LEFT:27018,RIGHT:27018 not being returned to the poolThu Jan 3 16:37:52 [WriteBackListener-g29i:27018] GLE is { singleShard: "events2/g19i:27018,g29i:27018", updatedExisting: true, n: 1, lastOp: 5829355838648614917, connectionId: 202, err: null, ok: 1.0
      }
      Thu Jan 3 16:37:52 [conn64330] Socket recv() errno:104 Connection reset by peer 10.0.25.172:27018
      Thu Jan 3 16:37:52 [conn64330] SocketException: remote: 10.0.25.172:27018 error: 9001 socket exception [1] server [10.0.25.172:27018]
      Thu Jan 3 16:37:52 [conn64330] DBClientCursor::init lazy say() failed
      Thu Jan 3 16:37:52 [conn64330] DBClientCursor::init message from say() was empty
      Thu Jan 3 16:37:52 [conn64330] slave no longer has secondary status: LEFT:27018
      Thu Jan 3 16:37:52 [conn64330] sharded connection to events/LEFT:27018,RIGHT:27018 not being returned to the pool
      Thu Jan 3 16:37:52 [WriteBackListener-RIGHT:27018] WriteBackListener exception : socket exception [CONNECT_ERROR] for RIGHT:27018
      Thu Jan 3 16:37:52 [conn64320] Socket recv() errno:104 Connection reset by peer 10.0.25.172:27018
      Thu Jan 3 16:37:52 [conn64320] SocketException: remote: 10.0.25.172:27018 error: 9001 socket exception [1] server [10.0.25.172:27018]
      Thu Jan 3 16:37:52 [conn64320] DBClientCursor::init lazy say() failed
      Thu Jan 3 16:37:52 [conn64320] DBClientCursor::init message from say() was empty
      Thu Jan 3 16:37:52 [conn64320] slave no longer has secondary status: LEFT:27018
      Thu Jan 3 16:37:52 [conn64320] warning: Weird shift of primary detected
      Thu Jan 3 16:37:53 [conn64348] Socket recv() errno:104 Connection reset by peer 10.0.25.172:27018
      Thu Jan 3 16:37:53 [conn64348] SocketException: remote: 10.0.25.172:27018 error: 9001 socket exception [1] server [10.0.25.172:27018]
      Thu Jan 3 16:37:53 [conn64348] DBClientCursor::init lazy say() failed
      Thu Jan 3 16:37:53 [conn64348] DBClientCursor::init message from say() was empty
      Thu Jan 3 16:37:53 [conn64348] slave no longer has secondary status: LEFT:27018
      Thu Jan 3 16:37:53 [conn64348] sharded connection to events/LEFT:27018,RIGHT:27018 not being returned to the pool
      Thu Jan 3 16:37:53 [conn64348] trying reconnect to RIGHT:27018
      Thu Jan 3 16:37:53 [conn64348] reconnect RIGHT:27018 failed couldn't connect to server RIGHT:27018
      Thu Jan 3 16:37:53 [conn64330] Socket recv() errno:104 Connection reset by peer 10.0.25.172:27018
      Thu Jan 3 16:37:53 [conn64330] SocketException: remote: 10.0.25.172:27018 error: 9001 socket exception [1] server [10.0.25.172:27018]
      Thu Jan 3 16:37:53 [conn64330] DBClientCursor::init lazy say() failed
      Thu Jan 3 16:37:53 [conn64330] DBClientCursor::init message from say() was empty
      Thu Jan 3 16:37:53 [conn64330] slave no longer has secondary status: LEFT:27018
      Thu Jan 3 16:37:53 [conn64330] sharded connection to events/LEFT:27018,RIGHT:27018 not being returned to the pool
      Thu Jan 3 16:37:53 [ReplicaSetMonitorWatcher] ReplicaSetMonitor::_checkConnection: LEFT:27018

      { setName: "events", ismaster: false, secondary: true, hosts: [ "LEFT:27018", "RIGHT:27018" ], me: "LEFT:270 18", maxBsonObjectSize: 16777216, ok: 1.0 }

      Thu Jan 3 16:37:53 [ReplicaSetMonitorWatcher] ReplicaSetMonitor::_checkConnection: caught exception RIGHT:27018 socket exception [FAILED_STATE] for RIGHT:27018
      Thu Jan 3 16:37:53 [conn64320] Socket say send() errno:104 Connection reset by peer 10.0.25.172:27018
      Thu Jan 3 16:37:53 [conn64320] DBException in process: socket exception [SEND_ERROR] for 10.0.25.172:27018
      Thu Jan 3 16:37:54 [conn64330] got not master for: LEFT:27018
      Received signal 11
      Backtrace: 0x8386d5 0x7fc340c094c0 0xc61e30 /usr/bin/mongos(_ZN5mongo17printStackAndExitEi+0x75)[0x8386d5]
      /lib/x86_64-linux-gnu/libc.so.6(+0x364c0)[0x7fc340c094c0]
      /usr/bin/mongos(_ZTVN5mongo18DBClientConnectionE+0x10)[0xc61e30]

            Assignee:
            randolph@mongodb.com Randolph Tan
            Reporter:
            parkan Arkadiy Kukarkin
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

              Created:
              Updated:
              Resolved: