Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-13217

Socket Exception on MapReduce from Removed Shard

    • Type: Icon: Bug Bug
    • Resolution: Done
    • Priority: Icon: Major - P3 Major - P3
    • None
    • Affects Version/s: 2.2.6
    • Component/s: MapReduce, Sharding
    • None
    • Linux

      We removed a 2-node replica set from a sharded cluster yesterday. The node fully drained and we ran the "final" removeShard command which resulted in the following

      mongos> db.runCommand(

      {removeShard : "rsgewrset40"}

      )
      {
      "errmsg" : "exception: can't find shard for: rsgewrset40",
      "code" : 13129,
      "ok" : 0
      }

      We then shut down the machines in the replica set and the arbiter for this shard.

      All systems except for our map/reduce jobs are running fine. Our MR job is getting the following exception:

      MongoDB shell version: 2.2.6
      connecting to: REWRWEB1P:27017/crew_feuds_prod Fri Mar 14 14:42:04 uncaught exception: map reduce failed:{
      'ok' : 0,
      'errmsg' : 'MR post processing failed:

      { result: \'rivals.mp3.pcros\', errmsg: \'exception: could not initialize cursor across all shards because : socket exception [CONNECT_ERROR] for rsgewrset40/rsgewrmng79.taketwo.online:27017,r...\', code: 14827, ok: 0.0 }

      '
      }

      We've restarted all of our mongoS, flushed the router config, and conpoolsynced.

      We've had to restart the replica set that was drained and just leave it running even though it's not part of the cluster.

      What do we need to do to get the MR job to forget about this node?

            Assignee:
            siyuan.zhou@mongodb.com Siyuan Zhou
            Reporter:
            al.gehrig@rockstarsandiego.com Al Gehrig
            Votes:
            3 Vote for this issue
            Watchers:
            8 Start watching this issue

              Created:
              Updated:
              Resolved: