Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-13352

Socket exception (SEND_ERROR) even after SERVER-9022 applied

    • Type: Icon: Bug Bug
    • Resolution: Done
    • Priority: Icon: Critical - P2 Critical - P2
    • 2.6.0
    • Affects Version/s: 2.4.9
    • Component/s: Networking, Sharding
    • None
    • Fully Compatible
    • ALL

      Issue Status as of Jan 09, 2015

      ISSUE SUMMARY
      Certain network settings and/or events may cause the connection pools used by MongoDB to be populated by "bad" or "broken" connections. Common causes included periodic network failures and firewalls silently killing long running connections, though the actual cause was sometimes impossible to ascertain.

      These connections only reveal themselves to be unusable when they are selected from the pool and data is written to them, prior to that they appear to be healthy and usable. This is particularly relevant to large sharded clusters which contain many connection pools (each mongos process and each primary for a shard have connection pools that can be impacted).

      USER IMPACT
      When a triggering event occurs, some proportion of the idle connections in a connection pool may become unusable, but still look healthy. Over time, as the MongoDB process (mongod or mongos) attempts to use these connections from the pool they may fail, throwing socket exceptions (SEND_ERROR, recv() timeout etc.). These errors occur sporadically (depending on how many connections were affected, and how busy the process was) until such time as the "bad" connections in the pool are exhausted, or the process in question is restarted. Essentially, this often presents as seemingly random socket exceptions long after the trigger event had occurred.

      WORKAROUNDS
      If there is a suspected regular trigger event occurring then preventing the event in the first place is the best solution. If that proves elusive, the only definitive solution is to restart the impacted processes once such an event has occurred (or is suspected to have occurred) in order to clear out the problematic pools.

      The releaseConnectionsAfterResponse parameter (added in 2.2.4 and 2.4.2 as part of SERVER-9022) can help alleviate the issue, but does not eliminate it. Additionally, this parameter must be used judiciously and with caution, per the warning given in SERVER-9022.

      AFFECTED VERSIONS
      MongoDB versions prior to 2.6.0 are affected by this issue.

      FIX VERSION
      The fix is included in the 2.6.0 production release.

      RESOLUTION DETAILS
      MongoDB 2.6 comes with a new connection pooling code that includes the work done in SERVER-9041 to proactively detect the re-use of broken connections from the pool.

      Original description

      Like some other folks I was encountering the issue described in SERVER-7008 (principally on a cluster with 32 mongos, and 20 mongod forming 10 shards, all running 2.4.9).

      The occurrences were a bit random but tended to occur in the mornings and tended to occur early in the week (the latter probably correlated with weekly compaction that occurs on sat night).

      The problem would always disappear for 1-2 weeks after a mongos restart.

      After applying SERVER-9022, the problems had appeared to have stopped. After ~6 weeks some nodes started to see SEND_ERROR exceptions however. As before a mongos restart fixed everything.

      I confirmed that all the servers did have the patch applied (was: true)

            Assignee:
            ramon.fernandez@mongodb.com Ramon Fernandez Marina
            Reporter:
            apiggott@ikanow.com Alex Piggott
            Votes:
            6 Vote for this issue
            Watchers:
            14 Start watching this issue

              Created:
              Updated:
              Resolved: