Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-26722

router blocks and throws ExceededTimeLimit

    • Type: Icon: Bug Bug
    • Resolution: Duplicate
    • Priority: Icon: Major - P3 Major - P3
    • None
    • Affects Version/s: 3.2.10
    • Component/s: Querying, Sharding
    • None
    • Fully Compatible
    • ALL
    • Sharding 2016-11-21, Sharding 2016-12-12

      We have updated one of our sharded cluster from v3.0.12 to v3.2.10. Since then our cluster was not operational because many operations got blocked by the the router. The corresponding log message looks like this:

      2016-10-20T11:00:22.902+0200 I ASIO     [NetworkInterfaceASIO-TaskExecutorPool-3-0] Failed to connect to s559:27017 - ExceededTimeLimit: Operation timed out
      2016-10-20T11:00:22.918+0200 I ASIO     [NetworkInterfaceASIO-TaskExecutorPool-3-0] Failed to connect to mongo-007.ipx:27017 - ExceededTimeLimit: Operation timed out
      2016-10-20T11:00:22.920+0200 I ASIO     [NetworkInterfaceASIO-TaskExecutorPool-3-0] Failed to connect to mongo-024.ipx:27017 - ExceededTimeLimit: Operation timed out
      2016-10-20T11:00:22.921+0200 I ASIO     [NetworkInterfaceASIO-TaskExecutorPool-3-0] Failed to connect to mongo-007.ipx:27017 - ExceededTimeLimit: Operation timed out
      2016-10-20T11:00:22.921+0200 I ASIO     [NetworkInterfaceASIO-TaskExecutorPool-3-0] Failed to connect to mongo-024.ipx:27017 - ExceededTimeLimit: Operation timed out
      

      We can reproduce the issue at any time just by executing a findOne through the router several times:

      for(x=0;x<1000;x++){db.offer.find({"_id" : NumberLong("5672494983")}).forEach(function(u){printjson(u)});print(x)}
      

      It blocks after a few findOne's already.
      If we execute the same code on the shard where the document is located then there is no blocking at all.

      We found out that mongodb router v3.0.12 doesn't have this problem. This is why we downgraded all our routers to v3.0.12 even though the rest of the cluster (mongod's) is running v3.2.10.

      Please see attached the log file from the router.
      Please see also 3 monitoring screenshots of the router TCP-sockets. As you can see, tcp_tw (tcp_timeWait) is much higher for v3.2.10 than for v3.0.12.

        1. fr-11_tcpwait.jpg
          fr-11_tcpwait.jpg
          173 kB
        2. fr-11_tcpwaitOnly.jpg
          fr-11_tcpwaitOnly.jpg
          169 kB
        3. offerstore-en-router-03_afterDowngradeToV3.0.10.png
          offerstore-en-router-03_afterDowngradeToV3.0.10.png
          201 kB
        4. offerstore-en-router-03_afterUpgradeToV3.2.10.png
          offerstore-en-router-03_afterUpgradeToV3.2.10.png
          201 kB
        5. offerstore-en-router-03_beforeUpgrade.png
          offerstore-en-router-03_beforeUpgrade.png
          202 kB
        6. offerstore-en-router-03_testWithV3.2.10.png
          offerstore-en-router-03_testWithV3.2.10.png
          232 kB
        7. offerstore-en-router-03.ipx.2016-10-20.log.tgz
          9.46 MB
        8. offerstore-en-router-03.ipx.2016-10-26.log.tgz
          3.74 MB
        9. v3.2.11 - 201.zip
          102.86 MB
        10. v3.2.11 - 202.zip
          16.96 MB
        11. v3.2.11 - 203.zip
          34.03 MB

            Votes:
            2 Vote for this issue
            Watchers:
            17 Start watching this issue

              Created:
              Updated:
              Resolved: