Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-28629

router blocks and throws ExceededTimeLimit

    • Type: Icon: Bug Bug
    • Resolution: Duplicate
    • Priority: Icon: Major - P3 Major - P3
    • None
    • Affects Version/s: 3.2.12
    • Component/s: Networking, Sharding
    • None
    • Fully Compatible
    • ALL
    • Hide

      We can reproduce the issue at any time just by executing a findOne through the router several times:

      for(x=0;x<1000;x++){db.offer.find({"_id" : NumberLong("5672494983")}).forEach(function(u){printjson(u)});print(x)}
      

      It blocks after a few findOne's already.
      If we execute the same code on the shard where the document is located then there is no blocking at all.

      Show
      We can reproduce the issue at any time just by executing a findOne through the router several times: for (x=0;x<1000;x++){db.offer.find({ "_id" : NumberLong( "5672494983" )}).forEach(function(u){printjson(u)});print(x)} It blocks after a few findOne's already. If we execute the same code on the shard where the document is located then there is no blocking at all.

      We are using a new sharded cluster running v3.2.12. Our cluster is not operational because many operations get blocked by the the router. The corresponding log message looks like this:

      2017-04-05T00:57:59.767+0200 I ASIO     [NetworkInterfaceASIO-TaskExecutorPool-3-0] Failed to connect to mongo-070.db00.pro05.eu.idealo.com:27017 - ExceededTimeLimit: Operation timed out
      2017-04-05T00:57:59.767+0200 I ASIO     [NetworkInterfaceASIO-TaskExecutorPool-3-0] Failed to connect to mongo-073.db00.pro05.eu.idealo.com:27017 - ExceededTimeLimit: Operation timed out
      2017-04-05T00:57:59.767+0200 I ASIO     [NetworkInterfaceASIO-TaskExecutorPool-3-0] Failed to connect to mongo-067.db00.pro05.eu.idealo.com:27017 - ExceededTimeLimit: Operation timed out
      2017-04-05T00:57:59.767+0200 I ASIO     [NetworkInterfaceASIO-TaskExecutorPool-3-0] Failed to connect to mongo-070.db00.pro05.eu.idealo.com:27017 - ExceededTimeLimit: Operation timed out
      2017-04-05T00:57:59.767+0200 I ASIO     [NetworkInterfaceASIO-TaskExecutorPool-3-0] Failed to connect to mongo-067.db00.pro05.eu.idealo.com:27017 - ExceededTimeLimit: Operation timed out
      2017-04-05T00:57:59.767+0200 I ASIO     [NetworkInterfaceASIO-TaskExecutorPool-3-0] Failed to connect to mongo-073.db00.pro05.eu.idealo.com:27017 - ExceededTimeLimit: Operation timed out
      2017-04-05T00:57:59.768+0200 I ASIO     [NetworkInterfaceASIO-TaskExecutorPool-3-0] Failed to connect to mongo-067.db00.pro05.eu.idealo.com:27017 - ExceededTimeLimit: Operation timed out
      2017-04-05T00:57:59.768+0200 I ASIO     [NetworkInterfaceASIO-TaskExecutorPool-3-0] Failed to connect to mongo-073.db00.pro05.eu.idealo.com:27017 - ExceededTimeLimit: Operation timed out
      2017-04-05T00:57:59.768+0200 I ASIO     [NetworkInterfaceASIO-TaskExecutorPool-3-0] Failed to connect to mongo-066.db00.pro05.eu.idealo.com:27017 - ExceededTimeLimit: Operation timed out
      2017-04-05T00:57:59.768+0200 I ASIO     [NetworkInterfaceASIO-TaskExecutorPool-3-0] Failed to connect to mongo-070.db00.pro05.eu.idealo.com:27017 - ExceededTimeLimit: Operation timed out
      

      We observe this behaviour independent on whether the query uses the shardkey or not. In all cases the queried field is indexed.

      A downgrade of the routers to v3.0.12 ist not possible because our configservers are running as replicaset instead of a mirrored set.
      An upgrade of the routers to v3.4.3 is not possible because "Version 3.4 mongos instances cannot connect to earlier versions of mongod instances."
      https://docs.mongodb.com/manual/release-notes/3.4-compatibility/

      Please see also 2 monitoring screenshots of the router TCP-sockets. As you can see, tcp_tw (tcp_timeWait) is very high.

      This ticket is related to SERVER-26722 which has been closed as "resolved and fixed in 3.2.12" but since we still have this issue, we've create this new ticket for it.

        1. v3.2.8_tcp_tw.jpg
          v3.2.8_tcp_tw.jpg
          194 kB
        2. v3.2.8_latencies.jpg
          v3.2.8_latencies.jpg
          237 kB
        3. v3.2.12_tcp_tw.jpg
          v3.2.12_tcp_tw.jpg
          214 kB
        4. v3.2.12_latencies.jpg
          v3.2.12_latencies.jpg
          236 kB
        5. tcp-tw_v3.0.12.jpg
          tcp-tw_v3.0.12.jpg
          123 kB
        6. tcp_timewait_3.2.12vs3.2.8.jpg
          tcp_timewait_3.2.12vs3.2.8.jpg
          174 kB
        7. fr-11_tcpwaitOnly.jpg
          fr-11_tcpwaitOnly.jpg
          169 kB
        8. fr-11_tcpwait.jpg
          fr-11_tcpwait.jpg
          173 kB
        9. figure_1.png
          figure_1.png
          63 kB

            Votes:
            4 Vote for this issue
            Watchers:
            20 Start watching this issue

              Created:
              Updated:
              Resolved: