Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-28629

router blocks and throws ExceededTimeLimit

    • Type: Icon: Bug Bug
    • Resolution: Duplicate
    • Priority: Icon: Major - P3 Major - P3
    • None
    • Affects Version/s: 3.2.12
    • Component/s: Networking, Sharding
    • None
    • Fully Compatible
    • ALL
    • Hide

      We can reproduce the issue at any time just by executing a findOne through the router several times:

      for(x=0;x<1000;x++){db.offer.find({"_id" : NumberLong("5672494983")}).forEach(function(u){printjson(u)});print(x)}
      

      It blocks after a few findOne's already.
      If we execute the same code on the shard where the document is located then there is no blocking at all.

      Show
      We can reproduce the issue at any time just by executing a findOne through the router several times: for (x=0;x<1000;x++){db.offer.find({ "_id" : NumberLong( "5672494983" )}).forEach(function(u){printjson(u)});print(x)} It blocks after a few findOne's already. If we execute the same code on the shard where the document is located then there is no blocking at all.

      We are using a new sharded cluster running v3.2.12. Our cluster is not operational because many operations get blocked by the the router. The corresponding log message looks like this:

      2017-04-05T00:57:59.767+0200 I ASIO     [NetworkInterfaceASIO-TaskExecutorPool-3-0] Failed to connect to mongo-070.db00.pro05.eu.idealo.com:27017 - ExceededTimeLimit: Operation timed out
      2017-04-05T00:57:59.767+0200 I ASIO     [NetworkInterfaceASIO-TaskExecutorPool-3-0] Failed to connect to mongo-073.db00.pro05.eu.idealo.com:27017 - ExceededTimeLimit: Operation timed out
      2017-04-05T00:57:59.767+0200 I ASIO     [NetworkInterfaceASIO-TaskExecutorPool-3-0] Failed to connect to mongo-067.db00.pro05.eu.idealo.com:27017 - ExceededTimeLimit: Operation timed out
      2017-04-05T00:57:59.767+0200 I ASIO     [NetworkInterfaceASIO-TaskExecutorPool-3-0] Failed to connect to mongo-070.db00.pro05.eu.idealo.com:27017 - ExceededTimeLimit: Operation timed out
      2017-04-05T00:57:59.767+0200 I ASIO     [NetworkInterfaceASIO-TaskExecutorPool-3-0] Failed to connect to mongo-067.db00.pro05.eu.idealo.com:27017 - ExceededTimeLimit: Operation timed out
      2017-04-05T00:57:59.767+0200 I ASIO     [NetworkInterfaceASIO-TaskExecutorPool-3-0] Failed to connect to mongo-073.db00.pro05.eu.idealo.com:27017 - ExceededTimeLimit: Operation timed out
      2017-04-05T00:57:59.768+0200 I ASIO     [NetworkInterfaceASIO-TaskExecutorPool-3-0] Failed to connect to mongo-067.db00.pro05.eu.idealo.com:27017 - ExceededTimeLimit: Operation timed out
      2017-04-05T00:57:59.768+0200 I ASIO     [NetworkInterfaceASIO-TaskExecutorPool-3-0] Failed to connect to mongo-073.db00.pro05.eu.idealo.com:27017 - ExceededTimeLimit: Operation timed out
      2017-04-05T00:57:59.768+0200 I ASIO     [NetworkInterfaceASIO-TaskExecutorPool-3-0] Failed to connect to mongo-066.db00.pro05.eu.idealo.com:27017 - ExceededTimeLimit: Operation timed out
      2017-04-05T00:57:59.768+0200 I ASIO     [NetworkInterfaceASIO-TaskExecutorPool-3-0] Failed to connect to mongo-070.db00.pro05.eu.idealo.com:27017 - ExceededTimeLimit: Operation timed out
      

      We observe this behaviour independent on whether the query uses the shardkey or not. In all cases the queried field is indexed.

      A downgrade of the routers to v3.0.12 ist not possible because our configservers are running as replicaset instead of a mirrored set.
      An upgrade of the routers to v3.4.3 is not possible because "Version 3.4 mongos instances cannot connect to earlier versions of mongod instances."
      https://docs.mongodb.com/manual/release-notes/3.4-compatibility/

      Please see also 2 monitoring screenshots of the router TCP-sockets. As you can see, tcp_tw (tcp_timeWait) is very high.

      This ticket is related to SERVER-26722 which has been closed as "resolved and fixed in 3.2.12" but since we still have this issue, we've create this new ticket for it.

        1. figure_1.png
          63 kB
          Ramon Fernandez Marina
        2. fr-11_tcpwait.jpg
          173 kB
          Kay Agahd
        3. fr-11_tcpwaitOnly.jpg
          169 kB
          Kay Agahd
        4. tcp_timewait_3.2.12vs3.2.8.jpg
          174 kB
          Kay Agahd
        5. tcp-tw_v3.0.12.jpg
          123 kB
          Kay Agahd
        6. v3.2.12_latencies.jpg
          236 kB
          Kay Agahd
        7. v3.2.12_tcp_tw.jpg
          214 kB
          Kay Agahd
        8. v3.2.8_latencies.jpg
          237 kB
          Kay Agahd
        9. v3.2.8_tcp_tw.jpg
          194 kB
          Kay Agahd

            Votes:
            4 Vote for this issue
            Watchers:
            20 Start watching this issue

              Created:
              Updated:
              Resolved: