Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-10268

Network socket timeouts with heartbeatTimeoutSecs are different with a mongos vs plain replica set

    • Type: Icon: Bug Bug
    • Resolution: Done
    • Priority: Icon: Minor - P4 Minor - P4
    • 3.0.0
    • Affects Version/s: None
    • Component/s: Networking
    • None
    • Environment:
      Dell XPS13 laptop, Ubuntu 12.04, MongoDB 2.4.5 (10gen repo)
    • Fully Compatible

      We were playing with heartbeatTimeoutSecs with Edouard, and noticed that the behavior in socket timeouts is different when you have a mongos as part of the cluster vs when there is only a plain replica set (only mongod/arbitrator servers).

      Consider the following test:

      • MongoD is running on ports 27001, 27002 and 27003 - it's easiest to run them in the foreground in a terminal (no --fork).
      • A MongoS is running on port 27000, connected to 1 config server. (config server isn't shown in attached logs as I didn't use --syslog for it)
      • Symptoms are the same whether i run some operations or if the cluster is completely idle.
      • Attached logs are with -vvvv verbosity

      Tip: On a linux system I can use --syslog to get a single log of all processes. Apparently on a mac the syslog will not distinguish between the source of each log entry, so it becomes impossible to know which node wrote what.

      The test is simply to stop the PRIMARY with Ctrl-Z and measure failover time until a new PRIMARY is active and writable. In this ticket, we focus on the time until we get network socket timeout, which looks like this in logs:

      --> 27001 is PRIMARY. The last log entry from 27001 signifies the time I pressed Ctrl-Z:

      Jul 16 17:05:29 hingo-sputnik mongod.27001[14713]: Tue Jul 16 17:05:29.845 [conn107] getmore local.oplog.rs query: { ts: { $gte: Timestamp 1373976850000|1 } } cursorid:6149432603763897 ntoreturn:0 keyUpdates:0 locks(micros) r:383 nreturned:0 reslen:20 5011ms

      --> 27003 notices that 27001 is not responding:

      Jul 16 17:05:33 hingo-sputnik mongod.27003[13420]: Tue Jul 16 17:05:33.472 [rsHealthPoll] Socket recv() timeout 127.0.1.1:27001

      I have called out these occurences in the attached logs, so they are easy to find.

      The default behavior is that this timeout is 10 seconds (plus some milliseconds).

      Now, we configure heartbeatTimeoutSecs

      myrs:PRIMARY> config = rs.config()
      {
              "_id" : "myrs",
              "version" : 14,
              "members" : [
                      {
                              "_id" : 0,
                              "host" : "hingo-sputnik:27001"
                      },
                      {
                              "_id" : 1,
                              "host" : "hingo-sputnik:27002"
                      },
                      {
                              "_id" : 2,
                              "host" : "hingo-sputnik:27003"
                      }
              ]
      }
      myrs:PRIMARY> config["settings"] = { heartbeatTimeoutSecs : 2 }
      { "heartbeatTimeoutSecs" : 2 }
      myrs:PRIMARY> config
      {
              "_id" : "myrs",
              "version" : 14,
              "members" : [
                      {
                              "_id" : 0,
                              "host" : "hingo-sputnik:27001"
                      },
                      {
                              "_id" : 1,
                              "host" : "hingo-sputnik:27002"
                      },
                      {
                              "_id" : 2,
                              "host" : "hingo-sputnik:27003"
                      }
              ],
              "settings" : {
                      "heartbeatTimeoutSecs" : 2
              }
      }
      myrs:PRIMARY> rs.reconfig(config)
      { "ok" : 1 }
      // Just to verify it sticks:
      myrs:PRIMARY> rs.config()
      {
              "_id" : "myrs",
              "version" : 15,
              "members" : [
                      {
                              "_id" : 0,
                              "host" : "hingo-sputnik:27001"
                      },
                      {
                              "_id" : 1,
                              "host" : "hingo-sputnik:27002"
                      },
                      {
                              "_id" : 2,
                              "host" : "hingo-sputnik:27003"
                      }
              ],
              "settings" : {
                      "heartbeatTimeoutSecs" : 2
              }
      }
      myrs:PRIMARY>
      

      And then we repeat the test.

      Actual result

      We find that with a cluster that includes one mongos and one config server, the socket timeout becomes shorter, as expected. (3.5 secs in attached log, but still in the right ballpark.) If I shut down the mongos and config server, then the socket timeout remains at 10+ secs regardless of what I set heartbeatTimeoutSecs to be.

      Expected result

      Expected behavior would be that the socket timeout also becomes shorter when there is no mongos present.

      Workarounds

      Apparently most of our users are using sharding, so they would see the correct behavior (should they ever try to use heartbeatTimeoutSecs). Also, for someone for whom this timeout is important, it is feasible to simply add a mongos process to the cluster as a workaround. Hence I set priority to minor.

      More importantly, as a first step we would just want to understand why mongos has any influence over what is happening.

        1. heartbeatTimeoutSecs-with-mongos.log
          401 kB
          Henrik Ingo
        2. heartbeatTimeoutSecs-without-mongos.log
          158 kB
          Henrik Ingo

            Assignee:
            matt.dannenberg Matt Dannenberg
            Reporter:
            henrik.ingo@mongodb.com Henrik Ingo (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            8 Start watching this issue

              Created:
              Updated:
              Resolved: