Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-17555

The heartbeat attempts connection multiple time (when other node is down) in way that is not intended

    • Type: Icon: Improvement Improvement
    • Resolution: Done
    • Priority: Icon: Minor - P4 Minor - P4
    • 3.0.0
    • Affects Version/s: 2.6.5
    • Component/s: Replication
    • None
    • Minor Change

      When the other side of the heartbeat channel goes down, the RS heartbeat attempts connection multiple time in a way that may not be intended and also causes log flooding for these failures as a result. This may be ok when a node goes out for a small duration but does not seem to be very intuitive when a longer maintenance task needs to be performed on the replica-set nodes.

      This is evident from the following code instrumented with additional log statements (src/mongo/db/repl/heartbeat.cpp):

      364         bool _requestHeartbeat(HeartbeatInfo& mem, BSONObj& info, int& theirConfigVersion) {
      365             log() << "_requestHeartbeat - entering function";
      366             {
      367                 log() << "_requestHeartbeat - initial scoped connection check " << __FILE__ << ":" << __LINE__;
      368                 ScopedConn conn(h.toString());
      369                 conn.setTimeout(_timeout);
      370                 if (tries++ % threshold == (threshold - 1)) {
      371                     log() << "_requestHeartbeat - reconnect attempt for scoped connection " << __FILE__ << ":" << __LINE__;
      372                     conn.reconnect();
      373                     log() << "_requestHeartbeat - after failed reconnect attempt for scoped connection " << __FILE__ << ":" << __LINE__;
      374                 }
      375             }
      376 
      377             Timer timer;
      378             time_t before = curTimeMicros64() / 1000000;
      379 
      380             bool ok = tryHeartbeat(&info, &theirConfigVersion);
      381 
      382             mem.ping = static_cast<unsigned int>(timer.millis());
      383             time_t totalSecs = mem.ping / 1000;
      384 
      385             // if that didn't work and we have more time, lower timeout and try again
      386             if (!ok && totalSecs < _timeout) {
      387                 log() << "replset info " << h.toString() << " heartbeat failed, retrying" << rsLog;
      388 
      389                 // lower timeout to remaining ping time
      390                 {
      391                     log() << "_requestHeartbeat - getting scoped connection again for setting timeout " << __FILE__ << ":" << __LINE__;
      392                     ScopedConn conn(h.toString());
      393                     conn.setTimeout(_timeout - totalSecs);
      394                 }
      395 
      396                 int checkpoint = timer.millis();
      397                 timer.reset();
      398                 ok = tryHeartbeat(&info, &theirConfigVersion);
      399                 mem.ping = static_cast<unsigned int>(timer.millis());
      400                 totalSecs = (checkpoint + mem.ping)/1000;
      401 
      402                 // set timeout back to default
      403                 {
      404                     log() << "_requestHeartbeat - getting scoped connection again for resetting timeout " << __FILE__ << ":" << __LINE__;
      405                     ScopedConn conn(h.toString());
      406                     conn.setTimeout(_timeout);
      407                 }
      408             }
      409 
      410             // we set this on any response - we don't get this far if
      411             // couldn't connect because exception is thrown
      412             time_t after = mem.lastHeartbeat = before + totalSecs;
      413 
      414             if ( info["time"].isNumber() ) {
      415                 long long t = info["time"].numberLong();
      416                 if( t > after )
      417                     mem.skew = (int) (t - after);
      418                 else if( t < before )
      419                     mem.skew = (int) (t - before); // negative
      420             }
      421             else {
      422                 // it won't be there if remote hasn't initialized yet
      423                 if( info.hasElement("time") )
      424                     warning() << "heatbeat.time isn't a number: " << info << endl;
      425                 mem.skew = INT_MIN;
      426             }
      427 
      428             {
      429                 be state = info["state"];
      430                 if( state.ok() )
      431                     mem.hbstate = MemberState(state.Int());
      432             }
      433 
      434             if (info.hasField("stateDisagreement") && info["stateDisagreement"].trueValue()) {
      435                 log() << "replset info " << h.toString() << " thinks that we are down" << endl;
      436             }
      437 
      438             log() << "_requestHeartbeat - leaving function";
      439             return ok;
      440         }
      

      The following is log for each heartbeat attempt when the other node in the replica-set goes down:

      2015-03-11T17:45:16.100-0400 [rsHealthPoll] _requestHeartbeat - entering function
      2015-03-11T17:45:16.100-0400 [rsHealthPoll] _requestHeartbeat - initial scoped connection check src/mongo/db/repl/heartbeat.cpp:367
      2015-03-11T17:45:16.101-0400 [rsHealthPoll] warning: Failed to connect to 192.168.56.1:27018, reason: errno:61 Connection refused
      2015-03-11T17:45:16.102-0400 [rsHealthPoll] couldn't connect to aks-osx.local:27018: couldn't connect to server aks-osx.local:27018 (192.168.56.1) failed, connection attempt failed
      2015-03-11T17:45:16.102-0400 [rsHealthPoll] warning: Failed to connect to 192.168.56.1:27018, reason: errno:61 Connection refused
      2015-03-11T17:45:16.103-0400 [rsHealthPoll] couldn't connect to aks-osx.local:27018: couldn't connect to server aks-osx.local:27018 (192.168.56.1) failed, connection attempt failed
      2015-03-11T17:45:16.173-0400 [rsHealthPoll] warning: Failed to connect to 192.168.56.1:27018, reason: errno:61 Connection refused
      2015-03-11T17:45:16.173-0400 [rsHealthPoll] replset info aks-osx.local:27018 heartbeat failed, retrying
      2015-03-11T17:45:16.173-0400 [rsHealthPoll] _requestHeartbeat - getting scoped connection again for setting timeout src/mongo/db/repl/heartbeat.cpp:391
      2015-03-11T17:45:16.175-0400 [rsHealthPoll] warning: Failed to connect to 192.168.56.1:27018, reason: errno:61 Connection refused
      2015-03-11T17:45:16.175-0400 [rsHealthPoll] couldn't connect to aks-osx.local:27018: couldn't connect to server aks-osx.local:27018 (192.168.56.1) failed, connection attempt failed
      2015-03-11T17:45:16.176-0400 [rsHealthPoll] warning: Failed to connect to 192.168.56.1:27018, reason: errno:61 Connection refused
      2015-03-11T17:45:16.176-0400 [rsHealthPoll] couldn't connect to aks-osx.local:27018: couldn't connect to server aks-osx.local:27018 (192.168.56.1) failed, connection attempt failed
      2015-03-11T17:45:16.310-0400 [rsHealthPoll] warning: Failed to connect to 192.168.56.1:27018, reason: errno:61 Connection refused
      2015-03-11T17:45:16.311-0400 [rsHealthPoll] _requestHeartbeat - getting scoped connection again for resetting timeout src/mongo/db/repl/heartbeat.cpp:404
      2015-03-11T17:45:16.312-0400 [rsHealthPoll] warning: Failed to connect to 192.168.56.1:27018, reason: errno:61 Connection refused
      2015-03-11T17:45:16.312-0400 [rsHealthPoll] couldn't connect to aks-osx.local:27018: couldn't connect to server aks-osx.local:27018 (192.168.56.1) failed, connection attempt failed
      2015-03-11T17:45:16.313-0400 [rsHealthPoll] _requestHeartbeat - leaving function
      

      Apart from logging, the flow can also lead to unintended behaviour in cases when the connection is successful during an in-between stage in the code.

            Assignee:
            sam.kleinman Sam Kleinman (Inactive)
            Reporter:
            anil.kumar Anil Kumar
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

              Created:
              Updated:
              Resolved: