-
Type: Improvement
-
Resolution: Done
-
Priority: Minor - P4
-
Affects Version/s: 2.6.5
-
Component/s: Replication
-
None
-
Minor Change
When the other side of the heartbeat channel goes down, the RS heartbeat attempts connection multiple time in a way that may not be intended and also causes log flooding for these failures as a result. This may be ok when a node goes out for a small duration but does not seem to be very intuitive when a longer maintenance task needs to be performed on the replica-set nodes.
This is evident from the following code instrumented with additional log statements (src/mongo/db/repl/heartbeat.cpp):
364 bool _requestHeartbeat(HeartbeatInfo& mem, BSONObj& info, int& theirConfigVersion) { 365 log() << "_requestHeartbeat - entering function"; 366 { 367 log() << "_requestHeartbeat - initial scoped connection check " << __FILE__ << ":" << __LINE__; 368 ScopedConn conn(h.toString()); 369 conn.setTimeout(_timeout); 370 if (tries++ % threshold == (threshold - 1)) { 371 log() << "_requestHeartbeat - reconnect attempt for scoped connection " << __FILE__ << ":" << __LINE__; 372 conn.reconnect(); 373 log() << "_requestHeartbeat - after failed reconnect attempt for scoped connection " << __FILE__ << ":" << __LINE__; 374 } 375 } 376 377 Timer timer; 378 time_t before = curTimeMicros64() / 1000000; 379 380 bool ok = tryHeartbeat(&info, &theirConfigVersion); 381 382 mem.ping = static_cast<unsigned int>(timer.millis()); 383 time_t totalSecs = mem.ping / 1000; 384 385 // if that didn't work and we have more time, lower timeout and try again 386 if (!ok && totalSecs < _timeout) { 387 log() << "replset info " << h.toString() << " heartbeat failed, retrying" << rsLog; 388 389 // lower timeout to remaining ping time 390 { 391 log() << "_requestHeartbeat - getting scoped connection again for setting timeout " << __FILE__ << ":" << __LINE__; 392 ScopedConn conn(h.toString()); 393 conn.setTimeout(_timeout - totalSecs); 394 } 395 396 int checkpoint = timer.millis(); 397 timer.reset(); 398 ok = tryHeartbeat(&info, &theirConfigVersion); 399 mem.ping = static_cast<unsigned int>(timer.millis()); 400 totalSecs = (checkpoint + mem.ping)/1000; 401 402 // set timeout back to default 403 { 404 log() << "_requestHeartbeat - getting scoped connection again for resetting timeout " << __FILE__ << ":" << __LINE__; 405 ScopedConn conn(h.toString()); 406 conn.setTimeout(_timeout); 407 } 408 } 409 410 // we set this on any response - we don't get this far if 411 // couldn't connect because exception is thrown 412 time_t after = mem.lastHeartbeat = before + totalSecs; 413 414 if ( info["time"].isNumber() ) { 415 long long t = info["time"].numberLong(); 416 if( t > after ) 417 mem.skew = (int) (t - after); 418 else if( t < before ) 419 mem.skew = (int) (t - before); // negative 420 } 421 else { 422 // it won't be there if remote hasn't initialized yet 423 if( info.hasElement("time") ) 424 warning() << "heatbeat.time isn't a number: " << info << endl; 425 mem.skew = INT_MIN; 426 } 427 428 { 429 be state = info["state"]; 430 if( state.ok() ) 431 mem.hbstate = MemberState(state.Int()); 432 } 433 434 if (info.hasField("stateDisagreement") && info["stateDisagreement"].trueValue()) { 435 log() << "replset info " << h.toString() << " thinks that we are down" << endl; 436 } 437 438 log() << "_requestHeartbeat - leaving function"; 439 return ok; 440 }
The following is log for each heartbeat attempt when the other node in the replica-set goes down:
2015-03-11T17:45:16.100-0400 [rsHealthPoll] _requestHeartbeat - entering function 2015-03-11T17:45:16.100-0400 [rsHealthPoll] _requestHeartbeat - initial scoped connection check src/mongo/db/repl/heartbeat.cpp:367 2015-03-11T17:45:16.101-0400 [rsHealthPoll] warning: Failed to connect to 192.168.56.1:27018, reason: errno:61 Connection refused 2015-03-11T17:45:16.102-0400 [rsHealthPoll] couldn't connect to aks-osx.local:27018: couldn't connect to server aks-osx.local:27018 (192.168.56.1) failed, connection attempt failed 2015-03-11T17:45:16.102-0400 [rsHealthPoll] warning: Failed to connect to 192.168.56.1:27018, reason: errno:61 Connection refused 2015-03-11T17:45:16.103-0400 [rsHealthPoll] couldn't connect to aks-osx.local:27018: couldn't connect to server aks-osx.local:27018 (192.168.56.1) failed, connection attempt failed 2015-03-11T17:45:16.173-0400 [rsHealthPoll] warning: Failed to connect to 192.168.56.1:27018, reason: errno:61 Connection refused 2015-03-11T17:45:16.173-0400 [rsHealthPoll] replset info aks-osx.local:27018 heartbeat failed, retrying 2015-03-11T17:45:16.173-0400 [rsHealthPoll] _requestHeartbeat - getting scoped connection again for setting timeout src/mongo/db/repl/heartbeat.cpp:391 2015-03-11T17:45:16.175-0400 [rsHealthPoll] warning: Failed to connect to 192.168.56.1:27018, reason: errno:61 Connection refused 2015-03-11T17:45:16.175-0400 [rsHealthPoll] couldn't connect to aks-osx.local:27018: couldn't connect to server aks-osx.local:27018 (192.168.56.1) failed, connection attempt failed 2015-03-11T17:45:16.176-0400 [rsHealthPoll] warning: Failed to connect to 192.168.56.1:27018, reason: errno:61 Connection refused 2015-03-11T17:45:16.176-0400 [rsHealthPoll] couldn't connect to aks-osx.local:27018: couldn't connect to server aks-osx.local:27018 (192.168.56.1) failed, connection attempt failed 2015-03-11T17:45:16.310-0400 [rsHealthPoll] warning: Failed to connect to 192.168.56.1:27018, reason: errno:61 Connection refused 2015-03-11T17:45:16.311-0400 [rsHealthPoll] _requestHeartbeat - getting scoped connection again for resetting timeout src/mongo/db/repl/heartbeat.cpp:404 2015-03-11T17:45:16.312-0400 [rsHealthPoll] warning: Failed to connect to 192.168.56.1:27018, reason: errno:61 Connection refused 2015-03-11T17:45:16.312-0400 [rsHealthPoll] couldn't connect to aks-osx.local:27018: couldn't connect to server aks-osx.local:27018 (192.168.56.1) failed, connection attempt failed 2015-03-11T17:45:16.313-0400 [rsHealthPoll] _requestHeartbeat - leaving function
Apart from logging, the flow can also lead to unintended behaviour in cases when the connection is successful during an in-between stage in the code.