Uploaded image for project: 'C Driver'
  1. C Driver
  2. CDRIVER-2174

_mongoc_cluster_check_interval() should invalidate nodes after detecting a closed socket

    • Type: Icon: Bug Bug
    • Resolution: Fixed
    • Priority: Icon: Major - P3 Major - P3
    • 1.7.0
    • Affects Version/s: 1.6.3
    • Component/s: libmongoc
    • None

      Quoting the SDAM spec on Network error when reading or writing:

      If there is a network timeout on the connection after the handshake completes, the client MUST NOT mark the server Unknown. (A timeout may indicate a slow operation on the server, rather than an unavailable server.) If, however, there is some other network error on the connection after the handshake completes, the client MUST replace the server's description with a default ServerDescription of type Unknown, and fill the ServerDescription's error field with useful information, the same as if an error or timeout occurred before the handshake completed.

      This doesn't jive with the implementation of _mongoc_cluster_check_interval(). Before the logic for socketCheckIntervalMS (i.e. isMaster for sockets last used more than five seconds ago by default), there is a basic check for a closed socket if the socket was last used more than a second ago (CHECK_CLOSED_DURATION_MSEC is defined as 1000):

      if (scanner_node->last_used + (1000 * CHECK_CLOSED_DURATION_MSEC) < now) {
        if (mongoc_stream_check_closed (stream)) {
           mongoc_cluster_disconnect_node (cluster, server_id);
           bson_set_error (error, MONGOC_ERROR_STREAM, MONGOC_ERROR_STREAM_SOCKET,
                           "Stream is closed");
           return false;
        }
      }
      

      I would expect mongoc_topology_invalidate_server() to be called here, as it often is in other places following mongoc_cluster_disconnect_node() on socket errors.


      I attempted to test this using the following PHP script:

      <?php
      
      function ping() {
          $manager = new \MongoDB\Driver\Manager('mongodb://127.0.0.1:27017?serverSelectionTryOnce=false&serverSelectionTimeoutMS=6000');
      
          printf("Start time: %.6f\n", $s = microtime(true));
          try {
              $server = $manager->selectServer(new MongoDB\Driver\ReadPreference(MongoDB\Driver\ReadPreference::RP_PRIMARY));
              echo "Selected primary\n";
              $server->executeCommand('admin', new MongoDB\Driver\Command(['ping' => 1]));
              echo "Ping successful\n";
          } catch (\Exception $e) {
              printf("ERROR: %s\n", $e->getMessage());
          }
          printf("End time: %.6f\n", $e = microtime(true));
          printf("Elapsed time: %.6f\n", $e - $s);
          echo "<hr>\n";
      }
      
      for ($i = 0; $i < 10; $i++) {
          ping();
          // Ensure that we surpass CHECK_CLOSED_DURATION_MSEC
          usleep(1500 * 1000);
      }
      

      If I start this script with a running MongoDB server on localhost, and restart it during the test, we get the following output (abridged to the relevant iterations around the restart):

      Start time: 1496422389.810717
      Selected primary
      Ping successful
      End time: 1496422389.811190
      Elapsed time: 0.000473
      
      Start time: 1496422391.311405
      Selected primary
      ERROR: Stream is closed
      End time: 1496422391.311774
      Elapsed time: 0.000369
      
      Start time: 1496422392.811984
      Selected primary
      ERROR: Failed to send "ismaster" command with database "admin": Failure during socket delivery: Broken pipe (32)
      End time: 1496422392.812394
      Elapsed time: 0.000410
      
      Start time: 1496422394.312595
      Selected primary
      Ping successful
      End time: 1496422399.315092
      Elapsed time: 5.002497
      

      Notice that two exceptions are thrown before we can recover from the socket error. If it's not immediately obvious, recovery occurs during the last iteration, which has a higher elapsed time. Also note that server selection succeeded immediately following the first "Stream is closed" error because the node was never invalidated.

      If we revise the original script to use a smaller usleep() interval to ensure we don't surpass CHECK_CLOSED_DURATION_MSEC, we only encounter one exception:

      Start time: 1496422345.924555
      Selected primary
      Ping successful
      End time: 1496422345.925055
      Elapsed time: 0.000500
      
      Start time: 1496422346.425322
      Selected primary
      ERROR: Failed to read 4 bytes: socket error or timeout
      End time: 1496422346.425683
      Elapsed time: 0.000361
      
      Start time: 1496422346.925905
      Selected primary
      Ping successful
      End time: 1496422351.928534
      Elapsed time: 5.002629
      

      I noted that this CHECK_CLOSED_DURATION_MSEC logic is specific to the single-threaded SDAM implementation, but I was unable to correlate it with any section of the SDAM spec. Do you have a reference for it?

      That said, I believe we can solve this behavior by adding a mongoc_topology_invalidate_server() call after bson_set_error() and before return false in the mongoc_stream_check_closed() condition cited earlier.

            Assignee:
            jesse@mongodb.com A. Jesse Jiryu Davis
            Reporter:
            jmikola@mongodb.com Jeremy Mikola
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

              Created:
              Updated:
              Resolved: