Loading...

XML

Word

Printable

JSON

Type: Bug
Resolution: Fixed
Priority: Major - P3
Fix Version/s: 1.7.0
Affects Version/s: 1.6.3
Component/s: libmongoc
Labels:
None

Confidence Status:
None

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name:
None
Goal Link:
None

Quoting the SDAM spec on Network error when reading or writing:

If there is a network timeout on the connection after the handshake completes, the client MUST NOT mark the server Unknown. (A timeout may indicate a slow operation on the server, rather than an unavailable server.) If, however, there is some other network error on the connection after the handshake completes, the client MUST replace the server's description with a default ServerDescription of type Unknown, and fill the ServerDescription's error field with useful information, the same as if an error or timeout occurred before the handshake completed.

This doesn't jive with the implementation of _mongoc_cluster_check_interval(). Before the logic for socketCheckIntervalMS (i.e. isMaster for sockets last used more than five seconds ago by default), there is a basic check for a closed socket if the socket was last used more than a second ago (CHECK_CLOSED_DURATION_MSEC is defined as 1000):

if (scanner_node->last_used + (1000 * CHECK_CLOSED_DURATION_MSEC) < now) {
  if (mongoc_stream_check_closed (stream)) {
     mongoc_cluster_disconnect_node (cluster, server_id);
     bson_set_error (error, MONGOC_ERROR_STREAM, MONGOC_ERROR_STREAM_SOCKET,
                     "Stream is closed");
     return false;
  }
}

I would expect mongoc_topology_invalidate_server() to be called here, as it often is in other places following mongoc_cluster_disconnect_node() on socket errors.

I attempted to test this using the following PHP script:

<?php

function ping() {
    $manager = new \MongoDB\Driver\Manager('mongodb://127.0.0.1:27017?serverSelectionTryOnce=false&serverSelectionTimeoutMS=6000');

    printf("Start time: %.6f\n", $s = microtime(true));
    try {
        $server = $manager->selectServer(new MongoDB\Driver\ReadPreference(MongoDB\Driver\ReadPreference::RP_PRIMARY));
        echo "Selected primary\n";
        $server->executeCommand('admin', new MongoDB\Driver\Command(['ping' => 1]));
        echo "Ping successful\n";
    } catch (\Exception $e) {
        printf("ERROR: %s\n", $e->getMessage());
    }
    printf("End time: %.6f\n", $e = microtime(true));
    printf("Elapsed time: %.6f\n", $e - $s);
    echo "<hr>\n";
}

for ($i = 0; $i < 10; $i++) {
    ping();
    // Ensure that we surpass CHECK_CLOSED_DURATION_MSEC
    usleep(1500 * 1000);
}

If I start this script with a running MongoDB server on localhost, and restart it during the test, we get the following output (abridged to the relevant iterations around the restart):

Start time: 1496422389.810717
Selected primary
Ping successful
End time: 1496422389.811190
Elapsed time: 0.000473

Start time: 1496422391.311405
Selected primary
ERROR: Stream is closed
End time: 1496422391.311774
Elapsed time: 0.000369

Start time: 1496422392.811984
Selected primary
ERROR: Failed to send "ismaster" command with database "admin": Failure during socket delivery: Broken pipe (32)
End time: 1496422392.812394
Elapsed time: 0.000410

Start time: 1496422394.312595
Selected primary
Ping successful
End time: 1496422399.315092
Elapsed time: 5.002497

Notice that two exceptions are thrown before we can recover from the socket error. If it's not immediately obvious, recovery occurs during the last iteration, which has a higher elapsed time. Also note that server selection succeeded immediately following the first "Stream is closed" error because the node was never invalidated.

If we revise the original script to use a smaller usleep() interval to ensure we don't surpass CHECK_CLOSED_DURATION_MSEC, we only encounter one exception:

Start time: 1496422345.924555
Selected primary
Ping successful
End time: 1496422345.925055
Elapsed time: 0.000500

Start time: 1496422346.425322
Selected primary
ERROR: Failed to read 4 bytes: socket error or timeout
End time: 1496422346.425683
Elapsed time: 0.000361

Start time: 1496422346.925905
Selected primary
Ping successful
End time: 1496422351.928534
Elapsed time: 5.002629

I noted that this CHECK_CLOSED_DURATION_MSEC logic is specific to the single-threaded SDAM implementation, but I was unable to correlate it with any section of the SDAM spec. Do you have a reference for it?

That said, I believe we can solve this behavior by adding a mongoc_topology_invalidate_server() call after bson_set_error() and before return false in the mongoc_stream_check_closed() condition cited earlier.

is related to

CDRIVER-2172 Call "ping" on a socket that has been idle for socketCheckIntervalMS

Closed

related to

CDRIVER-2237 Server not invalidated in topology if command fails

Closed

links to

mongodb/mongo-php-driver#597

Assignee:: A. Jesse Jiryu Davis
Reporter:: Jeremy Mikola
Votes:: 0 Vote for this issue
Watchers:: 2 Start watching this issue

Created:: Jun 02 2017 05:10:00 PM UTC
Updated:: Oct 28 2023 11:30:39 AM UTC
Resolved:: Jun 18 2017 01:32:55 AM UTC

Details

Description

Attachments

Issue Links

Activity

People

Dates