The topology scanner should fan out to all servers and check them all concurrently using non-blocking I/O. However, our implementation of the TLS handshake operation blocks, waiting for the initial connection to complete and to reach a certain step in the TLS protocol. This means that high-latency replica set members slow down the topology scanner more than expected.
Test:
cd to the mongo-c-driver root dir and start a mongod:
mongod --sslOnNormalPorts --sslPEMKeyFile tests/x509gen/server.pem --sslCAFile tests/x509gen/ca.pem
Update mongoc_stream_tls_openssl_handshake:
time_t time_ptr; time_ptr = time(NULL); printf ("start handshake %s", ctime (&time_ptr)); if (BIO_do_handshake (openssl->bio) == 1) { time_ptr = time(NULL); printf ("handshake succeeds %s", ctime (&time_ptr)); if (_mongoc_openssl_check_cert ( ssl, host, tls->ssl_opts.allow_invalid_hostname)) { RETURN (true); } *events = 0; bson_set_error (error, MONGOC_ERROR_STREAM, MONGOC_ERROR_STREAM_SOCKET, "TLS handshake failed: Failed certificate verification"); RETURN (false); } if (BIO_should_retry (openssl->bio)) { time_ptr = time(NULL); printf ("handshake should retry %s", ctime (&time_ptr)); *events = BIO_should_read (openssl->bio) ? POLLIN : POLLOUT; RETURN (false); }
Slow down the network, on Linux:
sudo tc qdisc add dev lo root netem delay 300ms
Or on Mac:
sudo pfctl -E (cat /etc/pf.conf && echo "dummynet-anchor \"foo\"" && echo "anchor \"foo\"") | sudo pfctl -f - echo "dummynet in quick proto tcp from any to any port 27017 pipe 1" | sudo pfctl -a foo -f - sudo dnctl pipe 1 config bw 20000bit/s
Ignore the warnings about "No ALTQ support in kernel", etc.
The shell should now connect, slowly:
mongo --ssl --sslPEMKeyFile tests/x509gen/client.pem --sslCAFile tests/x509gen/ca.pem --host localhost
Now recompile and run a test:
export MONGOC_TEST_URI=mongodb://localhost:27017,localhost:27017
export MONGOC_TEST_SSL_PEM_FILE=tests/x509gen/client.pem
export MONGOC_TEST_SSL_CA_FILE=tests/x509gen/ca.pem
./test-libmongoc --no-fork -l /Client/select_server/single
Listing "localhost:27017" twice lets us see if the topology scanner begins both handshakes concurrently and then both succeed (as expected) or if it begins and completes one handshake, then the other handshake (the bug). In fact, this is what I see with OpenSSL 1.0.1f on Ubuntu 16.04:
start handshake Thu Dec 15 02:37:12 2016 handshake succeeds Thu Dec 15 02:37:14 2016 start handshake Thu Dec 15 02:37:16 2016 handshake succeeds Thu Dec 15 02:37:17 2016
There are two symptoms of the blocking handshake. First, we see one handshake begin, block for two seconds, then succeed, before the other begins. Second, we expect the function to print "handshake should retry" but it doesn't.
This blocking behavior is even seen if we add this, although it blocks for a shorter duration:
export MONGOC_TEST_SSL_WEAK_CERT_VALIDATION=on
I've made some attempts to fix this like so, with no effect:
BIO_set_nbio (openssl->bio, 1);
There is a reference to a bug with this function and BIO_do_handshake from 15 years ago and another from 9 years ago that I do not believe can apply to OpenSSL 1.0.1.
I also tried deleting this line from _mongoc_openssl_ctx_new:
SSL_CTX_set_mode (ctx, SSL_MODE_AUTO_RETRY);
Also no effect on the topology scanner.
Return your network to normalcy, on Linux:
sudo tc qdisc del dev lo root
Or on Mac:
sudo dnctl -f flush sudo pfctl -f /etc/pf.conf
Update now resolved for OpenSSL and Windows Secure Channel implementations. The handshake is now parallelized by setting the timeout value to 0 during handshake. Additionally, I've increased throughput with SChannel by increasing the initial receive buffer size, and perhaps fixed a latent bug in the previous code that could cause a deadlock if the buffer grew large enough to receive multiple SSL blocks.
The Apple Secure Transport implement still handshakes connections serially, but I'm not fixing this at the moment. Mac OS X is used for development of C Driver applications, not for production.
- is depended on by
-
CDRIVER-1409 Test that topology scanner is still async with TLS
- Closed
- is related to
-
CDRIVER-2394 /Topology/invalidate_server/ tests are slow with SSL
- Closed
-
CDRIVER-2885 Topology scanner's SSL handshake is blocking for secure transport
- Backlog