-
Type: Bug
-
Resolution: Fixed
-
Priority: Major - P3
-
Affects Version/s: None
-
Component/s: None
-
Fully Compatible
-
ALL
-
v5.0, v4.4
-
-
Service Arch 2021-11-15, Service Arch 2021-11-22, Service Arch 2021-12-13, Service Arch 2022-1-10
-
(copied to CRM)
-
19
-
5
When the oplog fetcher cannot retrieve a batch of oplog documents from it's sync source, it will first try and create a new cursor on that sync source before choosing a new one (see the code here: https://github.com/mongodb/mongo/blob/ad91a93a5a31e175f5cbf8c69561e788bbc55ce1/src/mongo/db/repl/oplog_fetcher.cpp#L493)
In doing so, it will first attempt to re-establish the connection to the sync source using it's existing DBClientConnection. But in certain edge-cases, this connection-reestablishment process can make synchronous/blocking networking calls without correctly configured timeouts. In cases where the remote host the networking call is trying to communicate with is completely unresponsive, this can block the oplog fetching thread indefinitely, preventing the selection of a new sync-source. In this case, it is possible for a node to continue to use an unavailable node as its "sync-source" until the unavailable node re-contacts the set, allowing the blocking networking call to finish and the oplog fetcher to shut down.
I found two cases so far where these calls seem to be made:
In the first case, we perform a blocking SSL handshake without a timeout configured on the socket; if the remote node becomes unavailable between establishing the connection and the SSL handshake, we'll see this issue.
In DBClientConnection::_checkConnection(), between the "Trying to reconnect" log line and the "Reconnect attempt failed" log line, the thread attempting to re-establish connection to the sync source (node 01) calls DBClientConnection::connect() (see the source code here).
In trying to re-establish the connection, this thread first calls DBClientConnection::connectSocketOnly(), which in turn calls TransportLayerAsio::connect(). This function delegates to the ASIO library to establish the connection, and then creates an ASIOSession to manage it. As Atlas uses SSL to ensure connection safety, TransportLayerAsio::connect() will then attempt an SSL handshake here, also done through ASIO. However, the SSL handshake is attempted synchronously, and no timeout is set on the socket before the attempt (see the code for ensureSync here), so the handshake attempt will block until it errors or completes. This means that if the remote server (in this case node 01) stops responding to network requests (as it does here), the handshake attempt will block until the remote node comes back up and is able to respond. This explains why when the remote host (node 01) comes back online around 2020-12-11T07:43:05, node 00 is able to progress – the FTDC metrics that a slow SSL op is logged at this time on node 00, and expectedly, the connection has timed out and we see this error propogated from connectSocketOnly up through DBClientConnection::connect --> DBClientConnection::_checkConnection and eventualy DBClientCursor::init into the logs.
A second similar case I've been able to reproduce locally:
When the oplog fetcher creates a new cursor here: https://github.com/mongodb/mongo/blob/ad91a93a5a31e175f5cbf8c69561e788bbc55ce1/src/mongo/db/repl/oplog_fetcher.cpp#L647 , the old one falls out of scope and is destroyed. The call stack below ensues (I've abbreviated parts):
in recvmsg() ... in asio::read in mongo::transport::TransportLayerAsio::AsioSession::opportunisticRead in mongo::transport::TransportLayerAsio::AsioSession::read in mongo::transport::TransportLayerAsio::AsioSession::sourceMessageImpl in mongo::transport::TransportLayerAsio::AsioSession::sourceMessage in mongo::DBClientConnection::call in mongo::DBClientBase::runCommandWithTarget in mongo::DBClientConnection::runCommandWithTarget in mongo::DBClientBase::runCommand in mongo::(DBClientConnection)::initWireVersion in mongo::DBClientConnection::connect in mongo::ConnectionString::connect .... in mongo ScopedDbConnection constructor in mongo::DBClientCursor::kill in mongo::DBClientCursor destructor
In this code, the ScopedDbConnection constructor here: https://github.com/mongodb/mongo/blob/ad91a93a5a31e175f5cbf8c69561e788bbc55ce1/src/mongo/client/connpool.h#L471 constructs a DbClientConnection with a socket timeout of 0, which translates here: https://github.com/mongodb/mongo/blob/ad91a93a5a31e175f5cbf8c69561e788bbc55ce1/src/mongo/client/dbclient_connection.cpp#L610-L611 into the DBClientConnection storing the value for the socket timeout as boost::none. When it sets that timeout on the ASIOSession here: https://github.com/mongodb/mongo/blob/ad91a93a5a31e175f5cbf8c69561e788bbc55ce1/src/mongo/transport/session_asio.h#L204, the ASIOSession stores boost::none as the timeout. This means that when ASIOSession::sourceMessage is called, no timeout is set on the socket before the read is performed (see https://github.com/mongodb/mongo/blob/ad91a93a5a31e175f5cbf8c69561e788bbc55ce1/src/mongo/transport/session_asio.h#L302). This allows the read to block indefinitely until the node comes back online.
I've repro'd the issue on 4.4.1, but we should also audit the main branch for this issue – it doesn't look like the relevant code has changed.
Next steps:
Write a test to reproduce the issue described in this ticket
Determine the level at which level to solve this issue
- is related to
-
SERVER-66456 remove ThreadContext
- Closed
- related to
-
SERVER-62035 Investigate large delays in `CertGetCertificateChain` on Windows
- Backlog
-
SERVER-62665 Ensure `TransportLayerASIO::TimerService` can safely end the session
- Closed
-
SERVER-65416 Complete TODO listed in SERVER-30212
- Open
-
SERVER-62199 Use the `TimerService` to ensure `TransportLayer` honors requested timeouts
- Backlog
- mentioned in
-
Page Loading...