Loading...

XML

Word

Printable

JSON

Type: Bug
Resolution: Unresolved
Priority: Unknown
Fix Version/s: None
Affects Version/s: None
Component/s: None
Labels:
None

Confidence Status:
None

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name:
None
Goal Link:
None

It appears that there may be a libmongoc bug regarding tracking of maxWireVersion for servers when connected to a sharded cluster. I am seeing this specifically with a single-mongos cluster backed by 3 replica sets, started via mlaunch with:

mlaunch init --replicaset --sharded 3 --setParameter enableTestCommands=1

We have a Swift test which does the following, to test change streams' automatic resume behavior:
1. Create a client (in pooled mode, as Swift clients always are)
2. Use client from underlying pool to create a new collection
3. Open a change stream on the collection
4. Insert some documents to the collection
5. Set the following fail point: {"configureFailPoint":"failCommand","mode":

{"times":1}

, "data":{"errorCode":10107,"failCommands":["getMore"],"errorLabels":["ResumableChangeStreamError"]}}
6. Iterate the change stream. The getMore failpoint will be hit and then the change stream should make a single, successful resume attempt.
7. Inspect command monitoring events from the above and see that the aggregate was sent exactly twice.

On server latest (I first saw this on v5.0.0-alpha0-1541-ga8cf4f3 and have observed it on newer versions as well), this test started to fail as an extra aggregate attempt was observed, as the first resume attempt would consistently fail with an error from libmongoc with the domain MONGOC_ERROR_STREAM: "Failed to send \"aggregate\" command with database \"test\": Failed to read 4 bytes: socket error or timeout".

Of note, is that as part of the resume process drivers including libmongoc attempt to kill the original cursor. I noticed that the server recently merged in ~~SERVER-57457~~ where a connection will automatically be closed after receiving OP_KILL_CURSORS. With some printf debugging I have determined that OP_KILL_CURSORS is incorrectly being used to kill the cursor in this case after the getMore fails, so therefore the connection is being closed and the initial resume attempt fails.

Specifically, here server_stream->sd->max_wire_version is incorrectly 0, so the else block is hit.

I first witnessed this with latest + libmongoc-1.18.0-alpha2, however I have now tested as far back as server 4.4.3 + libmongoc 1.16.2 and saw that the branch using OP_KILLCURSORS is also used on those in this particular scenario. However this was not an issue until just now since the earlier server versions would still accept OP_KILLCURSORS without closing the connection.

I'll also note this seems to somehow be related to this particular code path, as from my printf testing cleaning up a change stream normally via mongoc_change_stream_destroy does not appear to take the OP_KILLCURSORS path.

Let me know if you need any more information or help reproducing this.

Slack thread for context: https://mongodb.slack.com/archives/C72LB5RPV/p1626133519311700

is related to

CDRIVER-3653 Connections should use server descriptions from handshake, not monitoring

Closed

Assignee:: Unassigned
Reporter:: Kaitlin Mahar
Votes:: 0 Vote for this issue
Watchers:: 4 Start watching this issue

Created:: Jul 13 2021 01:23:15 AM UTC
Updated:: Apr 15 2022 06:34:15 PM UTC

Details

Description

Attachments

Issue Links

Activity

People

Dates