-
Type: Bug
-
Resolution: Works as Designed
-
Priority: Major - P3
-
None
-
Affects Version/s: None
-
Component/s: None
-
None
-
ALL
-
Replication 2021-12-13
SERVER-56137 presented a case where an application was sending many findAndModify commands from forked child processes. Each command had different query/update options, but thansk to an old OpenSSL bug some of the commands ended up using the same combination of lsid and txnNumber fields. This lead to an interesting, albeit expected, interaction with retryable writes whereby the server would the result of the first findAndModify operation to use that lsid/txnNumber combination for the subsequent findAndModify that re-used the combination.
Observing different commands that resemble retry attempts
While investigating that issue, I also looked into what would happen if the subsequent command sharing the same lsid/txnNumber was not a findAndModify. Additionally, what if some commands were bulk writes that had multiple statements for the same txnNumber? This lead to some surprising discoveries in how the server crafts its responses for retryable writes. Since I'm familiar with retryable writes (e.g. "at-most-once" semantics, reliance on the oplog), these results below are not completely unexpected; however, they would likely be very confusing for a user.
Consider the following sequence of commands/replies observed between a driver (intentionally using fixed lsid/txnNumber combinations) and a MongoDB 4.4 primary:
One insert
insert command: {"insert":"test","documents":[{"AAA":1}],"txnNumber":1,"$db":"test","lsid":{"id":{"$binary":"YXt1nbJBTWmgNPM+Vjug2A==","$type":"04"}},"$clusterTime":{"clusterTime":{"$timestamp":{"t":1620079815,"i":1}},"signature":{"hash":{"$binary":"AAAAAAAAAAAAAAAAAAAAAAAAAAA=","$type":"00"},"keyId":0}}} insert reply: {"n":1,"opTime":{"ts":{"$timestamp":{"t":1620079815,"i":3}},"t":1},"electionId":{"$oid":"7fffffff0000000000000001"},"ok":1,"$clusterTime":{"clusterTime":{"$timestamp":{"t":1620079815,"i":3}},"signature":{"hash":{"$binary":"AAAAAAAAAAAAAAAAAAAAAAAAAAA=","$type":"00"},"keyId":0}},"operationTime":{"$timestamp":{"t":1620079815,"i":3}}} Documents in test.test: 0: {"_id":{"$oid":"609074c70ba2e3bde8ab5281"},"AAA":1}
The AAA document is successfully inserted and the response is what we expect (n:1).
Two inserts
insert command: {"insert":"test","documents":[{"BBB":1},{"CCC":1}],"txnNumber":1,"$db":"test","lsid":{"id":{"$binary":"YXt1nbJBTWmgNPM+Vjug2A==","$type":"04"}},"$clusterTime":{"clusterTime":{"$timestamp":{"t":1620079815,"i":3}},"signature":{"hash":{"$binary":"AAAAAAAAAAAAAAAAAAAAAAAAAAA=","$type":"00"},"keyId":0}}} insert reply: {"n":2,"opTime":{"ts":{"$timestamp":{"t":1620079815,"i":4}},"t":1},"electionId":{"$oid":"7fffffff0000000000000001"},"ok":1,"$clusterTime":{"clusterTime":{"$timestamp":{"t":1620079815,"i":4}},"signature":{"hash":{"$binary":"AAAAAAAAAAAAAAAAAAAAAAAAAAA=","$type":"00"},"keyId":0}},"operationTime":{"$timestamp":{"t":1620079815,"i":4}}} Documents in test.test: 0: {"_id":{"$oid":"609074c70ba2e3bde8ab5281"},"AAA":1} 1: {"_id":{"$oid":"609074c70ba2e3bde8ab5285"},"CCC":1}
The insert contains two documents, but only CCC will be written to the collection. BBB is ultimately a no-op, because it shares the same statement ID as the previously inserted AAA ("1.0"). Since CCC is second in the documents array and it has statement ID "1.1". The response (n:2) does not match what has actually happened.
Three upserts
{"update":"test","updates":[{"q":{"DDD":1},"u":{"$inc":{"DDD":1}},"upsert":true},{"q":{"EEE":1},"u":{"$inc":{"EEE":1}},"upsert":true},{"q":{"FFF":1},"u":{"$inc":{"FFF":1}},"upsert":true}],"txnNumber":1,"$db":"test","lsid":{"id":{"$binary":"YXt1nbJBTWmgNPM+Vjug2A==","$type":"04"}},"$clusterTime":{"clusterTime":{"$timestamp":{"t":1620079815,"i":4}},"signature":{"hash":{"$binary":"AAAAAAAAAAAAAAAAAAAAAAAAAAA=","$type":"00"},"keyId":0}}} update reply: {"n":3,"nModified":0,"upserted":[{"index":0,"_id":{"$oid":"609074c70ba2e3bde8ab5281"}},{"index":1,"_id":{"$oid":"609074c70ba2e3bde8ab5285"}},{"index":2,"_id":{"$oid":"609074c70ba2e3bde8ab5288"}}],"opTime":{"ts":{"$timestamp":{"t":1620079815,"i":5}},"t":1},"electionId":{"$oid":"7fffffff0000000000000001"},"ok":1,"$clusterTime":{"clusterTime":{"$timestamp":{"t":1620079815,"i":5}},"signature":{"hash":{"$binary":"AAAAAAAAAAAAAAAAAAAAAAAAAAA=","$type":"00"},"keyId":0}},"operationTime":{"$timestamp":{"t":1620079815,"i":5}}} Documents in test.test: 0: {"_id":{"$oid":"609074c70ba2e3bde8ab5281"},"AAA":1} 1: {"_id":{"$oid":"609074c70ba2e3bde8ab5285"},"CCC":1} 2: {"_id":{"$oid":"609074c70ba2e3bde8ab5288"},"FFF":2}
The update contains three upserts, each of which match a single document (e.g. DDD:1) and increments the same key in its query. The first two upserts for DDD and EEE end up reusing previous statement IDs "1.0" and "1.1", respectively, and are no-ops. The third upsert for FFF actually executes and we see a new document added to the collection. The response is misleading, as it suggests that three documents were upserted. ObjectIds for the previously inserted AAA and CCC documents are reported in the upserted field alongside the one document that was actually upserted (FFF). An application might then infer that DDD and EEE have the same ObjectIds as AAA and CCC.
Four updates
{"update":"test","updates":[{"q":{"DDD":1},"u":{"$inc":{"DDD":1}}},{"q":{"CCC":1},"u":{"$inc":{"CCC":1}}},{"q":{"BBB":1},"u":{"$inc":{"BBB":1}}},{"q":{"AAA":1},"u":{"$inc":{"AAA":1}}}],"txnNumber":1,"$db":"test","lsid":{"id":{"$binary":"YXt1nbJBTWmgNPM+Vjug2A==","$type":"04"}},"$clusterTime":{"clusterTime":{"$timestamp":{"t":1620079815,"i":5}},"signature":{"hash":{"$binary":"AAAAAAAAAAAAAAAAAAAAAAAAAAA=","$type":"00"},"keyId":0}}} update reply: {"n":4,"nModified":1,"upserted":[{"index":0,"_id":{"$oid":"609074c70ba2e3bde8ab5281"}},{"index":1,"_id":{"$oid":"609074c70ba2e3bde8ab5285"}},{"index":2,"_id":{"$oid":"609074c70ba2e3bde8ab5288"}}],"opTime":{"ts":{"$timestamp":{"t":1620079815,"i":6}},"t":1},"electionId":{"$oid":"7fffffff0000000000000001"},"ok":1,"$clusterTime":{"clusterTime":{"$timestamp":{"t":1620079815,"i":6}},"signature":{"hash":{"$binary":"AAAAAAAAAAAAAAAAAAAAAAAAAAA=","$type":"00"},"keyId":0}},"operationTime":{"$timestamp":{"t":1620079815,"i":6}}} Documents in test.test: 0: {"_id":{"$oid":"609074c70ba2e3bde8ab5281"},"AAA":2} 1: {"_id":{"$oid":"609074c70ba2e3bde8ab5285"},"CCC":1} 2: {"_id":{"$oid":"609074c70ba2e3bde8ab5288"},"FFF":2}
The update contains four statements, each of which match a single document (DDD through AAA, in that order) and increments the same key in its query. The first three updates will reuse previous statement IDs ("1.0" through "1.2") and be no-ops. The fourth update will match and increment AAA, which is what we see reflected in the collection. The nModified field in the response is accurate but the upserted field is misleading. Given that the upsert option was not even used for these updates, an application would not expect any upserts to be reported.
Six deletes
delete command: {"delete":"test","deletes":[{"q":{"EEE":1},"limit":1},{"q":{"DDD":1},"limit":1},{"q":{"CCC":1},"limit":1},{"q":{"BBB":1},"limit":1},{"q":{"AAA":1},"limit":1},{"q":{"AAA":2},"limit":1}],"txnNumber":1,"$db":"test","lsid":{"id":{"$binary":"YXt1nbJBTWmgNPM+Vjug2A==","$type":"04"}},"$clusterTime":{"clusterTime":{"$timestamp":{"t":1620079815,"i":6}},"signature":{"hash":{"$binary":"AAAAAAAAAAAAAAAAAAAAAAAAAAA=","$type":"00"},"keyId":0}}} delete reply: {"n":5,"opTime":{"ts":{"$timestamp":{"t":1620079815,"i":7}},"t":1},"electionId":{"$oid":"7fffffff0000000000000001"},"ok":1,"$clusterTime":{"clusterTime":{"$timestamp":{"t":1620079815,"i":7}},"signature":{"hash":{"$binary":"AAAAAAAAAAAAAAAAAAAAAAAAAAA=","$type":"00"},"keyId":0}},"operationTime":{"$timestamp":{"t":1620079815,"i":7}}} Documents in test.test: 0: {"_id":{"$oid":"609074c70ba2e3bde8ab5285"},"CCC":1} 1: {"_id":{"$oid":"609074c70ba2e3bde8ab5288"},"FFF":2}
The delete contains six statements. The first five statements match a single document (EEE through AAA, in that order) for value 1. The sixth statement matches AAA:2. All but the last two statements (AAA:1 and AAA:2) will be no-ops. Given the data in the collection, AAA:1 will match nothing, but AAA:2 will match and delete one document in the collection. Here, the result of n:5 suggests that five documents were actually deleted. Apart from the four no-op statements being counted, it looks as if the final two statements were accurately counted (only one of two resulted in a deletion).
One findAndModify
findAndModify command: {"findAndModify":"test","query":{"FFF":1},"update":{"$inc":{"FFF":1}},"upsert":true,"new":true,"txnNumber":1,"$db":"test","lsid":{"id":{"$binary":"YXt1nbJBTWmgNPM+Vjug2A==","$type":"04"}},"$clusterTime":{"clusterTime":{"$timestamp":{"t":1620079815,"i":7}},"signature":{"hash":{"$binary":"AAAAAAAAAAAAAAAAAAAAAAAAAAA=","$type":"00"},"keyId":0}}} findAndModify reply: {"lastErrorObject":{"n":1,"updatedExisting":false,"upserted":{"$oid":"609074c70ba2e3bde8ab5281"}},"value":{"_id":{"$oid":"609074c70ba2e3bde8ab5281"},"AAA":1},"ok":1,"$clusterTime":{"clusterTime":{"$timestamp":{"t":1620079815,"i":7}},"signature":{"hash":{"$binary":"AAAAAAAAAAAAAAAAAAAAAAAAAAA=","$type":"00"},"keyId":0}},"operationTime":{"$timestamp":{"t":1620079815,"i":7}}} Documents in test.test: 0: {"_id":{"$oid":"609074c70ba2e3bde8ab5285"},"CCC":1} 1: {"_id":{"$oid":"609074c70ba2e3bde8ab5288"},"FFF":2}
Finally, we attempt to repeat one of the earlier upserts using findAndModify for an FFF document. Since findAndModify consists of only a single statement, the entire command will be a no-op. The result suggests that a document was actually upserted, and it actually returns a copy of the AAA:1 document that was created by the first insert. This ignores that the document was later incremented and ultimately deleted. I think it's clear that this result is derived entirely from the original oplog entry corresponding to statement ID "1.0".
Possible server improvements
As mentioned in SERVER-56137, there are a few possible server improvements that immediately come to mind:
- Raise a server-side error if a command comes in using an lsid and txnNumber combination that was previously executed by a different command entirely. This may not be possible in all cases given that multiple commands can produce the same type of oplog entry, but it may be feasible in some cases. For example, a deletes should never produce an insert.
- Even if a server-side errors are raised, having a different type of oplog entry not count towards a different type's result would be an improvement (e.g. previous insert counting for a delete's n result).
- Include some checksum or hash of the original command (or at least its relevant parts) in the oplog entry, and raise an error if what the server considers a retry attempt fails a checksum match. This would have some overhead (both to calculate and store), but it would allow even multiple operations using the same command to be differentiated in most cases. There's still some edge cases it wouldn't cover: if the child processes in
SERVER-56137were all intentionally issuing the same command (e.g. $inc a field for the same document), we couldn't expect the server to see them as anything other than retry attempts.
I'll defer to server product/engineering to determine if any of these are worthwhile, but I'll understand if the entire premise of this issue comes across as a total edge case better left as-is.
- is related to
-
CDRIVER-3978 Add entropy for OpenSSL RAND_bytes to avoid duplicate sequences for same PID
- Closed