-
Type: Bug
-
Resolution: Fixed
-
Priority: Major - P3
-
Affects Version/s: None
-
Component/s: None
-
None
PyMongo's change stream resume logic is broken which can result in changes being missed under some specific circumstances.
What is wrong with the resume logic?
PyMongo does not correctly cache the postBatchResumeToken included in the aggregate command-response when firstBatch is empty.
When does this become a problem?
When a change stream that was started without resumeAfter, startAfter, or startAtOperationTime resumes after a getMore that was run immediately after aaggregate returned an empty firstBatch fails. Consider the following sequence of events:
- the driver runs an aggregate command to create a change stream; lets call this instant in time T1
- the agregate command response returns an empty firstBatch
- the driver tries to iterate the change stream - since the firstBatch was empty, the driver runs a getMore to get more results from the server which fails with some resumable error
- the driver tries to resume the change stream - it has no startAfter, resumeAfter, or startAtOperationTime and it hasn't cached the postBatchResumeToken from the initial aggregate so the change stream is created without any of these options set; lets call this instant in time T2
Due to this bug, applications might miss events that occur between T1 and T2 since the resume does not have an appropriate resume token to use.
Original Description
test_change_stream.TestAllScenarios.test_change_streams_change_streams_Test_consecutive_resume occasionally blocks forever causing the test suite to timeout:
[2020/07/02 04:30:38.875] test_change_streams_change_streams_Executing_a_watch_helper_on_a_Database_results_in_notifications_for_changes_to_all_collections_in_the_specified_database. (test_change_stream.TestAllScenarios) ... ok (0.092s)
[2020/07/02 04:30:38.990] test_change_streams_change_streams_Executing_a_watch_helper_on_a_MongoClient_results_in_notifications_for_changes_to_all_collections_in_all_databases_in_the_cluster. (test_change_stream.TestAllScenarios) ... ok (0.115s)
[2020/07/02 04:59:05.895] Command stopped early: context canceled
[2020/07/02 04:59:05.924] test_change_streams_change_streams_Test_consecutive_resume (test_change_stream.TestAllScenarios) ...
[2020/07/02 04:59:05.924] Running task-timeout commands.
[2020/07/02 04:59:05.924] Running command 'shell.exec' (step 1 of 1)
Seems to be caused by the changes in PYTHON-2143.
- is caused by
-
PYTHON-2143 Do not repeatedly resume if getMore receives the same error
- Closed