In the retryable reads POC (PYTHON-1674) there are GridFS tests that use command monitoring events to assert the expected behavior. The tests assume that a driver issues a single find command to read all chunks in a file, like this:
for chunk in chunks.find({"files_id": file_id}, sort=[("n", 1)]): # process chunk...
However, PyMongo actually runs a new find_one to read each chunk, similar to this:
for chunk_number in range(total_chunks): chunk = chunks.find_one({"files_id": file_id, "n": chunk_number}) # process chunk...
(The above is a simplification, the real implementation is here in the GridOut class.)
Now we could change PyMongo's implementation to cache a cursor and reuse it to iterate over all the chunks to match the spec tests. This is also how the GridFS spec itself suggests to read the chunks for a file:
Drivers must first retrieve the files collection document for this file. If there is no files collection document, the file either never existed, is in the process of being deleted, or has been corrupted, and the driver MUST raise an error.
Then, implementers retrieve all chunks with files_id equal to id, sorted in ascending order on “n”.
My question is why do we use many find_one's instead of a single find? Is it to avoid complications arising from cursor errors like CursorNotFound?
Note that since the default chunk size is 255KB, I expect that using a single find will be much more performant than many find_one's because many chunks can fit in a single find/getMore response.
- causes
-
MOTOR-403 Fix test.test_grid_file.TestGridFile.test_survive_cursor_not_found
- Closed
- is depended on by
-
PYTHON-1674 Retryable Reads
- Closed
- is related to
-
PYTHON-4146 Use insert_many to upload GridFS chunks for better performance
- Closed