Loading...

XML

Word

Printable

JSON

Type: Improvement
Resolution: Fixed
Priority: Major - P3
Fix Version/s: 3.8
Affects Version/s: None
Component/s: GridFS
Labels:
None

Confidence Status:
None

Backwards Compatibility:
Minor Change

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name:
None
Goal Link:
None

In the retryable reads POC (~~PYTHON-1674~~) there are GridFS tests that use command monitoring events to assert the expected behavior. The tests assume that a driver issues a single find command to read all chunks in a file, like this:

for chunk in chunks.find({"files_id": file_id}, sort=[("n", 1)]):
    # process chunk...

However, PyMongo actually runs a new find_one to read each chunk, similar to this:

for chunk_number in range(total_chunks):
    chunk = chunks.find_one({"files_id": file_id, "n": chunk_number})
    # process chunk...

(The above is a simplification, the real implementation is here in the GridOut class.)

Now we could change PyMongo's implementation to cache a cursor and reuse it to iterate over all the chunks to match the spec tests. This is also how the GridFS spec itself suggests to read the chunks for a file:

Drivers must first retrieve the files collection document for this file. If there is no files collection document, the file either never existed, is in the process of being deleted, or has been corrupted, and the driver MUST raise an error.

Then, implementers retrieve all chunks with files_id equal to id, sorted in ascending order on “n”.

My question is why do we use many find_one's instead of a single find? Is it to avoid complications arising from cursor errors like CursorNotFound?

Note that since the default chunk size is 255KB, I expect that using a single find will be much more performant than many find_one's because many chunks can fit in a single find/getMore response.

causes

MOTOR-403 Fix test.test_grid_file.TestGridFile.test_survive_cursor_not_found

Closed

is depended on by

PYTHON-1674 Retryable Reads

Closed

is related to

PYTHON-4146 Use insert_many to upload GridFS chunks for better performance

Closed

Assignee:: Shane Harvey
Reporter:: Shane Harvey
Votes:: 0 Vote for this issue
Watchers:: 1 Start watching this issue

Created:: Jan 16 2019 10:41:46 PM UTC
Updated:: Jan 17 2024 11:02:04 PM UTC
Resolved:: Mar 13 2019 10:51:22 PM UTC
Confidence Status Last Update:: 26/Feb/19 9:53 PM

Details

Description

Attachments

Issue Links

Activity

People

Dates