Loading...

XML

Word

Printable

JSON

Type: Improvement
Resolution: Unresolved
Priority: Major - P3
Fix Version/s: None
Affects Version/s: None
Component/s: None
Labels:
- repl-shortlist

Assigned Teams:

Replication
Confidence Status:
None
Work Order:
3

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name:
None
Goal Link:
None

Looking into some performance issues with the new client bulk API in python I found that inserting a batch of 100,000 docs in client.bulk_write adds an extra 2.7MB of metadata just from the duplicate "insert" and "document" fields:

>>> doc = {}
>>> client_bulk_op = {"insert": -1, "document": doc}
>>> overhead = len(encode(client_bulk_op)) - len(encode(doc))
>>> overhead
27

Another example:

>>> client.bulk_write([InsertOne({}, namespace='test.test') for _ in range(100_000)])
client bulk OP_MSG size: 4900159
ClientBulkWriteResult(...)
>>> client.test.test.bulk_write([InsertOne({}) for _ in range(100_000)])
collection bulk OP_MSG size: 2200118
BulkWriteResult(...)

That's 4.9 MB for client.bulk vs 2.2MB for collection.bulk.
Is it possible to improve this? Or at least minimize the duplication? For example using short field names save 1.2MB here:

>>> client_bulk_op = {"i": -1, "d": doc}
>>> len(encode(client_bulk_op)) - len(encode(doc))
15

Here's an example of an 18% decrease in performance in client.bulk_write() vs collection.bulk_write():

$ TEST_PATH=specifications/source/benchmarking/data OUTPUT_FILE=result.txt python test/performance/perf_test.py -v TestSmallDocBulkInsert TestSmallDocClientBulkInsert
runTest (__main__.TestSmallDocBulkInsert.runTest) ... Completed TestSmallDocBulkInsert 23.822 MB/s, MEDIAN=0.105s, total time=30.098s, iterations=230
ok
runTest (__main__.TestSmallDocClientBulkInsert.runTest) ... Completed TestSmallDocClientBulkInsert 19.355 MB/s, MEDIAN=0.129s, total time=30.143s, iterations=199
ok
$ python
>>> 1-(19.355/23.822)
0.18751574175132224

When I update the collection.bulk benchmark to append a 2.7MB comment field, TestSmallDocBulkInsert decreases to 21.615 MB/s. So it does seem likely the extra overhead is a big factor.

causes

SERVER-100316 Investigate and implement performance tests on cluster write

Closed

related to

PYTHON-3233 Improved Bulk Write API

Development Complete

JAVA-5545 Benchmark Collection and Client bulkWrite

Closed

DRIVERS-2862 Benchmark Collection and Client bulkWrite

Implementing

DRIVERS-2954 Support the new Bulk Command use at the Database and Collection levels.

Backlog

Assignee:: Unassigned
Reporter:: Shane Harvey
Participants:: Shane Harvey
Votes:: 2 Vote for this issue
Watchers:: 15 Start watching this issue

Created:: Aug 22 2024 09:11:11 PM UTC
Updated:: Feb 27 2025 11:46:09 PM UTC

Details

Description

Attachments

Issue Links

Activity

People

Dates