PyMongo inefficiently reads and assembles large messages off the network in network._receive_data_on_socket:
def _receive_data_on_socket(sock, length): msg = b"" while length: try: chunk = sock.recv(length) except (IOError, OSError) as exc: if _errno_from_exception(exc) == errno.EINTR: continue raise if chunk == b"": raise AutoReconnect("connection closed") length -= len(chunk) msg += chunk return msg
The biggest problem here is on line 179 where the chunk of recv'd data is appended to the full message using bytes +=. This is relatively efficient only on CPython 2 because str has an optimized version of +=. This optimization is not present on PyPy 2 and Python 3. Performance on Python 3 (and PyPy 2) suffer when assembling large messages. For example when reading large batches of documents:
$ python2.7 -m timeit -s 'from pymongo import MongoClient;c=MongoClient().test.test' 'for d in c.find_raw_batches(batch_size=1000000):d' 10 loops, best of 3: 428 msec per loop $ python3.6 -m timeit -s 'from pymongo import MongoClient;c=MongoClient().test.test' 'for d in c.find_raw_batches(batch_size=1000000):d' 10 loops, best of 3: 1.32 sec per loop
The slowdown becomes worse when recv'ing many small chunks of data to assemble a large message. One can simulate this by configuring a small TCP recv buffer:
$ sudo sysctl net.inet.tcp.recvspace=16384 net.inet.tcp.recvspace: 131072 -> 16384 $ sudo sysctl net.inet.tcp.autorcvbufmax=262144 net.inet.tcp.autorcvbufmax: 1048576 -> 262144 $ python2.7 -m timeit -s 'from pymongo import MongoClient;c=MongoClient().test.test' 'for d in c.find_raw_batches(batch_size=1000000):d' 10 loops, best of 3: 416 msec per loop $ python3.6 -m timeit -s 'from pymongo import MongoClient;c=MongoClient().test.test' 'for d in c.find_raw_batches(batch_size=1000000):d' 10 loops, best of 3: 3.48 sec per loop
The fix is to preallocate a bytearray and copy each chunk using slice assignment. On Python 3 we can do even better by passing a memoryview of the bytearray to Socket.recv_into. Here is the same benchmark with these improvements:
$ python2.7 -m timeit -s 'from pymongo import MongoClient;c=MongoClient().test.test' 'for d in c.find_raw_batches(batch_size=1000000):d' 10 loops, best of 3: 384 msec per loop $ python3.6 -m timeit -s 'from pymongo import MongoClient;c=MongoClient().test.test' 'for d in c.find_raw_batches(batch_size=1000000):d' 10 loops, best of 3: 194 msec per loop
Much better!
-
- Original description **
Cursor hangs after reading 101 documents from MongoDB.
Cursor gets stuck for a few seconds after reading 101 documents.
This happens again after reading 5776, 11451 documents.
My collection has 30.000 document, each 90 fields long (just attribute:value, only strings)
This problem only occurres in Python, using pymongo.
It does not occur when using mongo-shell.
#!/usr/bin/python3 from pymongo import MongoClient entriestoprocess = 15000 rowCount=1 mongo_client=MongoClient() db=mongo_client.test2 cursor = db.wr.find() for i in cursor: if (rowCount > entriestoprocess): print("Finished after {} entries".format(rowCount)) break print('----------------------------------------') print("Processing entry: {}".format(rowCount)) rowCount += 1
- is related to
-
RUBY-1364 Bad performance reading large documents over SSL
- Closed
- related to
-
PYTHON-413 MemoryError while retrieving large cursors
- Closed