Tabular datatypes (e.g. pandas.DataFrame, pyarrow.Table) are slow to iterate over and transform into documents that can be inserted into MongoDB. These datatypes are optimized for columnar operations and are fundamentally ill-suited for high-performance conversion to a sequence of documents that can be inserted/upserted into a MongoDB cluster.
Furthermore, when dealing with a large dataset, not only is the write performance poor due to inefficiencies in transposing the columnar data, but it can also be severely degraded if documents are not sent in optimally-sized batches to the server (e.g. inserting one document at-a-time is very slow).
We can significantly alleviate pain associated with the process of persisting tabular datasets from Python to a MongoDB cluster by writing a C-extension that:
iterates efficiently over the C-Arrays underlying columns that make up a table and encodes them directly to BSON
batches bulk writes automatically and optimally based on some heuristic to minimize the number of network round-trips needed to store the dataset
Note: previously this ticked tracked implementation of write support in BSON-NumPy
- is duplicated by
-
INTPYTHON-10 Write support for tabular datatypes
- Closed