Loading...

XML

Word

Printable

JSON

Type: Epic
Resolution: Fixed
Priority: Major - P3
Fix Version/s: 0.4.0
Affects Version/s: None
Component/s: None
Labels:
- rp-track

Epic Status:
Done
Epic Name:
Support for Writing Tabular Data

Quarter:
- FY23Q1
Scope Cost Estimate:
2
Cost to Date:
2
Final Cost Estimate:
4
Cost Threshold %:
200
Detailed Project Statuses:
Hide

Engineer: Julius Park

2022-04-05: Setting end date to 2022-04-15

Julius has continued to make strong progress - he has successfully added basic support for writing to MongoDB from PyArrow and has been refining the solution while running performance tests - with the most recent developments the performance hit appears to be negligible.

Engineer: Julius Park

2022-03-22: Setting end date to 2022-04-01

The team has signed off on the design and Julius has begun implementation.

Julius has round-trip testing working and he's also implemented row-specific error messages.

Engineer: Julius Park

2022-03-08: Start date pending design completion

Julius speedily achieved approval on his scope doc and has equally quickly stubbed out a design draft and suggestions for ticket breakdown. The team is deep into reviewing the doc.
Show
Engineer: Julius Park 2022-04-05: Setting end date to 2022-04-15 Julius has continued to make strong progress - he has successfully added basic support for writing to MongoDB from PyArrow and has been refining the solution while running performance tests - with the most recent developments the performance hit appears to be negligible. Engineer: Julius Park 2022-03-22: Setting end date to 2022-04-01 The team has signed off on the design and Julius has begun implementation. Julius has round-trip testing working and he's also implemented row-specific error messages. Engineer: Julius Park 2022-03-08: Start date pending design completion Julius speedily achieved approval on his scope doc and has equally quickly stubbed out a design draft and suggestions for ticket breakdown. The team is deep into reviewing the doc.

Tabular datatypes (e.g. pandas.DataFrame, pyarrow.Table) are slow to iterate over and transform into documents that can be inserted into MongoDB. These datatypes are optimized for columnar operations and are fundamentally ill-suited for high-performance conversion to a sequence of documents that can be inserted/upserted into a MongoDB cluster.

Furthermore, when dealing with a large dataset, not only is the write performance poor due to inefficiencies in transposing the columnar data, but it can also be severely degraded if documents are not sent in optimally-sized batches to the server (e.g. inserting one document at-a-time is very slow).

We can significantly alleviate pain associated with the process of persisting tabular datasets from Python to a MongoDB cluster by writing a C-extension that:

iterates efficiently over the C-Arrays underlying columns that make up a table and encodes them directly to BSON
batches bulk writes automatically and optimally based on some heuristic to minimize the number of network round-trips needed to store the dataset

Note: previously this ticked tracked implementation of write support in BSON-NumPy

is duplicated by

INTPYTHON-10 Write support for tabular datatypes

Closed

Assignee:: Julius Park (Inactive)

Reporter:: Rathi Gnanasekaran

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Created:: Aug 19 2019 07:26:31 PM UTC

Updated:: Jul 19 2024 01:24:27 AM UTC

Resolved:: May 05 2022 03:46:07 PM UTC

Start date:: 21/Mar/22

End date:: 15/Apr/22

Details

Description

Attachments

Issue Links

Activity

People

Dates