Loading...

XML

Word

Printable

JSON

Type: Bug
Resolution: Fixed
Priority: Minor - P4
Fix Version/s: 3.7
Affects Version/s: None
Component/s: BSON
Labels:
None

Confidence Status:
None

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name:
None
Goal Link:
None

The check for valid utf8 bytes in BSON does not catch all cases.
Take for example the following code example (done with python2.7 and pymongo 3.4)

import bson

# data from the cpe
bad_data = "\xf4\\\x89\x93';"

# encode it with pymongo's bson
m = bson.BSON.encode({'x': bad_data})

# decode it (should work right ?)
bson.BSON.decode(m)

# AHHH it doesn't
# InvalidBSON: 'utf8' codec can't decode byte 0xf4 in position 0: invalid continuation byte

And yes if you ask python to decode it with utf8 it fails.

bad_data.decode('utf8')

I think the check in the BSON module does not validate correctly if the first byte is 244.
According to this: https://en.wikipedia.org/wiki/UTF-8#Codepage_layout . Not all byte combinations after the 244 are possible, python checks this, but pymongos BSON doesn't.

I attached an example program which tries all 2**32 different bit combinations (takes ~60 minutes on a 8-core machine). As you can see, when the first byte is 244, python and pymongo have for around 500k a different opinion what is valid or not.

python test_all.py
.....
range with leading byte 243 is good
with leading byte 244 this amount differs 524288
range with leading byte 245 is good
.......

My tests were done with python2.7 and with pymongo 2.8 and 3.4. However by looking at the history of the validation code also newer versions (3.6.1) and older version seem to be affected.
https://github.com/mongodb/mongo-python-driver/blame/3.6.1/bson/encoding_helpers.c

Since I'm curious I hacked up a little python2 C-Extension (see attached) which uses the UTF8 validation from the mongodb C-driver.
https://github.com/mongodb/mongo-c-driver/blob/master/src/libbson/src/bson/bson-utf8.c

This validation is in sync with python, no differences. So my proposal would be to replace the validation code in pymongo/bson with the one from the C-driver.

- - Sort By Name
  - Sort By Date
  - Ascending
  - Descending
  - Thumbnails
  - List
  - Download All

_isutf8.c
4 kB
Mar 15 2018 01:41:06 PM UTC
test_all.py
1 kB
Mar 15 2018 01:43:33 PM UTC

is depended on by

PYTHON-1557 Release PyMongo 3.7

Closed

Assignee:: Bernie Hackett

Reporter:: Stephan Hof

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Created:: Mar 15 2018 01:52:17 PM UTC

Updated:: Oct 29 2023 02:30:43 AM UTC

Resolved:: Jun 25 2018 03:49:53 AM UTC

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates