The check for valid utf8 bytes in BSON does not catch all cases.
Take for example the following code example (done with python2.7 and pymongo 3.4)
import bson # data from the cpe bad_data = "\xf4\\\x89\x93';" # encode it with pymongo's bson m = bson.BSON.encode({'x': bad_data}) # decode it (should work right ?) bson.BSON.decode(m) # AHHH it doesn't # InvalidBSON: 'utf8' codec can't decode byte 0xf4 in position 0: invalid continuation byte
And yes if you ask python to decode it with utf8 it fails.
bad_data.decode('utf8')
I think the check in the BSON module does not validate correctly if the first byte is 244.
According to this: https://en.wikipedia.org/wiki/UTF-8#Codepage_layout . Not all byte combinations after the 244 are possible, python checks this, but pymongos BSON doesn't.
I attached an example program which tries all 2**32 different bit combinations (takes ~60 minutes on a 8-core machine). As you can see, when the first byte is 244, python and pymongo have for around 500k a different opinion what is valid or not.
python test_all.py ..... range with leading byte 243 is good with leading byte 244 this amount differs 524288 range with leading byte 245 is good .......
My tests were done with python2.7 and with pymongo 2.8 and 3.4. However by looking at the history of the validation code also newer versions (3.6.1) and older version seem to be affected.
https://github.com/mongodb/mongo-python-driver/blame/3.6.1/bson/encoding_helpers.c
Since I'm curious I hacked up a little python2 C-Extension (see attached) which uses the UTF8 validation from the mongodb C-driver.
https://github.com/mongodb/mongo-c-driver/blob/master/src/libbson/src/bson/bson-utf8.c
This validation is in sync with python, no differences. So my proposal would be to replace the validation code in pymongo/bson with the one from the C-driver.
- is depended on by
-
PYTHON-1557 Release PyMongo 3.7
- Closed