-
Type: Improvement
-
Resolution: Fixed
-
Priority: Major - P3
-
Affects Version/s: None
-
Component/s: BSON
-
None
-
Minor Change
Currently the BSON extension behaves differently when put_string is given something other than a valid UTF-8 byte sequence in a string labeled with UTF-8 encoding:
- In MRI, it is possible to construct strings in utf-8 encoding which are not valid utf-8. These strings are rejected by the bson C extension.
- In JRuby, it is not possible to construct strings in utf-8 encoding which are not valid utf-8.
- In MRI, writing a string which is not labeled with utf-8 encoding, but which happens to contain valid utf-8 sequences, treats the string as if it was labeled with utf-8 encoding and writes it verbatim to the bson buffer.
- In JRuby, writing a string which is not labeled with utf-8 encoding first converts it to utf-8, seemingly treating each byte in the original string as a code point. This mutates the input silently during serialization.
- In MRI, when a string is given in an encoding other than utf-8, even if the byte sequence in the string is valid in the claimed encoding, and the string can be encoded to utf-8, writing it to bson buffer yields an error saying the string is not valid utf-8.
- In JRuby, the same input is encoded to utf-8 and written to the bson buffer.
Proposed new behavior:
- Strings which are not already in utf-8 encoding are first (attempted to be) encoded in utf-8. This could fail, propagating Encoding::UndefinedConversionError to the application.
- Then, the string is checked to contain valid utf-8 byte sequences.
- Finally the string is written to the byte buffer.
For MRI, this change means strings in non-utf-8 encodings which contain valid data would be serialized after conversion to utf-8. Applications giving mislabeled strings to the driver (i.e. utf-8 data but string not marked as having utf-8 encoding) will need to set the encoding correctly.
For JRuby, this change means the bson extension will no longer mutate data when it is not valid utf-8. Sometimes the data will be rejected which was previously silently mutated.
- is related to
-
RUBY-1977 Document and repair edge cases in ByteBuffer
- Closed
- links to