Java Driver allows inserting invalid UTF-8 as string values

XMLWordPrintableJSON

    • Type: Bug
    • Resolution: Won't Fix
    • Priority: Minor - P4
    • None
    • Affects Version/s: 5.1.2
    • Component/s: Codecs
    • None
    • None
    • Java Drivers
    • Hide

      1. What would you like to communicate to the user about this feature?
      2. Would you like the user to see examples of the syntax and/or executable code and its output?
      3. Which versions of the driver/connector does this apply to?

      Show
      1. What would you like to communicate to the user about this feature? 2. Would you like the user to see examples of the syntax and/or executable code and its output? 3. Which versions of the driver/connector does this apply to?
    • None
    • None
    • None
    • None
    • None
    • None

      Summary

      Inserting a BSON document with the Java String "\uD83E" as one of its fields values, results in the field containing invalid utf-8 data.

      Please provide the version of the driver. If applicable, please provide the MongoDB server version and topology (standalone, replica set, or sharded cluster).

      Driver version: 5.1.2

      MongoDB server version: 7.0.12

      How to Reproduce

      Run the following code:

       

      package org.example;
      
      import com.mongodb.ConnectionString;
      import com.mongodb.client.MongoClients;
      import org.bson.Document;
      
      public class Main {
        public static void main(String[] args) {
          var connectionString = new ConnectionString(System.getenv("MONGODB_URL"));
          try (var client = MongoClients.create(connectionString)) {
            var database = client.getDatabase(connectionString.getDatabase());
            var collection = database.getCollection("_utf-8-test");
            var document = new Document();
            document.append("test", "\uD83E");
            collection.insertOne(document);
          }
        }
      } 

       

      Additional Background

      If you try to read the document using the Rust driver you will get the following result:

       

      Raw document: RawDocumentBuf {
          data: "24000000075f69640066c355aaf91cda3cd766501402746573740004000000eda0be0000",
      }
      Error: Error { kind: BsonDeserialization(DeserializationError { message: "invalid utf-8 sequence of 1 bytes from index 0" }), labels: {}, wire_version: None, source: None } 

       

            Assignee:
            Valentin Kavalenka
            Reporter:
            Asger Drewsen
            None
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

              Created:
              Updated:
              Resolved: