-
Type: Bug
-
Resolution: Won't Fix
-
Priority: Minor - P4
-
None
-
Affects Version/s: None
-
Component/s: BSON
-
(copied to CRM)
Summary
It is possible to create a String in Java containing byte sequences that when encoded by the Mongo Java Driver, are stored in a MongoDB string field as invalid UTF-8.
BSON spec defines string type as UTF-8 which I think implies valid UTF-8 (but I could be wrong). When invalid UTF-8 is stored in a string field, it makes it impossible to create a text index on that field because MongoDB will throw an exception. Then it's also difficult to find/fix those fields.
The behavior is different than the mongo shell or other drivers which validate the UTF-8 before persistence and use replacement characters such as '?' orΒ U+FFFD to ensure the DB only contains valid UTF-8 strings.
Please provide the version of the driver. If applicable, please provide the MongoDB server version and topology (standalone, replica set, or sharded cluster).
Replicable with Mongo Java Drivers: 4.3.x, 4.4.0
MongoDB Server: 4.4.10 (crashes replica set trying to create text index on the field)
MongoDB Server: 5.0.4 (fails with exception trying to create text index on the field)
How to Reproduce
Create String in Java using random bytes or truncating a String containing emojis. Insert into a collection. Try to create a text index.
Β
import java.nio.charset.StandardCharsets; import java.util.List; import org.bson.Document; import com.mongodb.MongoClientSettings; import com.mongodb.MongoCommandException; import com.mongodb.MongoCredential; import com.mongodb.ServerAddress; import com.mongodb.client.MongoClient; import com.mongodb.client.MongoClients; import com.mongodb.client.MongoCollection; public class CreateInvalidUTF8 { public static void main(String[] args) { String fname = "Firstname"; String lname = "ππππ"; MongoClient mongoCli = MongoClients.create(MongoClientSettings.builder() .applyToClusterSettings(builder -> builder .hosts(List.of(new ServerAddress(args[0])))) .credential(MongoCredential.createCredential(args[1], "admin", args[2].toCharArray())) .build()); MongoCollection<Document> coll = mongoCli.getDatabase("test_utf8").getCollection("foo"); MongoCollection<Document> coll2 = mongoCli.getDatabase("test_utf8").getCollection("bar"); Document d = new Document(); // This is a common mistake in Java code that takes user input. // But it could also be a String created from any byte array // containing random sequences that cannot be encoded d.put("name", fname + " " + lname.charAt(0)); coll.insertOne(d); // fails with MongoCommandException "text contains invalid UTF-8" try { coll.createIndex(new Document("name", "text")); } catch (MongoCommandException e) { e.printStackTrace(); } Document d2 = new Document(); // String.getBytes(UTF_8) validates UTF-8 // and substitutes valid replacement character so the UTF-8 is valid d2.put("name", new String((fname + " " + lname.charAt(0)).getBytes(StandardCharsets.UTF_8), StandardCharsets.UTF_8)); coll2.insertOne(d2); // succeeds coll2.createIndex(new Document("name", "text")); } }
Β
Β
Additional Background
The same text done via mongo shell succeeds and stores only valid UTF-8 in the DB string fields.
db.baz.insertOne({name:('Firstname ' + 'ππππ'.substring(0,1))}) db.baz.createIndex({name:'text'})
- is duplicated by
-
JAVA-5575 Java Driver allows inserting invalid UTF-8 as string values
- Closed
- is related to
-
SERVER-62871 [4.4] Improve handling of text index creation in the presence of invalid UTF-8
- Closed
-
SERVER-62348 Text index creation fails with error "text contains invalid UTF-8"
- Closed