When upgrading our oldest running production mongo cluster from 4.4 to 5.0 FCV (feature compatibility version), the operation fails.
This mongo cluster has been continuously running since something like 2.4 or 2.6, early days. Our other clusters which are newer did not have this issue.
mongos> db.adminCommand( { setFeatureCompatibilityVersion: "5.0", writeConcern: { w: "majority", wtimeout: 900000 } } ) { "ok" : 0, "errmsg" : "Failed command { _flushRoutingTableCacheUpdatesWithWriteConcern: \"REDACTED_DB1.REDACTED_COLLECTION1\", syncFromConfig: true, writeConcern: { w: \"majority\", wtimeout: 60000 } } for database 'admin' on shard 'rs_prod1_shard24' :: caused by :: Failed to read persisted collections entry for collection 'REDACTED_DB1.REDACTED_COLLECTION1'. :: caused by :: Failed to read the 'REDACTED_DB1.REDACTED_COLLECTION1' entry locally from config.collections :: caused by :: BSON field 'ShardCollectionType.uuid' is missing but a required field", "code" : 40414, "codeName" : "Location40414", "$clusterTime" : { "clusterTime" : Timestamp(1696887685, 3143), "signature" : { "hash" : BinData(0,"AAAAAAAAAAAAAAAAAAAAAAAAAAA="), "keyId" : NumberLong(0) } }, "operationTime" : Timestamp(1696887680, 5271) }
We have a 30 shard cluster, each of which are replica sets.
When I inspect the config.collections metadata from one of our mongos nodes, I see that all collections do have a UUID:
mongos> config.collections.count({uuid: {$exists: 0}}) 0 mongos> config.collections.count({uuid: {$exists: 1}}) 79508
However when I go to the primary for some of the shards (shards 29 and 30 output below), you can see that a ton of collections have no UUID directly on the shard, according to the `config.cache.collections` collection information.
rs_prod1_shard29:PRIMARY> config.cache.collections.count({uuid: {$exists: 0}}) 18337 rs_prod1_shard29:PRIMARY> config.cache.collections.count({uuid: {$exists: 1}}) 56378
rs_prod1_shard30:PRIMARY> config.cache.collections.count({uuid: {$exists: 0}}) 29756 rs_prod1_shard30:PRIMARY> config.cache.collections.count({uuid: {$exists: 1}}) 36651
Taking one of the collections without a UUID and doing a count on it on that shard shows that all of these collections have 0 count of documents (on this shard). Most of these collections aren't actually non-0 in size, just non-0 in data on this specific shard.
rs_prod1_shard30:PRIMARY> var collectionsWithoutUUID = config.cache.collections.find({uuid: {$exists: 0}}); var collectionsIterated = 0; while (collectionsWithoutUUID.hasNext()) { var collectionCacheInfo = collectionsWithoutUUID.next(); var databaseName = collectionCacheInfo._id.split(".")[0]; var collectionName = collectionCacheInfo._id.split(".")[1]; var count = db.getSiblingDB(databaseName)[collectionName].count(); if (count !== 0) { print(collectionCacheInfo._id + " - " + count); } }
We've noticed that there were a number of collections before the FCV upgrade that had been created on shards where the chunks didn't live, even before the upgrade.
So you'd have a collection, say, where there's 1 (or many!) chunks all on 1 shard, but other shards have the collection as existing, even though the shard doesn't own any chunks for the collection.
These empty/phantom collections would have weird properties. Often they didn't have the index definitions that the (real) chunk/shard had for the collection, and often they had a UUID mismatch as well.
In some cases that UUID mismatch would prevent chunks getting distributed, which in turn slowed down or blocked the balancer, because the balancer would (correctly) find these chunks as needing to be moved but unable to do so (and would keep trying every balancer run)
For a while we've been living with this situation by simply finding the bad collections as the balancer found them, and then going to the 0-sized, 0-chunks-living-on-that-shard collection and dropping it (from those shards only that didn't own documents and chunks), doing so would unblock the balancer for that collection.
We were hoping to do a full sweep of all collections on all shards and find and fix these scenarios, but since the action to be taken was "drop collection locally", we recognized that this could be a very risky move and hadn't gotten around to devoting the proper time/care to fully remediate.
In addition to logging this bug, I'd like to ask advice on what's the best way forward?
We are left in a situation where the target FCV version is 5.0 but the actual version is 4.4 in all shards, which prevents us some running some operations nicely (had trouble when reconfiguring a replica set yesterday, but not sure what else will break in this scenario medium-term).
One idea would be to manually set the UUID objects on all shards using the source of truth (the config server itself), as it does seem to be set there.
Another would be to systematically ensure that these collections are indeed "phantom" or "orphaned" on those shards and drop them all.