-
Type: Task
-
Resolution: Gone away
-
Priority: Major - P3
-
None
-
Affects Version/s: None
-
Component/s: None
-
Sharding NYC
-
4
The quick fix I did for SERVER-66224 was fixing tests but in general case it's wrong. Max said:
"Brett and I have been debugging an issue and learned that secondaries in the config server replica set attempt to extend the lease of the distributed lock. The writes the secondaries do are through ShardLocal so they end up failing with NotWritablePrimary - https://github.com/mongodb/mongo/blob/3805148358ae9b82e5f3b9307bd25fbf7a4dd4b5/src/mongo/db/s/dist_lock_catalog_replset.cpp#L206-L215I haven't been following the ShardLocal / ShardRemote / ShardConfig but would like to make certain we forbid secondaries from contacting the config server primary and extending the lease. Only the primary of the replica set should ever be doing the distributed lock pinging so fixing that may be the ultimate preferred solution
dist_lock_catalog_replset.cpp
Status DistLockCatalogImpl::ping(OperationContext* opCtx, StringData processID, Date_t ping) {
auto request = write_ops::FindAndModifyCommandRequest(_lockPingNS);
request.setQuery(BSON(LockpingsType::process() << processID));
request.setUpdate(write_ops::UpdateModification::parseFromClassicUpdate(
BSON("$set" << BSON(LockpingsType::ping(ping)))));
I'm saying it we should ideally prevent secondaries from pinging the distributed lock. Secondaries aren't authoritative
At minimum the writes the secondaries attempt to do today must still happen locally (and thus fail with NotWritablePrimary) if they are going to happen at all
Yes normally https://github.com/mongodb/mongo/blob/5dff90ff1e8a672a8716f0c9c936f8f50e56fd0b/src/mongo/db/repl/oplog.cpp#L367 would abort the local storage transaction on the secondary because config.lockpings is a replicated collection
oplog.cpp
uasserted(ErrorCodes::NotWritablePrimary, ss);
The specific case I'm worried about is secondary node in the catalog shard wants to ping the distributed lock so it contacts the current primary of the catalog shard. Instead it be the exclusive responsibility of the primary of the shards to do that pinging
Today on the CSRS the secondary node in the CSRS wants to ping the distributed lock so it tries to write to config.lockpings locally and gets a NotWritablePrimary error"