-
Type: Bug
-
Resolution: Fixed
-
Priority: Critical - P2
-
Affects Version/s: 4.2.10
-
Component/s: None
-
Fully Compatible
-
ALL
-
v4.4, v4.2, v4.0, v3.6
-
Sharding 2020-12-14
-
(copied to CRM)
ISSUE DESCRIPTION AND IMPACT
The bug causes a failure of the thread that creates new Hash-based Message Authentication Code (HMAC) signing keys every 90 days.
New keys are generated when the Config Server Replica Set (CSRS) fails over. So, if a failover does not happen on the CSRS for 90 days, operations across the sharded cluster will start to fail and will not succeed again until the CSRS fails over.
DIAGNOSIS AND AFFECTED VERSIONS
MongoDB 4.2.2 to 4.2.11 and 4.4.0 to 4.4.2 are affected. The bug may exist in previous versions but mechanisms other than failover cause the CSRS primary to re-generate the HMAC keys successfully in those versions.
To check the expiration date of the HMAC keys, use a mongo shell to connect to a mongos node, or the CSRS primary, authenticate as a user with admin privilege and run the following command to check the expiration date for the HMAC signing keys. The cluster will experience this issue when all the HMAC signing keys expire.
db.getSiblingDB("admin").system.keys.find().map(k => { return { _id: k._id, purpose: k.purpose, expiresAt: new Date(k.expiresAt.getTime()*1000) }})
To perform this check the database user must have permissions to query the admin.system.keys collection. To grant these permissions, create a new role with the find action on the admin.system.keys collection and grant this role to an admin user with the following commands, replacing ADMIN with the username:
use admin; db.createRole({ role: "query_keys", privileges: [ { resource: { db: "admin", collection: "system.keys"}, actions: [ "find" ] }, ], roles: [ ] }); db.grantRolesToUser("ADMIN", ["query_keys"])
REMEDIATION AND WORKAROUNDS
The fix is included in the 3.6.22, 4.0.22, 4.2.12 and 4.4.3 production releases and later. To prevent the issue before upgrading to a fixed release, step down the CSRS primary to initiate a failover before the 90 days limit is reached.
Original Description
I see the overflow issue SERVER-48709 is fixed, but the problem already happens after we upgraded the config server to a version of 4.2.10, new signing keys not generated by the monitoring-keys-for-HMAC thread, after 90 or 180 days, when the signing keys are expired, mongos can't connect mongod server nodes successfully. we have to restart the config server, so that new signing keys will be generated when monitoring-keys-for-HMAC thread start, and then mongos successfully connect mongod server nodes again.
I think the root cause of SERVER-47553 and SERVER-48709 maybe is the same, but it have not been digged out, as this issue may cause unexpected downtime for our service, it's a very serious problem, wish it can be fixed ASAP, Thanks!
- is duplicated by
-
SERVER-53337 Mongos hangs and stop responding
- Closed
-
SERVER-53540 DBException handling request, closing client connection: ClientDisconnect: operation was interrupted
- Closed
-
SERVER-57738 sharding cluster, clients cannot connect to mongos successfully
- Closed
- related to
-
SERVER-48709 signing key generator thread on config server not waken up as expected
- Closed