Loading...

Type: Bug
Resolution: Unresolved
Priority: Major - P3
Fix Version/s: None
Affects Version/s: 8.0.6
Component/s: None
Labels:
- MongoClient
- java-driver
Environment:
We use MongoDB version 8.0.6 in a Kubernetes environment as a PSSSA deployment. The Mongo Java sync driver version used in 5.3.1

Operating System:
ALL
Steps To Reproduce:
Hide

Start a 5‑member replica set split 3/2 across DC‑A and DC‑B.

Kill DC‑B (2 S + 1 A) - cluster becomes P S S D A (2 voting members).

Allow Kubernetes to schedule the arbiter in DC‑A - cluster becomes P S S D A (3 votes).

Restart the primary > 1 time in quick succession (we used 3 × kubectl rollout restart).

Observe all client operations fail with Cache Reader: No keys found for HMAC...
Show
Start a 5‑member replica set split 3/2 across DC‑A and DC‑B. Kill DC‑B (2 S + 1 A) - cluster becomes P S S D A (2 voting members). Allow Kubernetes to schedule the arbiter in DC‑A - cluster becomes P S S D A (3 votes). Restart the primary > 1 time in quick succession (we used 3 × kubectl rollout restart). Observe all client operations fail with Cache Reader: No keys found for HMAC...
Confidence Status:
None
Work Order:
3
Size Category:
TBD

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name:
None
Goal Link:
None

Environment:

We use MongoDB version 8.0.6 in a Kubernetes environment as a PSSSA deployment. The Mongo Java sync driver version used in 5.3.1

Problem : **

** Reads and writes to the MongoDB Primary fail when the primary gets restarted a few times in PSDDA and PDA scenarios.

Issue Description :

In MongoDB, for our application, we use a PSSSA (1 primary, 3 secondary, 1 arbiter) setup to support high availability scenarios in 2 datacenter (DC) environments. During our HA Testing, when one of the datacenters goes down (the DC where 2 secondary and 1 arbiter are scheduled), the deployment becomes PSDDA, the arbiter will get scheduled in the healthy DC by Kubernetes, and the primary is elected with PSDDA. Now, when the primary MongoDB is restarted a few times, all database queries start to fail, and the following error is observed in the MongoDB logs,

NOTE: This issue mainly occurs when the primary is restarted multiple times. In our testing, when the primary is restarted only once, this issue didn't occur, and the MongoDB server recovers automatically.

Mongo log :

"ctx":"monitoring-keys-for-HMAC","msg":"setting timestamp read source","attr":{"readSource":"kMajorityCommitted","provided":"none"

"ctx":"monitoring-keys-for-HMAC","msg":"setting timestamp read source","attr":{"readSource":"kNoTimestamp","provided":"none"

"ctx":"monitoring-keys-for-HMAC","msg":"Failed to refresh key cache","attr":{"error":"ReadConcernMajorityNotAvailableYet: Read concern majority reads are currently not possible.

{"t":\{"$date":"2025-04-30T08:40:51.673+00:00"}

,"s":"D1", "c":"COMMAND", "id":21966, "ctx":"conn5","msg":"Assertion while executing command","attr":{"command":"find","db":"testDB","error":"KeyNotFound: No keys found for HMAC that is valid for time:

{ ts: Timestamp(1746002404, 68) }

with id: 7498709687009804289"}}

Our server has the cluster-wide read concern as local, though the above logs indicate it expects a majority read concern. The MongoDB primary remains in the same state until one secondary comes up (PSSDA or PSSSA). When PSSDA is present, and when a majority read is possible, the server self-heals, and this issue doesn't occur. We could see in the documentation that

enableMajorityReadConcern is by default set to true, so MongoDB internal HMAC cache refresh operations must require a higher consistency guarantee to function properly.

We also had suspect on java driver not properly getting updated with this HMAC keys. On analysing the TCP dump for queries which were failing in mongo, we could observe that the $clusterTime element with HMAC keys is sent to the server, still, the "keyNotFound" errors were seen in the mongo logs. We also suspect the server is not updating the mongo driver with the latest $clustertime, which could also be the cause of the issue. In our testing, when we restarted the pod (which in turn creates a new Mongo client) the issue got fixed. Similarly, when we created a new Mongo client with driver in the same application pod which was affected the error didn't occur. We suspected the mongo java driver sync version used could be an issue (5.3.1 ). This is the current latest driver version available, and downgrading to an older version also didn't resolve the issue. We could also observe that not all queries contain the $clustertime field. When the new mongoclient was present, all the write queries lacked the $clustertime field.

The below image is taken from heap dump from query which fails with no key for HMAC error. this $clustertime was not present in other queries which doesn't cause this issue

We observed the following two scenarios where we read the HMAC keys. We first check for the feasibility of using majority read concern (if client requests); if not possible, there is a fallback for using local read concern.

https://github.com/mongodb/mongo/blob/f09021a85309e565f6efd2341685d97df187b0f6/src/mongo/db/keys_collection_client_direct.cpp#L117

We could observe the following query in the mongo logs as mentioned below, where read concern is used in local.

Log :

"t":{"$date":"2025-04-28T12:12:02.195+00:00"},"s":"D2", "c":"COMMAND", "id":21965, "ctx":"monitoring-keys-for-HMAC","msg":"About to run the command","attr":{"db":"admin","client":"","commandArgs":{"find":"system.keys","filter":{"purpose":"HMAC","expiresAt":{"$gt":{"$timestamp":{"t":1745842314,"i":1}}}},"sort":{"expiresAt":1},"readConcern":{},"$readPreference":{"mode":"nearest"},"$db":"admin"}}}

{"t":\{"$date":"2025-04-28T12:12:02.195+00:00"}

,"s":"D3", "c":"STORAGE", "id":22414, "ctx":"monitoring-keys-for-HMAC","msg":"WT begin_transaction","attr":{"snapshotId":58753,"readSource":"kNoTimestamp"}}

In the second scenario, where the HMAC monitoring thread is checking to refresh/ update keys in case of key expiry, we could see that the Majority read concern is used and no fallback mechanism is present.

https://github.com/mongodb/mongo/blob/master/src/mongo/db/keys_collection_manager.cpp#L184.

This could be a likely suspect. Though there is a limit set for retry, we could observe the below error continuously in the mongo logs until we recover the cluster by bringing the secondary up (PSSDA).

Log :

{"t":\{"$date":"2025-04-30T08:40:51.673+00:00"}

,"s":"D1", "c":"COMMAND", "id":21966, "ctx":"conn5","msg":"Assertion while executing command","attr":{"command":"find","db":"appviewx","error":"KeyNotFound: No keys found for HMAC that is valid for time:

{ ts: Timestamp(1746002404, 68) }

with id: 7498709687009804289"}}

- - Sort By Name
  - Sort By Date
  - Ascending
  - Descending
  - Thumbnails
  - List
  - Download All

image-2025-05-08-15-33-18-973.png
20 kB
May 08 2025 10:03:20 AM UTC
mongo-after-1-restart (2).log
1.25 MB
May 08 2025 07:34:20 AM UTC
mongo-after-multi-restart (2).log
12.53 MB
May 08 2025 07:34:30 AM UTC
rs.conf().pdf
38 kB
May 08 2025 10:12:52 AM UTC

Details

Description

Attachments

Attachments

Activity

People

Dates