Repeated primary restarts in a Secondary down PSA deployment lead to "KeyNotFound: No keys found for HMAC" errors, and all write operations are continuously failing.

    • Type: Bug
    • Resolution: Unresolved
    • Priority: Major - P3
    • None
    • Affects Version/s: 8.0.6
    • Component/s: None
    • Environment:
          We use MongoDB version 8.0.6 in a Kubernetes environment as a PSSSA deployment. The Mongo Java sync driver version used in 5.3.1
    • ALL
    • Hide
      • Start a 5‑member replica set split 3/2 across DC‑A and DC‑B.
      • Kill DC‑B (2 S + 1 A) - cluster becomes P S S D A (2 voting members).
      • Allow Kubernetes to schedule the arbiter in DC‑A - cluster becomes P S S D A (3 votes).
      • Restart the primary > 1 time in quick succession (we used 3 × kubectl rollout restart).
      • Observe all client operations fail with Cache Reader: No keys found for HMAC...
      Show
      Start a 5‑member replica set split 3/2 across DC‑A and DC‑B. Kill DC‑B (2 S + 1 A) - cluster becomes P S S D A (2 voting members). Allow Kubernetes to schedule the arbiter in DC‑A - cluster becomes P S S D A (3 votes). Restart the primary > 1 time in quick succession (we used 3 × kubectl rollout restart). Observe all client operations fail with Cache Reader: No keys found for HMAC...
    • None
    • 3
    • TBD
    • None
    • None
    • None
    • None
    • None
    • None

      Environment

                  We use MongoDB version 8.0.6 in a Kubernetes environment as a PSSSA deployment. The Mongo Java sync driver version used in 5.3.1

      Problem : ** 

              **        Reads and writes to the MongoDB Primary fail when the primary gets restarted a few times in PSDDA and PDA scenarios.

      Issue Description : 

               In MongoDB, for our application, we use a PSSSA (1 primary, 3 secondary, 1 arbiter) setup to support high availability scenarios in 2 datacenter (DC) environments. During our HA Testing, when one of the datacenters goes down (the DC where 2 secondary and 1 arbiter are scheduled), the deployment becomes PSDDA, the arbiter will get scheduled in the healthy DC by Kubernetes, and the primary is elected with PSDDA. Now, when the primary MongoDB is restarted a few times, all database queries start to fail, and the following error is observed in the MongoDB logs,

       

      NOTE: This issue mainly occurs when the primary is restarted multiple times. In our testing, when the primary is restarted only once, this issue didn't occur, and the MongoDB server recovers automatically.

      Mongo log :

      "ctx":"monitoring-keys-for-HMAC","msg":"setting timestamp read source","attr":{"readSource":"kMajorityCommitted","provided":"none" 

       

      "ctx":"monitoring-keys-for-HMAC","msg":"setting timestamp read source","attr":{"readSource":"kNoTimestamp","provided":"none"

       

       "ctx":"monitoring-keys-for-HMAC","msg":"Failed to refresh key cache","attr":{"error":"ReadConcernMajorityNotAvailableYet: Read concern majority reads are currently not possible.

      {"t":\{"$date":"2025-04-30T08:40:51.673+00:00"}

      ,"s":"D1", "c":"COMMAND",  "id":21966,   "ctx":"conn5","msg":"Assertion while executing command","attr":{"command":"find","db":"testDB","error":"KeyNotFound: No keys found for HMAC that is valid for time:

      { ts: Timestamp(1746002404, 68) }

      with id: 7498709687009804289"}}

       

      Our server has the cluster-wide read concern as local, though the above logs indicate it expects a majority read concern. The MongoDB primary remains in the same state until one secondary comes up (PSSDA or PSSSA). When PSSDA is present, and when a majority read is possible, the server self-heals, and this issue doesn't occur. We could see in the documentation that 

      enableMajorityReadConcern is by default set to true, so MongoDB internal HMAC cache refresh operations must require a higher consistency guarantee to function properly. 

       

      We also had suspect on java driver not properly getting updated with this HMAC keys. On analysing the TCP dump for queries which were failing in mongo, we could observe that the $clusterTime element with HMAC keys is sent to the server, still, the "keyNotFound" errors were seen in the mongo logs. We also suspect the server is not updating the mongo driver with the latest $clustertime, which could also be the cause of the issue. In our testing, when we restarted the pod (which in turn creates a new Mongo client) the issue got fixed. Similarly, when we created a new Mongo client with driver in the same application pod which was affected the error didn't occur. We suspected the mongo java driver sync version used could be an issue  (5.3.1 ). This is the current latest driver version available, and downgrading to an older version also didn't resolve the issue. We could also observe that not all queries contain the $clustertime field. When the new mongoclient was present, all the write queries lacked the $clustertime field.

      The below image is taken from heap dump from query which fails with  no key for HMAC error. this $clustertime was not present in other queries which doesn't cause this issue

       

      We observed the following two scenarios where we read the HMAC keys. We first check for the feasibility of using majority read concern (if client requests); if not possible, there is a fallback for using local read concern.

      https://github.com/mongodb/mongo/blob/f09021a85309e565f6efd2341685d97df187b0f6/src/mongo/db/keys_collection_client_direct.cpp#L117

       

      We could observe the following query in the mongo logs as mentioned below, where read concern is used in local.

      Log :

      "t":{"$date":"2025-04-28T12:12:02.195+00:00"},"s":"D2", "c":"COMMAND",  "id":21965,   "ctx":"monitoring-keys-for-HMAC","msg":"About to run the command","attr":{"db":"admin","client":"","commandArgs":{"find":"system.keys","filter":{"purpose":"HMAC","expiresAt":{"$gt":{"$timestamp":{"t":1745842314,"i":1}}}},"sort":{"expiresAt":1},"readConcern":{},"$readPreference":{"mode":"nearest"},"$db":"admin"}}}

      {"t":\{"$date":"2025-04-28T12:12:02.195+00:00"}

      ,"s":"D3", "c":"STORAGE",  "id":22414,   "ctx":"monitoring-keys-for-HMAC","msg":"WT begin_transaction","attr":{"snapshotId":58753,"readSource":"kNoTimestamp"}}

       

      In the second scenario, where the HMAC monitoring thread is checking to refresh/ update keys in case of key expiry, we could see that the Majority read concern is used and no fallback mechanism is present. 

      https://github.com/mongodb/mongo/blob/master/src/mongo/db/keys_collection_manager.cpp#L184

      This could be a likely suspect. Though there is a limit set for retry, we could observe the below error continuously in the mongo logs until we recover the cluster by bringing the secondary up (PSSDA). 

      Log :

      {"t":\{"$date":"2025-04-30T08:40:51.673+00:00"}

      ,"s":"D1", "c":"COMMAND",  "id":21966,   "ctx":"conn5","msg":"Assertion while executing command","attr":{"command":"find","db":"appviewx","error":"KeyNotFound: No keys found for HMAC that is valid for time:

      { ts: Timestamp(1746002404, 68) }

      with id: 7498709687009804289"}}

       

        1. image-2025-05-08-15-33-18-973.png
          image-2025-05-08-15-33-18-973.png
          20 kB
        2. mongo-after-1-restart (2).log
          1.25 MB
        3. mongo-after-multi-restart (2).log
          12.53 MB
        4. rs.conf().pdf
          38 kB

            Assignee:
            Rajesh Patil
            Reporter:
            Sakthivel Chendilvelu
            Votes:
            1 Vote for this issue
            Watchers:
            3 Start watching this issue

              Created:
              Updated: