Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-56756

Primary cannot stepDown when experiencing disk failures

    • Type: Icon: Bug Bug
    • Resolution: Fixed
    • Priority: Icon: Major - P3 Major - P3
    • 5.0.25
    • Affects Version/s: 4.0.23
    • Component/s: Replication
    • None
    • Fully Compatible
    • Linux
    • Hide
      • Start a 3-node replica-set.
      • Start a single node replica-set for config server.
      • Start a mongos server.
      • Start a workload with multiple client threads (e.g., 50) running a mixture of find/update operations against the primary (through mongos).
      • Freeze the dbpath for the primary (e.g., fsfreeze --freeze /mnt/primary).
      • Ask the primary to step down.
      Show
      Start a 3-node replica-set. Start a single node replica-set for config server. Start a mongos server. Start a workload with multiple client threads (e.g., 50) running a mixture of find/update operations against the primary (through mongos ). Freeze the dbpath for the primary (e.g., fsfreeze --freeze /mnt/primary ). Ask the primary to step down.
    • Repl 2021-07-12, Repl 2021-07-26, Repl 2021-08-09, Repl 2021-08-23, Replication 2021-11-15, Replication 2021-11-29, Replication 2021-12-13, Replication 2021-12-27, Replication 2022-01-10, Replication 2022-01-24, Replication 2022-02-07

      Sending a step down request to a primary that is experiencing disk failures could result in consistent time-out errors:

      {
              "operationTime" : Timestamp(1620337238, 857),
              "ok" : 0,
              "errmsg" : "Could not acquire the global shared lock before the deadline for stepdown",
              "code" : 262,
              "codeName" : "ExceededTimeLimit",
              "$gleStats" : {
                      "lastOpTime" : Timestamp(0, 0),
                      "electionId" : ObjectId("7fffffff0000000000000001")
              },
              "lastCommittedOpTime" : Timestamp(1620337238, 327),
              "$configServerState" : {
                      "opTime" : {
                              "ts" : Timestamp(1620337306, 1),
                              "t" : NumberLong(1)
                      }
              },
              "$clusterTime" : {
                      "clusterTime" : Timestamp(1620337306, 1),
                      "signature" : {
                              "hash" : BinData(0,"AAAAAAAAAAAAAAAAAAAAAAAAAAA="),
                              "keyId" : NumberLong(0)
                      }
              }
      }
      

      The error is returned from here and the behavior is easy to reproduce. I've tested the behavior on v4.0.23.

      Also, I tried to attach GDB to the primary to collect stack-traces, but GDB hangs and I haven't been able to find an alternative yet.

            Assignee:
            adi.zaimi@mongodb.com Adi Zaimi
            Reporter:
            amirsaman.memaripour@mongodb.com Amirsaman Memaripour
            Votes:
            0 Vote for this issue
            Watchers:
            12 Start watching this issue

              Created:
              Updated:
              Resolved: