Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-31223

fix race in StepDownTest::OnlyOneStepDownCmdIsAllowedAtATime

    • Type: Icon: Bug Bug
    • Resolution: Won't Fix
    • Priority: Icon: Major - P3 Major - P3
    • None
    • Affects Version/s: None
    • Component/s: None
    • None
    • Fully Compatible
    • ALL
    • Repl 2017-10-02, Repl 2017-10-23
    • 0

      There is a race in this test between the thread started in stepDown_nonBlocking() and the call to ReplicationCoordinator::stepDown():

      https://github.com/mongodb/mongo/blob/7626535bbcc2f90b7815cbf1a8e6d2c0bef732f1/src/mongo/db/repl/replication_coordinator_impl_test.cpp#L2046

      replication_coordinator_impl_test.cpp
      TEST_F(StepDownTest, OnlyOneStepDownCmdIsAllowedAtATime) {
          OpTime optime1(Timestamp(100, 1), 1);
          OpTime optime2(Timestamp(100, 2), 1);
      
          // No secondary is caught up
          auto repl = getReplCoord();
          repl->setMyLastAppliedOpTime(optime2);
          repl->setMyLastDurableOpTime(optime2);
          ASSERT_OK(repl->setLastAppliedOptime_forTest(1, 1, optime1));
          ASSERT_OK(repl->setLastAppliedOptime_forTest(1, 2, optime1));
      
          simulateSuccessfulV1Election();
      
          ASSERT_TRUE(getReplCoord()->getMemberState().primary());
      
          // Step down where the secondary actually has to catch up before the stepDown can succeed.
          // On entering the network, _stepDownContinue should cancel the heartbeats scheduled for
          // T + 2 seconds and send out a new round of heartbeats immediately.
          // This makes it unnecessary to advance the clock after entering the network to process
          // the heartbeat requests.
          auto result = stepDown_nonBlocking(false, Seconds(10), Seconds(60));
      
          // Now while the first stepdown request is waiting for secondaries to catch up, attempt another
          // stepdown request and ensure it fails.
          const auto opCtx = makeOperationContext();
          auto status = getReplCoord()->stepDown(opCtx.get(), false, Seconds(10), Seconds(60));
          ASSERT_EQUALS(ErrorCodes::ConflictingOperationInProgress, status);
      
          // Now ensure that the original stepdown command can still succeed.
          catchUpSecondaries(optime2);
      
          ASSERT_OK(*result.second.get());
          ASSERT_TRUE(repl->getMemberState().secondary());
      }
      

      If the main test thread attempts to call stepDown() before the TopologyCoordinator enters the attempingToStepDown state, this test will block.

            Assignee:
            benety.goh@mongodb.com Benety Goh
            Reporter:
            benety.goh@mongodb.com Benety Goh
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

              Created:
              Updated:
              Resolved: