-
Type: Improvement
-
Resolution: Fixed
-
Priority: Major - P3
-
Affects Version/s: None
-
Component/s: Replication, Sharding
-
Fully Compatible
-
Sharding 2020-06-15
The ReplicaSetAwareService allows interested classes to receive callbacks on certain events related to the replica set state of the node. These callbacks originate in ReplicationCoordinatorImpl, which call into ReplicationCoordatorExternalStateImpl, which then calls into the relevant function on ReplicaSetAwareServiceRegistry (which then calls the registered ReplicaSetAwareService (RSAS) classes).
SERVER-48406 added onBecomeArbiter() to ReplicaSetAwareService, and SERVER-47914 finished plumbing through the disabling of the LogicalClock/VectorClock on arbiters. In doing so, failures were caused in the arbiter-related db_repl_coordinator_test unittests:
[main] "Running","attr":{"test":"TermAndLastCommittedOpTimeUpdatedFromHeartbeatWhenArbiter","rep":1,"reps":1} ... [main] "Responding with {doc} (elapsed: {millis})","attr":{"doc":{"$replData":{"term":3,"lastOpCommitted":{"ts":{"$timestamp":{"t":10,"i":1}},"t":3},"lastCommittedWall":{"$date":"1970-01-01T00:01:40.000Z"},"lastOpVisible":{"ts":{"$timestamp":{"t":10,"i":1}},"t":3},"configVersion":2,"configTerm":-1,"replicaSetId":{"$oid":"000000000000000000000000"},"syncSourceIndex":1,"isPrimary":true},"ok":1,"state":2,"v":2,"configTerm":-1,"set":"mySet"},"millisMillis":0} [replexec] "Invariant failure","attr":{"expr":"!VectorClock::get(getServiceContext())->isEnabled() || _externalState->getGlobalTimestamp(getServiceContext()) >= opTime.getTimestamp()","file":"src/mongo/db/repl/replication_coordinator_impl.cpp","line":1378} [replexec] "\n\n***aborting after invariant() failure\n\n" [replexec] "{message}","attr":{"message":"Got signal: 6 (Aborted).\n"} [replexec] "BACKTRACE","attr":{"bt":{"backtrace":[{"a":"7FE20830E7BA","b":"7FE205382000","o":"2F8C7BA","s":"_ZN5mongo18stack_trace_detail12_GLOBAL__N_119printStackTraceImplERKNS1_7OptionsEPNS_14StackTraceSinkE.constprop.592","s+":"1EA"},{"a":"7FE20830FC69","b":"7FE205382000","o":"2F8DC69","s":"_ZN5mongo15printStackTraceEv","s+":"29"},{"a":"7FE20830C3E6","b":"7FE205382000","o":"2F8A3E6","s":"_ZN5mongo12_GLOBAL__N_116abruptQuitActionEiP7siginfoPv","s+":"66"},{"a":"7FE203C1F7E0","b":"7FE203C10000","o":"F7E0","s":"_L_unlock_16","s+":"2D"},{"a":"7FE2038AE4F5","b":"7FE20387C000","o":"324F5","s":"gsignal","s+":"35"},{"a":"7FE2038AFCD5","b":"7FE20387C000","o":"33CD5","s":"abort","s+":"175"},{"a":"7FE20629EDC9","b":"7FE205382000","o":"F1CDC9","s":"_ZN5mongo15invariantFailedEPKcS1_j","s+":"12C"},{"a":"7FE20691275D","b":"7FE205382000","o":"159075D","s":"_ZN5mongo4repl26ReplicationCoordinatorImpl34_setMyLastAppliedOpTimeAndWallTimeENS_8WithLockERKNS0_17OpTimeAndWallTimeEbNS0_22ReplicationCoordinator15DataConsistencyE","s+":"ED"},{"a":"7FE2069133F3","b":"7FE205382000","o":"15913F3","s":"_ZN5mongo4repl26ReplicationCoordinatorImpl19_advanceCommitPointENS_8WithLockERKNS0_17OpTimeAndWallTimeEb","s+":"A3"},{"a":"7FE20693F8C0","b":"7FE205382000","o":"15BD8C0","s":"_ZN5mongo4repl26ReplicationCoordinatorImpl24_handleHeartbeatResponseERKNS_8executor12TaskExecutor25RemoteCommandCallbackArgsEi","s+":"1CE0"},{"a":"7FE207E4E46C","b":"7FE205382000","o":"2ACC46C","s":"_ZZN5mongo8executor12TaskExecutor21scheduleRemoteCommandERKNS0_24RemoteCommandRequestImplINS_11HostAndPortEEERKSt8functionIFvRKNS1_25RemoteCommandCallbackArgsEEERKSt10shared_ptrINS_5BatonEEENKUlRKNS1_30RemoteCommandOnAnyCallbackArgsEE_clESM_","s+":"4C"},{"a":"7FE207CF3376","b":"7FE205382000","o":"2971376","s":"_ZN5mongo8executor12_GLOBAL__N_121remoteCommandFinishedERKNS0_12TaskExecutor12CallbackArgsERKSt8functionIFvRKNS2_30RemoteCommandOnAnyCallbackArgsEEERKNS0_24RemoteCommandRequestImplISt6vectorINS_11HostAndPortESaISG_EEEERKNS0_26RemoteCommandOnAnyResponseE","s+":"56"},{"a":"7FE207CF7583","b":"7FE205382000","o":"2975583","s":"_ZN5mongo8executor22ThreadPoolTaskExecutor11runCallbackESt10shared_ptrINS1_13CallbackStateEE","s+":"113"},{"a":"7FE207CF7962","b":"7FE205382000","o":"2975962","s":"_ZZN5mongo15unique_functionIFvNS_6StatusEEE8makeImplIZNS_8executor22ThreadPoolTaskExecutor23scheduleIntoPool_inlockEPNSt7__cxx114listISt10shared_ptrINS6_13CallbackStateEESaISB_EEERKSt14_List_iteratorISB_ESI_St11unique_lockINS_12latch_detail5LatchEEEUlT_E1_EEDaOSN_EN12SpecificImpl4callEOS1_","s+":"A2"},{"a":"7FE2065ED2AB","b":"7FE205382000","o":"126B2AB","s":"_ZN5mongo8executor14ThreadPoolMock15_consumeOneTaskERSt11unique_lockINS_12latch_detail5LatchEE","s+":"12B"},{"a":"7FE2065ED526","b":"7FE205382000","o":"126B526","s":"_ZNSt6thread11_State_implINS_8_InvokerISt5tupleIJZN5mongo4stdx6threadC4IZNS3_8executor14ThreadPoolMock7startupEvEUlvE0_JELi0EEET_DpOT0_EUlvE_EEEEE6_M_runEv","s+":"146"},{"a":"7FE2084D003F","b":"7FE205382000","o":"314E03F","s":"execute_native_thread_routine","s+":"F"},{"a":"7FE203C17AA1","b":"7FE203C10000","o":"7AA1","s":"start_thread","s+":"D1"},{"a":"7FE203964C4D","b":"7FE20387C000","o":"E8C4D","s":"clone","s+":"6D"}],"processInfo":{"mongodbVersion":"unknown","gitVersion":"none","compiledModules":["unknown"],"uname":{"sysname":"Linux","release":"2.6.32-220.el6.x86_64","version":"#1 SMP Wed Nov 9 08:03:13 EST 2011","machine":"x86_64"},"somap":[{"b":"7FE205382000","elfType":3,"buildId":"9F0FECD3257C33AAF626952C3078E347E0AEEDEA"}]}}} [replexec] "Frame","attr":{"frame":{"a":"7FE20830E7BA","b":"7FE205382000","o":"2F8C7BA","s":"_ZN5mongo18stack_trace_detail12_GLOBAL__N_119printStackTraceImplERKNS1_7OptionsEPNS_14StackTraceSinkE.constprop.592","s+":"1EA"}} [replexec] "Frame","attr":{"frame":{"a":"7FE20830FC69","b":"7FE205382000","o":"2F8DC69","s":"_ZN5mongo15printStackTraceEv","s+":"29"}} [replexec] "Frame","attr":{"frame":{"a":"7FE20830C3E6","b":"7FE205382000","o":"2F8A3E6","s":"_ZN5mongo12_GLOBAL__N_116abruptQuitActionEiP7siginfoPv","s+":"66"}} [replexec] "Frame","attr":{"frame":{"a":"7FE203C1F7E0","b":"7FE203C10000","o":"F7E0","s":"_L_unlock_16","s+":"2D"}} [replexec] "Frame","attr":{"frame":{"a":"7FE2038AE4F5","b":"7FE20387C000","o":"324F5","s":"gsignal","s+":"35"}} [replexec] "Frame","attr":{"frame":{"a":"7FE2038AFCD5","b":"7FE20387C000","o":"33CD5","s":"abort","s+":"175"}} [replexec] "Frame","attr":{"frame":{"a":"7FE20629EDC9","b":"7FE205382000","o":"F1CDC9","s":"_ZN5mongo15invariantFailedEPKcS1_j","s+":"12C"}} [replexec] "Frame","attr":{"frame":{"a":"7FE20691275D","b":"7FE205382000","o":"159075D","s":"_ZN5mongo4repl26ReplicationCoordinatorImpl34_setMyLastAppliedOpTimeAndWallTimeENS_8WithLockERKNS0_17OpTimeAndWallTimeEbNS0_22ReplicationCoordinator15DataConsistencyE","s+":"ED"}} [replexec] "Frame","attr":{"frame":{"a":"7FE2069133F3","b":"7FE205382000","o":"15913F3","s":"_ZN5mongo4repl26ReplicationCoordinatorImpl19_advanceCommitPointENS_8WithLockERKNS0_17OpTimeAndWallTimeEb","s+":"A3"}} [replexec] "Frame","attr":{"frame":{"a":"7FE20693F8C0","b":"7FE205382000","o":"15BD8C0","s":"_ZN5mongo4repl26ReplicationCoordinatorImpl24_handleHeartbeatResponseERKNS_8executor12TaskExecutor25RemoteCommandCallbackArgsEi","s+":"1CE0"}} [replexec] "Frame","attr":{"frame":{"a":"7FE207E4E46C","b":"7FE205382000","o":"2ACC46C","s":"_ZZN5mongo8executor12TaskExecutor21scheduleRemoteCommandERKNS0_24RemoteCommandRequestImplINS_11HostAndPortEEERKSt8functionIFvRKNS1_25RemoteCommandCallbackArgsEEERKSt10shared_ptrINS_5BatonEEENKUlRKNS1_30RemoteCommandOnAnyCallbackArgsEE_clESM_","s+":"4C"}} [replexec] "Frame","attr":{"frame":{"a":"7FE207CF3376","b":"7FE205382000","o":"2971376","s":"_ZN5mongo8executor12_GLOBAL__N_121remoteCommandFinishedERKNS0_12TaskExecutor12CallbackArgsERKSt8functionIFvRKNS2_30RemoteCommandOnAnyCallbackArgsEEERKNS0_24RemoteCommandRequestImplISt6vectorINS_11HostAndPortESaISG_EEEERKNS0_26RemoteCommandOnAnyResponseE","s+":"56"}} [replexec] "Frame","attr":{"frame":{"a":"7FE207CF7583","b":"7FE205382000","o":"2975583","s":"_ZN5mongo8executor22ThreadPoolTaskExecutor11runCallbackESt10shared_ptrINS1_13CallbackStateEE","s+":"113"}} [replexec] "Frame","attr":{"frame":{"a":"7FE207CF7962","b":"7FE205382000","o":"2975962","s":"_ZZN5mongo15unique_functionIFvNS_6StatusEEE8makeImplIZNS_8executor22ThreadPoolTaskExecutor23scheduleIntoPool_inlockEPNSt7__cxx114listISt10shared_ptrINS6_13CallbackStateEESaISB_EEERKSt14_List_iteratorISB_ESI_St11unique_lockINS_12latch_detail5LatchEEEUlT_E1_EEDaOSN_EN12SpecificImpl4callEOS1_","s+":"A2"}} [replexec] "Frame","attr":{"frame":{"a":"7FE2065ED2AB","b":"7FE205382000","o":"126B2AB","s":"_ZN5mongo8executor14ThreadPoolMock15_consumeOneTaskERSt11unique_lockINS_12latch_detail5LatchEE","s+":"12B"}} [replexec] "Frame","attr":{"frame":{"a":"7FE2065ED526","b":"7FE205382000","o":"126B526","s":"_ZNSt6thread11_State_implINS_8_InvokerISt5tupleIJZN5mongo4stdx6threadC4IZNS3_8executor14ThreadPoolMock7startupEvEUlvE0_JELi0EEET_DpOT0_EUlvE_EEEEE6_M_runEv","s+":"146"}} [replexec] "Frame","attr":{"frame":{"a":"7FE2084D003F","b":"7FE205382000","o":"314E03F","s":"execute_native_thread_routine","s+":"F"}} [replexec] "Frame","attr":{"frame":{"a":"7FE203C17AA1","b":"7FE203C10000","o":"7AA1","s":"start_thread","s+":"D1"}} [replexec] "Frame","attr":{"frame":{"a":"7FE203964C4D","b":"7FE20387C000","o":"E8C4D","s":"clone","s+":"6D"}}
This was determined to be because those unittests use ReplicationCoordinatorExternalStateMock, which lacks the necessary calls to ReplicaSetAwareServiceRegistry. This meant that the LogicalClock/VectorClock was not being disabled as it should have been, and the test then later failed on this dassert.
As a temporary measure, SERVER-47914 added the call to ReplicaSetAwareServiceRegistry::onBecomeArbiter() inside ReplicationCoordinatorExternalStateMock::onBecomeArbiterHook(), which fixed this problem. However, there is a more general problem — all ReplicaSetAwareService callbacks should still occur even when the ReplicationCoordinator's external state has been mocked out.
It turns out that, for the externalState functions that call the RSAS callbacks, each is called from only one place in ReplicationCoordinatorImpl. So there are two potential ways to properly fix this:
- Make the necessary hook functions in ReplicationCoordinatorExternalStateMock call into ReplicaSetAwareServiceRegistry, the same as ReplicationCoordatorExternalStateImpl does.
- Remove the calls to ReplicaSetAwareServiceRegistry from ReplicationCoordatorExternalStateImpl/Mock, and put them directly in ReplicationCoordinatorImpl.
My preference is #2, because there doesn't seem to be any point in having this abstraction go via externalState, if correctness requires it to be the same in every externalState implementation. william.schultz, do you agree? Or is there some reason to keep it in externalState, or other possibility I've missed?
- related to
-
SERVER-47914 Move clusterTime from LogicalClock to VectorClock
- Closed
-
SERVER-48406 Add onBecomeArbiter to ReplicaSetAwareService
- Closed