-
Type: Bug
-
Resolution: Fixed
-
Priority: Major - P3
-
Affects Version/s: None
-
Component/s: None
-
Service Arch
-
Minor Change
-
ALL
-
v8.0, v7.0, v6.0
-
Networking & Obs 2024-06-10, Networking & Obs 2024-06-24
-
200
The following can happen:
1. mongos sends a command to a config server or shard s0
2. As part of processing the command, s0 will run a subcommand against remote shard s1
3. s1 steps down
4. the command returns InterruptedDueToReplStateChange upstream in the path of s1 -> s0 -> mongos
5. the mongos gets InterruptedDueToReplStateChange from s0 and think it's the one that failed over.
6. mongos RSM marks s0 as ReplicaSetWithNoPrimary
7. since InterruptedDueToReplStateChange is a retriable error, the mongos will resend the command. The mongos will try to send a hello command to get the updated view of the topology, but sees there's already an outstanding request.
8. The command will be unable to retry until the outstanding hello on s0 returns, which will be up to 10s (the timeout of a streamable hello command).