-
Type: Bug
-
Resolution: Works as Designed
-
Priority: Major - P3
-
None
-
Affects Version/s: None
-
Component/s: None
-
Service Arch
-
ALL
-
Workload Scheduling 2024-07-22
-
0
BF-33912's lone BFG at the time of writing appears to have been caused by some replication error in a TSAN variant (which is quite slow) leading to a host being down for longer than usual (see the comments here for details).
This caused the client threads to receive a mix of "Connection reset by peer," "Connection refused," and "HostUnreachable" errors, but only HostUnreachable is considered a retriable error that will not consume the retry limit.
In suites where we kill/terminate shard processes, it should be expected to receive network errors more frequently (and that they should be transient).