Currently, after the main data transfer for a migration we block to wait for all the writes from the migration to be replicated to a majority of nodes, before we enter the critical section. We wait for 10 hours, but if after 10 hours the writes still haven't been replicated, we continue anyway and enter the critical section. In this case, it is very likely that the migration will abort shortly after the critical section writes happens and we wait 30 seconds for those writes to be replicated. Entering the critical section can block all read and write operations for up to 30 seconds, so we should avoid entering it all when it's so likely that we'll abort.
- duplicates
-
SERVER-13456 MigrateStatus::_go is unkillable when waiting for replication for critical section
- Closed