Uploaded image for project: 'Compass '
  1. Compass
  2. COMPASS-6094

Investigate changes in SERVER-68783: Recipient shard may incorrectly return 0 milliseconds remaining in resharding

    • Type: Icon: Investigation Investigation
    • Resolution: Done
    • Priority: Icon: Major - P3 Major - P3
    • No version
    • Affects Version/s: None
    • Component/s: None
    • None
    • Not Needed

      Original Downstream Change Summary

      Previously, a resharding operation would report that its estimated time remaining is 0 if either 1. the operation is very close to finishing or 2. an estimate could not be computed. Now, a value of 0 will only be reported in the former case. The visible effects of this change are as follows:

      1. In the $currentOp output for a resharding operation, the remainingOperationTimeEstimatedSecs field may not be present if an estimate could not be computed. Previously, this field would always be present, and have a value of 0 if the estimate could not be computed.

      2. In the serverStatus output, the shardingStatistics.resharding.coordinatorAllShardsLowestRemainingOperationTimeEstimatedMillis, shardingStatistics.resharding.coordinatorAllShardsHighestRemainingOperationTimeEstimatedMillis, and shardingStatistics.resharding.recipientRemainingOperationTimeEstimatedMillis fields will report a value of -1 if an estimate could not be computed. Previously, these fields would report a value of 0 in this case.

      Description of Linked Ticket

      In response to a _shardsvrReshardingOperationTime command (used for querying the estimated remaining time in a resharding operation) from the resharding coordinator, a recipient shard executes this code, which calls ReshardingMetrics::getRecipientHighEstimateRemainingTimeMillis to compute the estimate of the remaining time.  That function may return 0 incorrectly if the shard has just had a failover, and not yet restored all of the metrics.   That can happen because the metrics are only partly restored here and partly restored here.

       

      As a result, if a _shardsvrReshardingOperationTime command enters the system at the wrong time, it may observe only partly restored metrics, and the coordinator would be misled into believing that it can begin the critical section.

       

      This is related to SERVER-67653, but is not the same because in that ticket the coordinator incorrectly treats an omitted remainingMillis field as 0 remainingMillis.  In this ticket, the recipient incorrectly returns 0 remainingMillis.

            Assignee:
            Unassigned Unassigned
            Reporter:
            backlog-server-pm Backlog - Core Eng Program Management Team
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

              Created:
              Updated:
              Resolved: