Loading...

XML

Word

Printable

JSON

Type: Bug
Resolution: Unresolved
Priority: Major - P3
Fix Version/s: None
Affects Version/s: 6.1.0-rc0, 7.0.0, 8.0.0
Component/s: None
Labels:
None

Assigned Teams:

Query Execution
Operating System:
ALL
Steps To Reproduce:

Hide

I've also attached a jsTest. But to summarize how to reproduce the behavior:

(1) Spin up a sharded cluster with 'disableResumableRangeDeleter: true' and 'ttlMonitorSleepSecs: 1'. The bug requires orphans or unowned documents due to chunk migration, so we disable the range deleter.

(2) Create a sharded collection, with all chunks on shard0.

(3) Insert at least 'TTLIndexDeleteTargetDocs' into a collection with a 'ttlField' set to the either the current time or some time in the past. These will eventually live in their own chunk on shard1.

(4) Insert one document with ttlField set to the current time, make sure this can put in a separate chunk.

(5) Split the collection into 2 chunks. Move the chunk with 'TTLIndexDeleteTargetDocs' to shard1.

(6) Now, shard0 has 1 owned doc, and 'TTLIndexDeleteTargetDocs' orphan docs.

(7) Create a TTLIndex on 'ttlField' with expireAfterSeconds: 1.

(8) The single document that is expired, but owned on shard0, never gets deleted by the TTLMonitor.

Show
I've also attached a jsTest. But to summarize how to reproduce the behavior: (1) Spin up a sharded cluster with 'disableResumableRangeDeleter: true' and 'ttlMonitorSleepSecs: 1'. The bug requires orphans or unowned documents due to chunk migration, so we disable the range deleter. (2) Create a sharded collection, with all chunks on shard0. (3) Insert at least 'TTLIndexDeleteTargetDocs' into a collection with a 'ttlField' set to the either the current time or some time in the past. These will eventually live in their own chunk on shard1. (4) Insert one document with ttlField set to the current time, make sure this can put in a separate chunk. (5) Split the collection into 2 chunks. Move the chunk with 'TTLIndexDeleteTargetDocs' to shard1. (6) Now, shard0 has 1 owned doc, and 'TTLIndexDeleteTargetDocs' orphan docs. (7) Create a TTLIndex on 'ttlField' with expireAfterSeconds: 1. (8) The single document that is expired, but owned on shard0, never gets deleted by the TTLMonitor.
Sprint:
Execution Team 2024-08-19
Case:
Confidence Status:
None
Work Order:
3

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name:
None
Goal Link:
None

If there are greater than TTLIndexDeleteTargetDocs expired orphan (unowned) documents for a given TTL index, more recently expired documents cannot be removed by the TTLMonitor through the index.

Details
The TTLMonitor uses batched deletes by default. The batched delete stage first 'stages' documents in a buffer until _batchTargetMet().
The batch target is met if either the 'targetBatchDocs' are stored in the buffer, or more than 'targetStagedDocBytes' are stored in the buffer. However, documents in the buffer can be orphans.

Once the batch target is met, we try to commit the batch. Since the TTLMonitor doesn't remove orphans, orphan documents are 'skipped' and not issued a delete. If all staged deletes were 'successful' (or skipped), and the buffer is cleared.

If the buffer is empty, and _passStagingComplete, isEOF() is true, and the BatchedDeleteStage returns EOF. If _passTargetMet() is true, _passStagingComplete is true. _passTargetMet() is true if the total number of documents staged (this can include orphans) across batches exceeds '_batchedDeleteParams->targetPassDocs'. The TTLMonitor sets 'targetPassDocs' to TTLIndexDeleteTargetDocs.

If there are more than TTLIndexDeleteTargetDocs that are (1) orphans and (2) expired, the TTLMonitor will repeatedly try to issue the same batch delete with no delete progress. The TTLMonitor can't recover until the orphan documents are cleaned up.

~~The issue isn't specific to orphans. It can also manifest when a received chunk has expired documents, but the chunk hasn't been committed to the shard yet.~~

The issue isn't specific to orphans on a donor shard. Expired orphan documents on a recipient shard, which belong to a chunk that has yet to be committed, can also block TTL delete progress.

- - Sort By Name
  - Sort By Date
  - Ascending
  - Descending
  - Thumbnails
  - List
  - Download All

ttl_stuck_behind_orphans_repro.js
3 kB
Jul 23 2024 11:22:00 PM UTC

is related to

SERVER-56194 Make TTL deletes fair

Closed

Assignee:: Unassigned
Reporter:: Haley Connelly
Participants:: Haley Connelly
Votes:: 3 Vote for this issue
Watchers:: 26 Start watching this issue

Created:: Jul 23 2024 11:21:47 PM UTC
Updated:: May 05 2025 12:32:39 PM UTC

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates