Collection scan phase of hybrid index build can encounter a prepare conflict. In order to avoid 3 way dead lock mentioned in SERVER-41033, we set the prepare conflict behavior for hybrid index build to be PrepareConflictBehavior::kIgnoreConflictsAllowWrites. But, then it seems, collection scan phase still encounters a prepare conflict and as a result the server crashes due to the fix added by SERVER-41034. To be noted, index build started on secondary ignore interrupts.
Modifying the repro script of SERVER-44507 slightly on the number of documents getting inserted into the collection, can result in hybrid index builder encountering a prepare conflict during collection scan phase. Doing a deeper investigation showed that, we can lead to WiredTiger losing the config settings (ignore_prepare=force/true) for a WT session and can cause the WT to enforce ignore_prepare=false (default setting) for the hybrid index build collection scan and lead to prepare conflict.
Consider, I have inserted 3 documents into collection A before index build start. Then,
1) Assume, Index build starts and collection scan phase plan executor yields after reading 2nd document in the collection A.
2) Now, a transaction comes in and inserts 4th document into the collection A and prepares it.
3) Index Build collection scan phase plan executor resumes. As a result, we reacquire locks and open a new WT storage transaction with ignore_prepare=force.
4) Now the 3rd document gets read with the ignore_prepare=force config setting. And, then we write either it to external sorter (in-memory buffer) or update the index table (storage layer) for multi-index keys.
5) Index Build collection scan now tries to read 4th document which is in prepared state.
Taking a closer look at step#3 -5, we see the below sequence.
- Collection scan's plan executor explicitly begins WT Transaction on state restore. The transaction gets opened with ignore_prepare=force . Now, the recovery unit state becomes kActiveNotInUnitOfWork(see the recovery unit state transition diagram)
- Read the 3rd document using cursor.next().
- Start write unit of work. Since this is the outer WUOW, the recovery unit state becomes kActive.
- Write the 3rd document either to external sorter memory(bulk) or to the index table(real).
- Commit Write unit of work. This will cause the transaction to be committed in WT and reset the config settings flag (like ignore_prepare) associated with that session as part of __wt_txn_release(). And now the recovery unit state becomes kInactive
- Now, we read the 4th document using cursor.next() w/o explicit begin_transaction() call from MongoDB layer. So, WT is going to make a read in an implicit transaction with the default setting (i.e.) ignore_prepare=false. As a result, any subsequent read/write calls after sequence #5 is no longer guarded from prepare conflict error.
Problems because of the behavior
1) Concurrency issue - 3 way dead lock like in SERVER-41033
2) Possible data corruption - Not sure?
Thinking about the safety/correctness of the reads and writes performed inside an implicit WT transaction after sequence #5, we see that the implicit transaction will be opened with the default isolation level = read-committed as opposed to WT transaction opened by MongoDB layer in snapshot isolation level. So, is it possible to end up writing the index key for a document (say doc X) twice? One from the code path - read the docX during collection scan and insert index keys; and other from the side write table drain code path. If so, I think, we can hit problem like in SERVER-41529.
3) Should we need to audit our codebase to see if we have the same behavior (i.e.) read/write outside a transaction (read-committed isolation) and is it safe to do it?
- is depended on by
-
SERVER-44582 Assert that a storage transaction is active for all cursor read operations
- Closed
- is duplicated by
-
SERVER-44581 Index builds must ensure a storage transaction is active to correctly ignore prepare conflicts
- Closed
- is related to
-
SERVER-44590 Logged snapshot IDs should match when opening and closing transactions.
- Closed
-
SERVER-47407 Avoid WriteUnitOfWork in index build collection scan loop
- Closed
- related to
-
SERVER-44507 Hybrid index build is able to commit (acquire stronger mode locks) for a collection that contains prepared documents. (4.2 only)
- Closed
-
SERVER-41033 set ignore_prepare=true throughout any part of index building that happens in runWithoutInterruption
- Closed
-
SERVER-41034 Invariant if we get a prepare conflict inside runWithoutInterruptionExceptAtGlobalShutdown block.
- Closed
-
SERVER-44582 Assert that a storage transaction is active for all cursor read operations
- Closed