Loading...

XML

Word

Printable

JSON

Type: Bug
Resolution: Done
Priority: Critical - P2
Fix Version/s: 2.6.12, 3.0.10, 3.2.3, 3.3.2
Affects Version/s: None
Component/s: Index Maintenance, Sharding
Labels:
- code-only

Backwards Compatibility:
Fully Compatible
Operating System:
ALL
Backport Completed:

2.6.12, 3.0.10, 3.2.3
Sprint:
Query 10 (02/22/16)
Confidence Status:
None
Work Order:
0

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name:
None
Goal Link:
None

The migration logic on the donor shard that performs the initial index scan for documents to clone does not handle invalidations properly, and will generate a truncated set of documents to clone if the executor is killed during the index scan.

As a result, performing an index operation that invalidates plan executors at the same time that the initial index scan for a migration is yielding will cause some documents to not be transferred during the migration, and these documents will be deleted from the cluster during the next migration cleanup job.

The following index operations invalidate plan executors, and thus are able to trigger this issue:

Dropping an index with the dropIndexes command.
Aborting an index build with killOp().
Updating the TTL configuration for an index with the collMod command.

This is a regression introduced in version 1.7.2 by 9923c7b6, and affects all versions released since.

The following script will reproduce this issue:

var numDocs = 10000;

// Set up cluster.
var st = new ShardingTest({shards: 2});
var s = st.s0;
var d1 = st.shard1;
var coll = s.getDB("test").foo;
assert.commandWorked(s.adminCommand({enableSharding: coll.getDB().getName()}));
assert.commandWorked(s.adminCommand({shardCollection: coll.getFullName(), key: {_id: "hashed"}}));
for (i=0; i<numDocs; i++) {
    coll.insert({_id: i});
}
assert.commandWorked(coll.ensureIndex({a: 1}));

// Check document count.
assert.eq(numDocs, coll.find().itcount());

// Configure server to increase reproducibility.
assert.commandWorked(d1.adminCommand({setParameter: 1, internalQueryExecYieldIterations: 2}));
assert.commandWorked(d1.adminCommand({configureFailPoint: "setYieldAllLocksWait", mode: "alwaysOn",
                                      data: {namespace:"test.foo", waitForMillis: 100}}));

// Initiate migration and index drop in parallel.
shell = startParallelShell("sleep(1000); assert.commandWorked(db.foo.dropIndex({a: 1}));", s.port);
assert.commandWorked(s.adminCommand({moveChunk: coll.getFullName(), find: {_id: 0}, to: "shard0000",
                                     _waitForDelete: true}));
shell();
assert.commandWorked(d1.adminCommand({configureFailPoint: "setYieldAllLocksWait", mode: "off"}));

// Re-check document count.
assert.eq(numDocs, coll.find().itcount());

When run locally with version 3.2.1, the above script fails on the last line with the following:

2016-02-09T11:05:11.076-0500 E QUERY    [thread1] Error: [10000] != [7541] are not equal : undefined

is related to

SERVER-23425 Inserts and updates during chunk migration get deleted in 3.0.9, 3.0.10

Closed

related to

SERVER-13123 All callers of PlanExecutor::getNext need to deal with error returns

Closed

Assignee:: Tess Avitabile (Inactive)
Reporter:: J Rassi (Inactive)
Participants:: Githook User, J Rassi, Tess Avitabile
Votes:: 0 Vote for this issue
Watchers:: 14 Start watching this issue

Created:: Feb 09 2016 04:23:07 PM UTC
Updated:: Nov 17 2016 06:46:19 PM UTC
Resolved:: Feb 09 2016 11:06:10 PM UTC
Confidence Status Last Update:: 09/Feb/16 6:50 PM

Details

Description

Attachments

Issue Links

Activity

People

Dates