The migration logic on the donor shard that performs the initial index scan for documents to clone does not handle invalidations properly, and will generate a truncated set of documents to clone if the executor is killed during the index scan.
As a result, performing an index operation that invalidates plan executors at the same time that the initial index scan for a migration is yielding will cause some documents to not be transferred during the migration, and these documents will be deleted from the cluster during the next migration cleanup job.
The following index operations invalidate plan executors, and thus are able to trigger this issue:
- Dropping an index with the dropIndexes command.
- Aborting an index build with killOp().
- Updating the TTL configuration for an index with the collMod command.
This is a regression introduced in version 1.7.2 by 9923c7b6, and affects all versions released since.
The following script will reproduce this issue:
var numDocs = 10000; // Set up cluster. var st = new ShardingTest({shards: 2}); var s = st.s0; var d1 = st.shard1; var coll = s.getDB("test").foo; assert.commandWorked(s.adminCommand({enableSharding: coll.getDB().getName()})); assert.commandWorked(s.adminCommand({shardCollection: coll.getFullName(), key: {_id: "hashed"}})); for (i=0; i<numDocs; i++) { coll.insert({_id: i}); } assert.commandWorked(coll.ensureIndex({a: 1})); // Check document count. assert.eq(numDocs, coll.find().itcount()); // Configure server to increase reproducibility. assert.commandWorked(d1.adminCommand({setParameter: 1, internalQueryExecYieldIterations: 2})); assert.commandWorked(d1.adminCommand({configureFailPoint: "setYieldAllLocksWait", mode: "alwaysOn", data: {namespace:"test.foo", waitForMillis: 100}})); // Initiate migration and index drop in parallel. shell = startParallelShell("sleep(1000); assert.commandWorked(db.foo.dropIndex({a: 1}));", s.port); assert.commandWorked(s.adminCommand({moveChunk: coll.getFullName(), find: {_id: 0}, to: "shard0000", _waitForDelete: true})); shell(); assert.commandWorked(d1.adminCommand({configureFailPoint: "setYieldAllLocksWait", mode: "off"})); // Re-check document count. assert.eq(numDocs, coll.find().itcount());
When run locally with version 3.2.1, the above script fails on the last line with the following:
2016-02-09T11:05:11.076-0500 E QUERY [thread1] Error: [10000] != [7541] are not equal : undefined
- is related to
-
SERVER-23425 Inserts and updates during chunk migration get deleted in 3.0.9, 3.0.10
- Closed
- related to
-
SERVER-13123 All callers of PlanExecutor::getNext need to deal with error returns
- Closed