-
Type: Bug
-
Resolution: Done
-
Priority: Major - P3
-
Affects Version/s: 2.4.9, 2.5.1, 2.5.5
-
Component/s: Sharding
-
Fully Compatible
-
ALL
ISSUE SUMMARY
A bug in the sharding logic for hashed shard keys causes issues when sharding a collection on a hashed shard key and specifying the numInitialChunks option. Some chunks cannot be moved with the moveChunk command immediately after the collection was created.
USER IMPACT
This issue can lead to imbalanced data and issues during balancing in a sharded collection with a hashed shard key.
SOLUTION
Chunk splits now set the correct lower bound in the cached metadata within the shard.
WORKAROUNDS
A restart of mongod on the primary nodes between the shardCollection and moveChunk commands clears out the chunk manager cache and resolves the issue.
AFFECTED VERSIONS
Versions 2.4.0 to 2.4.9 are affected by this bug.
PATCHES
The fix is included in the 2.4.10 production release and the 2.6.0-rc0 release candidate, which will evolve into the 2.6.0 production release.
Original Description
When sharding a collection with a hashed shard key, and specifying numInitialChunks, some of these initial chunks are unable to be moved immediately afterwards.
jstests are attached.
In 2.4.9, the characterisation is:
- Only chunks on the last shard are affected.
- All but the final chunk are affected.
- Before a successful chunk move, attempting to move problem chunks gives errors such as:
{ "cause" : { "errmsg" : "exception: ranges differ, requested: { x: 0 } -> { x: 1152921504606846974 } existing: { x: 0 } -> { x: 8070450532247928818 }", "code" : 13587, "ok" : 0 }, "ok" : 0, "errmsg" : "move failed" } { "cause" : { "errmsg" : "exception: ranges differ, requested: { x: 1152921504606846974 } -> { x: 2305843009213693948 } existing: { x: 1152921504606846974 } -> { x: MaxKey }", "code" : 13587, "ok" : 0 }, "ok" : 0, "errmsg" : "move failed" } { "cause" : { "errmsg" : "exception: ranges differ, requested: { x: 2305843009213693948 } -> { x: 3458764513820540922 } existing: { x: 2305843009213693948 } -> { x: MaxKey }", "code" : 13587, "ok" : 0 }, "ok" : 0, "errmsg" : "move failed" } ...
- After a successful chunk move, attempting to move a problem chunk gives a different error:
{ "ok" : 0, "errmsg" : "no chunk found with those upper and lower bounds" }
In 2.5.1+, the characterisation is:
- All shards are affected.
- All chunks are affected.
- Attempting to move a chunk gives errors such as:
{ "cause" : { "errmsg" : "exception: cannot remove chunk [{ x: 0 }, { x: 1152921504606846974 }), this shard does not contain the chunk and it overlaps [{ x: 0 }, { x: 8070450532247928818 })", "code" : 16855, "ok" : 0 }, "ok" : 0, "errmsg" : "move failed" } { "cause" : { "errmsg" : "exception: cannot remove chunk [{ x: 1152921504606846974 }, { x: 2305843009213693948 }), this shard does not contain the chunk and it overlaps [{ x: 0 }, { x: 8070450532247928818 }), [{ x: 1152921504606846974 }, { x: MaxKey })", "code" : 16855, "ok" : 0 }, "ok" : 0, "errmsg" : "move failed" } { "cause" : { "errmsg" : "exception: cannot remove chunk [{ x: 2305843009213693948 }, { x: 3458764513820540922 }), this shard does not contain the chunk and it overlaps [{ x: 1152921504606846974 }, { x: MaxKey }), [{ x: 2305843009213693948 }, { x: MaxKey })", "code" : 16855, "ok" : 0 }, "ok" : 0, "errmsg" : "move failed" } ...
The chunks look fine in config.chunks. Restarting the affected shard server between shardCollection and moveChunk allows the chunks to be moved sucessfully, so this is likely to be a bug in ChunkManager that causes it to get confused about chunk bounds. Specifically, it looks like the upper bound is not being set properly.