Loading...

XML

Word

Printable

JSON

Type: Bug
Resolution: Done
Priority: Major - P3
Fix Version/s: 2.6.1, 2.7.0
Affects Version/s: None
Component/s: Sharding
Labels:
- sharding
Environment:

Hide
Centos 6.3
- 1 app server with installed mongos
- 3 different servers with installed mongoc (mongoc0, mongoc1, mongoc2)
- 2 Replica sets with groups of 3 servers each running mongod (rs: mongodb0-0, mongodb0-1, mongodb0-2; rs1: mongodb1-0, mongodb1-1, mongodb1-2)

Show
Centos 6.3 - 1 app server with installed mongos - 3 different servers with installed mongoc (mongoc0, mongoc1, mongoc2) - 2 Replica sets with groups of 3 servers each running mongod (rs: mongodb0-0, mongodb0-1, mongodb0-2; rs1: mongodb1-0, mongodb1-1, mongodb1-2)

Operating System:
Linux
Backport Completed:

2.6.1
Confidence Status:
None

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name:
None
Goal Link:
None

Issue Status as of April 15, 2014

ISSUE SUMMARY
In certain cases, the initial distribution of chunks for a hashed sharded collection to multiple shards can cause mongos to split at the same split point more than once, resulting in a corrupted collection metadata in the shard (not visible in the config server). If the corrupted chunks are later migrated in this collection, the corrupted chunk data can creep into the config server.

These empty chunks can be seen via the getShardVersion command with the {fullMetadata : true} option executed directly against the affected single mongod or replica set primary of the shard.

USER IMPACT
This bug can corrupt the config metadata and in turn cause existing documents not to be returned correctly.

WORKAROUNDS
If the corrupt metadata has not yet propagated to the config servers, the workaround is to stepdown or restart all primaries after sharding the collection on a hashed shard key. This will correctly reload metadata from the config server.

RESOLUTION
Prevent splitting on chunk boundaries to avoid the issue.

AFFECTED VERSIONS
All recent production releases up to 2.6.0 are affected.

PATCHES
The patch is included in the 2.6.1 production release.

Original description

In certain cases, the initial distribution of chunks for a hashed sharded collection to multiple shards can create duplicate split points, resulting in invisible, empty chunks with the same "min" and "max" value in the collection metadata. These should not interfere with normal operation, but if chunks are later migrated in this collection, this may result in inconsistent metadata which must be manually fixed.

These empty chunks can be seen via the "getShardMetadata" command with the "fullMetadata : true" option executed directly against the affected single mongod or replica set primary of the shard. The workaround is to stepdown or restart the single mongod or primary, which will correctly reload metadata from the config server.

Original Description:

After an unexpected reboot of application server I found out that mongos started to show errors while I try to run show collections.

 mongos> show collections;
  Mon Feb  3 22:50:21.680 error: {
    "$err" : "error loading initial database config information :: caused by :: Couldn't load a valid config for database.stats_archive_monthly after 3 attempts. Please try again.",
    "code" : 13282
  } at src/mongo/shell/query.js:128

However, all mongo servers and mongo config servers were healthy and have no issues in logs.

First of all I tried to reboot each of the server in cluster with no success. Error still occurs.

Then after a little check of mongo source I found out that this error could be caused by overlapping ranges of shard keys.

Looking into shard information for broken collection, I noticed this:

database.stats_archive_monthly
        shard key: { "a" : "hashed" }
        chunks:
            rs1 6
            rs0 6
        { "a" : { "$minKey" : 1 } } -->> { "a" : NumberLong("-7686143364045646500") } on : rs1 Timestamp(2, 0)
        { "a" : NumberLong("-7686143364045646500") } -->> { "a" : NumberLong("-6148914691236517200") } on : rs1 Timestamp(3, 0)
        { "a" : NumberLong("-6148914691236517200") } -->> { "a" : NumberLong("-4611686018427387900") } on : rs1 Timestamp(4, 0)
        { "a" : NumberLong("-4611686018427387900") } -->> { "a" : NumberLong("-3074457345618258600") } on : rs1 Timestamp(5, 0)
        { "a" : NumberLong("-3074457345618258600") } -->> { "a" : NumberLong("-1537228672809129300") } on : rs1 Timestamp(6, 0)
        { "a" : NumberLong("-1537228672809129300") } -->> { "a" : NumberLong(0) } on : rs1 Timestamp(7, 0)
        { "a" : NumberLong(0) } -->> { "a" : NumberLong("7686143364045646500") } on : rs0 Timestamp(7, 1)
        { "a" : NumberLong("1537228672809129300") } -->> { "a" : NumberLong("3074457345618258600") } on : rs0 Timestamp(1, 9)
        { "a" : NumberLong("3074457345618258600") } -->> { "a" : NumberLong("4611686018427387900") } on : rs0 Timestamp(1, 10)
        { "a" : NumberLong("4611686018427387900") } -->> { "a" : NumberLong("6148914691236517200") } on : rs0 Timestamp(1, 11)
        { "a" : NumberLong("6148914691236517200") } -->> { "a" : NumberLong("7686143364045646500") } on : rs0 Timestamp(1, 12)
        { "a" : NumberLong("7686143364045646500") } -->> { "a" : { "$maxKey" : 1 } } on : rs0 Timestamp(1, 13)

There is range

{ "a" : NumberLong(0) } -->> { "a" : NumberLong("*7686143364045646500*") } on : rs0 Timestamp(7, 1)

that is overlapping all shard keys from first replica set.

For some additional statistics: First replica set contains 73 records, second replica set contain 0 records.

rs0:PRIMARY> db.stats_archive_monthly.count();
73

rs1:PRIMARY> db.stats_archive_monthly.count();
0

Only one query that work with this collection is:

 $mongo_db['stats_archive_monthly'].update( {a: account_id, l_id: location_id, t: time.truncate(interval())}, {'$set' => {u: data.to_i}}, upsert: true)

All data on DB servers is correct, since it is staging environment and all documents has

{"a" : 1}

they all should appear at only one shard.

Somehow now DB is completely unusable unless it is full restored.

- - Sort By Name
  - Sort By Date
  - Ascending
  - Descending
  - Thumbnails
  - List
  - Download All

mongos1-failedShardConfig.log
214 kB
Mar 20 2014 03:22:50 PM UTC
shard-problem.txt
36 kB
Mar 20 2014 03:22:50 PM UTC
20140321_1540.rar
71 kB
Mar 21 2014 02:53:42 PM UTC
SERVER-12638.js
2 kB
Mar 21 2014 09:19:52 PM UTC
repro.js
2 kB
Mar 23 2014 02:56:52 AM UTC
repro24.out.gz
85 kB
Mar 23 2014 02:56:52 AM UTC

Assignee:: Randolph Tan

Reporter:: Mikhail Kochegarov [X]

Participants:: Ankur Chauhan, Asya Kamsky, Björn Bullerdieck [X], Eliot Horowitz, Githook User, Mikhail Kochegarov [X], Randolph Tan, Scott Hernandez, Shaun Verch

Votes:: 2 Vote for this issue

Watchers:: 18 Start watching this issue

Created:: Feb 06 2014 06:55:13 AM UTC

Updated:: Jul 11 2016 05:17:35 PM UTC

Resolved:: Apr 03 2014 08:10:30 PM UTC

GA Target Date:: None

Public Preview Target Date:: None

Private Preview Target Date:: None

Experiment Target Date:: None

Details

Description

Original description

Attachments

Attachments

Activity

People

Dates