-
Type: Bug
-
Resolution: Done
-
Priority: Major - P3
-
None
-
Affects Version/s: 1.8.1
-
Component/s: Sharding
-
None
-
Environment:sharded cluster with three shards, three config servers, connecting through mongos
-
ALL
Not sure what's happening here, but mongos seem to be very confused about which databases exist. It threw errors at the application with the message "too many attempts to update config, failing", and at the same time this can be found in the mongos log:
Thu Sep 1 07:41:44 [conn6] SyncClusterConnection connecting to [richcollconf03:28100] Thu Sep 1 07:43:43 [conn2] couldn't find database [complete_20110828] in config db Thu Sep 1 07:43:43 [conn2] put [complete_20110828] on: richcollshard2:richcollshard2/richcolldb03.byburt.com:27017,richcolldb04 Thu Sep 1 07:46:10 [LockPinger] dist_lock pinged successfully for: richassembler03.byburt.com:1314862900:1804289383 Thu Sep 1 07:47:59 [mongosMain] dbexit: received signal 15 rc:0 received signal 15
Then it died.
Running "show dbs" in the mongo console while connected to the mongos clearly shows that the database in question exists.
This is not the first problem we've encountered where mongos is confused about which databases exist, and frankly we're getting scared of using sharding because it's so easily corrupted. I haven't found or heard any way to fix the problem but to clean the whole cluster and start over.
If you're wondering about the date in the database name we use a application side partitioning scheme, mostly because we need to remove old data, but also partly because it's so easy to get a corrupted sharding config, and in such a case we don't want as little of our active data in that database as possible.
This may be related to SERVER-3738, which happened at roughly the same time.
This is some more context from the mongos logs:
Thu Sep 1 07:41:41 [conn7] SyncClusterConnection connecting to [richcollconf01:28100] Thu Sep 1 07:41:41 [conn7] SyncClusterConnection connecting to [richcollconf02:28100] Thu Sep 1 07:41:41 [conn3] SyncClusterConnection connecting to [richcollconf01:28100] Thu Sep 1 07:41:41 [conn7] SyncClusterConnection connecting to [richcollconf03:28100] Thu Sep 1 07:41:41 [conn3] SyncClusterConnection connecting to [richcollconf02:28100] Thu Sep 1 07:41:41 [conn3] SyncClusterConnection connecting to [richcollconf03:28100] Thu Sep 1 07:41:41 [conn8] ns: complete_20110901.exposures ClusteredCursor::query ShardConnection had to change attempt: 0 Thu Sep 1 07:41:42 [conn7] ns: complete_20110901.exposures ClusteredCursor::query ShardConnection had to change attempt: 0 Thu Sep 1 07:41:42 [conn5] ns: complete_20110901.exposures ClusteredCursor::query ShardConnection had to change attempt: 0 Thu Sep 1 07:41:42 [conn2] ns: complete_20110901.exposures ClusteredCursor::query ShardConnection had to change attempt: 0 Thu Sep 1 07:41:43 [conn6] warning: splitChunk failed - cmd: { splitChunk: "complete_20110901.exposures", keyPattern: { _id: 1 }, min: { _id: MinKey }, max: { _id: "3LLLLL" }, from: "richcollshard2/richcolldb03.byburt.com:27017,richcolldb Thu Sep 1 07:41:43 [conn7] ns: complete_20110901.exposures ClusteredCursor::query ShardConnection had to change attempt: 0 Thu Sep 1 07:41:43 [conn4] ns: complete_20110901.exposures ClusteredCursor::query ShardConnection had to change attempt: 0 Thu Sep 1 07:41:43 [conn5] warning: splitChunk failed - cmd: { splitChunk: "complete_20110901.exposures", keyPattern: { _id: 1 }, min: { _id: MinKey }, max: { _id: "3LLLLL" }, from: "richcollshard2/richcolldb03.byburt.com:27017,richcolldb Thu Sep 1 07:41:43 [conn5] SyncClusterConnection connecting to [richcollconf01:28100] Thu Sep 1 07:41:43 [conn5] SyncClusterConnection connecting to [richcollconf02:28100] Thu Sep 1 07:41:43 [conn5] SyncClusterConnection connecting to [richcollconf03:28100] Thu Sep 1 07:41:43 [conn4] ns: complete_20110901.exposures ClusteredCursor::query ShardConnection had to change attempt: 0 Thu Sep 1 07:41:43 [conn6] ns: complete_20110901.exposures ClusteredCursor::query ShardConnection had to change attempt: 0 Thu Sep 1 07:41:43 [conn6] SyncClusterConnection connecting to [richcollconf01:28100] Thu Sep 1 07:41:43 [conn6] SyncClusterConnection connecting to [richcollconf02:28100] Thu Sep 1 07:41:43 [conn6] SyncClusterConnection connecting to [richcollconf03:28100] Thu Sep 1 07:41:43 [conn6] ns: complete_20110901.exposures ClusteredCursor::query ShardConnection had to change attempt: 1 Thu Sep 1 07:41:43 [conn2] autosplitted complete_20110901.exposures shard: ns:complete_20110901.exposures at: richcollshard2:richcollshard2/richcolldb03.byburt.com:27017,richcolldb04 lastmod: 9|2 min: { _id: "SSSSSO" } max: { _id: "WEEEE9 Thu Sep 1 07:41:43 [conn6] ns: complete_20110901.exposures ClusteredCursor::query ShardConnection had to change attempt: 0 Thu Sep 1 07:41:43 [conn4] ns: complete_20110901.exposures ClusteredCursor::query ShardConnection had to change attempt: 0 Thu Sep 1 07:41:43 [conn5] autosplitted complete_20110901.exposures shard: ns:complete_20110901.exposures at: richcollshard2:richcollshard2/richcolldb03.byburt.com:27017,richcolldb04 lastmod: 11|3 min: { _id: "001F5CLQTKHLAAE4" } max: { _ Thu Sep 1 07:41:43 [conn7] ns: complete_20110901.exposures ClusteredCursor::query ShardConnection had to change attempt: 0 Thu Sep 1 07:41:43 [conn2] ns: complete_20110901.exposures ClusteredCursor::query ShardConnection had to change attempt: 0 Thu Sep 1 07:41:43 [conn8] ns: complete_20110901.exposures ClusteredCursor::query ShardConnection had to change attempt: 0 Thu Sep 1 07:41:43 [conn4] ns: complete_20110901.exposures ClusteredCursor::query ShardConnection had to change attempt: 0 Thu Sep 1 07:41:43 [mongosMain] connection accepted from 127.0.0.1:34480 #10 Thu Sep 1 07:41:43 [mongosMain] connection accepted from 127.0.0.1:34481 #11 Thu Sep 1 07:41:44 [conn6] SyncClusterConnection connecting to [richcollconf01:28100] Thu Sep 1 07:41:44 [conn6] SyncClusterConnection connecting to [richcollconf02:28100] Thu Sep 1 07:41:44 [conn6] SyncClusterConnection connecting to [richcollconf03:28100] Thu Sep 1 07:43:43 [conn2] couldn't find database [complete_20110828] in config db Thu Sep 1 07:43:43 [conn2] put [complete_20110828] on: richcollshard2:richcollshard2/richcolldb03.byburt.com:27017,richcolldb04 Thu Sep 1 07:46:10 [LockPinger] dist_lock pinged successfully for: richassembler03.byburt.com:1314862900:1804289383 Thu Sep 1 07:47:59 [mongosMain] dbexit: received signal 15 rc:0 received signal 15