-
Type: Bug
-
Resolution: Done
-
Priority: Major - P3
-
None
-
Affects Version/s: 2.4.3
-
Component/s: Sharding
-
None
-
Environment:Ubuntu 12.04.1 LTS
3.2.0-32-generic #51-Ubuntu SMP Wed Sep 26 21:33:09 UTC 2012 x86_64 x86_64 x86_64 GNU/Linux
MongoDB 2.4.3
-
ALL
We started to shard one more of our big collections in our database. Database has 26 collections and some of them are already sharded.
Now every night (UTC) we let the balancer run:
{ "_id" : "balancer", "activeWindow" : { "start" : "18:00", "stop" : "7:00" }, "stopped" : false }
The collection we now added has around 140 mio documents.
"avgObjSize" : 378.40800250149164,
"size" : 52424250472,
What we now see, is that outside the Balancer window the homeshard is doing its cleanup rounds.
Thus we see a lot writes and reads via mongotop on this collection.
We profiled the access patterns and think that >80% of the writes are coming from the cleanup job.
some output dbtop (webinterface) for this collection:
total Reads Writes Queries GetMores Inserts Updates Removes 2259 84.9% 1987 49.9% 272 34.9% 682 37.9% 5 2.7% 0 0% 40 8.3% 0 0% 2320 84.1% 1479 47.9% 841 36.3% 530 28.9% 3 11.3% 0 0% 6 0.2% 0 0%
In the logfile of the server process (primary) we find the following entry:
Tue Aug 27 15:08:25.610 [cleanupOldData-5219670bedeed3fdea9d337b] moveChunk starting delete for: database.CollectionToshard from { targetUid: -5232965359423252304 } -> { targetUid: -5219148617130848963 } .... Tue Aug 27 15:32:58.264 [cleanupOldData-5219670bedeed3fdea9d337b] Helpers::removeRangeUnlocked time spent waiting for replication: 526999ms Tue Aug 27 15:32:58.264 [cleanupOldData-5219670bedeed3fdea9d337b] moveChunk deleted 92419 documents for database.CollectionToshard from { targetUid: -5232965359423252304 } -> { targetUid: -5219148617130848963 }
Every cleanup deletes around 90k documents in ~24 minutes. This is very slow and we suffer from periodic high IO writes. During these high IO writes the mongod service is slow and we queue up reads and some writes (monitored via mongostat)
Is this cleanUp job so aggressive for the IO?
Why is this cleanup not done while the balancer runs?
Is there a way to check the status of this cleanup job?
Is there a way to limit the cleanup job performance?
Thanks in advance,
Steffen