-
Type: Bug
-
Resolution: Done
-
Priority: Critical - P2
-
Affects Version/s: 1.4.0, 1.4.1
-
Component/s: None
-
None
-
Environment:MongoHQ, semi-dedicated environment, 2 replicas and an arbiter (orchid)
Ruby 1.9.2, mongo/bson/bson_ext 1.4.1, Mongoid 2.0.2
-
Major Change
The short story is that upgrading to 1.4.0 and then 1.4.1 made our production environment (almost) toast.
- new-relic.png: shows the query performance right after deployment w/ 1.4.0 and then 1.4.1 until the problem went away, downgrade to 1.3.1
- mongohq-conncount.png: shows the number of connections from the rails app to mongo varying significantly up to 77, downgrading to 1.3.1 put it back in a stable number of 11
- mongostat.png shows nothing unusual while queries timeout from ruby
a random sampler
30.times { puts Benchmark.realtime
{ Mongoid.master.connection.active? }; sleep(1) }) which executes db.runCommand(
{ ping: 1 }).
0.024342775344848633
0.08080220222473145
2.113878011703491 <-------- not ok
0.023059368133544922
0.03187060356140137
- at the same time we're experiencing timeouts between replicas, but with 1.3.1 it doesn't affect performance, mongodb log
Fri Oct 21 20:50:01 [ReplSetHealthPollTask] EINTR retry
Fri Oct 21 20:50:01 [ReplSetHealthPollTask] DBClientCursor::init call() failed
Fri Oct 21 20:50:01 [ReplSetHealthPollTask] replSet info arbiter0.orchid.mongohq.com:10001 is down (or slow to respond): DBClientBase::findOne: transport error: arbiter0.orchid.mongohq.com:10001 query:
Fri Oct 21 20:50:05 [ReplSetHealthPollTask] replSet info arbiter0.orchid.mongohq.com:10001 is up