-
Type: Bug
-
Resolution: Done
-
Priority: Major - P3
-
Affects Version/s: 2.0.0-rc2
-
Component/s: Performance, Sharding, Stability
-
Environment:ubuntu 10.04 x86_64; 4 shards, each shard a RS of two members + arbiter
-
Linux
Comment for future reference - this ticket was used primarily to track a writeback listener change in 2.0.1 - the wbl was not forcing a version reload which would reset the connection version on the mongod side, even after multiple failed retries. The changes in this ticket fix that.
this issue may be similar to or related to SERVER-4037. previously, we have noted that mongos will enter a state in which it is attempting to retry a command query indefinitely to some shard members. prior to 2.0.0-rc2, this resulted in mongos outputting a bunch of 'writeback failed' messages. this no longer occurs, but we still see behavior which cause the number of command ops/sec on shards to increase to 1500-2000 ops/sec range. note, these command ops are not logged at the shards but we definitely see them when running mongostat. bouncing mongos fixes the issue.
on the shards, we can see that mongos effectively has caused a dos by issuing many quick successive connections:
Fri Oct 21 11:50:49 [initandlisten] connection accepted from mongos:35539 #9663
Fri Oct 21 11:50:50 [initandlisten] connection accepted from mongos:35543 #9664
Fri Oct 21 11:50:51 [initandlisten] connection accepted from mongos:35544 #9665
Fri Oct 21 11:50:51 [initandlisten] connection accepted from mongos:35545 #9666
Fri Oct 21 11:50:53 [initandlisten] connection accepted from mongos:35546 #9667
Fri Oct 21 11:51:01 [initandlisten] connection accepted from mongos:35547 #9668
Fri Oct 21 11:51:02 [initandlisten] connection accepted from mongos:35551 #9669
Fri Oct 21 11:51:02 [initandlisten] connection accepted from mongos:35552 #9670
Fri Oct 21 11:51:03 [initandlisten] connection accepted from mongos:35553 #9671
Fri Oct 21 11:51:03 [initandlisten] connection accepted from mongos:35554 #9672
Fri Oct 21 11:51:05 [initandlisten] connection accepted from mongos:35558 #9673
Fri Oct 21 11:51:12 [initandlisten] connection accepted from mongos:35559 #9674
Fri Oct 21 11:51:13 [initandlisten] connection accepted from mongos:35560 #9675
Fri Oct 21 11:51:13 [initandlisten] connection accepted from mongos:35561 #9676
Fri Oct 21 11:51:14 [initandlisten] connection accepted from mongos:35565 #9677
Fri Oct 21 11:51:14 [initandlisten] connection accepted from mongos:35569 #9678
Fri Oct 21 11:51:16 [initandlisten] connection accepted from mongos:35570 #9679
Fri Oct 21 11:51:18 [initandlisten] connection accepted from mongos:35571 #9680
Fri Oct 21 11:51:18 [initandlisten] connection accepted from mongos:35572 #9681
Fri Oct 21 11:51:18 [initandlisten] connection accepted from mongos:35573 #9682
Fri Oct 21 11:51:19 [initandlisten] connection accepted from mongos:35577 #9683
Fri Oct 21 11:51:19 [initandlisten] connection accepted from mongos:35580 #9684
Fri Oct 21 11:51:19 [initandlisten] connection accepted from mongos:35581 #9685
Fri Oct 21 11:51:21 [initandlisten] connection accepted from mongos:35585 #9686
Fri Oct 21 11:51:22 [initandlisten] connection accepted from mongos:35586 #9687
Fri Oct 21 11:51:22 [initandlisten] connection accepted from mongos:35587 #9688
Fri Oct 21 11:51:29 [initandlisten] connection accepted from mongos:35591 #9689
Fri Oct 21 11:51:30 [initandlisten] connection accepted from mongos:35592 #9690
Fri Oct 21 11:51:31 [initandlisten] connection accepted from mongos:35593 #9691
Fri Oct 21 11:51:32 [initandlisten] connection accepted from mongos:35594 #9692
Fri Oct 21 11:51:34 [initandlisten] connection accepted from mongos:35595 #9693
Fri Oct 21 11:51:37 [initandlisten] connection accepted from mongos:35596 #9694
Fri Oct 21 11:51:38 [initandlisten] connection accepted from mongos:35597 #9695
Fri Oct 21 11:51:38 [initandlisten] connection accepted from mongos:35598 #9696
Fri Oct 21 11:51:40 [initandlisten] connection accepted from mongos:35599 #9697
Fri Oct 21 11:51:40 [initandlisten] connection accepted from mongos:35600 #9698
Fri Oct 21 11:51:41 [initandlisten] connection accepted from mongos:35601 #9699
Fri Oct 21 11:51:42 [initandlisten] connection accepted from mongos:35602 #9700
Fri Oct 21 11:51:42 [initandlisten] connection accepted from mongos:35603 #9701
Fri Oct 21 11:51:42 [initandlisten] connection accepted from mongos:35604 #9702
Fri Oct 21 11:51:43 [initandlisten] connection accepted from mongos:35605 #9703
Fri Oct 21 11:51:44 [initandlisten] connection accepted from mongos:35609 #9704
...
the load on this shard RS member increased to 15 and became unresponsive. bouncing mongos stopped the connections/queries and service returned to normal. during the initial stages of the problem, performance degraded severely eventually leading to all queries timing out.
- depends on
-
SERVER-4240 make 2.0.1 mongos nightly compatible with 2.0.1 mongod
- Closed
- related to
-
SERVER-3683 Possible for setShardVersion to never be set on mongod after multiple StaleConfigExceptions due to stale/missing mongod metadata
- Closed