-
Type: Bug
-
Resolution: Done
-
Priority: Major - P3
-
None
-
Affects Version/s: 3.4.14
-
Component/s: Networking, Sharding, Stability
-
None
-
ALL
Our mongo cluster consists of 6 different servers, each one contains 1 config server, 1 mongos, and 1 mongod servers.
After a random time period, the primary shard, which exists on the 1st server, stops accepting connections.
The whole architecture is running on the top of docker containers, on Google cloud.
We're using the official image.
We're running using the following configurations:
echo never > /sys/kernel/mm/transparent_hugepage/defrag;
echo never > /sys/kernel/mm/transparent_hugepage/enabled;
echo 65535 > /proc/sys/net/core/somaxconn;
mongod \
--configsvr \
--replSet config_rs \
--storageEngine wiredTiger \
--profile=1 \
--slowms=300 \
--port=27019 \
--dbpath /data/db \
--journalCommitInterval 500 \
--setParameter failIndexKeyTooLong=false \
--noIndexBuildRetry
Each of our servers run at it's startup the following commands:
sysctl -w vm.swappiness=10 sysctl -w vm.vfs_cache_pressure=50 sysctl -w fs.file-max=100000 sysctl -w net.core.somaxconn=65535 sysctl -w net.core.rmem_max=16777216 sysctl -w net.core.wmem_max=16777216 sysctl -w net.ipv4.tcp_max_syn_backlog=4096 sysctl -w net.ipv4.tcp_syncookies=1 sysctl -w net.ipv4.ip_forward=1 sysctl -w net.core.rmem_max=16777216 sysctl -w net.core.wmem_max=16777216 sysctl -w vm.overcommit_memory=1 echo never > /sys/kernel/mm/transparent_hugepage/enabled
The mongos log looks like this:
2018-04-15T08:35:46.773+0000 I ASIO [NetworkInterfaceASIO-TaskExecutorPool-0-0] Failed to connect to database-shard-1:27018 - NetworkInterfaceExceededTimeLimit: Operation timed out, request was RemoteCommand 2244876 -- target:database-shard-1:27018 db:admin cmd:{ isMaster: 1 } 2018-04-15T08:35:46.774+0000 I ASIO [NetworkInterfaceASIO-TaskExecutorPool-1-0] Connecting to database-shard-1:27018 2018-04-15T08:35:46.775+0000 I ASIO [NetworkInterfaceASIO-TaskExecutorPool-1-0] Failed to connect to database-shard-1:27018 - NetworkInterfaceExceededTimeLimit: Operation timed out, request was RemoteCommand 2244878 -- target:database-shard-1:27018 db:admin cmd:{ isMaster: 1 } 2018-04-15T08:35:46.775+0000 I ASIO [NetworkInterfaceASIO-TaskExecutorPool-2-0] Failed to connect to database-shard-1:27018 - NetworkInterfaceExceededTimeLimit: Operation timed out, request was RemoteCommand 2244880 -- target:database-shard-1:27018 db:admin cmd:{ isMaster: 1 } 2018-04-15T08:35:46.776+0000 I ASIO [NetworkInterfaceASIO-TaskExecutorPool-2-0] Connecting to database-shard-1:27018 2018-04-15T08:35:46.777+0000 I ASIO [NetworkInterfaceASIO-TaskExecutorPool-3-0] Connecting to database-shard-1:27018 2018-04-15T08:35:46.777+0000 I ASIO [NetworkInterfaceASIO-TaskExecutorPool-0-0] Connecting to database-shard-1:27018 2018-04-15T08:35:46.777+0000 I ASIO [NetworkInterfaceASIO-TaskExecutorPool-1-0] Connecting to database-shard-1:27018 2018-04-15T08:35:46.778+0000 I ASIO [NetworkInterfaceASIO-TaskExecutorPool-2-0] Connecting to database-shard-1:27018 2018-04-15T08:35:46.781+0000 I ASIO [NetworkInterfaceASIO-TaskExecutorPool-3-0] Failed to connect to database-shard-1:27018 - NetworkInterfaceExceededTimeLimit: Operation timed out, request was RemoteCommand 2244882 -- target:database-shard-1:27018 db:admin cmd:{ isMaster: 1 } 2018-04-15T08:35:46.782+0000 I ASIO [NetworkInterfaceASIO-TaskExecutorPool-3-0] Connecting to database-shard-1:27018 2018-04-15T08:35:46.782+0000 I ASIO [NetworkInterfaceASIO-TaskExecutorPool-0-0] Failed to connect to database-shard-1:27018 - NetworkInterfaceExceededTimeLimit: Operation timed out, request was RemoteCommand 2244884 -- target:database-shard-1:27018 db:admin cmd:{ isMaster: 1 } 2018-04-15T08:35:46.784+0000 I - [conn60513] end connection 10.240.8.112:49550 (2205 connections now open) 2018-04-15T08:35:46.796+0000 I - [conn60212] end connection 10.240.1.110:47314 (2204 connections now open) 2018-04-15T08:35:46.801+0000 I ASIO [NetworkInterfaceASIO-TaskExecutorPool-0-0] Connecting to database-shard-1:27018 2018-04-15T08:35:46.803+0000 I ASIO [NetworkInterfaceASIO-TaskExecutorPool-1-0] Connecting to database-shard-1:27018 2018-04-15T08:35:46.815+0000 I - [conn60119] end connection 10.240.8.152:43718 (2203 connections now open) 2018-04-15T08:35:46.816+0000 I ASIO [NetworkInterfaceASIO-TaskExecutorPool-2-0] Connecting to database-shard-1:27018 2018-04-15T08:35:46.819+0000 I ASIO [NetworkInterfaceASIO-TaskExecutorPool-2-0] Failed to connect to database-shard-1:27018 - NetworkInterfaceExceededTimeLimit: Operation timed out, request was RemoteCommand 2244888 -- target:database-shard-1:27018 db:admin cmd:{ isMaster: 1 } 2018-04-15T08:35:46.822+0000 I ASIO [NetworkInterfaceASIO-TaskExecutorPool-3-0] Connecting to database-shard-1:27018 2018-04-15T08:35:46.825+0000 I ASIO [NetworkInterfaceASIO-TaskExecutorPool-3-0] Failed to connect to database-shard-1:27018 - NetworkInterfaceExceededTimeLimit: Operation timed out, request was RemoteCommand 2244890 -- target:database-shard-1:27018 db:admin cmd:{ isMaster: 1 } 2018-04-15T08:35:46.829+0000 I - [conn60484] end connection 10.240.8.10:59090 (2202 connections now open) 2018-04-15T08:35:46.832+0000 I - [conn60223] end connection 10.240.0.233:55630 (2201 connections now open) 2018-04-15T08:35:46.892+0000 I - [conn60133] end connection 10.240.1.12:42636 (2200 connections now open) 2018-04-15T08:35:46.922+0000 I - [conn60524] end connection 10.240.0.184:53948 (2199 connections now open) 2018-04-15T08:35:46.946+0000 I - [conn60451] end connection 10.240.1.0:51832 (2198 connections now open) 2018-04-15T08:35:46.966+0000 I - [conn60516] end connection 10.240.8.43:43668 (2197 connections now open) 2018-04-15T08:35:46.968+0000 I - [conn60341] end connection 10.240.1.17:43064 (2196 connections now open) 2018-04-15T08:35:47.014+0000 I NETWORK [thread2] connection accepted from 10.240.1.62:36442 #60833 (2196 connections now open) 2018-04-15T08:35:47.016+0000 I NETWORK [conn60833] received client metadata from 10.240.1.62:36442 conn60833: { driver: { name: "nodejs", version: "2.2.34" }, os: { type: "Linux", name: "linux", architecture: "x64", version: "4.10.0-38-generic" }, platform: "Node.js v8.11.1, LE, mongodb-core: 2.1.18" } 2018-04-15T08:35:47.064+0000 I - [conn60291] end connection 10.240.0.150:58800 (2196 connections now open) 2018-04-15T08:35:47.077+0000 I - [conn60501] end connection 10.240.8.104:35454 (2195 connections now open) 2018-04-15T08:35:47.080+0000 I - [conn59862] end connection 10.240.5.229:60850 (2194 connections now open) 2018-04-15T08:35:47.129+0000 I ASIO [NetworkInterfaceASIO-TaskExecutorPool-0-0] Failed to connect to database-shard-1:27018 - NetworkInterfaceExceededTimeLimit: Operation timed out, request was RemoteCommand 2244892 -- target:database-shard-1:27018 db:admin cmd:{ isMaster: 1 } 2018-04-15T08:35:47.206+0000 I - [conn60114] end connection 10.240.5.202:48330 (2193 connections now open)
I tried to connect to the mongod directly, and I've got a connection timed out as well.
This indicates that the mongod process is the problem. It happens only in the 1st mongod server and all the other 5 are ok at the moment.
I've checked number of open files, ulimit, somaxcon... all looked fine.
I'm attaching strace of the mongod process from inside the docker container:
st st1