-
Type: Bug
-
Resolution: Fixed
-
Priority: Unknown
-
Affects Version/s: None
-
Component/s: Connections, Server Selection
-
None
-
Not Needed
-
Summary
In my scenario, I have an arbiter from which I collect metrics using the Percona MongoDB exporter (https://github.com/percona/mongodb_exporter). The exporter uses the Go driver to establish a direct connection to the arbiter and execute commands like getDiagnosticData and replSetGetStatus every 10 seconds.
The exporter connects to the arbiter as soon as it starts and reuses the same connection throughout its entire life. When the mongod instance and the MongoDB exporter start at the same time (which is typically the case as they reside in the same Kubernetes pod), the connection becomes permanently broken. The following error is returned by the Go driver on every attempt to execute any command:
server selection error: context canceled, current topology: { Type: Single, Servers: [{ Addr: mongo-arbiter:27017, Type: Unknown, Last error: (NotPrimaryOrSecondary) node is not in primary or recovering state }, ] }
Restarting only the MongoDB exporter or disabling the global connection pool (via the --no-mongodb.global-conn-pool flag) fixes the problem.
This is particularly easy to reproduce when the exporter connects to the soon-to-be arbiter instance before initiating the replica set. In such case, the exporter is initially able to execute commands but starts failing once the replica set is initiated.
MongoDB driver version: v1.10.3 (the issue doesn't reproduce with v1.3.2 used by older versions of the MongoDB exporter)
MongoDB server version: v4.4 or later
Topology: replica set or sharded cluster + arbiter
How to Reproduce
Reproducibility: 100%
1. Create the following prometheus.yml file:
global: scrape_interval: 10s scrape_timeout: 9s scrape_configs: - job_name: mongo-0 static_configs: - targets: - exporter-0:9216 - job_name: mongo-1 static_configs: - targets: - exporter-1:9216 - job_name: mongo-arbiter static_configs: - targets: - exporter-arbiter:9216
2. Create the following docker-compose.yml file:
version: '3.8' services: mongo-0: container_name: mongo-0 image: mongo:4.4 networks: - mongo entrypoint: ["/usr/bin/mongod", "--bind_ip_all", "--replSet", "rs"] mongo-1: container_name: mongo-1 image: mongo:4.4 networks: - mongo entrypoint: ["/usr/bin/mongod", "--bind_ip_all", "--replSet", "rs"] mongo-arbiter: container_name: mongo-arbiter image: mongo:4.4 networks: - mongo entrypoint: ["/usr/bin/mongod", "--bind_ip_all", "--replSet", "rs"] exporter-0: container_name: exporter-0 image: percona/mongodb_exporter:0.37.0 networks: - mongo command: ["--mongodb.uri=mongodb://mongo-0:27017/", "--mongodb.direct-connect", "--collector.diagnosticdata", "--mongodb.global-conn-pool", "--log.level=debug"] exporter-1: container_name: exporter-1 image: percona/mongodb_exporter:0.37.0 networks: - mongo command: ["--mongodb.uri=mongodb://mongo-1:27017/", "--mongodb.direct-connect", "--collector.diagnosticdata", "--mongodb.global-conn-pool", "--log.level=debug"] exporter-arbiter: container_name: exporter-arbiter image: percona/mongodb_exporter:0.37.0 networks: - mongo command: ["--mongodb.uri=mongodb://mongo-arbiter:27017/", "--mongodb.direct-connect", "--collector.diagnosticdata", "--mongodb.global-conn-pool", "--log.level=debug"] victoriametrics: container_name: victoriametrics image: victoriametrics/victoria-metrics:latest networks: - mongo ports: - 8428:8428 volumes: - ./prometheus.yml:/etc/prometheus/prometheus.yml command: ["-promscrape.config", "/etc/prometheus/prometheus.yml"] networks: mongo: driver: bridge
3. Start the containers in the background:
docker compose up -d
4. Monitor the exporter-arbiter container logs for 30-40 seconds:
docker logs -f exporter-arbiter
Note that the getDiagnosticData command result is emitted every 10 seconds.
5. Initiate the replica set:
cat <<EOF | docker exec -i mongo-0 mongo rs.initiate({ "_id": "rs", "members": [ {"_id": 0, "host": "mongo-0:27017"}, {"_id": 1, "host": "mongo-1:27017"}, {"_id": 2, "host": "mongo-arbiter:27017", "arbiterOnly": true} ] }); EOF
6. Monitor the exporter-arbiter container logs for another 30-40 seconds.
docker logs -f exporter-arbiter
Note that the following entries are now emitted instead of the command results:
time="2023-03-30T18:49:31Z" level=warning msg="cannot load topology labels: cannot check if the instance is an arbiter: server selection error: context canceled, current topology: { Type: Single, Servers: [{ Addr: mongo-arbiter:27017, Type: Unknown, Last error: (NotPrimaryOrSecondary) node is not in primary or recovering state }, ] }" time="2023-03-30T18:49:31Z" level=error msg="Registry - Cannot get node type to check if this is a mongos : server selection error: context canceled, current topology: { Type: Single, Servers: [{ Addr: mongo-arbiter:27017, Type: Unknown, Last error: (NotPrimaryOrSecondary) node is not in primary or recovering state }, ] }"
7. Restart the exporter-arbiter container:
docker restart exporter-arbiter
8. Monitor the exporter-arbiter container logs for another 30-40 seconds:
docker logs -f exporter-arbiter
Note that the command results are emitted again.
9. Stop and remove the containers:
docker compose down
Additional Background
Use the --no-mongodb.global-conn-pool flag instead of the --mongodb.global-conn-pool flag in the example above to disable the global connection pool. The issue doesn't reproduce when a new connection is used each time.
I forked the v0.11.2 branch of the MongoDB exporter which uses the Go driver v1.3.2 which doesn't have this issue. After upgrading to v1.11.3, the issue started to appear.
- is caused by
-
GODRIVER-2829 Requesting an immediate heartbeat check after an error doesn't work for streaming heartbeats
- Backlog
-
GODRIVER-2828 SDAM error handling doesn't ignore errors based on topology version correctly
- Closed