FEATURE DESCRIPTION
The Storage Node Watchdog is a new feature in MongoDB designed to detect unresponsive I/O conditions.
VERSIONS
This is an enterprise only feature in MongoDB, available in the 3.2.16 and 3.4.7 and newer production releases. The Watchdog is not available on macOS.
OPERATION
The Storage Node Watchdog is disabled by default:
- It must be enabled at startup as follows:
mongod --setParameter watchdogPeriodSeconds=60
The watchdogPeriodSeconds parameter is an integer number of seconds and can be either -1 (the default value), which means the watchdog is disabled, or a value greater or equal to 60.
- The watchdog may be paused at runtime by setting watchdogPeriodSeconds to -1 via the setParameter command:
MongoDB Enterprise> db.runCommand({setParameter:1, watchdogPeriodSeconds : -1})
- The watchdog may be resumed at runtime or its period changed by setting watchdogPeriodSeconds to a value >= 60:
MongoDB Enterprise> db.runCommand({setParameter:1, watchdogPeriodSeconds : 120})
It is an error to set watchdogPeriodSeconds at runtime if the server was not started with a value >= 60 at startup.
The watchdog monitors the following directories:
- The --dbpath directory
- The --dbpath/journal directory if the journal is enabled
- The directory of --logpath file
- The directory of --auditPath file
If any of these directories resides in an I/O subsystem and that I/O subsystem becomes unresponsive, the watchdog will detect such condition after sufficient time has passed, then terminate mongod tearing down all its threads and exiting the process with exit code 61. The maximum time the watchdog can take to detect an unresponsive I/O subsystem is approximately twice the watchdogPeriodSeconds.
IMPLEMENTATION DETAILS
It is implemented as a pair of threads in mongod that monitors various directories MongoDB uses to store data, and log files. One thread checks the monitored directories, and a second thread ensures that the first thread never gets stuck. The check thread runs at a fixed 10 second interval.
DIAGNOSTICS
When enabled, the watchdog logs all changes to watchdogPeriodSeconds at the default log level.
When enabled at startup, the following message will appear in the logs:
CONTROL [initandlisten] Starting Watchdog Monitor
If watchdogPeriodSeconds is disabled or changed at runtime, messages like the following will appear in the logs:
CONTROL [initandlisten] WatchdogMonitor disabled CONTROL [initandlisten] WatchdogMonitor period changed to 120s
At log level 1, the watchdog logs its periodic disk checks:
CONTROL [watchdogCheck] Watchdog test 'checked directory '/data/db/'' took 3ms
If the watchdog was enabled at startup, an additional section is added to the output of the serverStatus command output named "watchdog".
MongoDB Enterprise > db.serverStatus() ... "watchdog" : { "checkGeneration" : NumberLong(2), "monitorGeneration" : NumberLong(0), "monitorPeriod" : 120 }, ...
The meaning of this data is:
- checkGeneration: 64-bit signed integer; indicates the number of directory checks run since startup. It increments once for each directory checked. For example, if dbpath and logpath are specified, then this field is incremented twice every 10 seconds.
- monitorGeneration: 64-bit signed integer; number of times the check thread has been checked for progress.
- monitorPeriod: 64-bit signed integer; the value of the watchdogPeriodSeconds parameter.
TEST METHODOLOGY
We use CharybdeFS, a Linux FUSE file system from ScyllaDB to create unresponsive I/O conditions and verify the storage watchdog detects them.
Original description
Implement a storage node watchdog for Linux.
- is duplicated by
-
SERVER-14139 Disk failure on one node can (eventually) block a whole cluster
- Closed
- is related to
-
SERVER-30774 Add Storage Node Watchdog to MongoS
- Closed
- related to
-
SERVER-31457 Mongod stop responding, takes 200 load and don't even switch to secondary
- Closed