On Linux, storage errors, specifically failed writes, can result in unnecessary and (mostly) unsignaled data loss and database corruption. This is due to a combination of two factors:
- The Linux kernel does not retry failed page writes, even on subsequent msyncs, unless the application re-dirties the pages (see below for details).
- After a failed msync mongod clears the journal, logs a slightly obscure message, and continues operation with a now corrupted database.
The net result is that after a storage write error a customer may continue operating for some time with data loss and a corrupted database without being aware of it, until the data in question is next acccessed, at which point they may see an error, typically a bson object size error.
Possible mongod improvements:
- Transient errors: on msync failure mongod could
- not clear the journal until a successful msync
- replay the journal to re-dirty the pages (and restore data possibly lost from mapped memory) which will force the kernel to retry the writes on a subsequent msync
- Permanent errors: on msync failure mongod could
- not clear the journal
- issue a message warning of likely data loss and database corruption
- fassert to shut down, alerting the user to the issue and allowing them to take corrective action sooner rather than later
- The error return from msync does not distinguish between transient and permanent failures, so a hybrid heuristic could be adopted: retry the msync as in the first option above for some (configurable) period of time or number of msync attempts, then treat the error as permanent and fassert as in the second option above.
KERNEL BEHAVIOR ON FAILED WRITES
The following behavior was observed by doing SCSI error injects using systemtap on Linux (CentOS 6.5, 2.6.32-431.el6.centos.plus.x86_64 kernel).
- create and zero a 16 MB file using dd, sync it to disk
- then in parallel do:
- thread 1
- simple standalone C program:
- open and mmap the file
- write some portion of file with non-zero data via the mmap
- msync the file, watch return code
- msync file a second time, watch return code
- then remount fs to make sure kernel buffers are all flushed to disk
- simple standalone C program:
- thread 2 - systemtap script that
- prints SCSI write commands issued against the file
- errors out exactly one of the commands
- thread 1
- then hexdump file to observe content
Observed that
- whether the SCSI error happens during the first msync or before it (page flushed due to memory pressure, triggered by increasing file size from 16 MB to 1 GB) the first msync reports i/o error, meaning we are guaranteed to see an error even if it occurred before the msync
- the second msync however does does not reattempt the failed page write and does not report i/o error
- in fact the failed page write is never reattempted at any time up through the remount
- hexdump after the remount shows the data never made it to disk, confirming the data loss
On the other hand, if the C program is modified to re-dirty the pages before the second msync by rewriting them, the second msync triggers a new attempt to write the previously failed pages, and examining the data for the initially failed pages does make it to disk on the second attempt.
(Similar behavior was also reproduced by using mongod and two fsync db commands, observing msync error log message for first but not second fsync, no reattempt of failed page writes, and observing corrupted db file content by inspection.)
LOGGING
Mongod log message is as follows (can be on different threads depending on how msync is triggered):
[DataFileSync] msync errno:5 Input/output error
Suggest improving, at least in non-recoverable case, to indicate likely data loss and database corruption due to storage errors, and need for corrective action.
Kernel logs messages such as the following (details depend on type of error) to syslog and dmesg:
sd 0:0:2:0: timing out command, waited 1080s
sd 0:0:2:0: SCSI error: return code = 0x06000008
sd 0:0:0:0: [sda] Unhandled error code
sd 0:0:0:0: [sda] Result: hostbyte=DID_OK driverbyte=DRIVER_TIMEOUT
sd 0:0:0:0: [sda] CDB: Write(10): 2a 00 00 48 f2 50 00 00 10 00
end_request: I/O error, dev sdc, sector 6201448
Buffer I/O error on device dm-2, logical block 601268
lost page write due to I/O error on dm-2