WiredTiger compaction can be expensive and sometimes doesn't produce much space savings. (See the opening comment in session_compact.c.) Can we use existing file system features to alleviate this problem?
Many widely used file systems (XFS, ext4, NTFS, APFS) support hole punching. Using this feature, an application can tell the operating system to free the disk space backing a specified range of a file. In principle, this offers a more efficient way to free unused space from WiredTiger files. Rather than re-arranging data within a file so that we can truncate the end of the file, we could punch holes at locations where we have freed data, avoiding the copying overhead (and the subsequent checkpoints) that are part of our current compaction implementation.
This could be built a few different ways, none of which require major architectural changes:
- Hole punch whenever we place blocks on the available space extent list.
- Have a background task that notices when there is a large disparity between a file's physical size and the amount of data in it. When this happens it would walk the available space extent list and hole punch everything on it.
- Add an option to the existing SESSION::compact() interface that would trigger a hole-punching compaction (as in the previous item) rather than using the existing compaction.
The one big problem I see (from the WT perspective) is interaction with backup. It would take some bigger changes to preserve the space-saving benefits of hole punching across backups. None of our existing backup techniques (full, incremental, block-incremental) would preserve holes that have been punched in source files.
Of course, there are other potential downsides to doing this:
- POSIX doesn't have a standard API for hole punching, so the code would have to be specialized based on the underlying operating system. Linux file systems use fallocate(FALLOC_FL_PUNCH_HOLE), MacOS uses fcntl(F_PUNCHHOLE), and some systems don't support it at all.
- When we write data to a file location that we previously hole punched, the file system will have to allocate space to the file, incurring a bit of additional overhead. More importantly, we could get back ENOSPC errors in places we currently don't expect them.
- Repeatedly hole punching a file and then filling the holes will result in on-disk file fragmentation. This probably isn't much of an issue, since WiredTiger doesn't guarantee sequential data layout within files. I.e., a scan of an entire btree doesn't turn into sequential reads to its file.
- Customers might be confused to see file sizes (e.g., via ls -l) that don't match the space used by files.
- Hole punching probably isn't used as much as many other file system features, and therefore might have more bugs.
Overall, I think that the backup issues are probably a deal-breaker here. But having thought about this for a bit, I figured I'd write it up for posterity.