-
Type: Improvement
-
Resolution: Won't Do
-
Priority: Major - P3
-
None
-
Affects Version/s: None
-
Component/s: Performance
-
None
-
Storage Execution
Right now we checksum each KV pair independently using murmur3 and then validate after finishing a file that the checksum matched. There are a few issues with this:
- Murmur3 is a mediocre hash function at this point, both for perf and error detection
- Fix: use crc32c (from wiredtiger)
- Hashing small chunks of data is slower than hashing big chunks
- Fix: We are already producing buffers of data for compression purposes. We should do the checksumming on the big buffer either before or after compression. Doing it before compression makes sure that the decompression produces the right result, but doing it after compression both checksums less data and avoids sending garbage into the decompressor. Since we trust Snappy to decompress correctly when fed good input, I think checksumming after compression make sense.
- We wait until we finish with whole files to check the checksums. 1) this wastes work if we could have aborted earlier 2) it risks sending garbage data to consumers who aren't prepared for it. 3) It assumes we will actually reach the end of the file. Consumers like TopKSorter are unlikely to do so.
- Fix: Check the checksum immediately after reading a chunk from the file (and after decompression, if the checksum was computed prior to compression).
- is related to
-
WT-7236 Cache the result of the wiredtiger_crc32c_func() return value
- Closed