-
Type: Improvement
-
Resolution: Fixed
-
Priority: Major - P3
-
Affects Version/s: None
-
Component/s: None
-
Storage Engines
-
8
-
2024-02-06 tapioooooooooooooca
My testing shows that we can improve the perf of crc32c on neoverse-n1/graviton2 by making the main loop do 16 bytes at a time rather than 8. This enables using the ldp (load pair) instruction rather than ldr (load register). It also reduces the per-byte loop overhead. This gets us from ~16GB/s to 18.3GB/s which seems to be about the limit a single core can issue loads. I saw no additional advantage when going to 32 bytes at a time.
Additional improvements:
- Stop trying to align before entering the main loop. Doesn't seem to improve perf vs just letting the main loop do unaligned loads.
- Use a cascade of if (remaining & 4)/if (remaining & 2)/if (remaining & 1) blocks to handle the tail in at most 3 crcs rather than looping with byte-at-a-time. (4 when adding a case for bytes & 8 after expanding the main loop to 16 bytes)
- Use the __crc32cX(...) (where X is one of b-yte, h-alfword, w-ord, or d-oubleword) intrinsics from #include <arm_acle.h> rather than inline asm.
Here's the code I found worked well. Its C++ designed to update a member variable _val with the incremental hash, but should be easy to translate to C with whichever API you want.
void addBytes(const void* start, size_t bytes) { auto p = static_cast<const char*>(start); // Unfortunately our compiler tries to update _val as it goes rather than keeping it in // a register. auto reg = _val; // Do chunks of 16 bytes at a time. (faster than doing 8) while (bytes >= 16) { auto a = loadAt<uint64_t>(p); auto b = loadAt<uint64_t>(p + 8); reg = __crc32cd(reg, a); reg = __crc32cd(reg, b); p += 16; bytes -= 16; } // Now pick off the tails. if (bytes & 8) { reg = __crc32cd(reg, loadAt<uint64_t>(p)); p += 8; } if (bytes & 4) { reg = __crc32cw(reg, loadAt<uint32_t>(p)); p += 4; } if (bytes & 2) { reg = __crc32ch(reg, loadAt<uint16_t>(p)); p += 2; } if (bytes & 1) { reg = __crc32cb(reg, loadAt<uint8_t>(p)); } // Finally, put the register back. _val = reg; }
- is related to
-
WT-12293 Optimize our crc32 implementation for x86
- Backlog