Loading...

XML

Word

Printable

JSON

Type: Improvement
Resolution: Fixed
Priority: Major - P3
Fix Version/s: WT11.3.0, 7.3.0-rc0
Affects Version/s: None
Component/s: None
Labels:
- perf-improvement

Assigned Teams:

Storage Engines
Sprint:
2024-02-06 tapioooooooooooooca
Story Points:
8

My testing shows that we can improve the perf of crc32c on neoverse-n1/graviton2 by making the main loop do 16 bytes at a time rather than 8. This enables using the ldp (load pair) instruction rather than ldr (load register). It also reduces the per-byte loop overhead. This gets us from ~16GB/s to 18.3GB/s which seems to be about the limit a single core can issue loads. I saw no additional advantage when going to 32 bytes at a time.

Additional improvements:

Stop trying to align before entering the main loop. Doesn't seem to improve perf vs just letting the main loop do unaligned loads.
Use a cascade of if (remaining & 4)/if (remaining & 2)/if (remaining & 1) blocks to handle the tail in at most 3 crcs rather than looping with byte-at-a-time. (4 when adding a case for bytes & 8 after expanding the main loop to 16 bytes)
Use the __crc32cX(...) (where X is one of b-yte, h-alfword, w-ord, or d-oubleword) intrinsics from #include <arm_acle.h> rather than inline asm.

Here's the code I found worked well. Its C++ designed to update a member variable _val with the incremental hash, but should be easy to translate to C with whichever API you want.

    void addBytes(const void* start, size_t bytes) {
        auto p = static_cast<const char*>(start);

        // Unfortunately our compiler tries to update _val as it goes rather than keeping it in
        // a register.
        auto reg = _val;

        // Do chunks of 16 bytes at a time. (faster than doing 8)
        while (bytes >= 16) {
            auto a = loadAt<uint64_t>(p);
            auto b = loadAt<uint64_t>(p + 8);
            reg = __crc32cd(reg, a);
            reg = __crc32cd(reg, b);
            p += 16;
            bytes -= 16;
        }

        // Now pick off the tails.
        if (bytes & 8) {
            reg = __crc32cd(reg, loadAt<uint64_t>(p));
            p += 8;
        }
        if (bytes & 4) {
            reg = __crc32cw(reg, loadAt<uint32_t>(p));
            p += 4;
        }
        if (bytes & 2) {
            reg = __crc32ch(reg, loadAt<uint16_t>(p));
            p += 2;
        }
        if (bytes & 1) {
            reg = __crc32cb(reg, loadAt<uint8_t>(p));
        }

        // Finally, put the register back.
        _val = reg;
    }

is related to

WT-12293 Optimize our crc32 implementation for x86

Backlog

Assignee:: Chenhao Qu
Reporter:: Mathias Stearn
Votes:: 0 Vote for this issue
Watchers:: 7 Start watching this issue

Created:: Nov 20 2023 10:05:04 AM UTC
Updated:: Mar 04 2024 07:10:50 AM UTC
Resolved:: Jan 18 2024 11:51:22 AM UTC

Details

Description

Attachments

Issue Links

Activity

People

Dates