Loading...

Type: Task
Resolution: Unresolved
Priority: Major - P3
Fix Version/s: None
Affects Version/s: None
Component/s: None
Labels:
- perf-tiger-v2

Assigned Teams:

DevProd Build
Confidence Status:
None
Work Order:
3

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name:
None
Goal Link:
None

A few things to consider in the 9.0 timeframe, especially after we bump the toolchain:

In addition to testing clang vs gcc, we should probably test libc++ vs libstdc++. This should probably be a 4-way comparison with both compilers each tested with both stdlibs (easy with clang, but requires some flag fiddling with gcc). At the very least, I know that libc++ has a much better implementation of std::string, and we may even want to stdx:: a similar implementation if we stick to libstdc++.
We should strongly consider enabling LIBCXX_ABI_UNSTABLE when building llvm/libc++. It enables a lot of improvements and optimizations that can't be turned on by default because it would be an ABI break. But since we embed the stdlib in our release binaries, we are fine taking that ABI break. Of note, this enables an even better std::string and enables shared_ptr and unique_ptr to be passed in registers when passed by value (only on clang unfortunately).
When testing clang, we should mark some important types with [[clang::trivial_abi]], including at least Value, BSONObj, and boost::intrusive_pointer. This will both allow them to be passed and returned in registers, as well as make vectors of them use memcpy for growth on libc++. boost::intrusive_ptr will automatically take care of many types (SharedBuffer, Message, Document, etc) and is required for at least BSONObj and Future. We can either modify our vendored copy of the boost sources (easiest to POC), or we can just copy its small implementation into mongo:: and find-replace all usages.
Our arm64 builds should either use -mcpu=neoverse-n1 (emit code for the underlying cpu core used in graviton2 and the other cloud providers) or at least add +rcpc to our -march= flag to opt in to the RCpc (Release Consistency Processor Consistent) extension and the ldapr instruction (basically the "real" load-acquire, since ldar which is named load-acquire actually has sequentially consistent semantics). This will improve performance everywhere that memory_order_acquire is used as well as for accessing thread-safe {{static}}s in functions. While not a required part of armv8.2, it appears to be extremely widely supported in actual cpus. While our current toolchain does support these flags, it doesn't affect codegen until newer versions of gcc and clang.
- For grav4 builds we should use -mcpu=neoverse-v2
Build flags to explore when building the toolchain to see if they make compiles faster on our workstations:
- ~~LLVM_ENABLE_RPMALLOC uses a vendored copy of rpmalloc rather than the system malloc. Default on windows, should see if it is also faster on linux.~~ Turns out this is only supported on windows, but see next line for an alternative
- Use mimalloc on linux -DCMAKE_EXE_LINKER_FLAGS=-Wl,--push-state,path/to/mimalloc/out/release/libmimalloc.a,--pop-state as suggested by the last comment on this. Alternatively, could try jemalloc which is what rustc uses for its binaries, or tcmalloc which we use in mongod. Even though we are more familiar with tcmalloc, the clang's workload is probably more similar to rustc's than mongod's, especially since both use LLVM.
  - We should also explore alternative mallocs for gcc.
- LLVM_ENABLE_LTO I'm surprised we aren't already doing this! In addition to using LTO for the new toolchain, we should also consider enabling it for v4. Might be worth seeing if Full is any faster than Thin. While it is slower to build, if it makes clang even a few % faster, that seems like a clearly valuable tradeoff, at least once the toolchain is closer to a stable point so you aren't iterating on it frequently.
- LLVM_ENABLE_MODULES Makes the compile of llvm itself a bit faster, so useful when iterating on toolchain. Shouldn't impact the end result.
- LLVM_STATIC_LINK_CXX_STDLIB and LLVM_ENABLE_LIBCXX should couple nicely with LIBCXX_ABI_UNSTABLE to use faster stdlib implementations, and possibly enable LTO into the stdlib. We might also need to use -ffat-lto-objects to support non-lto builds linking against libc++.
- Clang supports a multi-stage PGO build where it is possible to train it on a custom code base. It might be worth training it on our code, but A) it only supports cmake projects, so we would need to make a simple CMakeLists.txt that can build a subset of our code, and B) we would probably want to retrain after significant changes to our flags, eg changing -std. Maybe just using the default hello world program for PGO is enough?
- Bolt for post-linking binary optimizations. Can be coupled with (Thin)LTO and PGO.
- We should build the latest zstd and statically link to it from the toolchain so that we use the latest improvements regardless of which version of zstd is installed on the remote host.
Build options we should consider when compiling mongo code with the new toolchain to improve our build performance (latter two already landed with v4 in ~~SERVER-97735~~)
- Use zstd for debug compression in the assembler and possibly also the linker
- ~~Re-enable split dwarf, and ideally stop using -gdwarf-64 when using split dwarf.~~
- ~~See if we can use -Wl,--gdb-index to have lld build the gdb index rather than having gdb do it.~~

I've put these together since they have some interplay and should probably be considered together. For example, clang's support of trivial_abi and the unstable libc++ ABI may be enough to push it over into being faster than gcc, even if it wouldn't be with our current codebase and build flags.

related to

SERVER-89271 Reevaluate debugging/hardening/fortification compiler options

Open

SERVER-97735 Support debug-fission/split-dwarf with bazel

Closed

Details

Description

Attachments

Issue Links

Activity

People

Dates