Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-89269

Things to explore when upgrading toolchain

    • Type: Icon: Task Task
    • Resolution: Unresolved
    • Priority: Icon: Major - P3 Major - P3
    • None
    • Affects Version/s: None
    • Component/s: None
    • Build

      A few things to consider in the 9.0 timeframe, especially after we bump the toolchain:

      • In addition to testing clang vs gcc, we should probably test libc++ vs libstdc++. This should probably be a 4-way comparison with both compilers each tested with both stdlibs (easy with clang, but requires some flag fiddling with gcc). At the very least, I know that libc++ has a much better implementation of std::string, and we may even want to stdx:: a similar implementation if we stick to libstdc++.
      • We should strongly consider enabling LIBCXX_ABI_UNSTABLE when building llvm/libc++. It enables a lot of improvements and optimizations that can't be turned on by default because it would be an ABI break. But since we embed the stdlib in our release binaries, we are fine taking that ABI break. Of note, this enables an even better std::string and enables shared_ptr and unique_ptr to be passed in registers when passed by value (only on clang unfortunately).
      • When testing clang, we should mark some important types with [[clang::trivial_abi]], including at least Value, BSONObj, and boost::intrusive_pointer. This will both allow them to be passed and returned in registers, as well as make vectors of them use memcpy for growth on libc++. boost::intrusive_ptr will automatically take care of many types (SharedBuffer, Message, Document, etc) and is required for at least BSONObj and Future. We can either modify our vendored copy of the boost sources (easiest to POC), or we can just copy its small implementation into mongo:: and find-replace all usages.
      • Our arm64 builds should either use -mcpu=neoverse-n1 (emit code for the underlying cpu core used in graviton2 and the other cloud providers) or at least add +rcpc to our -march= flag to opt in to the RCpc (Release Consistency Processor Consistent) extension and the ldapr instruction (basically the "real" load-acquire, since ldar which is named load-acquire actually has sequentially consistent semantics). This will improve performance everywhere that memory_order_acquire is used as well as for accessing thread-safe {{static}}s in functions. While not a required part of armv8.2, it appears to be extremely widely supported in actual cpus. While our current toolchain does support these flags, it doesn't affect codegen until newer versions of gcc and clang.
      • Build flags to explore when building LLVM to see if they make compiles faster on our workstations:
        • LLVM_ENABLE_RPMALLOC uses a vendored copy of rpmalloc rather than the system malloc. Default on windows, should see if it is also faster on linux.
        • LLVM_ENABLE_LTO I'm surprised we aren't already doing this! In addition to using LTO for the new toolchain, we should also consider enabling it for v4. Might be worth seeing if Full is any faster than Thin. While it is slower to build, if it makes clang even a few % faster, that seems like a clearly valuable tradeoff, at least once the toolchain is closer to a stable point so you aren't iterating on it frequently.
        • LLVM_ENABLE_MODULES Makes the compile of llvm itself a bit faster, so useful when iterating on toolchain. Shouldn't impact the end result.
        • LLVM_STATIC_LINK_CXX_STDLIB and LLVM_ENABLE_LIBCXX should couple nicely with LIBCXX_ABI_UNSTABLE to use faster stdlib implementations, and possibly enable LTO into the stdlib.
        • Clang supports a multi-stage PGO build where it is possible to train it on a custom code base. It might be worth training it on our code, but A) it only supports cmake projects, so we would need to make a simple CMakeLists.txt that can build a subset of our code, and B) we would probably want to retrain after significant changes to our flags, eg changing -std. Maybe just using the default hello world program for PGO is enough?
        • Bolt for post-linking binary optimizations. Can be coupled with (Thin)LTO and PGO.

      I've put these together since they have some interplay and should probably be considered together. For example, clang's support of trivial_abi and the unstable libc++ ABI may be enough to push it over into being faster than gcc, even if it wouldn't be with our current codebase and build flags.

            Assignee:
            zack.winter@mongodb.com Zack Winter
            Reporter:
            mathias@mongodb.com Mathias Stearn
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

              Created:
              Updated: