# Writes Hurt: Lessons in Cache Design for Optane NVRAM

Alexandra Fedorova<sup>1,2</sup>, Keith Smith<sup>1</sup>, Keith Bostic<sup>1</sup>, Alexander Gorrod<sup>1</sup>, Sue LoVerso<sup>1</sup>, and Michael Cahill<sup>1</sup>

<sup>1</sup>MongoDB <sup>2</sup>University of British Columbia

## **Abstract**

Intel® Optane<sup>TM</sup> DC Persistent Memory resides on the memory bus and approaches DRAM in access latency. One avenue for its adoption is to employ it in place of persistent storage; another is to use it as a cheaper and denser extension of DRAM. In pursuit of the latter goal, we present the design of a volatile Optane NVRAM cache as a component in a storage engine underlying a widely used commercial database. The primary innovation in our design is a new cache admission policy. We discover that on Optane NVRAM, known for its limited write throughput, the presence of writes disproportionately affects the throughput of reads, much more so than on DRAM. Therefore, an admission policy that indiscriminately admits new data (and thus generates writes), severely limits the rate of data retrieval and results in exceedingly poor performance for the cache overall. We design an admission policy that balances the rate of admission with the rate of lookups using dynamically observed characteristics of the workload. Our implementation outperforms OpenCAS (an off-the-shelf Optane-based block cache) in all cases, and Intel Memory Mode in cases where the database size exceeds the available NVRAM. Our cache is decoupled from the rest of the storage engine and uses generic metrics to guide its admission policy; therefore our design can be easily adopted in other systems.

# 1 Introduction

Intel® Optane<sup>TM</sup> DC Persistent Memory is one of the first widely available non-volatile memory (NVRAM) products, released two years prior to the time of this writing. At present the community is still grappling with the question of how to best use it in the storage stack. Although one way of adoption exploits its persistence (e.g., using it in place of another block storage device or turning applications' volatile memory into persistent), another avenue is to use it as a volatile extension to DRAM, a denser and cheaper one at that. Our study explores the second option.

We design and implement *NVCache*: an Optane NVRAM-resident volatile cache for *AnonStorageEngine* [2] – the stor-

age engine underlying a widely used commercial database [1]. At the heart of any cache is an admission policy. An admission policy decides, upon a cache miss, whether the missing block should be *admitted*, i.e., kept in the cache after being retrieved from a lower level of storage. With few exceptions, caches indiscriminately admit data on read misses, differing only in whether they admit it on write misses. We found that such a simplistic policy decreases the throughput of writeheavy workloads up to 80% and read-only workloads by about 20%. Admitting new data into a cache generates writes - as every newly inserted cache block must be written into the cache memory – and limited write throughput is a well known property of Optane NVRAM [36]. What was not previously known was that writes to Optane NVRAM disproportionately affect the throughput of concurrent reads. While writes affect concurrent reads on any storage device, our measurements show that this effect is much larger on Optane NVRAM than on its counterpart DRAM (see §2). An overly eager admission rate will thus limit the rate at which existing data can be retrieved, diminishing the utility of the cache. Admission policy must, therefore, balance between the rate of admitting new data and the rate of accessing existing data. Our main contribution is a new admission policy that embodies this principle.

Although our work is a case study exploring a specific point in a vast design space, our findings apply broadly to similar systems. NVCache is decoupled from the rest of the storage engine and our new admission policy relies only on the rates of data admission, removal and lookup for its decisions, so our design is easy to adopt in other storage engines or standalone caches. While our work addresses the idiosyncrasy of one specific storage technology, we hypothesize that the admission policy we propose will be relevant for any caching device where writes disproportionately impact reads.

The rest of the paper is organized as follows: §2 demonstrates that writes disproportionately affect the throughput of reads on Optane NVRAM. That section also puts our work in the broader context of multi-tier caching systems, and provides relevant background on *AnonStorageEngine*.

§3 presents the basics of NVCache design, which relies on well-known methods, and then unveils the design of the new admission policy, backing its features with experimental data. §4 compares NVCache with off-the-shelf alternatives: Intel Memory Mode [5] and OpenCAS [7], and reports the effect on performance-per-\$ of replacing part of system DRAM with Optane NVRAM. §5 describes related work and §6 summarizes our findings.

# 2 Background and Motivation

# 2.1 Optane memory's Achilles' heel

Optane NVRAM has a superpower: read and write latency for small operations compete with DRAM, reads being only about  $2\times$  slower and writes being roughly the same latency as DRAM<sup>1</sup> (see [36], Fig.2). Read throughput is impressive: sequential reads reach 6GB/s per NVDIMM (see [36], Fig.4(a)), and with a single CPU supporting up to six NVDIMMs, the throughput can climb into double digits.

Optane also has an Achilles' heel: write throughput is sluggish and gets worse with many threads. Figure 1 shows sequential write throughput to Optane NVDIMMs (with one and two DIMMs) and to an Optane SSD P4800X (built with the same memory technology but packaged as an SSD). Writes to Optane memory are barely competitive with the SSD using one thread, but show negative scaling as we use more threads<sup>2</sup>.

This phenomenon is not new and was reported by others (see [36], Fig.4(b)). What was *not* previously shown, and is even more crucial for cache design, is that sluggish writes disproportionately affect the throughput of *reads*. Figure 2 shows the read throughput on Optane NVRAM dropping precipitously in the presence of concurrent writers. Only a single concurrent writer causes read throughput to drop from a solid 12GB/s to a unimpressive 3.4 GB/s (a 72% loss). With eight writer threads, reads proceed at only 0.8 GB/s (a 93% slowdown)<sup>3</sup>. The same experiment on DRAM produces a milder degradation in read throughput, with a loss of only 18% with one concurrent reader and of 35% with eight.

The implication of this finding for cache design on Optane NVRAM is that an admission policy that eagerly accepts new data (and thus generates writes) will disproportionately affect the speed of reads, i.e., cache lookups, severely limiting the



Figure 1: Sequential write throughput to Optane persistent memory using one or two NVDIMMs, and to Optane SSD. Parameters of the experimental system are described in §3.2.1.



Figure 2: Read throughput for Optane NVRAM (two NVDIMMs) and DRAM in with 8 reader threads and with increasing concurrent writers. The read throughput on Optane NVRAM is disproportionately affected. Parameters of the experimental system are described in §3.2.1.

effectiveness of the entire system. An admission policy, must therefore carefully balance the rate of cache admission relative to the rate of lookups.

# 2.2 Multi-tier caching systems

We contribute a new design of a single-tier volatile cache in Optane NVRAM; since this cache co-exists with the DRAM cache in our storage engine (as we explain in §3), it is helpful to discuss it in the broader context of multi-tier caches and tiered memory systems. Here we provide a broad overview of these areas, deferring the comparison with specific projects until §5.

A multi-tier caching system is comprised of multiple storage devices organized as a hierarchy or a pool of caches [13, 15–19,22,23,25,27,28,34,37,38]. Tiers might include DRAM and NVRAM in front of an SSD (as in our system), a SSD in front of an HDD, or any other combination thereof, but with

<sup>&</sup>lt;sup>1</sup>Writes into NVRAM need only to reach the processor's ADR (Asynchronous DRAM refresh domain).

<sup>&</sup>lt;sup>2</sup>Our data is for non-interleaved writes. Interleaved writes will achieve higher throughput (and also negative scaling with more threads, see [36], Fig.4(c)), but interleaving can only be used on NVDIMMs in the same NUMA node, which was not the case on our system configured according to manufacturer recommendations ([6], Table 17). NVRAM access was done via memopy from a mmaped file residing on a DAX file system in NVRAM. This was the fastest method and it produced similar results as the fastest methods discovered by others [36].

<sup>&</sup>lt;sup>3</sup>Our system has 16 cores, so CPU contention is not the issue.

faster, more expensive storage generally in front of slower, less expensive storage. Studies of these systems investigate how to divide the data between the tiers to maximize performance. Broadly speaking, there are two design approaches: cooperative and independent. In a cooperative design the tiers are tightly coupled: one tier may evict data into another, and may inform it about the access patterns observed within its space. In an independent design each tier makes its own decisions about what data to admit and evict. There is also a middle ground, where one tier may take hints about data access characteristics from other tiers, but does not directly accept data or directives about what to cache. Independent caches are easier to design and maintain from software engineering perspective, because they are less coupled with the rest of the system, and for this reason they are easier to port to other systems. Our design falls into the independent category, as we explain in §3.

Multi-tier memory systems can be thought of as a sub-category of multi-tier caches, where one tier is DRAM and another is NVRAM or some other kind of slower memory [11, 12, 21, 24, 29, 31–33, 35]. These systems are typically implemented in the kernel or in a language runtime [12,31,32] and are transparent to applications. The main challenge in building them is deciding which pages must go to the "fast" tier and which ones to the "slow" tier – the same problem that must be addressed in cooperative caches.

Like all caches, multi-tier systems innovate on admission and eviction policies. An admission policy tells the cache when to insert new data; an eviction policy tells it which data to evict when the space becomes scarce. Typical caches always admit data on reads and vary as to whether they admit data on writes: i.e., write-allocate or not. Multi-tier caches may also admit data as it is evicted from another tier. While most caches tune their admission algorithms to maximize the hit rate, our algorithm takes into account the rate of admission for reasons explained in §2.1. So our main contribution is the admission policy that is based on a fundamentally new principle. We believe that our innovation in admission policies will be relevant for any cache storage medium where the presence of writes disproportionately affects the throughput of reads.

# 2.3 AnonStorageEngine

AnonStorageEngine is a persistent transactional key-value store [2]. Internally it uses a B+-tree to organize the data. AnonStorageEngine materializes data in memory (in its DRAM cache) in a different format than it is stored on disk. Data on disk contains efficiently encoded keys and values. The keys in each block are sorted, but not indexed. When AnonStorageEngine reads a block from disk it decodes and indexes it, so that the data can be searched and updated efficiently. Furthermore, on-disk data may be optionally compressed and/or encrypted, and AnonStorageEngine decom-

presses and decrypts it before placing it in DRAM.

The main advantages of this two-pronged approach to data representation is that it provides efficient space utilization for stored data and fast operations for cached data. It is also the reason we adopted the independent design for our NVRAM cache, as we explain in §3.1.

# 3 NVCache: a step-by-step design

We first describe the baseline architecture of *NVCache*, which builds upon well-known techniques. Then we describe the evolution of the new admission policy design, beginning with a naïve architecture and presenting experiments that motivate the next feature.

# 3.1 NVCache basics

As explained in §2.3 AnonStorageEngine uses different formats for data stored persistently on disk and for data materialized in memory. On-disk data is stored in blocks. In-memory data, which lives inside the engine's fixed-sized DRAM cache, is stored in pages. Blocks contain efficiently encoded keys and values. Pages additionally contain indexing and other structures to facilitate fast operations.

NVCache sits underneath the DRAM cache. Naturally we had to make a decision whether to use NVCache for caching pages, blocks or both. *AnonStorageEngine* already has a DRAM cache for pages, so caching pages would amount to extending the existing cache to use both DRAM and NVRAM – a tiered cache similar to the recent one in Facebook's RocksDB [25]. Caching blocks would entail creating a stand-alone block cache that sits between the DRAM cache and the block device. We decided to cache blocks, and not pages, for the following reasons.

AnonStorageEngine's pages are organized in memory as a B+-tree for efficient searching and updating, and pages contain pointers to other pages. If a page were to be manually copied (at application level) from DRAM to NVRAM in a tiered cache, the virtual addresses would change and any pointers would have to be updated accordingly. Updating them is an error-prone process that would require locking or other form of synchronization. AnonStorageEngine is lockfree on the read-path and mostly lock-free on the write path: adding synchronization would substantially compromise a core advantage of its original design.

An alternative to implementing a tiered cache manually would be to use transparent tiered memory implemented in the kernel, such as Nimble [35] or HeMem [29], or to build on top of CacheLib: Facebook's library for building caches that provides support for tiered memory [14]. Kernel-based systems would require adopting an experimental kernel, which was not an option in a production deployment. CacheLib source became open on September 2, 2021 [3]; building upon

it is one alternative we may consider in the future, but according to the authors, CacheLib is not the best option for building a database's internal page cache, and so it could not be used as the substrate for RocksDB's page cache (see [14], Section 6 and discussion in §5). Thus, for our current design we decided to use a stand-alone block cache, as it avoids the aforementioned problems, is simple to integrate in the existing storage engine and can be easily ported to other key-value stores.

NVCache sits next to the *block manager* – the code responsible for reading/writing the data from/to disk (see Fig. 3). **Read path:** If the DRAM cache cannot locate searched-for data, it issues a read to the block manager 1. The block manager checks if the block is present in NVCache 2, accessing it from NVCache if it is 3 and reading it from disk if it is not 4. It then transforms the block into a page, decrypting and decompressing it if needed, and hands it over to the DRAM cache 5. If the block is not present in NVCache, NVCache has the discretion to admit it after the block manager has read it from disk 6. NVCache stores the blocks in the same format as they are stored on disk: compressed/encrypted if those configuration options were chosen. This is a feature, as storing compressed blocks increases NVRAM effective capacity.

Write path: The write path is not symmetrical to the read path, because *AnonStorageEngine* does not modify disk blocks in place. Updates are written into in-memory specific data structures, and then formatted into blocks and written back to disk during a process called *reconciliation*. Reconciliation may occur when the DRAM cache evicts pages or as part of a database checkpoint. Reconciliation always writes a new page 7, which the block manager turns into a new block. When the block manager writes a new block 8, it notifies NVCache 9; NVCache has the discretion to admit it. Obsolete blocks are eventually freed, at which time the block manager instructs NVCache to invalidate cached copies of the freed blocks 10.

Within a broader context of multi-tier caching systems, NVCache adopts an independent design (see §2). This is a natural consequence of our decision to cache blocks, as opposed to pages. The kernel buffer cache also caches blocks, so there is an opportunity for a cooperative design integrated with the kernel: we did not pursue this avenue, because adopting a custom kernel would not be practical in customer deployments. There are off-the-shelf NVRAM caching solutions implemented in the kernel: *device mapper write cache* [4] and OpenCAS [7]. We describe them, evaluate OpenCAS (the more advanced of the two) and present the results in §4.

We experimented in the middle ground between an independent and a co-operative design, where the DRAM cache informs the NVCache on evicting a clean page (so the NVCache

could bump its priority) or informs NVCache about the reason for writing a dirty block (e.g., because of eviction or a checkpoint). Using this information did not improve NVCache effectiveness, and keeping track of it introduced overhead, so we retained a purely independent design. As a result, NVCache communicates with the block manager via a narrow API, allowing its codebase to evolve independently of the rest of the system.

Internally, NVCache is organized as a hash table with a fixed number of buckets. Upon collision, blocks mapping to the same bucket are chained in a linked list. A bucket is protected with a spinlock, but our measurements showed that the rate of collisions and the synchronization overhead were negligible (with 32K buckets for a 180GB NVCache). We use PMDK's [9] allocator (based on jemalloc) to allocate NVRAM on admitting new blocks. NVCache metadata is in DRAM, but PMDK's jemalloc metadata is in the NVRAM. NVCache does not use NVRAM's persistent nature: upon exit it loses cached data. This decision simplified our design substantially, as we do not need to deal with crash consistency. The downside is that we pay the cost of re-warming the cache upon restart, and so we may revise our design in the future.

When NVCache runs out of space it cannot admit new blocks. *Eviction* is needed to purge blocks less likely to be used in order to make space for new ones. We use a simple LFRU eviction policy [26]. During eviction it targets blocks that were not reused within a fixed time window and evicts the least frequently used among those. Tracking of the LRU and LFU blocks is approximated so that there is no need to maintain lists. There is an eviction thread that wakes up once a second and scans the cache for eviction candidates.

# 3.2 NVCache Admission Policy Design

The NVCache admission policy is rooted in experimental data; we therefore present the details of our experimental system and the workloads prior to exploring its design.

#### 3.2.1 Experimental system

**System:** Our system is a Lenovo ThinkSystem SR360 built with two Intel Xeon Gold 5218 processors, each having 16 hyper-threaded cores.

Memory: There are two Optane NVRAM modules, 126GB each, for a total capacity of 252GB. The modules are placed in separate sockets as per manufacturer recommendation. There is 196GB of DRAM; we modulate the amount available for experiments either via software (by creating a large file in ramfs) or hardware (by physically removing DRAM) in cases where the experiments demand this. We used workloads with a variety of database sizes to study conditions when the working set fits into NVRAM and when it exceeds its capacity. With a total 252GB physical capacity we are able to configure NVCache to hold at most 180GB of



Figure 3: Interaction of NVCache with the rest of the storage engine.

data. The metadata overhead of NVCache structures is kept in DRAM (and in any case it is small – a couple of gigabytes), but the PMDK metadata, kept in the NVRAM, takes a substantial amount of space; 180GB NVCache size was the largest that we could use without experiencing out-of-memory errors from the PMDK's allocator.

**Disk:** We use Intel Optane P4800X SSD, built with the same physical media as NVRAM DIMMs, but packaged as an SSD on the PCIe bus. This SSD provides up to 2.5GB/s sequential read bandwidth and up to 2.2GB/s sequential write bandwidth.

#### 3.2.2 Workloads

While for the final evaluation (§4) we used the widely adopted YCSB [10, 20], during the *design* process we used our inhouse benchmarks. The in-house benchmarks are configuration files for a *AnonStorageEngine*-provided workload generator application, specifying parameters such as the number of records in the database, the sizes and distributions of keys and values, the mix of operations (read, update, insert, modify, scan), the number of threads, whether or not logging and transactions are enabled, the size of the DRAM cache, the total running time, etc. The benchmarks are designed to either emulate customer workloads or to stress a particular feature (e.g., checkpoints, eviction). When presenting the throughput for a benchmark we break it down by operation type: for example, if the benchmark *bm* performs a mix of reads and updates, we would report the throughput as *bm.read* and *bm.update*.

The workloads fall into two categories: (1) those that do not stand to benefit from NVCache (e.g., they use small data sets fitting entirely in DRAM, and/or they perform mostly writes) and (2) those that do (large data sets, read-dominant). We initially focus on benchmarks in the first category, in particular those with small data sets. The database pages are cached in the engine's DRAM cache, and its blocks – in the

kernel buffer cache as they are read from disk<sup>4</sup>. So even if the DRAM cache is configured to be much smaller than the dataset size, the OS buffer cache would comfortably fit blocks of small workloads. Since NVRAM caching cannot benefit these workloads, they make for an easy demonstration of the implementation overhead and are excellent workloads for exploring how to minimize it.

#### 3.2.3 Lessons learned

Our design rests on the three lessons that we learned in the process: (1) Bypass NVRAM for small workloads, (2) Throttle the admission rate, and (3) NVRAM cache benefit is limited to read-dominant workloads. Lesson #2 embodies our main contribution; the others, while less novel, were also crucial for building a well-performing cache.

**Lesson #1: Bypass NVRAM for small datasets** Our first and the most simple admission policy, *alloc-read-write*, was always admitting a block to the NVCache when it is read from or written to disk by the block manager. Figure 4 shows the performance *degradation* of running with 16GB DRAM and 180GB NVCache<sup>5</sup> for small-sized benchmarks fitting into DRAM that will not benefit from any additional caching. (Eviction is disabled during these experiments to tease apart the sources of overhead, but we re-introduce it at the end of this section.) We observe that performance penalty under this policy is substantial across the board, reaching 91% for *evict-btree-stress-multi*.

The key observation we made when analysing the causes of the overhead is that it is useless to cache data for small benchmarks that comfortably fit into DRAM – the engine's cache or the OS buffer cache. So our first lesson is to *bypass NVCache for datasets fitting into DRAM*. We call this feature *small-bypass*, and implement it by having the NVCache

<sup>&</sup>lt;sup>4</sup>§3.1 explains the difference between blocks and pages.

<sup>&</sup>lt;sup>5</sup>We ran with larger DRAM sizes too, but reached the same conclusions.



Figure 4: Throughput degradation for workloads that do not stand to benefit from NVRAM caching. Lower numbers are better. Eviction is disabled during these experiments to simplify the analysis of the overhead.



Figure 5: Cached blocks that were outdated and freed. Data corresponds to the experiment in Fig. 4. These are aggregate data for the entire workload, so we do not show the breakdown by operation type.

monitor the aggregate size of all database files used by the workload and abstain from admitting any blocks until the dataset size outgrows the available DRAM. The bar labelled *small-bypass* in Fig. 4 shows the overhead being significantly reduced by this feature.

Small-bypass, in a way, approximates cooperation with the OS buffer cache. NVCache cannot know which blocks the buffer cache holds, but it roughly approximates this information by juxtaposing the workload's data size and the amount of DRAM.

**Lesson #2: Throttle the admission rate** The *small-bypass* feature all but eliminated the overhead for some workloads, but made only a small improvement for others. To show why, Figure 5 presents the number of blocks removed from NVCache because they were outdated and freed by the block manager as a percent of all admitted blocks. We observe that

the benchmarks whose overhead is still substantial after the introduction of *small-bypass* are those overwrite many existing blocks.

When an application generates new data, either by inserting new key-value pairs or updating the old ones, the block manager generates new data blocks. The blocks containing old invalid data are eventually freed by the block manager and are removed from NVCache. Removing a block from NVCache involves freeing its associated memory in NVRAM, and since the PMDK allocator keeps its metadata in NVRAM, freeing a block generates writes into NVRAM. Moreover, removing old blocks creates space for new blocks, and NVCache eagerly admits data in the freed space. That also generates writes. As we showed in §2 writes disproportionately affect the throughput of reads, i.e., of cache lookups.

One could simply disable the cache for write-intensive workloads, but even read-dominant workloads will suffer from the interference of writes if overly eager eviction makes it possible to admit blocks at a high rate. Admitting new blocks generates writes, and writes will interfere with reads.

Consider data in Table 1 for the three read-dominant workloads from Table 2 (this table contains workloads with large working sets, for which caching may be beneficial). Table 1 shows data for experiments with eviction configured to eagerly evict unused blocks and for experiments configured to run without any eviction at all. When the cache is full and new blocks cannot be admitted, eager eviction frees up the space. While admitting recently referenced blocks in favour of those that were less recently accessed should improve the hit rate, this also generates more writes into NVRAM, which may diminish the rate at which we can read cached blocks. Indeed, we see from Table 1 that even though the cache hit rate with eager eviction is higher (as expected), the overall throughput is substantially lower than without any eviction. That is because the amount of cache writes produced with eviction is substantially higher than without it, and the writes slow down the reads.

|           | Eager eviction |          | No eviction |          |
|-----------|----------------|----------|-------------|----------|
| WL        | ops/sec        | hit rate | ops/sec     | hit rate |
| evict-    | 61,699         | 48%      | 162,690     | 44%      |
| btree-    |                |          |             |          |
| large     |                |          |             |          |
| evict-    | 97,491         | 45%      | 134,404     | 36%      |
| btree-    |                |          |             |          |
| scan.read |                |          |             |          |
| medium-   | 62,012         | 48%      | 164,644     | 44%      |
| btree-    |                |          |             |          |
| large     |                |          |             |          |

Table 1: Throughput of read-dominant workloads suffers substantially with aggressive eviction despite it producing a higher cache hit rate. Aggressive eviction generates many writes that hurt the throughput of cache lookups (reads).

The question we then ask is: how to balance the rate of block admission and removal, which generate writes, with the rate of cache lookups, which produce reads? To address it, we introduce the overhead bypass ratio (OBP):

$$OBP = \frac{blocks\_inserted + blocks\_removed}{blocks\_looked\_up}$$

Intuitively, the quantity in the numerator captures the cost of using the cache: the write-generating insertions and removals. The quantity in the denominator captures the benefit: cache lookups. OBP thus expresses the balance between the cost and benefit of using the cache; we experimentally determined that a target ratio of 10% works best, but settings between 5% and 30% were also acceptable. If OBP were to be ported and tuned for different hardware, the thresholds would be adjusted according to the degree to which concurrent writes affect the reads. E.g., on hardware where writes have a smaller effect on the performance of reads, acceptable OBP thresholds would be higher.

NVCache continuously updates OBP and abstains from admitting or evicting cache blocks if OBP exceeds its target (10%). The OBP metric proved remarkably stable across workloads and cache sizes. We also found OBP to work better than a simple no-write-allocate policy or OBP used in conjunction with the no-write-allocate policy. The *small-bypass+OBP* bar in Figure 4 shows that *small-bypass* and OBP completely eliminate the overhead for the benchmarks that do not stand to benefit from caching.

**Lesson #3: Only read-dominant workloads benefit** While the previous sections focused on the overhead and thus experimented with small-sized workloads that do not stand to benefit from NVCache, here we switch to using large-sized workloads, which teach us the third lesson: NVRAM cache benefits only read-dominant workloads. Prior study of



Figure 6: Workloads with large datasets. 32GB DRAM and 180GB NVCache.

a custom NVRAM cache for Facebook's RocksDB came to a similar conclusion [25].

Table 2 shows the large-sized workloads and their characteristics. The rate of operations marked with an asterisk (e.g., insert, update for evict-btree-scan) is kept constant by the workload generator, and so we do not report their throughput, because it is largely insensitive to the system configuration. The data size reported in the second column is the on-disk size of the database reported at the end of the run. The intermediate database size may be much larger at points when many new blocks were written to disk, but the outdated ones were not yet freed. Column six reports the total amount of data written to SSD during the run. This amount is non-zero even for read-only workloads, because it includes the data written to populate the database prior to the measured benchmark run. Although NVCache is enabled during the populate phase, it hardly admits any blocks, because OBP throttles the admission rate during this write-only phase. So when the measured run begins, NVCache is empty; it warms up during the measured run. All benchmarks run for 60 minutes, with the exception of 500m-btree-50r50u, which runs for 120.

Figure 6 presents the throughput of large workloads with 32GB DRAM and 180GB of NVCache. (Data with other memory sizes leads to similar conclusions, so we omit it.)

Read-intensive workloads benefit from NVCache substantially, running over  $3 \times$  faster with the cache than without it (e.g., *evict-btree-scan,read*). But even a small proportion of writes substantially limits performance potential: *evict-btree-stress-multi* performs 20% of write operations, but the performance boost it gets from NVCache is only 12%.

Write-intensive workloads do not benefit from NVCache, nor would they benefit from any sort of block caching, because the churn that they generate, continuously adding and removing blocks, makes most of the cache content outdated. Table 2 shows the NVCache hit ratio and the fraction of removed blocks relative to those inserted. The data tells us two things: (1) workloads that don't benefit from the cache have a very low hit ratio, (2) the low hit ratio is likely because they

| Workload                         | Op mix<br>(threads),<br>data size                      | DRAM cache<br>size | NVCache hit<br>ratio | Removed /<br>inserted ratio | Amount of<br>data written<br>to SSD | Amount of data admitted to cache |
|----------------------------------|--------------------------------------------------------|--------------------|----------------------|-----------------------------|-------------------------------------|----------------------------------|
| 500m-btree-<br>50r50u            | 50% read, 50%<br>update (20),<br>163GB                 | 28GB               | 6%                   | 98%                         | 2190GB                              | 191GB                            |
| chkpt-stress-<br>lg              | 100% update (6), 134GB                                 | 28GB               | 2%                   | 94%                         | 780GB                               | 36GB                             |
| evict-bt-<br>stress-multi-<br>lg | 80% read, 20%<br>update (100),<br>250GB                | 1GB                | 20%                  | 94%                         | 1740GB                              | 424GB                            |
| evict-btree-<br>large            | <b>100% read</b> (16), 120GB                           | 28GB               | 97%                  | 0%                          | 120GB                               | 115GB                            |
| evict-btree-<br>scan             | 95% read, 4%<br>insert*, 1%<br>update* (430),<br>250GB | 28GB               | 97%                  | 47%                         | 400GB                               | 300GB                            |
| medium-<br>btree-large           | <b>100% read</b> (16), 120GB                           | 28GB               | 97%                  | 0%                          | 120GB                               | 115GB                            |
| overflow-<br>130k-lg             | 50% read, 50%<br>update (20),<br>253GB                 | 21GB               | 6%                   | 95%                         | 2000GB                              | 127GB                            |
| update-chkpt-<br>btree-lg        | 90% insert, 5%<br>read, 5%<br>update (5),<br>185GB     | 25GB               | 6%                   | 95%                         | 1720 GB                             | 137GB                            |
| update-delta-<br>mix1-lg         | 100% updates (6), 125GB                                | 20GB               | 2%                   | 98%                         | 2000GB                              | 93GB                             |
| update-grow-<br>stress-lg        | 96% update,<br>4% inserts*<br>(5), 190GB               | 20GB               | 2%                   | 97%                         | 2100GB                              | 98GB                             |

Table 2: Properties of 'large' workloads.

remove most of the blocks they insert. They write terabytes of data throughout the run (Column 6), even though their database size at the end of the run is no larger than a couple hundred gigabytes (Column 2), overwriting most of the data that they generate.

These data suggest that admitting zero blocks for write-dominant workloads would be the most practical strategy, but since the degree of write-intensity is not always known *a priori*, we rely on the OBP feature to limit the damage. As Figure 6 shows, OBP effectively prevents performance overhead for write-dominant workloads, and columns 6 and 7 of Table 2 show that OBP filters the majority of the write traffic to NVRAM.

As we explained earlier, *AnonStorageEngine* does not update existing blocks in place, so a write-dominant workload will most certainly invalidate old blocks. A storage engine that does update data in-place may be less sensitive to the phenomenon described in this section. However, given a lim-

ited write throughput of Optane NVRAM and given that the RocksDB study [25] reached a similar conclusion, we expect the lesson learned here to be broadly applicable.

### 3.2.4 Summary

We presented three lessons in design of caches residing in Optane NVRAM:

- 1. Detect workloads that fit into the OS buffer cache and do not admit their blocks.
- 2. Admitting blocks into Optane NVRAM produces writes, which slow down the reads, i.e., cache lookups. The admission policy must balance the cost of admitting data into the cache against the benefit of using it later.
- 3. Optane NVRAM caches benefit read-dominant workloads. For write-dominant workloads, the admission policy must minimize the number of admitted blocks.

Our admission policy uses the *small-bypass* feature to embody the first lesson, and the OBP feature to embody the second and third.

## 4 Evaluation

We evaluate NVCache using the YCSB benchmarks [10, 20]. We tuned the algorithms and parameters of the NVCache using only our in-house benchmarks (a "training set", to use an analogy from statistical modeling), and performed no additional tuning during this final evaluation phase, using YCSB as the "test set". We ran experiments on the system described in §3.2.1, varying the amount of DRAM and NVRAM. Parameters of the YCSB benchmarks are shown in Table 3<sup>6</sup>. The DRAM cache size was set to half of the available DRAM<sup>7</sup>, but capped at 40GB.

| Workload | Op mix, threads          | Dataset |
|----------|--------------------------|---------|
| YCSB-A   | 50% read, 50% update, 20 | 130GB   |
| YCSB-B   | 50% read, 50% update, 20 | 194GB   |
| YCSB-C   | 100% read, 20            | 259GB   |
| YCSB-D   | 95% read, 5% insert, 100 | 219GB   |
| YCSB-E   | 95% scan, 5% insert, 20  | 210GB   |

Table 3: YCSB characteristics

Our evaluation asks two questions:

- 1. How does NVCache compare to off-the-shelf solutions pursuing similar goals?
- 2. What is the effect of using an NVRAM cache on performance-per-\$?

# 4.1 Comparison with off-the-shelf solutions

#### 4.1.1 Baselines used for comparison

We compare with two solutions that permit using NVRAM as an extensions of DRAM, available in off-the shelf Optane systems: *Intel Memory Mode* (MM) [5] and *Intel Open Cache Acceleration Software* (OpenCAS) [5]. For potential future deployment of NVRAM in the field, it was important for us that these alternatives were available in standard Linux servers and did not require custom unsupported kernels.

Intel Memory Mode is a hardware configuration that presents Optane NVRAM to the rest of the system as regular volatile memory, and uses DRAM transparently as its cache, with data transferred between the two in units of cache

lines. This is an attractive alternative, because it permits using NVRAM as an extension to DRAM without requiring any code changes, and makes it available for all data structures, in user space and kernel alike. In contrast, NVCache makes NVRAM available only for caching database file blocks.

Memory Mode can be enabled only in specific hardware configurations ( [6], Table 17). We were able to successfully configure MM such that each NVDIMM was "paired" with a DRAM DIMM, meaning that it must be placed in the unused slot of the same channel of the same iMC (integrated memory controller) as the NVDIMM. Using additional DRAM DIMMs that were not paired with NVDIMMs produced configuration errors on our system, so we could only use the configuration with two NVDIMMs and two DRAM DIMMs. Our DRAM DIMMs were 16GB in size, so that restricted us to a configuration with 32GB of DRAM. Fortunately, MM could be configured to use all or part of the NVRAM, so we were able to vary the amount of NVRAM in the experiments.

In MM, the amount of total system memory is reported to be the same as the size of the NVRAM dedicated to MM. Since the *AnonStorageEngine*'s DRAM cache is always configured to be half the size of the *physical* DRAM (see the beginning of §4) for equitable comparisons other systems, the kernel buffer cache will dynamically expand to use more plentiful system memory as the NVRAM size grows. So in essence our configuration with MM uses NVRAM in the same way as NVCache does (for caching file blocks), but via an off-the-shelf hardware solution and without any code changes. An alternative would be to increase the size of the engine's DRAM cache as NVRAM grows; exploring this option in-depth was difficult due to space constraints, and thus was deemed outside the scope of the current work.

OpenCAS is an open-source software project supported by Intel that allows using a fast block device as a cache for a slow block device, and it can be configured so that NVRAM acts as a block cache for the SSD – same idea as NVCache. OpenCAS can be configured in several modes [8]: write-back, write-through, write-around, pass-through (disabled) and write-only (allocate blocks only on write). Based on the lessons learned during admission policy design, write-around seemed the most appropriate configuration option: "In write-around mode, the caching software writes data to the flash device if and only if that block already exists in the cache and [...] further optimizes the cache to avoid cache pollution in cases where data is written and not often subsequently re-read." [8]

Alternative baselines not pursued: Other alternatives to compare would be device mapper write cache (*dm-wc*) [4] and First Responder [30] – both OS-level block caches, and tiered memory systems, such as Nimble [35] and HeMem [29]. We considered comparing to *dm-wc* (the source code for First Responder is not available at the time of the writing), but upon analysing its properties we discovered that *dm-wc* admits blocks only on writes and does not throttle the admission

<sup>&</sup>lt;sup>6</sup>We did not include YCSB-F: it is modify-heavy, and modify operations in our storage engine were designed to trade performance for smaller cache footprint and smaller log records. Therefore, the overall throughput in modify operations was very low and insensitive to memory configurations.

<sup>&</sup>lt;sup>7</sup>The engine's cache and the OS buffer cache share the available DRAM, so this setting gives each an equal share.

rate, which runs counter to the lessons learned in our design. For example, dm-wc would admit zero blocks for read-only workloads, depriving of NVRAM caching the very workloads that benefit the most. OpenCAS, in contrast, can be configured with flexible admission policies, superseding dm-wc in that regard.

Nimble [35] and HeMem [29] are tiered memory systems that transparently move application pages between DRAM and NVRAM depending on how the pages are accessed. We did not compare against them, because they both required custom kernels, which would be impractical for us to adopt in the field. Furthermore, HeMem uses the NVRAM tier only for large allocations exceeding 1GB (HeMem specifically targets "big data" systems), so it would not use NVRAM for our engine's pages or blocks, whose size is on the order of a dozen kilobytes.

#### 4.1.2 Results

Figure 7 shows the throughput of the memory mode (MM), OpenCAS and NVCache with 32GB DRAM and 64GB, 128GB and 252GB of NVRAM relative to using no NVRAM at all. We make the following observations:

Observation 1: OpenCAS cache derives no performance benefit from NVRAM. This occurs, we conjecture, because it does not throttle the admission rate. OpenCAS delivers similar or better read hit rate as the NVCache (numbers not shown), but also makes two orders of magnitude more writes to NVRAM. With these observations, our best explanation is that failing to throttle the admission rate to NVRAM is the main reason why OpenCAS fails to perform.

**Observation 2:** Memory mode outperforms or performs comparably to NVCache when NVRAM is ample, as shown in Figure 7(c). The amount of NVRAM available for the experiments in Figure 7(c) is 252GB; given the dataset sizes in Table 3 we observe that they, for the most part, comfortably fit into the NVRAM. System memory is reported to be 252GB in Memory Mode, and as a result the kernel buffer cache has ample capacity to expand into the NVRAM space, providing no competition for the engine's DRAM cache. By contrast, with NVCache the amount of system memory is 32GB, and the kernel buffer cache competes with the engine's DRAM cache, swapping some of its pages to disk. At the same time we observe that for NVCache the marginal utility of additional NVRAM is small after it reaches 128GB. E.g., increasing the available NVRAM from 64GB to 128GB, NVCache hit rate grows by about 20%, but going from 128GB to 252GB, it grows by only another 5%.

On the other hand, we observe that Memory Mode hurts performance of the write-intensive YCSB-A (by about 30%), while NVCache keeps it unchanged.

**Observation 3:** When the dataset size exceeds NVRAM capacity, NVCache provides substantially better performance than Memory Mode. As shown in Fig. 7(a), NVCache outper-

forms the memory mode by between 30% (for YCSB-B) and 169% (YCSB-C). Further, the memory mode hurts YCSB-A's update throughput by about 18% relative to the DRAM-only baseline, while NVCache doesn't. We conclude that a bespoke cache can be superior to Memory Mode when the dataset size substantially exceeds the available NVRAM.

# 4.1.3 Combining Memory Mode and NVCache

We also experimented with configurations where part of the NVRAM is dedicated to MM and the remainder is used in AppDirect mode for NVCache, reasoning that we could size NVCache such that its marginal utility is highest (128GB), and the rest of the NVRAM could be used as MM's system memory for the benefit of other data structures. Unfortunately, we observed orders of magnitude worse throughput than with either MM or NVCache alone, and did not pursue this avenue further.

## 4.2 Performance vs. cost

In this experiment we take a fixed memory budget of 96GB and vary the fraction used by DRAM and NVRAM as shown in Table 48. We perform the experiments in this section using only NVCache, as we are unable to vary the amount of DRAM used in MM (see §4.1.1) and OpenCAS proved to be not competitive.

| NVRAM | DRAM | Relative cost |
|-------|------|---------------|
| 0GB   | 96GB | 1             |
| 16GB  | 80GB | 0.90          |
| 32GB  | 64GB | 0.79          |
| 48GB  | 48GB | 0.69          |
| 64GB  | 32GB | 0.59          |

Table 4: NVRAM and DRAM amounts and the cost of all system memory relative to an all-DRAM setup.

We use the NVRAM/DRAM per-byte cost ratio of 0.38, same as in a recent study with Optane memory [25]. As the amount of NVRAM increases and the amount of DRAM decreases, the total cost of system memory also decreases, as shown in Column 3.

Figure 8(a) shows the performance of YCSB normalized to the 96GB DRAM configuration and divided by the cost ratio in Column 3. In other words, these are performance/\$ numbers relative to the DRAM-only configuration. Positive numbers mean that the performance decreased less than the memory cost. Read-only or read-mostly workloads that benefit from the NVCache (see cache hit ratios in Fig. 8(b)) experience a positive gain, as expected.

<sup>&</sup>lt;sup>8</sup>We do not use the configuration with 16GB DRAM, because a scarce DRAM amount triggered a known kernel bug in the DAX code (at fs/inode.c:530).



Figure 7: YCSB throughput under memory mode, OpenCAS and NVCache relative to 32GB DRAM and zero NVRAM.



Figure 8: YCSB performance per dollar and NVCache hit ratio under increasing NVRAM and decreasing DRAM.

While in most cases performance predictably drops as the amount of DRAM decreases, YCSB-C in configuration with 64GB NVRAM and 32GB DRAM actually performs better than it does with 96GB DRAM – so we decrease the system cost *and* improve performance in absolute terms. This occurs because going beyond 32GB DRAM the utility of additional DRAM (and a larger DRAM cache) is considerably smaller than the loss in performance due to a smaller NVCache.

YCSB-A, whose write intensity makes it unable to benefit from any additional caching, suffers the overall loss in terms of performance/\$, as its performance drops at a steeper rate than the memory cost as we decrease the amount of DRAM.

#### 4.3 Summary

Our evaluation revealed that the memory mode is a competitive off-the-shelf alternative to a custom NVRAM cache when the amount of NVRAM is ample, but when it is scarce a custom cache solution such as NVCache will deliver better performance. OpenCAS is not competitive with either NVCache or the memory mode.

NVRAM is a cost-effective method of reducing memory cost while balancing the impact on performance for read-dominant workloads, where in some cases we can both reduce cost *and* improve performance as DRAM is swapped in favour of NVRAM. For write-intensive workloads, however,

replacing part of DRAM with NVRAM is not a cost-effective option.

## 5 Related Work

The most similar and recent counterpart to our study is a volatile Optane-resident cache for Facebook's RocksDB [25]. That work takes RocksDB's DRAM block cache and turns it into a two-tiered cache of DRAM and NVRAM, making it similar to tiered memory systems. Like other tiered memory systems, it addresses the question of how to split the cached data between the DRAM and the NVRAM tiers. We present a different design, that uses a stand-alone block cache interposed between the DRAM cache and the block devices. Although the RocksDB study also uses Optane NVRAM as the cache media, it does not raise awareness about the detrimental impact of concurrent writes on reads – a new finding we share – and does not factor it into the admission policy.

HeuristicDB [37] is a cooperative block layer cache that uses a fast Optane SSD as a caching tier in front of a slower drive. HeuristicDB admits all blocks read from and written to the block device, except those part of sequential access pattern. While this generous admission policy might work for Optane SSDs, we demonstrated that it is unacceptably costly for Optane NVRAM.

MyNVM is another key-value store based on RocksDB

that uses Optane SSD as the medium for an internal block cache [22]. Similarly to NVCache, MyNVM caters its admission policy to the properties of the Optane device, but pursues different goals: (1) to extend its endurance MyNVM admits only carefully selected keys, and (2) to maximize its bandwidth it accumulates keys into relatively large 128KB blocks before writing them to the device. While we did not focus on improving endurance, write throttling performed in NVCache via the overhead bypass parameter (OBP – see §3.2.3) should increase it. The second goal is potentially applicable to our NVRAM device too, but prior experiments ( [36], Fig. 5) showed that writing into Optane NVRAM blocks larger than those that we already write (e.g., 16KB+) does not improve its bandwidth.

Like MyNVM, Facebook's CacheLib caters to the properties of (flash) SSDs by limiting the admission rate to promote device endurance [14]. The throttling heuristic is rather simple; it is a configurable probability p that determines the overall rate of admission. As far as we know, p is not dynamically adjusted based on the observed rates of writes and lookups, so it could not be used in place of NVCache's OBP.

The work by Arulraj et al. [13] establishes a broad framework for reasoning about multi-tiered caching systems comprised of DRAM, persistent memory and SSDs. The authors propose an algorithm for data placement that dynamically tunes the following probabilities: the probability of bypassing DRAM on reads and writes (data being read/written directly from/to NVRAM) and the probability of bypassing the NVRAM on reads and writes. Bypassing DRAM is not applicable in our engine, because DRAM stores data in a different format than NVRAM, but bypassing NVRAM is the same question we grappled with during the design of our admission policy. Arulraj's work uses simulated annealing to dynamically adapt these probabilities, while we use a dynamically computed OBP threshold. Their evaluation was performed on a simulator, while we used real hardware, which revealed concrete limitations and influenced our design.

Estro et al. explored the relationship of performance and cost and the effects of different cache settings (such as write-through vs. writeback) in multi-tier caching configurations on real hardware [23]. Performing similar analysis would be a natural extension of our work, but can only be done after understanding the idiosyncrasies of cache design using recently adopted memory technology, contributed by our study.

The design of Orthus [34] was driven by an observation similar to ours: a seemingly faster device (Optane SSD, in their case) outperforms a slower device (a flash-based SSD) in general, but lags behind it under high concurrency. Orthus embraces a hybrid design: initially, a faster device acts as a cache for a slower device, admitting all blocks until a desired hit rate is accomplished. Then Orthus switches to a 'tiered mode', where the load is distributed among both devices to maximize the overall throughput. Our OBP feature accomplishes a somewhat similar effect when it begins throttling

the admission rate to NVCache, and as a result more reads are being sent to the storage device over time. In contrast to Orthus, NVCache throttles the admission rate based on the observed cost/benefit metric, and not as a consequence of achieving a certain hit rate. In fact, we observed that it may be beneficial for overall performance to throttle the admission rate at the expense of the reduced hit rate.

Multi-tiered memory systems dealt primarily with the policies for selecting the right tier for a memory page, and (to that end) efficiently tracking page access patterns [11, 12, 21, 24, 29, 31–33, 35]. Our decision to make NVCache independent from the DRAM cache makes these techniques largely complementary. As an alternative design, a tiered memory system could be used in place of NVCache by providing a larger pool of memory into which the engine's DRAM cache could transparently expand. Exploring this alternative was left for future work, since generic tiered memory systems known to us, e.g., Nimble [35] and HeMem [29] required custom kernels that were impractical do adopt in the field. Cache-Lib [3, 14] is a library for development of custom caches that span tiers, but as far as we understand it caches data as items whose raw memory can be traversed by the application, and so it would face the same need to fix pointers described in §3.1 if data structures with pointers to raw memory of other items were moved between tiers. Intel Memory Mode is a tiered memory system implemented in hardware, and we compared it against NVCache in §4.

#### 6 Conclusion

Although it was well known that Optane NVRAM delivers limited write throughput, it was not known that writes disproportionately affect the throughput of reads. We discovered that in the presence of a single writer thread, the throughput of reads drops almost by a factor of  $4\times$ . In contrast, with DRAM used in the same experiment the impact on read throughput was only 18%. This discovery led us to propose a new admission policy for Optane-resident caches. Our policy throttles the rate of writes to the cache (generated by the admission of new data, removal of invalid data and eviction), with the rate of reads, i.e., cache lookups. The metric capturing this principle, the Overhead Bypass Threshold, is generic and can be applied in any cache residing on hardware with similar properties. Our implementation outperforms an off-the-shelf cache from OpenCAS across the board, and the hardware tiered memory system (Intel Memory Mode) in all cases where the dataset size exceeds the amount of NVRAM.

# 7 Availability

The *AnonStorageEngine* source code, including NVCache, is available as open source software. Its location will be posted as soon as anonymity restrictions are lifted.

# References

- [1] AnonDatabaseCompany.anondatabasecompany.url, 2021.
- [2] AnonStorageEngine. anonstorageengine.git.repo, 2021.
- [3] CacheLib, Facebook's open source caching engine for web-scale services. https://engineering.fb.com/2021/09/02/open-source/cachelib, 2021.
- [4] Device Mapper Write Cache. www.kernel.org/doc/html/latest/admin-guide/device-mapper/cache.html, 2021.
- [5] Intel Memory Mode. https://software.intel.com/content/www/us/en/develop/articles/gsg-intro-to-provisioning-pmem.html, 2021.
- [6] Intel® Server Board S2600WF Product Family. Technical Product Specification. https://www.intel.com/content/dam/support/us/en/documents/server-products/server-boards/S2600WF\_TPS.pdf, 2021.
- [7] Open Cache Acceleration Software. https://open-cas.github.io, 2021.
- [8] Open Cache Acceleration Software: Admin Guide. https://open-cas.github.io/guide\_configuring.html, 2021.
- [9] Persistent Memory Development Kit. https://pmem.io/pmdk, 2021.
- [10] Yahoo! Cloud Serving Benchmark, Git Repo. https://github.com/brianfrankcooper/YCSB, 2021.
- [11] Neha Agarwal and Thomas F. Wenisch. Thermostat: Application-Transparent Page Management for Two-Tiered Main Memory. *SIGPLAN Not.*, 52(4):631–644, April 2017.
- [12] Shoaib Akram, Kathryn S. McKinley, Jennifer B. Sartor, and Lieven Eeckhout. Managing hybrid memories by predicting object write intensity. In *Conference Companion of the 2nd International Conference on Art, Science, and Engineering of Programming*, Programming'18 Companion, page 75–80, New York, NY, USA, 2018. Association for Computing Machinery.
- [13] Joy Arulraj, Andy Pavlo, and Krishna Teja Malladi. Multi-Tier Buffer Management and Storage System Design for Non-Volatile Memory. *CoRR*, abs/1901.10938, 2019.

- [14] Benjamin Berg, Daniel S. Berger, Sara McAllister, Isaac Grosof, Sathya Gunasekar, Jimmy Lu, Michael Uhlar, Jim Carrig, Nathan Beckmann, Mor Harchol-Balter, and Gregory R. Ganger. The CacheLib Caching Engine: Design and Experiences at Scale. In 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20), pages 753–768. USENIX Association, November 2020.
- [15] Muhammad Bilal and Shin-Gak Kang. A cache management scheme for efficient content eviction and replication in cache networks. *IEEE Access*, 5:1692–1701, 2017.
- [16] Mustafa Canim, George A. Mihaila, Bishwaranjan Bhattacharjee, Kenneth A. Ross, and Christian A. Lang. SSD Bufferpool Extensions for Database Systems. *Proceedings of the VLDB Endowment*, 3(2):1435–1446, 2010.
- [17] Yuxia Cheng, Wenzhi Chen, Zonghui Wang, Xinjie Yu, and Yang Xiang. Amc: an adaptive multi-level cache algorithm in hybrid storage systems. *Concurrency and Computation: Practice and Experience*, 27(16):4230–4246, 2015.
- [18] Yuxia Cheng, Yang Xiang, Wenzhi Chen, Houcine Hassan, and Abdulhameed Alelaiwi. Efficient cache resource aggregation using adaptive multi-level exclusive caching policies. *Future Gener. Comput. Syst.*, 86:964–974, 2018.
- [19] Asaf Cidon, Assaf Eisenman, Mohammad Alizadeh, and Sachin Katti. Dynacache: Dynamic cloud caching. In 7th USENIX Workshop on Hot Topics in Cloud Computing (HotCloud 15), Santa Clara, CA, July 2015. USENIX Association.
- [20] Brian F. Cooper, Adam Silberstein, Erwin Tam, Raghu Ramakrishnan, and Russell Sears. benchmarking cloud serving systems with ycsb.
- [21] Subramanya R. Dulloor, Amitabha Roy, Zheguang Zhao, Narayanan Sundaram, Nadathur Satish, Rajesh Sankaran, Jeff Jackson, and Karsten Schwan. Data Tiering in Heterogeneous Memory Systems. In *Proceedings of the Eleventh European Conference on Computer Systems*, EuroSys '16, New York, NY, USA, 2016. Association for Computing Machinery.
- [22] Assaf Eisenman, Darryl Gardner, Islam AbdelRahman, Jens Axboe, Siying Dong, Kim Hazelwood, Chris Petersen, Asaf Cidon, and Sachin Katti. Reducing DRAM Footprint with NVM in Facebook. In *Proceedings of the Thirteenth EuroSys Conference*, EuroSys '18, New York, NY, USA, 2018. Association for Computing Machinery.

- [23] Tyler Estro, Pranav Bhandari, Avani Wildani, and Erez Zadok. Desperately seeking ... optimal multi-tier cache configurations. In 12th USENIX Workshop on Hot Topics in Storage and File Systems (HotStorage 20). USENIX Association, July 2020.
- [24] Sudarsun Kannan, Ada Gavrilovska, Vishal Gupta, and Karsten Schwan. HeteroOS: OS Design for Heterogeneous Memory Management in Datacenter. In *Pro*ceedings of the 44th Annual International Symposium on Computer Architecture, ISCA '17, page 521–534, New York, NY, USA, 2017. Association for Computing Machinery.
- [25] Hiwot Tadese Kassa, Jason Akers, Mrinmoy Ghosh, Zhichao Cao, Vaibhav Gogte, and Ronald Dreslinski. Improving performance of flash based key-value stores using storage class memory as a volatile memory extension. In 2021 USENIX Annual Technical Conference (USENIX ATC 21), pages 821–837, 2021.
- [26] Donghee Lee, Jongmoo Choi, Jong-Hun Kim, Sam H. Noh, Sang Lyul Min, Yookun Cho, and Chong Sang Kim. On the Existence of a Spectrum of Policies That Subsumes the Least Recently Used (LRU) and Least Frequently Used (LFU) Policies. In *Proceedings of the 1999 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems*, SIGMETRICS '99, page 134–143, New York, NY, USA, 1999. Association for Computing Machinery.
- [27] Fei Meng, Li Zhou, Xiaosong Ma, Sandeep Uttamchandani, and Deng Liu. vCacheShare: Automated Server Flash Cache Space Management in a Virtualization Environment. In 2014 USENIX Annual Technical Conference (USENIX ATC 14), pages 133–144, Philadelphia, PA, June 2014. USENIX Association.
- [28] Sundaresan Rajasekaran, Shaohua Duan, Wei Zhang, and Timothy Wood. multi-cache: Dynamic, efficient partitioning for multi-tier caches in consolidated vm environments. In 2016 IEEE International Conference on Cloud Engineering (IC2E).
- [29] Amanda Raybuck, Tim Stamler, Wei Zhang, Mattan Erez, and Simon Peter. HeMem: Scalable Tiered Memory Management for Big Data Applications and Real NVM. In 28th ACM Symposium on Operating Systems Principles, 10 2021.
- [30] Hyunsub Song, Shean Kim, J. Hyun Kim, Ethan JH Park, and Sam H. Noh. First responder: Persistent memory simultaneously as high performance buffer cache and storage. In 2021 USENIX Annual Technical Conference (USENIX ATC 21), pages 839–853, 2021.

- [31] Chenxi Wang, Huimin Cui, Ting Cao, John Zigman, Haris Volos, Onur Mutlu, Fang Lv, Xiaobing Feng, and Guoqing Harry Xu. panthera: Holistic memory management for big data processing over hybrid memories.
- [32] Wei Wei, Dejun Jiang, Sally A. McKee, Jin Xiong, and Mingyu Chen. Exploiting program semantics to place data in hybrid memory. In 2015 International Conference on Parallel Architectures and Compilation, PACT 2015, San Francisco, CA, USA, October 18-21, 2015, pages 163–173. IEEE Computer Society, 2015.
- [33] Kai Wu, Yingchao Huang, and Dong Li. Unimem: Runtime Data Managementon Non-Volatile Memory-Based Heterogeneous Main Memory. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC '17, New York, NY, USA, 2017. Association for Computing Machinery.
- [34] Kan Wu, Zhihan Guo, Guanzhou Hu, Kaiwei Tu, Ramnatthan Alagappan, Rathijit Sen, Kwanghyun Park, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau. The storage hierarchy is not a hierarchy: Optimizing caching on modern storage devices with orthus. In 19th USENIX Conference on File and Storage Technologies (FAST 21), pages 307–323. USENIX Association, February 2021.
- [35] Zi Yan, Daniel Lustig, David Nellans, and Abhishek Bhattacharjee. Nimble Page Management for Tiered Memory Systems. In 2019 Architectural Support for Programming Languages and Operating Systems (ASP-LOS '19), pages 331–345, 04 2019.
- [36] Jian Yang, Juno Kim, Morteza Hoseinzadeh, Joseph Izraelevitz, and Steve Swanson. An Empirical Guide to the Behavior and Use of Scalable Persistent Memory. In 18th USENIX Conference on File and Storage Technologies (FAST 20), pages 169–182, Santa Clara, CA, February 2020. USENIX Association.
- [37] Jinfeng Yang, Bingzhe Li, and David J. Lilja. *HeuristicDB: A Hybrid Storage Database System Using a Non-Volatile Memory Block Device*. Association for Computing Machinery, New York, NY, USA, 2021.
- [38] Zhengyu Yang, Morteza Hoseinzadeh, Allen Andrews, Clay Mayers, David Thomas Evans, Rory Thomas Bolt, Janki Bhimani, Ningfang Mi, and Steven Swanson. AutoTiering: Automatic data placement manager in multitier all-flash datacenter. In 36th IEEE International Performance Computing and Communications Conference, IPCCC 2017, San Diego, CA, USA, December 10-12, 2017, pages 1–8. IEEE Computer Society, 2017.