Systematic CXL Memory Characterization and Performance Analysis at Scale

Jinshu Liu Virginia Tech

https://github.com/MoatLab/Melody

1.Introduction

Currently, there is a significant gap in research that explores detailed CXL characteristics and their impact on memory-intensive workloads at scale, in depth, and across the full spectrum of sub-μs latencies.

In particular, how do CXL devices differ in detailed performance characteristics beyond average latency and bandwidth metrics? How (much) does CXL’s long (and longer) latency affect CPU efficiency and workload performance? What are the underlying causes and how do we analyze it?

Exsiting works focus on coarse-grained analysis and overlook several critical aspects: (i) CXL performance stability (i.e., tail latencies); (ii) CPU tolerance to prolonged CXL latencies across various workloads, and the architectural implications of CXL; and (iii) the lack of systematic approach to dissect workload performance and CPU inefficiency under CXL.

So:

introduce Melody, a comprehensive framework for detailed CXL performance characterization.

The first analysis of CXL characteristics beyond average latency and bandwidth across 4 real CXL devices.
An extensive evaluation of CXL’s performance implications across diverse workloads.
A systematic approach for workload performance analysis under CXL.

contributions(in my view):

1.MELODY,a framwork to measure CXL perfomence.

2.An in-depth study of CXL tail latencies (like caption).

3.Root-cause analysis approach

2.Background

How CPU backend and CXL MC process Load and Store request?

Request types:

The CPU issues two types of load requests: Demand and Prefetch. Demand loads are memory reads that CPU requests from (CXL) MC only when it is needed for computation. Prefetch reads are predictive reads directed by prefetchers, e.g., “L1PF” and “L2PF” in Figure 2a.

Stores are first queued in the “store buffer.” Each store request triggers a Read-for-ownership (RFO) for cache coherence from CXL/DRAM, followed by a Write upon cache eviction.

MC ： Memory requests to the CXL MC are encapsulated(compress) in a specific packet format, known as Flits , for transmission over CXL/PCIe. Upon arrival, the CXL controller (“CXL Ctrl”) parses the request and places it in the request queue. The request scheduler then selects the next request to process based on the scheduling policy and other factors such as thermal management for low latency, high bandwidth, and reliability. Requests are then passed to the command scheduler, which issues appropriate low-level DDR commands to the DRAM chips.

3.CXL Device Characterization

3.1 Testbed

Concern:

work load:

cloud workloads (in-memory caching and databases such as Redis [13] and VoltDB [21], CloudSuite [1], and Phoronix [12]), graph processing (GAPBS [22], PBBS [19]), data analytics (Spark [30]), ML/AI (GPT-2 [5], MLPerf [14], Llama [9]), SPEC CPU 2017 [18], and PARSEC [24].

3.2 CXL latency stability and its relationship with bandwidth

Terms distinction:

Loaded latencies: memory access latency under high utilization

Idle latency: occurs when the system experiences minimal load

这一部分实现了一个MIO，通过多次指针追踪记录一次rdtsc时间戳来计算average latency，并采用MLC来验证MIO。测试了一些tail latency 与bandwidth之间的关系，结果均可以想到。

一个测量内存压力的方法：将指针追踪访问线程和32个AVX访存线程一起bind到一个numa nod(co-locate)

CXL latency vs. bandwidth under various read/write ratios.

Local DRAM achieves the highest bandwidth under a read-only workload, whereas NUMA and all CXL devices (except CXL-C) achieve minimal bandwidth in read-only scenarios. This is because NUMA and CXL links are bidirectional, allowing them to sustain higher bandwidth under mixed read/write workloads
CXL devices demonstrate significant variability

Impact of CPU prefetchers on (tail) latency.

Prefetching does not fully mitigate CXL-induced tail latencies.

Reasoning.

本节中测出的结果发现尾延迟等性能差距很大，这样的结论其实作用不大。但是性能差异大可以作为其他性能研究的挑战和动机

1.CXL协议传输层与连接层的实现本身引入了性能开销

2.MC 控制器实现本身

4 Workload Characterization

讨论了一些工作负载的延迟敏感性等，此前论文已经有过

5 Spa for CXL Slowdown Analysis

5.2 Challenges and Limitations of State-of-the-Art

Challenge:

1.Identifying the underlying CPU events/metrics that can correlate to the slowdowns is challenging.

It is even more challenging to establish a precise correlation between workload performance and architecture-level performance metrics,

Why not TMA?

TMA does not provide a differential analysis to interpret pipeline differences resulting from varying backend memory (i.e., CXL vs. local DRAM).
TMA is unable to precisely correlate architecture level metrics with workload slowdowns.

5.3 Spa: A Bottom-Up Approach

DRAM (Demand Load) Slowdown:

These misses denote demand read misses, excluding RFO and prefetch requests.

Store Slowdown :

Incoming store requests queued in the store buffer are dequeued upon completion. Some writes issue RFO requests before execution. If the store buffer fills up, these RFOs would hinder load efficiency, causing CPU stalls.

Cache Slowdown:

On SKX, most cache slowdown occurs in L2 due to a significant rise in stall cycles for L1 load misses with CXL. Conversely, on SPR/EMR, LLC experiences the bulk of slowdown, with a notable increase in stall cycles for L2 load misses with CXL.

key finding:

This reduces L2 prefetcher’s coverage of both demand reads and L1 prefetch. L1 prefetches would either miss entirely in L2 or at best, they would hit on a pending L2 prefetch in L2. Consequently, CXL also negatively impacts L1 prefetcher’s timeliness.Loads that would have otherwise hit in the cache if L1 prefetches were timely, now are delayed. Consequently, overall prefetch efficiency suffers and stalls on caches increase.

由于CXL的长延迟，L2预取的信息时效性降低，当L1需要相应数据的时候，L2还没有预取回来，导致L1认为miss，于是访问L2,再次发出请求。原本可以命中的Load请求变得不命中。

intel没有计数器直接观测L1Pf-L2-hit与miss的情况，可以通过一些其他的计数器间接的观测情况。

发现：L2PF-L3-miss减少，L1PF-L3-miss增多，L2PFL3-hit不变，因此推导出：L2预取器低效预取，L1预取增多。

5.5 & 5.6 Workload Slowdown Diversity &Period-based Slowdown Analysis

An approach to convert time-based sampling data into a period-based slowdown analysis.

5.7 Spa Use Cases

Performance tuning. For example, to mitigate the slowdown bursts observed in 605.mcf (Figure 16b), we first identify memory accesses during bursty periods (e.g., exceeding 10%) using binary instrumentation via Intel Pin. Next, we pinpoint the source code responsible for high slowdowns using addr2line. Our analysis reveals that two performance-critical objects, each 2GB in size, are contributing to the slowdown.

作者提到的两个case，一个是用来做性能优化，一个是作为性能指标来进行分层，这两个点其实都是和后一篇论文有联系，做铺垫。

不足与机会：

作者在验证cache slow down的主要原因的方法是： To validate this, we disable all the hardware prefetchers (L1 and L2) and measure workload slowdowns. With prefetchers off, we found virtually no stall cycles on cache。这样的方法并不深入，为什么降低？这些值得深挖，但是需要一些硬件探索。
关于预取，可以增加预取器的深度(也就是再多预取几个周期)直接解决这个问题。
5.7中讲到了一个关于SPA的使用案例，通过剖析SPA中的slowdown来分析slowdown，然后把slowdown严重的变量放置到CXL，这样的方法我觉得十分适用。

ASPLOS`25 Systematic CXL Memory Characterization and Performance Analysis at Scale