67 stories

Mashing Up CXL And Gen-Z For Shared Disaggregated Memory

1 Share

If you are impatient for not just memory pooling powered by the CXL protocol, but the much more difficult task of memory sharing by servers attached to giant blocks of external memory, you are not alone.

Mashing Up CXL And Gen-Z For Shared Disaggregated Memory was written by Timothy Prickett Morgan at The Next Platform.

Read the whole story
9 days ago
Share this story

483 – SOC2: The screenshots will continue until security improves

1 Comment

483 points, 229 comments

Read the whole story
143 days ago
Great writeup uf how to handle SOC2
Share this story


1 Comment

Page 2

You can’t perform that action at this time.

You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session.

Read the whole story
154 days ago
Nice way to structure a Django app
Share this story

AWS CDK - The Good, The Bad and the Scary

1 Comment

AWS Cloud Development Kit (CDK) has become, in its short history, a very popular infrastructure-as-code tool. It's not too surprising why - it allows engineers to use richer programming languages to define infrastructure, rather than having to use JSON or YAML. With richer languages comes better tooling, better abstractions, and more. My old friend and colleague Gregor Hohpe has used the phrase “Infrastructure-as-actual-Code” to delineate between tools like CDK (and Pulumi) vs CloudFormation (and Terraform).

I've spent time over the last year working on a couple of projects using CDK. I've now appreciated first hand the power that comes from using TypeScript to define a complicated AWS deployment. Overall I think if you have a fairly small team that is owning the application, the infrastructure code of an application, and the production deployment of that application, then CDK is worth strongly considering (I'd use it myself).

However I also still have major concerns with CDK. Some of those are immediately apparent, e.g. the lack of cloud-side support. But many of my concerns relate to longer term operability. I think choosing CDK requires making a trade-off - better efficiency for developers in the short term vs headaches for operations engineers in years to come.

In this article I share my thoughts - both positive and negative - about CDK, for technical decision makers. I hope these ideas will help as you make your own choices about what infrastructure tooling to use.

The Good

Programming Language Superpowers

Besides CDK, the standard way of deploying AWS infrastructure, with only AWS tools, has been to use CloudFormation (perhaps with SAM helping out a little). The problem with first-party use of CloudFormation is that it requires using YAML or JSON as a source language, and neither of these languages are particularly friendly when it comes to building anything of any complexity.

CDK, on the other hand, allows engineers to use JavaScript, TypeScript, Python, Java, or C# to define their infrastructure (with more languages on the way). Because of this writing CDK is much more effective than writing CloudFormation, for the following reasons:

  • Modern editors guide engineers using auto-complete, and catch many bugs early using syntax checking.
  • Standard language structures, like iteration and function calls, allow abstractions to be made within the definition of a “stack”.
  • Library code - shared among multiple “stacks” - is also simple, using standard language techniques, allowing further abstractions to be made.
  • Infrastructure-definition code can be unit tested.

Engineer enthusiasm

Unsurprisingly, engineers tend to prefer writing CDK to writing YAML! And so with the choice of CDK comes good will and a recruiting benefit - both useful for management.

Furthermore this enthusiasm is apparent in the larger community. It's hard for folk to get particularly excited about CloudFormation, but with CDK we're seeing dedicated conferences, open source contributions, and more - all of which support your own team's work.

The Bad

CDK isn't an AWS service in the normal sense - you can't go to the AWS Web Console and pull up the CDK view. This is because CDK is predominantly a client-side abstraction over CloudFormation.

That means while writing CDK is more effective, deploying and debugging CDK at runtime is a different story.

This is immediately obvious when you go to look at a CDK app in the only place it exists in the AWS Web Console - the CloudFormation console. All of your nicely abstracted resources are suddenly flattened into a vast number of more fundamental AWS types, all with hard-to-read generated names. And when things go wrong (things will go wrong) you still need someone on the team that understands CloudFormation.

The problems with CDK not having a cloud-side component don't stop at debugging though. Because it's a client-side tool it puts many more responsibilities on to a team that AWS would usually handle, for example:

  • The deployment environment must be maintained - for development and production. E.g. the correct versions of CDK and programming language environment must be maintained for the lifetime of the application. With a CloudFormation / SAM approach AWS are responsible for providing such an environment.
  • CDK requires a “bootstrap” environment to be set up and maintained, for each AWS sub-account. This has application cross-cutting concerns and requires careful thought.
  • If using JavaScript / TypeScript for CDK you will need to upgrade your Node versions fairly frequently due to Node’s (comparatively) rapid support cycle (Node LTS versions are only supported for 2.5 years). This might be particularly concerning if your application code is written using a language that has longer support (e.g. Java), but you’re using TypeScript as an organizational standard for CDK.
  • Unlike a regular AWS service, the CDK team is already choosing to make major version changes to CDK. This can lead to operations headaches when upgrades need to be applied:

The Scary

As a developer I mostly love CDK. I am way faster with it than I am with CloudFormation.

But if I put my CTO hat on, and think about CDK as a strategic choice for an organization, then things get a lot murkier.

Longer-term Operations Requirements

While the exciting time of a software project is when there's a lot of activity happening, the truth is that most industrial software projects that have any amount of success tend to enter a “maintenance mode” longer-term. Deployments are less frequent, and the set of individuals looking after the app changes over time.

What happens to a CDK application in this context? I have several concerns, beyond those I mentioned in “meta-responsibilities” above.

  • How maintainable is the app's CDK code? Because CDK uses a “regular” programming language it opens up the possibility for heavily engineered solutions - especially in the case of org-wide abstractions. What happens to an application-developer's finely tuned CDK library when the organization ops team changes fundamental requirements about how a company is using AWS, e.g. a change of VPC architecture? Is that developer still employed by the company? And if not which ops engineer is going to need to figure out how the code works?
  • Similarly, how many programming languages are maintenance operations engineers going to be expected to know? Traditionally even if a company has used multiple programming languages for app-dev, from an operations perspective an engineer can typically get by with knowing bash / Python, or PowerShell. But if app-dev teams chose the programming-language-de-jour when building an app will operations engineers now need to know Python, JavaScript, Java, Go, and whatever other languages CDK enables, in order to support an organization’s portfolio?
  • What happens for applications for which a team doesn't regularly update the CDK environments, e.g. those that go 3 years between deployments? Will they be non-deployable without significant work because of changes to the CDK bootstrap in their account? Will the version of Node they use even still be available? What happens in the case of an urgent need to deploy due to a security issue?

AWS support of CDK over the long term

As I've already described, CDK isn't an “AWS service” in the normal sense - there's no CDK API, there's no CDK section of your AWS bill. While this may change over time, if it doesn't that begs the question - to what extent will AWS continue to support CDK? Sure, CDK does have an official marketing page, but SAR has one of them too and it's been given the cold shoulder over the last couple of years (and SAR at least has a cloud component.)

This breaks out in a few ways.

  • How stable will CDK be over time? We've already seen a major upgrade from V1 to V2, and V1 will no longer receive security updates from 2023. Such a rapid sunsetting is unusual in the AWS ecosystem.
  • Will CDK continue to receive significant investment from AWS? Since there's no cloud-side element to CDK we can imagine that it might more-easily get mothballed if it loses favor. That's a problem for something that will need continued updates to maintain support for its underpinnings (e.g. Node LTS updates, again.)
  • On a related matter, will CDK continue to support new AWS services and features in a timely manner?
  • Will the existing problems with CDK (e.g. the lack of runtime support in-cloud) be addressed?

Should you use CDK or not?

Over the last year I’ve heard of a lot of companies switching to CDK, largely I suspect because senior developers want to use it, and are making their voices heard. And since it’s an “officially supported” AWS product there’s a certain expectation that it will be maintained over time to the same extent that EC2 and S3 are.

But because of the fundamental nature of CDK being a client-side tool, and not a cloud-side tool, I think this expectation is false, or at least is yet to be proven. This brings up all the longer-term concerns I’ve mentioned here.

Whether you should use CDK or not, therefore, depends on how the concerns here could impact you over time. For example if you are using CDK on a small number of applications, where the applications themselves are deployed frequently, and the people on team handling deployment are the same as those who are building the app itself, then I think CDK is a good idea. Again, I would use CDK myself for this kind of organizational setup.

However I would likely not use CDK in the following scenarios:

  • Where an application is deployed infrequently / is already “in maintenance mode”
  • Where an application-development team is a separate team from the one responsible for production deployment

I do hope that over time AWS makes CDK the stable, supported, standard tool for infrastructure deployment. But for now the choice to use CDK is one that favors developers in the short term, while I fear giving a sizable headache to Ops teams in the long term, without significant ongoing maintenance.

Read the whole story
161 days ago
Interesting and balanced reflection on CDK usage
Share this story

Internode Cache Thrashing: Hunting a NUMA Performance Bug

1 Comment

ARM-based computers continue to make inroads across both personal computing as well as cloud server spaces, from the ARM-based MacBooks you can use during development to the AWS Graviton2-based instances that provide better price-performance than similar Intel x86-based instances. But Amazon isn’t the only cloud provider with ARM-based instances. Oracle Cloud offers the Ampere Altra A1, which scales to 80 cores per CPU and runs at speeds up to 3.3 GHz.

Michał Chojnowski
Michał is a software engineer at ScyllaDB. He has an undergraduate degree from the University of Warsaw, where he became involved with the implementation of Parquet in ScyllaDB.

I discovered a curious problem while porting ScyllaDB — a high-performance, low-latency database that’s designed to extract every possible ounce of performance out of modern infrastructure — to run on the ARM-based Ampere A1 server. We’ll share full performance benchmarking results of running ScyllaDB on such a big beefy server in due time. But first, I want to share the results of my troubleshooting to alert other developers to the kinds of issues you might encounter when releasing code for an ARM-based platform.

Spoiler alert: The issue encountered was not actually due to the ARM-based Ampere Altra platform. So what was the problem?

(If you enjoy hunting down perplexing performance mysteries, register for P99 CONF — a free virtual conference dedicated to all things performance.)

The Problem

When testing ScyllaDB’s performance on Oracle Cloud’s new ARM machines (the bare metal Ampere A1 with 80 ARM cores), I noticed that some test runs were significantly slower than expected.

I repeated the benchmark many times and determined that its behavior was bimodal: either ScyllaDB was running at full, expected throughput (around 45k writes per core per second) or at 40% throughput — never in between. The slow runs were happening about as often as the fast runs, but fast and slow runs were interleaved randomly, without any discernible pattern.

On a good run, shards were able to sustain ~50,000 write operations per second per shard.

On the same server, on a different run of the same benchmark, shards were only attaining ~16,000 write operations per second per shard — less than 40% of the performance of the good runs.

Without reading further, do you already have a hunch as to the answer? Make a mental note, and then let’s see what we discovered.

Hardware Utilization

When looking for any bottleneck, it’s always good to start by checking resource utilization. Resources that are utilized at 100% are bottleneck candidates. If no resource is utilized at 100%, there’s either a client-side problem or a scheduling problem (for instance, a thread is sleeping even though resources are available). In ScyllaDB’s case, the main constraining resources we look at are CPU, RAM, network and disk.

The benchmark that exposed the problem was specifically meant to stress the CPU:

cassandra-stress write duration=1h cl=QUORUM -pop dist=UNIFORM\(1..100000\) -mode native cql3 maxPending=1024 -rate threads=1000 -node

With the dataset this small, all writes should be happening in RAM; none should be flushed to disk. We disabled the commitlog for this test, so the disk should have been silent.

Looking at cassandra-stress’s default schema, we expected network traffic of about 200 bytes per query, which (given 50k ops/core/second) adds up to about 10MiB/core/s. For this benchmark, I happened to be running ScyllaDB on 32 cores (I didn’t have enough client nodes available in this network to use all 160 cores), so that’s a total of 320MiB/s, or about 2.6Gib/s. This is nowhere near the advertised throughput of 100Gib/s, so we wouldn’t expect the network to be the problem.

Let’s check the metrics:

Network performance

Disk performance

As expected, the disk is almost silent, the network traffic is within expectations and way below its maximum capacity, and the CPU load is 100%. (Not for all shards, but this is expected. We asked cassandra-stress for uniform load distribution, so slightly faster shards are slightly less utilized. The slowest shard is the bottleneck.)

This clearly seems like a CPU bottleneck. But if we want to double-check that the network is fine, we can just run the benchmark clients on the same machine. I did that, and the throughput remained exactly the same. So let’s move on to diagnosing the CPU.

Basic CPU Stats

There are two possibilities: either a) the slow runs are doing more work per query or b) the CPU is doing the work more slowly. To find out which one is happening, we start by looking at IPC (instructions-per-cycle). A decreased IPC will mean that the CPU is doing the work more slowly.

sudo perf stat -C8 --timeout 10000

We find an IPC is 0.42 for slow runs and 0.98 for fast runs. This is fairly close to the throughput ratio: ~16k/core/s for slow runs and ~45k/core/s for fast runs.

This is damning evidence that we are facing a low-level CPU bottleneck. Slow runs aren’t doing additional work, but they are worse at utilizing the CPU.

There are a few possible explanations for poor CPU utilization. Most importantly: unpredictable branches, unnecessary data dependencies and cache misses. In this context, only cache misses make sense because, in both cases, the CPU is executing the same code on the same data. Besides, the output of perf stat shows that the slow case had fewer branch misses overall.


Before we do anything more, let’s disable address space layout randomization to simplify investigation and cross-referencing addresses.

echo 0 | sudo tee /proc/sys/kernel/randomize_va_space

Now, before we try to understand the nature of stalls, let’s try to find them. We can use flamegraphs for that.

git clone https://github.com/brendangregg/FlameGraph
git -C FlameGraph remote add adamnovak https://github.com/adamnovak/FlameGraph
git -C FlameGraph fetch adamnovak
git -C FlameGraph cherry-pick 7ff8d4c6b1f7c4165254ad8ae262f82668c0c13b # C++ template display fix

sudo timeout 10 perf record --call-graph=fp -C8 -o $x.data
sudo perf script -i $x.data > $x.perf
FlameGraph/stackcollapse-perf.pl $x.perf > $x.folded
FlameGraph/flamegraph.pl $x.folded > $x.svg

The “good case”

The “bad case”

Two things stand out:

  1. TCP handling takes proportionally more time in the good case. This is probably because the rest of the code is faster. (Let’s ignore this part of the graph for now.)
  2. The bad case has two wide peaks that aren’t apparent at all in the good case: compact_radix_tree::tree::get_at() and database::apply(). Each takes about 10% of the total work. We should investigate them.

Let’s look at an instruction-level breakdown of those samples:

sudo perf annotate -i $x.data

Apparently, for each of the two suspicious functions, 99% of the time is spent in a single load instruction.

Cross-referencing the assembly and the source code, we see that for compact_radix_tree::tree::get_at() that ee9a74: ldr w10, [x20, #8] is the first instruction that loads something from a tree node to memory when the tree is being searched. That seems like a very reasonable bottleneck: Walking a tree is exactly where we would expect cache misses to occur.

However, in database::apply, cbbfa8: ldr w9, [x26, #3720] is the instruction that loads the current log level from memory. (It is then compared with log_level::trace in cbbfc4: cmp w9, #0x4). This is not reasonable: The log level should be perfectly cached. Besides, it is only used in a predictable comparison: The CPU should be able to continue executing the program speculatively while it’s waiting for the result of the load. Very weird. Let’s patch this instruction away and see what happens.

readelf --headers /opt/scylladb/libexec/scylla | grep -A1 -m1 .text

echo 'mov w9, #0' | as -o patch.o
objcopy -O binary patch.o patch.bin
sudo dd of=/opt/scylladb/libexec/scylla seek=$((0xcbbfa8 - 0xba6840 + 0x996840)) if=patch.bin bs=1 conv=notrunc

Instead of loading the log level from memory, we have hardcoded #0 (log_level::error). Let’s try to get another bad run and see what’s different.

database::apply has disappeared from the graph. Yet…

There seems to be a very slight improvement: IPC has improved from 0.42 to 0.43, and throughput from 470k ops/s to 480k ops/s. But that’s nowhere near the 10% that we supposedly removed. The “bottleneck” has just dispersed, not disappeared.

This doesn’t make sense. What’s going on?


Unfortunately, it seems that cycle flamegraphs won’t help us. Let’s take a step back.

We have already mentioned that cache misses are the only reasonable cause. But there are two possibilities: either a) there are more of them, or b) the penalty is greater. A greater amount of cache misses could be caused by very unlucky aliasing. Let’s check if that’s the case.

The CPU’s performance monitoring unit (PMU) gives access to cache miss statistics. We can find their names in perf list and read them (e.g., with 1perf stat --timeout 10000 -e l2d_cache_refill). But if we search only for events that seem relevant, we might miss something. Let’s just dump all of them.

We write a script that extracts a list of all available PMU events on this architecture from ARM’s documentation. We can print their number and pass them to perf stat. We collect all events with

sudo perf stat --timeout 1000000 -C8 ...events... -x\t 2>&1 | sed 's/<not counted>/0/g'

PMUs have a limited number of hardware counters, so perf can’t count all events at once — it has to multiplex them. This means that results will be approximate. This should not be a problem for us, since we are repeating a single workload. However, let’s use a long timeout to minimize the variance, just in case.

perf stat -x\t produces a tab-separated file. We can load the results into a pretty table:

Looking at all relevant events, we see that the good case has more cache misses on all levels. This likely means that the bad case doesn’t have more misses, but the penalty is greater.

The penalty of misses could be caused by increased contention: Maybe cores are competing for access to the main memory more severely in the bad case? Let’s check what happens when we run ScyllaDB on a different number of cores:

Indeed, the bad run IPC is significantly correlated with the number of used cores: It’s 0.42 for 30 cores, 0.26 for 64 cores. When lowering the number of cores, bad run IPC rises and stabilizes at around 1.0 for 10 cores. For less than 10 cores, bad runs are not visible. The good run IPC is close to 1 for any number of cores.

A very important observation is that all bad runs are hard-bottlenecked at around 500k ops/s, which is reached at around 11 cores. Adding more cores beyond that does not improve it; it only decreases IPC. It is clear that cores are heavily contending for something, but only sometimes. Why? No idea.

Let’s return to the table and take a look at all the other events. Maybe we will find something that happens more often in the bad case. That would be a good candidate for a bottleneck.

There are a few such events:

  • CPU_CYCLES, obviously, because we were doing the measurement for the same amount of time in both cases.
  • LDREX_SPEC “exclusive operation speculatively executed” — but since it happens only 1,000 times per second, it can’t possibly be the cause.
  • EXC_UNDEF “number of undefined exceptions taken locally” — I don’t even know what this means, but it doesn’t seem like a reasonable bottleneck.
  • STALL_BACKEND only supports our suspicion that the CPU is bottlenecked on memory somehow.


REMOTE_ACCESS is suspicious. Why do we need to access the other socket at all? ScyllaDB is NUMA aware — its underlying Seastar framework binds the memory for each shard to the CPU socket where the shard is running. And even if it wasn’t doing that, by default Linux allocates memory for new pages on the socket where the page fault came from. Shards should only be causing page faults in their own memory, so there should be no remote socket accesses. Besides, we are running the benchmarks on 32 cores, all of which are on socket 0. Even if shards shared some memory, it would be on the same socket. Perhaps remote accesses happen in kernel space?

Let’s take a look:

sudo perf top -C8 -e r31

Apparently, only 36% of remote accesses are happening in the kernel, but others are from ScyllaDB! How can this be? Maybe a binding went wrong. Let’s check numa_maps, which shows the NUMA stats and policy for all memory mappings in the process:

sudo cat /proc/$(pgrep -x scylla)/numa_maps

Aha! We forgot that shards are sharing some memory: the static memory. .text, .bss, and .data are used by all shards. Normally, we would expect such memory to be read-only or read-mostly since the Seastar architecture eschews shared atomic variables in favor of per-core dedicated memory for writeable variables. But perhaps we violated this principle.

N0=x N1=y means that x pages in the address range are allocated on node 0 and y pages are allocated on node 1. By cross-referencing readelf --headers /opt/scylladb/libexec/scylla we can determine that .text, .rodata and other read-only sections are on node 0, while .data, .bss and other writable sections are on node 1.

That’s what remote accesses are coming for. Could that be the cause of performance problems?

We can test this by forcing memory to a given NUMA node by running the executable under numactl. Let’s prepend /usr/bin/numactl --membind 1 to /usr/bin/scylla scylla_args…:

sudo systemctl edit --full scylla-server
sudo systemctl restart scylla-server

Oops, we wanted to bind everything to node 1, but some parts of the executable (.text) are still on node 0. That’s because Linux consults the memory policy only when pages are allocated — but .text is already allocated in the page cache. If we want to force .text to node 1 too, we can stop ScyllaDB, drop the page cache and try again.

sudo systemctl stop scylla-server
echo 3 | sudo tee /proc/sys/vm/drop_caches
sudo systemctl start scylla-server

Now everything is on node 1.

Let’s try running the benchmark a few times with everything on node 0 and then with everything on node 1. Aaand… that’s it! Every run with data on node 0 is fast and every run with data on node 1 is slow.

We have learned that remote memory accesses are the bottleneck. Now we have to understand why.

If you are wondering why .data and .bss sometimes land on node 0 and sometimes on node 1: This is determined by the core on which ScyllaDB happens to be started. When ScyllaDB is launched, Linux schedules it on an arbitrary core — sometimes on node 0, sometimes on node 1. During startup, .data and .bss are touched, causing a page fault. In accordance with the default policy, they are allocated on the NUMA node, which contains this core. Only later, ScyllaDB launches shard threads and binds them to cores chosen by the user.

Finding the Source of NUMA Problems

To dig further, we want something more granular than numactl, which causes all memory to be allocated on a given node. We have to use mbind() — a Linux call that allows setting NUMA memory policy for an address range. With the MF_MOVE_ALL flag, it also allows moving already-allocated memory between nodes.

Let’s add a way of asking ScyllaDB to call mbind(). We can modify ScyllaDB’s REST API for that. Since we are too lazy to add a new call, let’s just hijack an existing one:

We have hacked mbind() into a random API call. Now we can

curl http://localhost:10000/column_family/metrics/write_latency/0x028b0000,0x10000

to move arbitrary page ranges between nodes.

Using this ability, we discover that only one page matters: 0x28c0000, which contains .data, .got.plt and the beginning of .bss. When this page is on node 1, the run is slow, even if all other pages are on node 0. When it’s on node 0, the run is fast, even if all other pages are on node 1.

Remote accesses to memory only happen after L2 cache misses. There are two possible causes of cache misses: aliasing and invalidation. If they happen because of aliasing, this means ScyllaDB is naturally accessing enough memory that all important lines can’t fit in the cache. That would be rather hard to deal with. Perhaps it would require re-architecting the program to get rid of global variables.

But maybe we are accidentally invalidating a cache line. If that’s the case, we should be able to find it. But mbind() won’t allow us to test areas more granular than a page, so we have to improvise.

If we could manipulate the layout of the executable, we could move the suspicious area by just enough bytes to split it in half with a page boundary. We can then check which half is bad by sending one half to the remote node (together with the surrounding page).

If we repeat this bisection enough times, we will find the problematic cache line.

We can move the suspicious area by stuffing some padding before it. .tm_clone_table seems like a good enough place to do that. We can add an array in .tm_clone_table somewhere in ScyllaDB and recompile it. (By the way, note that our hacked-in mbind API writes something to this array to prevent it from being optimized out. If it wasn’t used, the linker would discard it because ScyllaDB is compiled with -fdata-sections).

Let’s try to pad .got.plt to a page boundary to test this hack.

It works: We can manipulate the layout. Now we have to repeat this 10 times to find the culprit.

The Fix

Eventually, we narrow the search to bytes 0x3800x400 of .bss. We can’t go further because .bss is aligned to 32. Let’s use gdb to see how those bytes are used:

sudo gdb -p (pgrep -x scylla)
(gdb) watch *0x28d0000
(gdb) watch *0x28d0008
(gdb) watch *0x28d0010
(gdb) watch *0x28d0018
(gdb) continue

When we run the benchmark with those watchpoints, we see that only 0x28d0000 is written to. This happens in line 568 of compact-radix-tree.hh:

And what’s under the problematic address?

(gdb) info symbol 0x28d0000

This explains everything.

nil_root is a special, global tree node used as a guard in tree algorithms. However, this trick had an unintended side effect. node_head_ptr is a pointer that automatically updates the backreference in the target of assignment. Whenever it was assigned to nil_root, it wrote something to a shared cache line. This resulted in internode cache thrashing, which is very costly: according to https://www.anandtech.com/show/16315/the-ampere-altra-review/3, about 2,000 cycles per write!

Special casing nil_root fixes our performance problem:



I later measured that the problematic assignment to nil_root happens about three times per query.

With 3e9 cycles per second, three invalidations per query and 2e3 cycles per invalidation, we can estimate a bottleneck of 3e9/3/2e3 = 500,000 queries per second. This matches the observed result quite closely.

With full knowledge, we can now understand the cycle flamegraph more. It wasn’t lying: The instructions highlighted by perf annotate really had something special about them: They were loading from the thrashed cache line.

(gdb) info address dblog

The tree node load instruction was so slow because it was loading nil_root. The log-level load was so slow because it happened to be on the same cache line.

Even though the log-level load was used only in a perfectly predictable comparison, speculative execution couldn’t hide it because the latency of NUMA accesses is too high for it to handle. 2,000 cycles is more than enough to exhaust all available space in the reorder buffer, which nowadays is a few hundred instructions.

However, the suspicious load instructions weren’t the bottleneck; when we removed one of them, nothing improved. The real culprit was invisible, and its performance penalty was spread over the entire program.

So one important lesson from this case is this: a single innocent CPU instruction brought down the performance of the entire system by more than 50%, yet it was impossible to detect by sampling. Invalidating cache is fast in itself; the penalty is visible elsewhere and can be hard to connect to the cause.

Join Us at P99 CONF

If you enjoyed this digital sleuthing to find a single bug, you will really love the sorts of detective stories that will be shared at P99 CONF, a free virtual conference that will be held Oct. 19-20. P99 CONF is dedicated to engineers who obsess over P99 percentiles and high-performance, low-latency applications. You can register now at p99conf.io.

The post Internode Cache Thrashing: Hunting a NUMA Performance Bug appeared first on The New Stack.

Read the whole story
171 days ago
i love low-level performance debugging stories
Share this story

930 – The Floppotron 3.0

1 Share

930 points, 90 comments

Read the whole story
171 days ago
Share this story
Next Page of Stories