66 stories

483 – SOC2: The screenshots will continue until security improves

1 Comment

483 points, 229 comments

Read the whole story
36 days ago
Great writeup uf how to handle SOC2
Share this story


1 Comment

Page 2

You can’t perform that action at this time.

You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session.

Read the whole story
47 days ago
Nice way to structure a Django app
Share this story

AWS CDK - The Good, The Bad and the Scary

1 Comment

AWS Cloud Development Kit (CDK) has become, in its short history, a very popular infrastructure-as-code tool. It's not too surprising why - it allows engineers to use richer programming languages to define infrastructure, rather than having to use JSON or YAML. With richer languages comes better tooling, better abstractions, and more. My old friend and colleague Gregor Hohpe has used the phrase “Infrastructure-as-actual-Code” to delineate between tools like CDK (and Pulumi) vs CloudFormation (and Terraform).

I've spent time over the last year working on a couple of projects using CDK. I've now appreciated first hand the power that comes from using TypeScript to define a complicated AWS deployment. Overall I think if you have a fairly small team that is owning the application, the infrastructure code of an application, and the production deployment of that application, then CDK is worth strongly considering (I'd use it myself).

However I also still have major concerns with CDK. Some of those are immediately apparent, e.g. the lack of cloud-side support. But many of my concerns relate to longer term operability. I think choosing CDK requires making a trade-off - better efficiency for developers in the short term vs headaches for operations engineers in years to come.

In this article I share my thoughts - both positive and negative - about CDK, for technical decision makers. I hope these ideas will help as you make your own choices about what infrastructure tooling to use.

The Good

Programming Language Superpowers

Besides CDK, the standard way of deploying AWS infrastructure, with only AWS tools, has been to use CloudFormation (perhaps with SAM helping out a little). The problem with first-party use of CloudFormation is that it requires using YAML or JSON as a source language, and neither of these languages are particularly friendly when it comes to building anything of any complexity.

CDK, on the other hand, allows engineers to use JavaScript, TypeScript, Python, Java, or C# to define their infrastructure (with more languages on the way). Because of this writing CDK is much more effective than writing CloudFormation, for the following reasons:

  • Modern editors guide engineers using auto-complete, and catch many bugs early using syntax checking.
  • Standard language structures, like iteration and function calls, allow abstractions to be made within the definition of a “stack”.
  • Library code - shared among multiple “stacks” - is also simple, using standard language techniques, allowing further abstractions to be made.
  • Infrastructure-definition code can be unit tested.

Engineer enthusiasm

Unsurprisingly, engineers tend to prefer writing CDK to writing YAML! And so with the choice of CDK comes good will and a recruiting benefit - both useful for management.

Furthermore this enthusiasm is apparent in the larger community. It's hard for folk to get particularly excited about CloudFormation, but with CDK we're seeing dedicated conferences, open source contributions, and more - all of which support your own team's work.

The Bad

CDK isn't an AWS service in the normal sense - you can't go to the AWS Web Console and pull up the CDK view. This is because CDK is predominantly a client-side abstraction over CloudFormation.

That means while writing CDK is more effective, deploying and debugging CDK at runtime is a different story.

This is immediately obvious when you go to look at a CDK app in the only place it exists in the AWS Web Console - the CloudFormation console. All of your nicely abstracted resources are suddenly flattened into a vast number of more fundamental AWS types, all with hard-to-read generated names. And when things go wrong (things will go wrong) you still need someone on the team that understands CloudFormation.

The problems with CDK not having a cloud-side component don't stop at debugging though. Because it's a client-side tool it puts many more responsibilities on to a team that AWS would usually handle, for example:

  • The deployment environment must be maintained - for development and production. E.g. the correct versions of CDK and programming language environment must be maintained for the lifetime of the application. With a CloudFormation / SAM approach AWS are responsible for providing such an environment.
  • CDK requires a “bootstrap” environment to be set up and maintained, for each AWS sub-account. This has application cross-cutting concerns and requires careful thought.
  • If using JavaScript / TypeScript for CDK you will need to upgrade your Node versions fairly frequently due to Node’s (comparatively) rapid support cycle (Node LTS versions are only supported for 2.5 years). This might be particularly concerning if your application code is written using a language that has longer support (e.g. Java), but you’re using TypeScript as an organizational standard for CDK.
  • Unlike a regular AWS service, the CDK team is already choosing to make major version changes to CDK. This can lead to operations headaches when upgrades need to be applied:

The Scary

As a developer I mostly love CDK. I am way faster with it than I am with CloudFormation.

But if I put my CTO hat on, and think about CDK as a strategic choice for an organization, then things get a lot murkier.

Longer-term Operations Requirements

While the exciting time of a software project is when there's a lot of activity happening, the truth is that most industrial software projects that have any amount of success tend to enter a “maintenance mode” longer-term. Deployments are less frequent, and the set of individuals looking after the app changes over time.

What happens to a CDK application in this context? I have several concerns, beyond those I mentioned in “meta-responsibilities” above.

  • How maintainable is the app's CDK code? Because CDK uses a “regular” programming language it opens up the possibility for heavily engineered solutions - especially in the case of org-wide abstractions. What happens to an application-developer's finely tuned CDK library when the organization ops team changes fundamental requirements about how a company is using AWS, e.g. a change of VPC architecture? Is that developer still employed by the company? And if not which ops engineer is going to need to figure out how the code works?
  • Similarly, how many programming languages are maintenance operations engineers going to be expected to know? Traditionally even if a company has used multiple programming languages for app-dev, from an operations perspective an engineer can typically get by with knowing bash / Python, or PowerShell. But if app-dev teams chose the programming-language-de-jour when building an app will operations engineers now need to know Python, JavaScript, Java, Go, and whatever other languages CDK enables, in order to support an organization’s portfolio?
  • What happens for applications for which a team doesn't regularly update the CDK environments, e.g. those that go 3 years between deployments? Will they be non-deployable without significant work because of changes to the CDK bootstrap in their account? Will the version of Node they use even still be available? What happens in the case of an urgent need to deploy due to a security issue?

AWS support of CDK over the long term

As I've already described, CDK isn't an “AWS service” in the normal sense - there's no CDK API, there's no CDK section of your AWS bill. While this may change over time, if it doesn't that begs the question - to what extent will AWS continue to support CDK? Sure, CDK does have an official marketing page, but SAR has one of them too and it's been given the cold shoulder over the last couple of years (and SAR at least has a cloud component.)

This breaks out in a few ways.

  • How stable will CDK be over time? We've already seen a major upgrade from V1 to V2, and V1 will no longer receive security updates from 2023. Such a rapid sunsetting is unusual in the AWS ecosystem.
  • Will CDK continue to receive significant investment from AWS? Since there's no cloud-side element to CDK we can imagine that it might more-easily get mothballed if it loses favor. That's a problem for something that will need continued updates to maintain support for its underpinnings (e.g. Node LTS updates, again.)
  • On a related matter, will CDK continue to support new AWS services and features in a timely manner?
  • Will the existing problems with CDK (e.g. the lack of runtime support in-cloud) be addressed?

Should you use CDK or not?

Over the last year I’ve heard of a lot of companies switching to CDK, largely I suspect because senior developers want to use it, and are making their voices heard. And since it’s an “officially supported” AWS product there’s a certain expectation that it will be maintained over time to the same extent that EC2 and S3 are.

But because of the fundamental nature of CDK being a client-side tool, and not a cloud-side tool, I think this expectation is false, or at least is yet to be proven. This brings up all the longer-term concerns I’ve mentioned here.

Whether you should use CDK or not, therefore, depends on how the concerns here could impact you over time. For example if you are using CDK on a small number of applications, where the applications themselves are deployed frequently, and the people on team handling deployment are the same as those who are building the app itself, then I think CDK is a good idea. Again, I would use CDK myself for this kind of organizational setup.

However I would likely not use CDK in the following scenarios:

  • Where an application is deployed infrequently / is already “in maintenance mode”
  • Where an application-development team is a separate team from the one responsible for production deployment

I do hope that over time AWS makes CDK the stable, supported, standard tool for infrastructure deployment. But for now the choice to use CDK is one that favors developers in the short term, while I fear giving a sizable headache to Ops teams in the long term, without significant ongoing maintenance.

Read the whole story
54 days ago
Interesting and balanced reflection on CDK usage
Share this story

Internode Cache Thrashing: Hunting a NUMA Performance Bug

1 Comment

ARM-based computers continue to make inroads across both personal computing as well as cloud server spaces, from the ARM-based MacBooks you can use during development to the AWS Graviton2-based instances that provide better price-performance than similar Intel x86-based instances. But Amazon isn’t the only cloud provider with ARM-based instances. Oracle Cloud offers the Ampere Altra A1, which scales to 80 cores per CPU and runs at speeds up to 3.3 GHz.

Michał Chojnowski
Michał is a software engineer at ScyllaDB. He has an undergraduate degree from the University of Warsaw, where he became involved with the implementation of Parquet in ScyllaDB.

I discovered a curious problem while porting ScyllaDB — a high-performance, low-latency database that’s designed to extract every possible ounce of performance out of modern infrastructure — to run on the ARM-based Ampere A1 server. We’ll share full performance benchmarking results of running ScyllaDB on such a big beefy server in due time. But first, I want to share the results of my troubleshooting to alert other developers to the kinds of issues you might encounter when releasing code for an ARM-based platform.

Spoiler alert: The issue encountered was not actually due to the ARM-based Ampere Altra platform. So what was the problem?

(If you enjoy hunting down perplexing performance mysteries, register for P99 CONF — a free virtual conference dedicated to all things performance.)

The Problem

When testing ScyllaDB’s performance on Oracle Cloud’s new ARM machines (the bare metal Ampere A1 with 80 ARM cores), I noticed that some test runs were significantly slower than expected.

I repeated the benchmark many times and determined that its behavior was bimodal: either ScyllaDB was running at full, expected throughput (around 45k writes per core per second) or at 40% throughput — never in between. The slow runs were happening about as often as the fast runs, but fast and slow runs were interleaved randomly, without any discernible pattern.

On a good run, shards were able to sustain ~50,000 write operations per second per shard.

On the same server, on a different run of the same benchmark, shards were only attaining ~16,000 write operations per second per shard — less than 40% of the performance of the good runs.

Without reading further, do you already have a hunch as to the answer? Make a mental note, and then let’s see what we discovered.

Hardware Utilization

When looking for any bottleneck, it’s always good to start by checking resource utilization. Resources that are utilized at 100% are bottleneck candidates. If no resource is utilized at 100%, there’s either a client-side problem or a scheduling problem (for instance, a thread is sleeping even though resources are available). In ScyllaDB’s case, the main constraining resources we look at are CPU, RAM, network and disk.

The benchmark that exposed the problem was specifically meant to stress the CPU:

cassandra-stress write duration=1h cl=QUORUM -pop dist=UNIFORM\(1..100000\) -mode native cql3 maxPending=1024 -rate threads=1000 -node

With the dataset this small, all writes should be happening in RAM; none should be flushed to disk. We disabled the commitlog for this test, so the disk should have been silent.

Looking at cassandra-stress’s default schema, we expected network traffic of about 200 bytes per query, which (given 50k ops/core/second) adds up to about 10MiB/core/s. For this benchmark, I happened to be running ScyllaDB on 32 cores (I didn’t have enough client nodes available in this network to use all 160 cores), so that’s a total of 320MiB/s, or about 2.6Gib/s. This is nowhere near the advertised throughput of 100Gib/s, so we wouldn’t expect the network to be the problem.

Let’s check the metrics:

Network performance

Disk performance

As expected, the disk is almost silent, the network traffic is within expectations and way below its maximum capacity, and the CPU load is 100%. (Not for all shards, but this is expected. We asked cassandra-stress for uniform load distribution, so slightly faster shards are slightly less utilized. The slowest shard is the bottleneck.)

This clearly seems like a CPU bottleneck. But if we want to double-check that the network is fine, we can just run the benchmark clients on the same machine. I did that, and the throughput remained exactly the same. So let’s move on to diagnosing the CPU.

Basic CPU Stats

There are two possibilities: either a) the slow runs are doing more work per query or b) the CPU is doing the work more slowly. To find out which one is happening, we start by looking at IPC (instructions-per-cycle). A decreased IPC will mean that the CPU is doing the work more slowly.

sudo perf stat -C8 --timeout 10000

We find an IPC is 0.42 for slow runs and 0.98 for fast runs. This is fairly close to the throughput ratio: ~16k/core/s for slow runs and ~45k/core/s for fast runs.

This is damning evidence that we are facing a low-level CPU bottleneck. Slow runs aren’t doing additional work, but they are worse at utilizing the CPU.

There are a few possible explanations for poor CPU utilization. Most importantly: unpredictable branches, unnecessary data dependencies and cache misses. In this context, only cache misses make sense because, in both cases, the CPU is executing the same code on the same data. Besides, the output of perf stat shows that the slow case had fewer branch misses overall.


Before we do anything more, let’s disable address space layout randomization to simplify investigation and cross-referencing addresses.

echo 0 | sudo tee /proc/sys/kernel/randomize_va_space

Now, before we try to understand the nature of stalls, let’s try to find them. We can use flamegraphs for that.

git clone https://github.com/brendangregg/FlameGraph
git -C FlameGraph remote add adamnovak https://github.com/adamnovak/FlameGraph
git -C FlameGraph fetch adamnovak
git -C FlameGraph cherry-pick 7ff8d4c6b1f7c4165254ad8ae262f82668c0c13b # C++ template display fix

sudo timeout 10 perf record --call-graph=fp -C8 -o $x.data
sudo perf script -i $x.data > $x.perf
FlameGraph/stackcollapse-perf.pl $x.perf > $x.folded
FlameGraph/flamegraph.pl $x.folded > $x.svg

The “good case”

The “bad case”

Two things stand out:

  1. TCP handling takes proportionally more time in the good case. This is probably because the rest of the code is faster. (Let’s ignore this part of the graph for now.)
  2. The bad case has two wide peaks that aren’t apparent at all in the good case: compact_radix_tree::tree::get_at() and database::apply(). Each takes about 10% of the total work. We should investigate them.

Let’s look at an instruction-level breakdown of those samples:

sudo perf annotate -i $x.data

Apparently, for each of the two suspicious functions, 99% of the time is spent in a single load instruction.

Cross-referencing the assembly and the source code, we see that for compact_radix_tree::tree::get_at() that ee9a74: ldr w10, [x20, #8] is the first instruction that loads something from a tree node to memory when the tree is being searched. That seems like a very reasonable bottleneck: Walking a tree is exactly where we would expect cache misses to occur.

However, in database::apply, cbbfa8: ldr w9, [x26, #3720] is the instruction that loads the current log level from memory. (It is then compared with log_level::trace in cbbfc4: cmp w9, #0x4). This is not reasonable: The log level should be perfectly cached. Besides, it is only used in a predictable comparison: The CPU should be able to continue executing the program speculatively while it’s waiting for the result of the load. Very weird. Let’s patch this instruction away and see what happens.

readelf --headers /opt/scylladb/libexec/scylla | grep -A1 -m1 .text

echo 'mov w9, #0' | as -o patch.o
objcopy -O binary patch.o patch.bin
sudo dd of=/opt/scylladb/libexec/scylla seek=$((0xcbbfa8 - 0xba6840 + 0x996840)) if=patch.bin bs=1 conv=notrunc

Instead of loading the log level from memory, we have hardcoded #0 (log_level::error). Let’s try to get another bad run and see what’s different.

database::apply has disappeared from the graph. Yet…

There seems to be a very slight improvement: IPC has improved from 0.42 to 0.43, and throughput from 470k ops/s to 480k ops/s. But that’s nowhere near the 10% that we supposedly removed. The “bottleneck” has just dispersed, not disappeared.

This doesn’t make sense. What’s going on?


Unfortunately, it seems that cycle flamegraphs won’t help us. Let’s take a step back.

We have already mentioned that cache misses are the only reasonable cause. But there are two possibilities: either a) there are more of them, or b) the penalty is greater. A greater amount of cache misses could be caused by very unlucky aliasing. Let’s check if that’s the case.

The CPU’s performance monitoring unit (PMU) gives access to cache miss statistics. We can find their names in perf list and read them (e.g., with 1perf stat --timeout 10000 -e l2d_cache_refill). But if we search only for events that seem relevant, we might miss something. Let’s just dump all of them.

We write a script that extracts a list of all available PMU events on this architecture from ARM’s documentation. We can print their number and pass them to perf stat. We collect all events with

sudo perf stat --timeout 1000000 -C8 ...events... -x\t 2>&1 | sed 's/<not counted>/0/g'

PMUs have a limited number of hardware counters, so perf can’t count all events at once — it has to multiplex them. This means that results will be approximate. This should not be a problem for us, since we are repeating a single workload. However, let’s use a long timeout to minimize the variance, just in case.

perf stat -x\t produces a tab-separated file. We can load the results into a pretty table:

Looking at all relevant events, we see that the good case has more cache misses on all levels. This likely means that the bad case doesn’t have more misses, but the penalty is greater.

The penalty of misses could be caused by increased contention: Maybe cores are competing for access to the main memory more severely in the bad case? Let’s check what happens when we run ScyllaDB on a different number of cores:

Indeed, the bad run IPC is significantly correlated with the number of used cores: It’s 0.42 for 30 cores, 0.26 for 64 cores. When lowering the number of cores, bad run IPC rises and stabilizes at around 1.0 for 10 cores. For less than 10 cores, bad runs are not visible. The good run IPC is close to 1 for any number of cores.

A very important observation is that all bad runs are hard-bottlenecked at around 500k ops/s, which is reached at around 11 cores. Adding more cores beyond that does not improve it; it only decreases IPC. It is clear that cores are heavily contending for something, but only sometimes. Why? No idea.

Let’s return to the table and take a look at all the other events. Maybe we will find something that happens more often in the bad case. That would be a good candidate for a bottleneck.

There are a few such events:

  • CPU_CYCLES, obviously, because we were doing the measurement for the same amount of time in both cases.
  • LDREX_SPEC “exclusive operation speculatively executed” — but since it happens only 1,000 times per second, it can’t possibly be the cause.
  • EXC_UNDEF “number of undefined exceptions taken locally” — I don’t even know what this means, but it doesn’t seem like a reasonable bottleneck.
  • STALL_BACKEND only supports our suspicion that the CPU is bottlenecked on memory somehow.


REMOTE_ACCESS is suspicious. Why do we need to access the other socket at all? ScyllaDB is NUMA aware — its underlying Seastar framework binds the memory for each shard to the CPU socket where the shard is running. And even if it wasn’t doing that, by default Linux allocates memory for new pages on the socket where the page fault came from. Shards should only be causing page faults in their own memory, so there should be no remote socket accesses. Besides, we are running the benchmarks on 32 cores, all of which are on socket 0. Even if shards shared some memory, it would be on the same socket. Perhaps remote accesses happen in kernel space?

Let’s take a look:

sudo perf top -C8 -e r31

Apparently, only 36% of remote accesses are happening in the kernel, but others are from ScyllaDB! How can this be? Maybe a binding went wrong. Let’s check numa_maps, which shows the NUMA stats and policy for all memory mappings in the process:

sudo cat /proc/$(pgrep -x scylla)/numa_maps

Aha! We forgot that shards are sharing some memory: the static memory. .text, .bss, and .data are used by all shards. Normally, we would expect such memory to be read-only or read-mostly since the Seastar architecture eschews shared atomic variables in favor of per-core dedicated memory for writeable variables. But perhaps we violated this principle.

N0=x N1=y means that x pages in the address range are allocated on node 0 and y pages are allocated on node 1. By cross-referencing readelf --headers /opt/scylladb/libexec/scylla we can determine that .text, .rodata and other read-only sections are on node 0, while .data, .bss and other writable sections are on node 1.

That’s what remote accesses are coming for. Could that be the cause of performance problems?

We can test this by forcing memory to a given NUMA node by running the executable under numactl. Let’s prepend /usr/bin/numactl --membind 1 to /usr/bin/scylla scylla_args…:

sudo systemctl edit --full scylla-server
sudo systemctl restart scylla-server

Oops, we wanted to bind everything to node 1, but some parts of the executable (.text) are still on node 0. That’s because Linux consults the memory policy only when pages are allocated — but .text is already allocated in the page cache. If we want to force .text to node 1 too, we can stop ScyllaDB, drop the page cache and try again.

sudo systemctl stop scylla-server
echo 3 | sudo tee /proc/sys/vm/drop_caches
sudo systemctl start scylla-server

Now everything is on node 1.

Let’s try running the benchmark a few times with everything on node 0 and then with everything on node 1. Aaand… that’s it! Every run with data on node 0 is fast and every run with data on node 1 is slow.

We have learned that remote memory accesses are the bottleneck. Now we have to understand why.

If you are wondering why .data and .bss sometimes land on node 0 and sometimes on node 1: This is determined by the core on which ScyllaDB happens to be started. When ScyllaDB is launched, Linux schedules it on an arbitrary core — sometimes on node 0, sometimes on node 1. During startup, .data and .bss are touched, causing a page fault. In accordance with the default policy, they are allocated on the NUMA node, which contains this core. Only later, ScyllaDB launches shard threads and binds them to cores chosen by the user.

Finding the Source of NUMA Problems

To dig further, we want something more granular than numactl, which causes all memory to be allocated on a given node. We have to use mbind() — a Linux call that allows setting NUMA memory policy for an address range. With the MF_MOVE_ALL flag, it also allows moving already-allocated memory between nodes.

Let’s add a way of asking ScyllaDB to call mbind(). We can modify ScyllaDB’s REST API for that. Since we are too lazy to add a new call, let’s just hijack an existing one:

We have hacked mbind() into a random API call. Now we can

curl http://localhost:10000/column_family/metrics/write_latency/0x028b0000,0x10000

to move arbitrary page ranges between nodes.

Using this ability, we discover that only one page matters: 0x28c0000, which contains .data, .got.plt and the beginning of .bss. When this page is on node 1, the run is slow, even if all other pages are on node 0. When it’s on node 0, the run is fast, even if all other pages are on node 1.

Remote accesses to memory only happen after L2 cache misses. There are two possible causes of cache misses: aliasing and invalidation. If they happen because of aliasing, this means ScyllaDB is naturally accessing enough memory that all important lines can’t fit in the cache. That would be rather hard to deal with. Perhaps it would require re-architecting the program to get rid of global variables.

But maybe we are accidentally invalidating a cache line. If that’s the case, we should be able to find it. But mbind() won’t allow us to test areas more granular than a page, so we have to improvise.

If we could manipulate the layout of the executable, we could move the suspicious area by just enough bytes to split it in half with a page boundary. We can then check which half is bad by sending one half to the remote node (together with the surrounding page).

If we repeat this bisection enough times, we will find the problematic cache line.

We can move the suspicious area by stuffing some padding before it. .tm_clone_table seems like a good enough place to do that. We can add an array in .tm_clone_table somewhere in ScyllaDB and recompile it. (By the way, note that our hacked-in mbind API writes something to this array to prevent it from being optimized out. If it wasn’t used, the linker would discard it because ScyllaDB is compiled with -fdata-sections).

Let’s try to pad .got.plt to a page boundary to test this hack.

It works: We can manipulate the layout. Now we have to repeat this 10 times to find the culprit.

The Fix

Eventually, we narrow the search to bytes 0x3800x400 of .bss. We can’t go further because .bss is aligned to 32. Let’s use gdb to see how those bytes are used:

sudo gdb -p (pgrep -x scylla)
(gdb) watch *0x28d0000
(gdb) watch *0x28d0008
(gdb) watch *0x28d0010
(gdb) watch *0x28d0018
(gdb) continue

When we run the benchmark with those watchpoints, we see that only 0x28d0000 is written to. This happens in line 568 of compact-radix-tree.hh:

And what’s under the problematic address?

(gdb) info symbol 0x28d0000

This explains everything.

nil_root is a special, global tree node used as a guard in tree algorithms. However, this trick had an unintended side effect. node_head_ptr is a pointer that automatically updates the backreference in the target of assignment. Whenever it was assigned to nil_root, it wrote something to a shared cache line. This resulted in internode cache thrashing, which is very costly: according to https://www.anandtech.com/show/16315/the-ampere-altra-review/3, about 2,000 cycles per write!

Special casing nil_root fixes our performance problem:



I later measured that the problematic assignment to nil_root happens about three times per query.

With 3e9 cycles per second, three invalidations per query and 2e3 cycles per invalidation, we can estimate a bottleneck of 3e9/3/2e3 = 500,000 queries per second. This matches the observed result quite closely.

With full knowledge, we can now understand the cycle flamegraph more. It wasn’t lying: The instructions highlighted by perf annotate really had something special about them: They were loading from the thrashed cache line.

(gdb) info address dblog

The tree node load instruction was so slow because it was loading nil_root. The log-level load was so slow because it happened to be on the same cache line.

Even though the log-level load was used only in a perfectly predictable comparison, speculative execution couldn’t hide it because the latency of NUMA accesses is too high for it to handle. 2,000 cycles is more than enough to exhaust all available space in the reorder buffer, which nowadays is a few hundred instructions.

However, the suspicious load instructions weren’t the bottleneck; when we removed one of them, nothing improved. The real culprit was invisible, and its performance penalty was spread over the entire program.

So one important lesson from this case is this: a single innocent CPU instruction brought down the performance of the entire system by more than 50%, yet it was impossible to detect by sampling. Invalidating cache is fast in itself; the penalty is visible elsewhere and can be hard to connect to the cause.

Join Us at P99 CONF

If you enjoyed this digital sleuthing to find a single bug, you will really love the sorts of detective stories that will be shared at P99 CONF, a free virtual conference that will be held Oct. 19-20. P99 CONF is dedicated to engineers who obsess over P99 percentiles and high-performance, low-latency applications. You can register now at p99conf.io.

The post Internode Cache Thrashing: Hunting a NUMA Performance Bug appeared first on The New Stack.

Read the whole story
64 days ago
i love low-level performance debugging stories
Share this story

930 – The Floppotron 3.0

1 Share

930 points, 90 comments

Read the whole story
64 days ago
Share this story

Handling Concurrency Without Locks

1 Comment

Concurrency is not very intuitive. You need to train your brain to consider what happens when multiple processes execute a certain code block at the same time. There are several issues I often encounter:

  1. Failing to recognize potential concurrency issues: It's not uncommon for both beginner and seasoned developers to completely miss a potential concurrency problem. When this happens, and the concurrency issue end up causing bugs, it's usually very hard to trace and debug.

  2. Dismiss concurrency issues due to low likelihood: If you recognized a potential concurrency issue, at some point you probably thought to yourself "what are the chances if this happening...". It's very tempting to dismiss concurrency issues when the likelihood is low. However, I personally found that concurrency issues tend to creep up at the worst time - when your system is under significant load and you have very little time (and grace) to come up with a solution.

  3. Abusing locks: If you recognized a potential issue and decided to handle it properly, your next step will usually involve some kind of lock. Sometimes locks are necessary, but more often than not they can be avoided, or replaced by more permissive locks.

In this article I present common concurrency challenges and how to overcome them with minimal locking.

Other type of race...
Other type of race...

Table of Contents

A Simple Web Application

A URL shortener provides short URLs that redirects to other URLs. There are several reasons why you would want that:

  1. Include links in space-constrained places: Links in SMS messages, Tweets etc.
  2. Track the number of clicks: Ads, campaigns, newsletter links, promotional emails etc.

To most developers a URL shortener sounds like a straight forward project. This is why it makes a great example to demonstrate common concurrency issues, and how easy they are to miss and get wrong. I'm using Python, Django and PostgreSQL, but the concepts apply to any programming language and RDBMs.

⚙️ Django Project setup

To build your URL shortener start by creating a new Django project:

$ python -m venv venv
$ source venv/bin/activate
$ pip install django
$ django-admin startproject project
$ cd project

This will create a Python virtual environment, install the latest Django version and create a Django project named project.

For the URL shortener, add a new app called shorturl in the project:

$ python manage.py startapp shorturl

Next, register the new app in settings.py:

# settings.py
    # ...

Django uses SQLlite by default. If you want to configure Django to use PostgreSQL instead, install psycopg and create a database:

$ pip install psycopg2
$ createdb -O yourdbuser shorturl

Then edit the following parameters in settings.py:

# settings.py
    "default": {
        "ENGINE": "django.db.backends.postgresql",
        "NAME": "shorturl",
        "USER": "yourdbuser",

Finally, run the initial migrations:

$ python manage.py migrate

You are now ready for the interesting part!

Naive Implementation

A short URL is composed of a short unique identifier that points to some target URL, and a counter to keep track of the number of hits. A Django model for a short URL can look like this:

from django.db import models

class ShortUrl(models.Model):
    key: str = models.CharField(max_length=20, unique=True)
    target_url: str = models.URLField()
    hits: int = models.PositiveIntegerField()

A simple function to create a new short URL can look like this:

import secrets, string

CHARACTERS = string.ascii_letters + string.digits

class ShortUrl(models.Model):

    # ...

    def create(cls, target_url: str) -> ShortUrl:
        key = ''.join((secrets.choice(CHARACTERS) for _ in range(5)))
        return cls.objects.create(key=key, target_url=target_url, hits=0)

The function accepts a target URL, generates a random key from a list of possible characters and saves a new ShortUrl to the database. You can now use this function to create a new short URL:

>>> shorturl = ShortURL.create('https://hakibenita.com')
>>> vars(shorturl)
{'_state': <django.db.models.base.ModelState at 0x7fd5e05558a0>,
 'id': 1,
 'created_at': datetime.datetime(2022, 4, 29, 8, 2, 18, 615165, tzinfo=datetime.timezone.utc),
 'key': 'c6UFG',
 'target_url': 'https://hakibenita.com',
 'hits': 0}

This seems innocent enough, so what can possibly go wrong?

Handling Possible Collisions

Say your URL shortener becomes a wild success and you have millions of new short URLs created every day. At some point, the function that generates the random key may produce a key that already exist:

>>> shorturl = ShortUrl.create('https://hakibenita.com/tag/django')
IntegrityError: duplicate key value violates unique constraint "shorturl_shorturl_key_uk"
DETAIL:  Key (key)=(c6UFG) already exists.

Notice that the function generated the random key c6UFG which is similar to a short URL you previously created. The key column has a unique constraint defined on it, so you got a database error.

In order to avoid a unique constraint violation, you might try to check the key in advance, like this:

class ShortUrl(models.Model):

    def create(cls, target_url: str) -> ShortUrl:
        while True:
            key = ''.join((secrets.choice(CHARACTERS) for _ in range(5)))
            if not cls.objects.filter(key=key).exists():
        return cls.objects.create(key=key, target_url=target_url, hits=0)

Instead of creating the short URL immediately, you first check that the random key you generated does not already exist. If you find that is does, you keep iterating on random keys until you find one that doesn't.

Aside from the fact that this function can now potentially go into an infinite loop, there is another problem!

Time-of-check to Time-of-use

The purpose of checking that the key does not exist in advance was to avoid a database error, but is it really impossible to end up with a unique constraint violation this way? Consider the following scenario:

Process 1 Process 2
Check key c6UFG -> ✅
Check key c6UFG -> ✅
Use Key c6UFG
Use key c6UFG -> 💥 Already exists

Let's break it down:

  1. Process #1 generates the random key c6UFG and checks that it does not already exist
  2. Before Process #1 has a chance to write the new shorturl to the database, process #2 generates the same key c6UFG and checks that it does not already exist. At this point, it doesn't!
  3. Process #2 writes the short URL to the database and completes successfully
  4. Process #1 now tries to save a short URL with the same key c6UFG, which it checked in advance, and fails with a unique constraint violation

This is a very common concurrency issue commonly referred to as "time-of-check to time-of-use", or "TOCTOU". The name describes the issue pretty well - the problem is caused when another process changes the data between the time a process checked a value until the time it used it. In this case, Process #2 added a short URL with the same key after Process #1 had checked it, but before it used it.


Whenever you have a problem with multiple processes accessing the same resource at the same time, the most intuitive solutions is a lock. But, what exactly should you lock?

A simple architecture for a web application usually contains a web application process and a database:

Single process
Single process

If that's your setup, you might be able to obtain a lock at the process level and make sure you are the only one accessing a certain resource at a time.

However, a common setup for Django (as well as other web applications) is to run with multiple worker processes:

Multiple worker processes
Multiple worker processes

With multiple worker processes on the same server it's no longer enough to lock a resource within a single process. However, since the two processes are running on the same server, you might be tempted to try and find a solution at the operating system level. But, is it enough?

Multiple servers
Multiple servers

If your system is running on multiple servers, with multiple worker processes on each one, even the OS can't save you. You now might think the lowest common denominator is the application itself. You can spend days trying to come up with an original way to coordinate a lock between the servers, but would that solve your problem?

Multiple applications
Multiple applications

Modern systems can run multiple applications on top of the same database. We for example, do it with Django admin.

At this point it becomes clearer that the lowest common denominator, the resource that all servers, processes and applications share, is the database. If you want to "lock" a resource, you better do it in the database.

Lock in the Database

Now that you know where to lock, there is another challenge. If you were updating an existing row in the database you could have locked that specific row, but this is not the case. You want to create a new row, for a new short URL, so what can you possibly lock?

One option is to lock the entire table:

from django.db import connection, transaction

class ShortUrl(models.Model):

    def create(target_url: str) -> ShortUrl:
        with transaction.atomic(), connection.cursor() as cursor:
            cursor.execute('LOCK TABLE shorturl_shorturl IN EXCLUSIVE MODE;')
            while True:
                key = ''.join((secrets.choice(CHARACTERS) for _ in range(5)))
                if not cls.objects.filter(key=key).exists():
            return cls.objects.create(key=key, target_url=target_url, hits=0)

To prevent other processes from creating short URLs and potentially causing a unique constraint violation, you locked the entire shorturl table.

First, to obtain a lock in the database you need to operate inside a database transaction:

with transaction.atomic():

In most cases, Django operates in "autocommit" mode, meaning it implicitly opens a transaction for every command, and commits immediately after. In this case you want to execute multiple commands in the same database transaction, so you explicitly control the transaction using the transaction.atomic context.

The Django ORM does not provide functions for locking tables. To lock a table in the database you need to use a cursor and execute raw SQL:

with transaction.atomic(), connection.cursor() as cursor:
    cursor.execute('LOCK TABLE shorturl_shorturl IN EXCLUSIVE MODE;')

After you obtained an exclusive lock on the table, no other transaction can obtain the same lock until you release it. This guarantees that the data cannot change between the time you checked the values and the time you used them.

So, problem solved?

Asking for Forgiveness

Imagine your app makes it to the top page of a very popular news site. In a matter of minutes you start getting thousands of hits, and users are creating thousands of new short URLs. Now imagine your system can only create a single short URL from a single user at a time, and all other users need to wait in line. Is that acceptable?

When you locked the entire table you made sure no other transaction can make changes to the table until you are done. This means adding new short URLs is now safe, but it can also get pretty slow when you have many concurrent requests.

What if you could ditch the lock? What if you could make your function safe without preventing multiple users from creating short URLs at the same time? Have another look at the exception you got in the beginning:

>>> shorturl = ShortUrl.create('https://hakibenita.com/tag/django')
IntegrityError: duplicate key value violates unique constraint "shorturl_shorturl_key_uk"
DETAIL:  Key (key)=(c6UFG) already exists.

When the function attempted to create a short URL with a key that already existed, the database returned an error. This is because the key column has a unique constraint defined on it. So, if the database is already making sure that there are no duplicates, why not rely on it?

With this in mind, you can now change your function to handle the unique constraint violation:

from django.db import IntegrityError

class ShortUrl(models.Model):

    def create(cls, target_url: str) -> ShortUrl:
        while True:
            key = ''.join((secrets.choice(CHARACTERS) for _ in range(5)))
                return ShortUrl.objects.create(key=key, target_url=target_url, hits=0)
            except IntegrityError:
                # Key exists, try try again!

The function now generates a random key and attempts to create a new short URL. If the command fails with an IntegrityError it means a short URL with the same key already exists. In this case, the function will generate another random key and try again until it succeeds.

The first thing you might notice is that there is no explicit lock, and there is no explicit database transaction. Multiple processes can now safely create short URLs without getting an IntegrityError.

According to this quote from the Python glossary, this approach is also much more "pythonic":

Common Python coding style [ ... ] This clean and fast style is characterized by the presence of many try and except statements.

The term EAFP stands for "Easier to ask for forgiveness than permission", and this is exactly what you did. Instead of checking in advance, you tried to write the row to the database and handled the exception. The opposite of EAFP is LBYL, "Look Before You Leap", which is what you did when you checked the values in advance.

Python as a language encourages "asking or forgiveness" (EAFP), as opposed to other languages such as JavaScript or Java that encourage you to check values in advance, the LBYL style.

Asking for Forgiveness in PostgreSQL

Sometimes it's necessary to generate short URLs as part of another transaction. For example, if you include the short URL in a notification you save to the database, your code can look like this:

>>> with transaction.atomic():
...     shorturl = ShortUrl.create('https://hakibenita.com')
...     notification = Notification.create(to='...', message='...')

To make sure both the notification and the short URL are created together, you execute the code inside a database transaction. This is a perfectly valid use of database transactions - this is exactly what they are for!

If you use PostgreSQL, under some circumstances you might encounter this error:

InternalError: current transaction is aborted, commands ignored until end of transaction block

In PostgreSQL, when there is an error during a transaction, all other commands are aborted until the end of the transaction. Now recall how ShortUrl.create is implemented - you iterate over random keys and attempt to create until you don't get an IntegrityError. This means that if you execute your function inside a transaction and it did trigger an IntegrityError, PostgreSQL will abort the transaction, and you won't be able to proceed with the rest of the code.

To accommodate possible exceptions within transaction in PostgreSQL, you can use another transaction:

from django.db import transaction, IntegrityError

class ShortUrl(models.Model):

    def create(cls, target_url: str) -> ShortUrl:
        while True:
            key = ''.join((secrets.choice(CHARACTERS) for _ in range(5)))
                # In PostgreSQL, when an SQL command fails (usually due to
                # IntegrityError) it is not possible to execute other commands
                # until the end of the atomic block. To be able to retry different
                # keys multiple times after, we execute the command in its own
                # atomic block.
                # https://docs.djangoproject.com/en/4.0/topics/db/transactions/#handling-exceptions-within-postgresql-transactions
                with transaction.atomic():
                    return ShortUrl.objects.create(key=key, target_url=target_url, hits=0)
            except IntegrityError:
                # Key exists, try try again!

The additional transaction introduces very little overhead, it simply restricts the effect a possible exception can have on any outer transactions.

Identifying Race Conditions

Now that you have all of these short URLs in your system, it's time to actually use them. The URL shortener system includes a view that redirects a short URL to its target URL, and increments the counter. The view can look like this:

# views.py
from django.http import HttpRequest, HttpResponse, HttpResponseRedirect, Http404
from django.views.decorators.http import require_http_methods

from .models import ShortUrl

def resolve_short_url(request: HttpRequest, key: str) -> HttpResponse:
        shorturl = ShortUrl.resolve(key)
    except ShortUrl.DoesNotExist:
        raise Http404()
    return HttpResponseRedirect(shorturl.target_url)

The view attempts to "resolve" a key to a ShortURL instance, and if it finds one, redirects to it.

A naive implementation of the function ShortUrl.resolve can look like this:

class ShortUrl(models.Model):

    def resolve(cls, key: str) -> ShortUrl:
        shorturl = cls.objects.get(key=key)
        shorturl.hits += 1
        return shorturl

The function accepts the key as argument and attempts to find a short URL with that key. If the key is not found, .get(key=key) throws a ShortUrl.DoesNotExist exception and the view will return a 404 response. If a short URL is found, the hit counter is incremented and saved, the object is returned to the view and the user is redirected to the target URL.

So, where is the problem?

As you already experienced when you created the short URL, concurrency issues often require special attention. Imagine what will happen if multiple users are trying to resolve the same key at the same time. Consider the following scenario:

Process 1 Process 2 Hits
Select hits -> 0 0
Select hits -> 0 0
Update hits -> 1 1
💥 Update hits -> 1 1

In this scenario, two processes are resolving the same URL at the same time. The short URL is resolved twice, but the hits counter is 1. This is incorrect!

Select for Update

The problem with the naive implementation is that when multiple concurrent processes resolve the same short URL at the same time, the counter can get out of sync. Each of the processes is incrementing the counter based on the value it received when it fetched the row from the database, and that's how the counter gets out of sync.

What if you could lock the row to prevent multiple processes from selecting and updating it at the same time? Consider the following implementation:

from django.db import transaction

class ShortUrl(models.Model):

    def resolve(cls, key: str) -> ShortUrl:
        with transaction.atomic():
            shorturl = (
            shorturl.hits = shorturl.hits + 1
        return shorturl

The function now opens a database transaction, and uses select_for_update to lock the row. If the lock is obtained, other processes cannot obtain the same lock on the row until the transaction is finished. This means the counter can no longer get out of sync, because only a single process can fetch and update it at the same time. But it also means that any concurrent processes must either wait or fail.

Imagine you launch a big campaign to hundreds of thousands of users. To check how effective your campaign is, you use your URL shortener to keep track of how many users clicked the links. Immediately when you send out the campaign it lands in your users' emails and thousands of them click the links. Now imagine each user needs to wait in line until the previous one fetched the row and updated the hit counter in the database. Sounds like it might be a problem...

Increment in the Database

Using select_for_update you solved the problem of the hit counter going out of sync, but your system is now doing a very poor job at the one thing it should be doing very well - redirect short URLs!

The main issue with the previous approach is that you incremented the counter based on what you fetched. With many concurrent processes operating at the same time, it is very much possible that since you fetched the row, the counter was incremented multiple times by other processes and you just don't know about it.

What if instead of incrementing the counter based on what you have stored in memory, you instruct the database to update based on what it currently has stored?

from django.db.models import F

class ShortUrl(models.Model):

    def resolve(cls, key: str) -> ShortUrl:
        shorturl = cls.objects.get(key=key)
        shorturl.hits = F('hits') + 1
        return shorturl

The function now uses an F expression to update the counter relative to what is in the database.

The different seems very mild, so the best way to understand it is to look at the SQL generated by the update commands. The naive approach will execute the following command:

UPDATE shorturl_shorturl
SET hits = 1
WHERE id = 154;

This will update hits to 1, regardless of the current value of hits in the database.

The function using the F expression will execute the following command:

UPDATE shorturl_shorturl
SET hits = hits + 1
WHERE id = 154;

The hit counter is now incremented by one and not set to a fixed value. This is how using an F expression solves the problem without obtaining an explicit lock on the row.

Update and Immediately Return

Using F expressions solved the problem without an exclusive lock, but there are still two minor downsides to this approach:

  1. There are two round trips to the database: the function first access the database to fetch the short URL by key, and then access it again to update the row.

  2. The hit counter on the returned instance is not updated: the function is incrementing the value in the database, but the short URL object it returns stores the hits counter prior to when the hits counter was incremented.

Normally, the F expression solution with two round trips to the database is good enough, no reason to go through the trouble of trying to optimize it. However in this case, the system should be able to accommodate sudden bursts and redirect very quickly, so it might be worth the trouble.

To solve both issues, you can extend the solution beyond Django's ORM built-in capabilities with some SQL magic:

from django.db import transaction

class ShortUrl(models.Model):

    def resolve(cls, key: str) -> ShortUrl:
        short_url = cls.objects.raw('''
            UPDATE shorturl_shorturl
            SET hits = hits + 1
            WHERE key = %s
            RETURNING *
        ''', [key])
        if not short_url:
            raise cls.DoesNotExist()
    return short_url[0]

Let's break it down:

  1. Use RETURNING to get updated results: In PostgreSQL and SQLite you can return the rows affected by an UPDATE statement. The rows returned by the RETURNING clause are the updated rows. Even though you only update the column hits, you can use * to return the entire row with the updated values.

  2. Use .raw to construct a ShortUrl instance: Django ORM allows you to construct ORM objects from raw SQL.

  3. If no rows were affected, raise DoesNotExist: if no rows were affected it means there is no short URL for the provided key. To mimic Django's behavior in this case, raise a DoesNotExist exception, otherwise return the object.

Binding the table name

Under some circumstances you might want to avoid explicitly using the name of the table. Mainly if you have a reuseable model and you cannot be sure what the name of the database table is.

Using string concatenation to set the name of the table is not acceptable as it puts the query at risk of SQL Injection. Using psycopg, the Python driver for PostgreSQL, there is a way to safely bind identifiers such as table names:

from psycopg2.sql import SQL, Identifier

def resolve(cls, key: str) -> ShortUrl:
    short_url = cls.objects.raw(
            UPDATE {}
            SET hits = hits + 1
            WHERE key = %s
            RETURNING *
    if not short_url:
        raise cls.DoesNotExist()
    return short_url[0]

The function now uses the name of the table as parameter.

Check out Preventing SQL Injection Attacks With Python for more about how to safely compose SQL for PostgreSQL in Python using Psycopg2.

Take Away

Throughout this article you used several different approaches to solve very common concurrency issues in two seemingly simple tasks:

  • Create short URL
    • ⛔ #1: Failing to recognize possible collisions
    • ⛔ #2: Time-of-check time-of-use (TOCTOU)
    • ✅ #1: Lock!
    • ✅ #2: Ask for forgiveness
  • Increment hit counter
    • ⛔ #3: Ignore race conditions
    • ✅ #3: Select for update
    • ✅ #4: Increment in the database
    • ✅ #5: Update and immediately return

Some of these approaches were fragile, and broke under even minor loads, and others has their advantages and disadvantages.

The main take away is this:

  1. Keep concurrency in mind
    Concurrency issues are very hard to miss, and even harder to debug. Recognizing concurrency issues requires special attention and awareness. This article covers the awareness part, so now you just need to pay attention.

  2. Don't let probabilities distract you
    It's very tempting to dismiss concurrency issues when the likelihood is low. Hopefully now that you know a few more ways to handle concurrency you are in a better position to resist the temptation.

  3. Avoid locks if possible
    Locks slow things down and cause contention so it's best to avoid them when possible. Now you know how.

For more advanced approaches for managing concurrency in Django, check out How to Manage Concurrency in Django Models.

Read the whole story
65 days ago
Nice intro to database concurrency
Share this story
Next Page of Stories