No need to rush… (About spinlocks)

From the film “Kidnapping, Caucasian Style”.

After a small article about the quirks of working with the cache, I got a few comments in my DMs about how spinlocks work — plus an interview invite. It's nice that purely technical articles are read not only by techies. Back to spinlocks: since the discussion spilled beyond Habr and there's interest, why not write about working with these synchronization primitives. The topic really is interesting, and developers have come up with more than a dozen kinds of spinlocks for different tastes and needs. Again, everything comes with tests and worked examples. @tbl Linus is indeed right — in userspace spinlocks are "evil incarnate", but there are nuances...

My first encounter with a spinlock in real life (not really, but I'd never thought about it that way before) happened in the laundry room of my alma mater's dorm (ITMO) — it was a cross between a lottery and the tactics of a desert gopher scanning for predators. There were 5 washing machines and two dryers, plus a sign-up sheet for time slots. First you put yourself at the end of the queue and write down when you'll come, then you show up at that time (which didn't always work out), but arriving on time you realize all the machines are still (already) stuffed with someone's clothes, and there may be a few people ahead of you, give or take. You can sit and wait while ~~the strange guys from the room across the hall~~ your neighbors pull their stuff out, or you can go back to your room and come down ten minutes later, hoping no "smart" person jumped the queue. On the third attempt a machine is free and your clothes finally go in; now you can go back to your room for half an hour and wait for the wash to finish, or wait right here playing Snake on a brand-new Nokia 3310. Going back to your room might end with your damp clothes dumped in the communal "freshly washed" basket. The same quest awaits you with the dryer — except there are two, and the number of checks is random, because there's no sign-up sheet for dryers. We check the dryers every half hour and voilà... after three hours and a bunch of trips the laundry is clean, and as compensation for the lost time you get someone's yellow sock with a little elephant on it, discovered back in your room. That's how you can feel like a human spinlock with a polling interval of plus-or-minus ten minutes, with the washing machine as the resource.

There are two kinds of waiting on a resource: blocking and non-blocking.

Blocking waiting (mutex, critical section) suspends the thread until the resource is freed at the OS level. This is useful, because the lock is implemented by the OS itself and consumes negligible machine resources. The scheduler can also account for the priority of such waits, or the priorities of the threads using them. Such a lock leaves the CPU free for other tasks — including any other task, even ones that directly hold the lock. However, all of this is expensive in time, mostly because of blocking contending threads at the OS level. Such locks — mutexes / critical sections — should be used when the work takes significantly longer than the cost of suspending the thread, e.g. I/O operations, calls into OS kernel functions, or doing laundry in a university dorm. If there's interest, we can discuss how mutexes work a bit later; for now, spinlocks.

Non-blocking waiting (spinlock) checks some condition in an infinite loop. This is also useful, because it can resume the thread on the very next CPU cycle, it costs the OS nothing and takes minimal effort from the developer — and it's completely opaque to the scheduler, which only sees a bunch of threads doing active work. An even bigger drawback is that, in the worst case, an active thread can prevent the CPU from doing other work. If a thread's task requires lengthy computation, the other threads will spend disproportionately more time waiting than working, all while occupying the CPU like ordinary threads. As a rule, a spinlock is used for short critical sections, e.g. reading or writing a critical data structure.

Developers have come up with quite a few spinlocks — let's see how it all works.

The naive implementation

Let's start with a simple implementation; an atomic variable guarantees the wait (you can actually write code that breaks this assumption and the spinlock won't work, but that's another story). Using an atomic variable makes the read-modify-write of the current lock state indivisible (i.e. reading the variable's state, changing it to another, and writing it back to memory is a single operation during which the thread cannot be interrupted). Without an atomic variable, the constituent read-modify-write steps could be paused, leading to the lock being acquired by several threads at once. In C++ any operation is generally considered non-atomic until explicitly stated otherwise by the compiler and hardware vendor — even a plain assignment of POD types.

Show code

struct naive_spinlock_tas {
  std::atomic<bool> locked{false};

#define do_nothing
  void lock() { for (; locked.exchange(true) ;) do_nothing; }
  void unlock() { locked.store(false); }
};

Benchmark code

void naive_spinlock_inc(naive_spinlock_tas &s, s64 &val) {
   for (int i = 0; i < 100000; i++) {
     s.lock();
     val++;
     s.unlock();
   }
 }

static void naive_spinlock_bench(benchmark::State &s) {
   const u32 bench_threads = s.range(0);

   s64 val = 0;
   std::vector<std::thread> thread_pool;
   thread_pool.reserve(bench_threads);

   naive_spinlock_tas sl;

   for (auto stage : s) {
     for (u32 i = 0u; i < bench_threads; i++) {
       thread_pool.emplace_back([&] { naive_spinlock_inc(sl, val); });
     }

     for (auto &thread : thread_pool)
        thread.join();

     thread_pool.clear();
   }
 }
 BENCHMARK(naive_spinlock_bench)
     ->RangeMultiplier(2)
     ->Range(1, std::thread::hardware_concurrency())
     ->UseRealTime()
     ->Unit(benchmark::kMillisecond);

Results

----------------------------------------------------------------------------
Benchmark                                 Time             CPU    Iterations
----------------------------------------------------------------------------
naive_spinlock_bench/1/real_time        1.49 ms        0.066 ms          471
naive_spinlock_bench/2/real_time        6.31 ms        0.143 ms          109
naive_spinlock_bench/4/real_time        23.7 ms        0.000 ms           28
naive_spinlock_bench/8/real_time        86.4 ms        0.000 ms            9
naive_spinlock_bench/16/real_time        349 ms        0.000 ms            2
naive_spinlock_bench/24/real_time        670 ms        0.000 ms            1

// Just counting twice as much is faster than doing it in two threads
---------------------------------------------------------------------------
Benchmark                                 Time             CPU   Iterations
---------------------------------------------------------------------------
naive_spinlock_bench/1/real_time       2.46 ms        0.053 ms          294

If we wanted to get some work done and hoped that doubling the thread count would cut the runtime, we see that didn't happen. Instead, the 2-thread test takes 3× longer than the single-threaded one. With a caveat, of course — the test is very synthetic, but it shows a likely trend when working with data that parallelizes poorly. I covered the reasons for this behavior in the previous article: updating an atomic variable flushes and refreshes the cache line in the other cores working with it. With this spinlock implementation, 2 out of 3 cache requests miss.

perf stat

124,334,591      L1-dcache-loads           #   14.438 M/sec
 78,181,819      L1-dcache-load-misses     #   62.88% of all L1-dcache hits

A better spinlock

Show code

struct better_naive_spinlock_ttas {
  std::atomic<bool> locked{false};

#define do_nothing
  void lock() {
     while (true) {
       if (!locked.exchange(true)) return;

       while (locked.load()) do_nothing;
     }
  }
  void unlock() { locked.store(false); }
};

Let's try to get rid of the cache misses and increase the lock's useful work. To do that, instead of constantly trying to change the variable, we'll wait for it to be released. Once it's free we try again, because someone else might lock it first. All we did was replace the atomic write attempt with a read. The nice thing is that read-only cache copies can coexist in several different cores at no extra cost. The core reads its local copy of the atomic variable from its own L1 cache, the lock-release ordering becomes advisory, and the number of cache misses drops.

Results

-----------------------------------------------------------------------------------
Benchmark                                         Time             CPU   Iterations
-----------------------------------------------------------------------------------
better_naive_spinlock_bench/1/real_time        1.53 ms        0.066 ms          475
better_naive_spinlock_bench/2/real_time        6.79 ms        0.267 ms          117
better_naive_spinlock_bench/4/real_time        18.9 ms        0.000 ms           36
better_naive_spinlock_bench/8/real_time        44.5 ms        0.000 ms           16
better_naive_spinlock_bench/16/real_time        106 ms        0.000 ms            7
better_naive_spinlock_bench/24/real_time        246 ms        0.000 ms            3

perf stat

1,296,553,727      L1-dcache-loads           #  253.267 M/sec
   80,252,055      L1-dcache-load-misses     #    6.22% of all L1-dcache hits

This also helped the runtime; you don't see it at two threads, but from 4 threads and up we see a significant drop in per-thread runtime. A relatively small code change brought a performance gain — yet there's still room for improvement.

Hurrying is harmful

Looking at the wait implementation, I realize there's no point reading the lock value every cycle — the thread does some work that takes time. Take a notional 24 cores, assume the thread does heavy computation (iterating an array), and don't poll the lock during that time.

Show code

struct waiting_spinlock_ttas {
  std::atomic<bool> locked{false};

#define do_nothing
  void lock() {
     for (;;) {
       if (!locked.exchange(true)) return;

       for (; locked.load();){
         for (volatile s32 i = 0; i < 24 * 100 ; i += 1) do_nothing;
       }
     }
  }
  void unlock() { locked.store(false); }
};

This optimization significantly cuts the thread's runtime — we gave the CPU a chance to do what it's good at: compute... volatile is needed here so the compiler doesn't throw away this loop, which just waits and burns electricity.

Results

-----------------------------------------------------------------------------------
Benchmark                                         Time             CPU   Iterations
-----------------------------------------------------------------------------------
waiting_spinlock_ttas_bench/1/real_time        1.50 ms        0.000 ms          477
waiting_spinlock_ttas_bench/2/real_time        3.34 ms        0.224 ms          209
waiting_spinlock_ttas_bench/4/real_time        8.01 ms        0.000 ms           84
waiting_spinlock_ttas_bench/8/real_time        26.5 ms        0.000 ms           26
waiting_spinlock_ttas_bench/16/real_time       73.0 ms        0.000 ms            9
waiting_spinlock_ttas_bench/24/real_time        116 ms        0.000 ms            6

Instead of do_nothing you can substitute the _mm_pause() instruction and drop volatile as no longer needed; then the wait will be even less aggressive, which of course helps performance. The less work we do, the more work we do :)

Results

-----------------------------------------------------------------------------------
Benchmark                                         Time             CPU   Iterations
-----------------------------------------------------------------------------------
waiting_spinlock_mmps_bench/1/real_time        1.50 ms        0.000 ms          460
waiting_spinlock_mmps_bench/2/real_time        2.95 ms        0.135 ms          231
waiting_spinlock_mmps_bench/4/real_time        5.59 ms        0.124 ms          126
waiting_spinlock_mmps_bench/8/real_time        11.3 ms        0.504 ms           62
waiting_spinlock_mmps_bench/16/real_time       22.9 ms         1.01 ms           31
waiting_spinlock_mmps_bench/24/real_time       34.9 ms        0.781 ms           20

Understanding how our threads interact with each other is crucial for finding optimization opportunities. But even the current implementation has room to improve. Let's check the cache misses — under one percent, and all we did was remove the aggressive reads from the cache.

perf stat

16,743,277,459      L1-dcache-loads           # 1266.561 M/sec
    86,735,606      L1-dcache-load-misses     #     0.51% of all L1-dcache hits

Compilers are very smart, but they won't do the domain analysis for us, and there are instructions they insert very rarely, like _mm_pause().

Can we do better?

We can, but nuances start to appear — you have to look at the actual algorithm where the spinlock is used. For the specific case of incrementing a variable it helps to spread the lock checks out in time. What happens if two threads try to acquire the lock at the same moment? Both see the lock is held and back off for the same duration, and with high probability they'll come out at the same time and collide again. The more such threads, the lower the chance of normal operation. This can be avoided by picking a random wait time.

Show code

struct random_spinlock_ttas {
  std::atomic<bool> locked{false};
  std::uniform_int_distribution<int> dist;
  std::mt19937 rng;

  random_spinlock_ttas() {
    rng.seed(std::random_device()());
    dist = std::uniform_int_distribution<int>(1024, 2048);
  }

  void lock() {
     for (;;) {
       if (!locked.exchange(true)) return;

       for (; locked.load();){
         s32 backoff_iters = dist(rng);
         for (int i = 0; i < backoff_iters; ++i) _mm_pause();
       }
     }
  }
  void unlock() { locked.store(false); }
};

The benchmark shows we won another 10% of runtime. But benchmarks and tests don't guarantee the same result in a real application. I gave this example to show that knowing the lock's domain points to optimization opportunities for specific cases — but there's no guarantee that using such a lock in general algorithms will give a similar result.

Results

------------------------------------------------------------------------------------
Benchmark                                        Time            CPU   Iterations
------------------------------------------------------------------------------------
random_spinlock_ttas_bench/1/real_time        1.52 ms        0.028 ms          568
random_spinlock_ttas_bench/2/real_time        2.80 ms        0.157 ms          274
random_spinlock_ttas_bench/4/real_time        5.63 ms        0.131 ms          135
random_spinlock_ttas_bench/8/real_time        10.5 ms        0.414 ms           73
random_spinlock_ttas_bench/16/real_time       19.5 ms        0.434 ms           36
random_spinlock_ttas_bench/24/real_time       29.9 ms         1.04 ms           23

Analyzing the cache-miss data tells us why the per-thread runtime dropped. Updating the variable in L1 cache is even less aggressive (0.42% misses), with less overhead and latency, so the CPU does more work. It's not as obvious with a small number of threads, but it becomes increasingly noticeable as their count grows.

perf stat

17,543,234,558      L1-dcache-loads           # 1296.631 M/sec
    74,638,504      L1-dcache-load-misses     #     0.42% of all L1-dcache hits

What's wrong with this spinlock

Chasing the best runtime, you can spoil the overall picture, since the threads are doing some useful work. Attempts to lower a thread's wait time lead to "unintended prioritization" of the threads that got a shorter wait. For example, in the spinlock with a random wait time, priority goes to the threads whose wait is shorter. With a constant wait time, priority goes to the threads that arrived earlier — this effect is called "starvation": some threads get significantly less runtime. In any spinlock based on busy-waiting this effect shows up more or less, and some threads will idle. You can partially fix this by taking the scheduler's job on yourself, but again, these are nuances and you have to understand WHAT we're doing and what behavior we want in the end. To keep threads from migrating, you need to pin a thread to a specific core. If the scheduler is free to pick the core for a thread, then while a spinlock is held on one core several threads can spin one after another, increasing total runtime. When threads are distributed across cores, they get their timeframe roughly in queue order on that core, with less cost for moving context between cores. A downside of this approach is crudely interfering with the scheduler's work (which isn't always bad — colleagues say the PS5 has a terrible scheduler).

In most game engines the audio, render and I/O threads are pinned to specific cores to avoid migration side effects.

Rewriting the benchmark code to set thread affinity:

Show code

 for (auto stage : s) {
   for (u32 i = 0u; i < bench_threads; i++) {
     thread_pool.emplace_back([&] { waiting_spinlock_affn_inc(sl, val); });
     SetThreadAffinityMask(thread_pool.back().native_handle(), DWORD_PTR(1) << i);
   }

   for (auto &thread : thread_pool)
      thread.join();

   thread_pool.clear();
 }

we get the following result: total runtime dropped for a large number of threads. But there the cache-update effects across core caches show up (compare these numbers with waiting_spinlock_mmps_bench), which offset the benefit of spreading threads across cores.

Results

-----------------------------------------------------------------------------------
Benchmark                                         Time             CPU   Iterations
-----------------------------------------------------------------------------------
waiting_spinlock_affn_bench/1/real_time        1.59 ms        0.000 ms          580
waiting_spinlock_affn_bench/2/real_time        2.70 ms        0.000 ms          254
waiting_spinlock_affn_bench/4/real_time        6.18 ms        0.138 ms          113
waiting_spinlock_affn_bench/8/real_time        9.81 ms        0.781 ms           80
waiting_spinlock_affn_bench/16/real_time       19.9 ms        0.000 ms           34
waiting_spinlock_affn_bench/24/real_time       30.7 ms        0.710 ms           22

Not all spinlocks are equally useful

There's a class of tasks where using a busy-wait spinlock is harmful or outright unacceptable (e.g. parallel data-loading tasks), where the next stage depends on the results of the previous one (spread across several threads). To ensure fair access to a resource (e.g. a file system), you need a spinlock that doesn't depend on how fast it locks, but instead provides sequential access to the resource for each thread. The _ReadWriteBarrier() call is needed so the compiler is guaranteed to call unlock() after lock() and doesn't mess things up with optimizations enabled.

Show code

struct ticket_spinlock_ttas {
  std::atomic<u16> waiter{0};
  volatile u16 dish{0};

  void lock() {
     auto my_dish = waiter.fetch_add(1);

#define do_nothing _mm_pause()
     while (dish != my_dish) do_nothing;
  }
  void unlock() {
    _ReadWriteBarrier();
    dish = dish + 1;
  }
};

A ticket-based spinlock lets threads wait for the lock calmly. Overall performance suffers, of course, compared to any other implementation, but every thread gets its timeframe to work. Runtime isn't the only metric to aim for; in many tasks you need a compromise between runtime and overall progress, which is especially apparent in coupled tasks where the start of some depends on the completion of others.

Results

-----------------------------------------------------------------------------
Benchmark                                   Time             CPU   Iterations
-----------------------------------------------------------------------------
ticket_spinlock_bench/1/real_time        1.45 ms        0.024 ms          651
ticket_spinlock_bench/2/real_time        11.9 ms        0.000 ms           67
ticket_spinlock_bench/4/real_time        53.5 ms        0.000 ms           10
ticket_spinlock_bench/8/real_time         220 ms        0.000 ms            3
ticket_spinlock_bench/16/real_time        635 ms        0.000 ms            1
ticket_spinlock_bench/24/real_time        905 ms        0.000 ms            1

The pthread implementation

For comparison with the world of honest OS locks, let's take the standard pthread implementation (Windows) (https://github.com/GerHobbelt/pthread-win32) and run the benchmark. The pthread spinlock is implemented via the Windows CRITICAL_SECTION, so all the joys of overhead — context switches, thread parking and scheduler work — are present in full.

Show code

pthread_spinlock_t sl;
pthread_spin_init(&sl, PTHREAD_PROCESS_PRIVATE);

void pthread_spinlock_inc(pthread_spinlock_t &sl, s64 &val) {
  for (int i = 0; i < 100000; i++) {
    pthread_spin_lock(&sl);
    val++;
    pthread_spin_unlock(&sl);
  }
}

The low time on a single thread I can't explain. Apparently some OS trick when there are no contenders for the critical section. On two or more threads the result is close to the simplest implementation, except there's no CPU load — which I wrote about at the start of the article.

Results

------------------------------------------------------------------------------
Benchmark                                    Time             CPU   Iterations
------------------------------------------------------------------------------
pthread_spinlock_bench/1/real_time        0.70 ms        0.147 ms          826
pthread_spinlock_bench/2/real_time        6.07 ms        0.000 ms          115
pthread_spinlock_bench/4/real_time        22.3 ms        0.504 ms           31
pthread_spinlock_bench/8/real_time         119 ms        0.000 ms            6
pthread_spinlock_bench/16/real_time        341 ms        0.000 ms            2
pthread_spinlock_bench/24/real_time        606 ms        0.000 ms            1

Instead of conclusions

Writing a functionally correct spinlock doesn't take much code, but you need to carefully think through where it will be used to make it perform well and/or with guaranteed latencies. A library implementation (or general implementations) isn't always a good choice for boosting performance. I'm sure that, having studied a specific domain, you'll be able to write your own implementation that best fits the particular situation.

a simple no-wait spinlock — grabs the resource as fast as possible and is free of the prioritization effect;
a spinlock with a constant wait lets the thread do more work without loading the CPU much;
spinlocks with backoff and random waits let you write simple prioritization;
a ticket spinlock distributes the resource fairly among threads.

P.S.

← All articles