Game++. Performance traps

The C++ standard library contains many classes and functions that integrate easily into a project, are safe, and have been tested on many cases. However, you pay for the convenience and the all-you-can-eat nature with performance. In games, if performance isn't put first right away, then by the end of the project you accumulate so much technical debt that it's often easier to throw everything out and start over. Straightforward use of the standard library, in most cases where you need performant and efficient code — and I don't only mean games right now — turns out not to be the best choice.

The examples below conclude a series of articles in which I tried to collect interesting moments of using various data structures used in game development, their extensions and opportunities for improvement.

The article is aimed at readers who aren't C++ gurus or connoisseurs of the language's subtleties, but are generally familiar with the language and its ideas; knowledge of x86 assembly isn't required, I'll be attaching links to quickbench code examples to explain why I give one piece of advice or another.

Sometimes I'll be telling horror stories here, but most of these cases got in the way of software working normally in prod, so I had to treat them with respect.

Thank you, Habr user, for walking this path with me through a series of articles about C++ gamedev — from light acrobatics with strings and containers, to tricky allocators and multithreading. You not only supported my interest in collecting scattered pieces into something whole and self-contained, but also helped turn technical notes into a living dialogue about game development. I hope these articles became not just a good weekend read, but also a reason for your own experiments, refactoring and new ideas. The game isn't over yet!

Game++. String interning
Game++. Cooking vectors
Game++. Dancing with allocators
Game++. Juggling STL algorithms
Game++. run, thread, run...
Game++. Building arcs
Game++. Unpacking containers
Game++. Heap? Less
Game++. Work hard
Game++. Patching patterns
Game++. while (!game(over))
Game++. Bonus. Performance traps <=== You are here

Most slow code appears either out of laziness or out of too much cleverness, and usually it's genuinely clever and fashionable code. But instead of opening the profiler, we just pile on more fashionable code. And powerful hardware relaxes you a lot, and there you go — std::map starts getting used where there should be a simple array. Or we stuff std::function, shared_ptr, variant everywhere, allocate at runtime, copy by value in a lambda, hoping the compiler will "optimize something there" — well, it'll optimize something, of course, but it won't think for us. And then QA complains that the game on the Xbox Series X lags like a calculator. Who knew that the atomic increments of shared_ptr aren't free? Everyone. Everyone knew. In modern software probably everything can be slow, starting from allocations and ending with reading from disk; below there will be a not-very-long list, but for some reason out of all the projects only these cases stuck vividly in my memory, although there were many more, you can't remember them all. And you, dear Habr user, surely have a couple of stories that could add to this list — don't be shy to tell them in the comments. Let's go!

Memory allocation is evil, and there's not enough evil to go around

The first thing you learn when developing games for consoles is minimal allocations per frame, which fundamentally contradicts the committee's guidelines on using standard-library containers (https://isocpp.github.io/CppCoreGuidelines/CppCoreGuidelines#Rsl-arrays)

SL.con.1: Prefer using STL array or vector instead of a C array
Reason C arrays are less safe, and have no advantages over array and vector. For a fixed-length array, use std::array, which does not degenerate to a pointer when passed to a function and does know its size. Also, like a built-in array, a stack-allocated std::array keeps its elements on the stack. For a variable-length array, use std::vector, which additionally can change its size and handles memory allocation.

A common piece of advice is to prefer std::vector over plain C arrays and static vectors, because it's safer and preserves ownership control of the resource. This is quite reasonable, especially if we assume the vector will grow during operation. However, if there's no need for growth and performance is critical, there are ways to achieve better results.

Usually a vector stores pointers internally, one or more; some pointers (like the end element and the capacity) can be replaced with numbers, but that doesn't change the essence — it's still the same contiguous chunk of memory:

begin — a pointer to the start of the data,
end — a pointer to the position right after the last element (used to compute the current size),
capacity_end — a pointer to the end of the allocated memory block (defines the maximum size without reallocation).

struct vector {
    T* begin;
    T* end;
    T* capacity;
};

This triple of pointers lets std::vector quickly check the size, whether there's room for new elements, iterate, shift, delete, and so on. But you pay for it with indirect access to the elements: the data is stored not inside the object, but at the address that begin points to. This breaks data locality in the cache; the first several accesses to a vector's data (usually 10–15), until the CPU's internal structures adapt to the new pattern, go "cold", they're several times slower than subsequent accesses, so working with small vectors is even more costly than with large ones — there are always cache misses there and the algorithm doesn't have time to "spin up". This effect is known as "cold start" — it's a slightly more general concept, but the essence is the same: small data structures are always more expensive to process.

The "cold-start" effect

When you first access the elements of a vector, the data isn't yet loaded into the CPU cache, so a lot of cache misses happen. After several accesses the data gets loaded into the cache (L1, L2, L3), and subsequent accesses become significantly faster — "cache warm-up".

Especially noticeable with:

The first pass over a large vector
Accessing data after it's been evicted from the cache (cold start)
Working with vectors whose size exceeds the cache size (cache poisoning)

To minimize this effect, techniques are used such as:

Prefetching (preloading data)
Cache-friendly algorithms with good spatial locality
"Warming up" the cache before the main computations

Even if the vector is small, its data is always on the heap, which is significantly more costly in terms of time. The way a vector is initialized matters. For a concrete example, suppose we have a number N, and we want to return a vector containing the squares of the first N numbers:
[0, 1, 4, 9, 16, ...]

A modern vector implementation is wrapped in a large number of checks — for the current size, for going out of bounds of the array, for the need to reallocate memory. The compiler will generate a bunch of extra code (in particular, calls to operator new, memmove and operator delete) and exception handling, so the generated code turns out badly bloated for no reason at all. Even for a simple function that returns some array of elements of a known size — all these checks will be included. (https://godbolt.org/z/vK5W93z8W). This is how ordinary developers write when they want to return some array of values; here a memory reservation in the vector is even done, otherwise it'd be downright sad.

std::vector<int> createVector(int size) {
    std::vector<int> result;
    result.reserve(size);
    for (int ii = 0; ii < size; ++ii) {
        result.push_back(ii * ii);
    }
    return result;
}

Asm (vector<int>)

makeVector(int):
        push    rbp
        push    r15
        push    r14
        push    r13
        push    r12
        push    rbx
        sub     rsp, 24
        xorps   xmm0, xmm0
        movups  xmmword ptr [rdi], xmm0
        mov     qword ptr [rdi + 16], 0
        test    esi, esi
        js      .LBB0_1
        mov     rbp, rdi
        je      .LBB0_27
        mov     r12d, esi
        movsxd  rbx, esi
        lea     rdi, [4*rbx]
        call    operator new(unsigned long)
        mov     r15, rax
        mov     qword ptr [rbp], rax
        mov     qword ptr [rbp + 8], rax
        lea     r9, [rax + 4*rbx]
        mov     qword ptr [rbp + 16], r9
        xor     r14d, r14d
        mov     r8, rax
        mov     rbx, rax
        mov     qword ptr [rsp + 16], rbp
        mov     dword ptr [rsp + 12], r12d
        jmp     .LBB0_6
.LBB0_7:
        mov     dword ptr [rbx], r13d
        add     rbx, 4
        mov     qword ptr [rbp + 8], rbx
        add     r14d, 1
        cmp     r12d, r14d
        je      .LBB0_27
.LBB0_6:
        mov     r13d, r14d
        imul    r13d, r14d
        cmp     rbx, r9
        jne     .LBB0_7
        mov     rdx, rbx
        sub     rdx, r8
        movabs  rax, 9223372036854775804
        cmp     rdx, rax
        je      .LBB0_9
        mov     rax, rdx
        sar     rax, 2
        mov     ecx, 1
        test    rdx, rdx
        je      .LBB0_13
        mov     rcx, rax
.LBB0_13:
        lea     rdx, [rcx + rax]
        movabs  rsi, 2305843009213693951
        mov     r12, rsi
        cmp     rdx, rsi
        ja      .LBB0_15
        mov     r12, rdx
.LBB0_15:
        add     rcx, rax
        movabs  rax, 2305843009213693951
        cmovb   r12, rax
        test    r12, r12
        mov     qword ptr [rsp], r8
        je      .LBB0_16
        mov     rbp, r15
        mov     r15, r9
        lea     rdi, [4*r12]
        call    operator new(unsigned long)
        mov     rbp, rax
        mov     r8, qword ptr [rsp]
        mov     r9, r15
        jmp     .LBB0_19
.LBB0_16:
        xor     ebp, ebp
.LBB0_19:
        mov     rdx, r9
        sub     rdx, r8
        mov     rax, rdx
        sar     rax, 2
        lea     r15, [4*rax]
        add     r15, rbp
        mov     dword ptr [rbp + 4*rax], r13d
        test    rdx, rdx
        jle     .LBB0_21
        mov     rdi, rbp
        mov     rsi, r8
        mov     r13, r9
        call    memmove
        mov     r9, r13
        mov     r8, qword ptr [rsp]
.LBB0_21:
        add     r15, 4
        sub     rbx, r9
        test    rbx, rbx
        jle     .LBB0_23
        mov     rdi, r15
        mov     rsi, r9
        mov     rdx, rbx
        call    memmove
        mov     r8, qword ptr [rsp]
.LBB0_23:
        test    r8, r8
        je      .LBB0_25
        mov     rdi, r8
        call    operator delete(void*)
.LBB0_25:
        add     rbx, r15
        mov     r15, rbp
        mov     rbp, qword ptr [rsp + 16]
        mov     qword ptr [rbp], r15
        mov     qword ptr [rbp + 8], rbx
        lea     r9, [r15 + 4*r12]
        mov     qword ptr [rbp + 16], r9
        mov     r8, r15
        mov     r12d, dword ptr [rsp + 12]
        add     r14d, 1
        cmp     r12d, r14d
        jne     .LBB0_6
.LBB0_27:
        mov     rax, rbp
        add     rsp, 24
        pop     rbx
        pop     r12
        pop     r13
        pop     r14
        pop     r15
        pop     rbp
        ret
.LBB0_9:
        mov     edi, offset .L.str.1
        call    std::__throw_length_error(char const*)
.LBB0_1:
        mov     edi, offset .L.str
        call    std::__throw_length_error(char const*)
        jmp     .LBB0_29
        mov     rdi, rax
        call    _Unwind_Resume
        mov     r15, rbp
.LBB0_29:
        mov     rbx, rax
        test    r15, r15
        je      .LBB0_31
        mov     rdi, r15
        call    operator delete(void*)
.LBB0_31:
        mov     rdi, rbx
        call    _Unwind_Resume

But if you simply return some filled array of elements, such a lightweight array, the compiler will produce completely different code, free of these wrappings. Half of the results returned as vectors in an application don't change, and if they don't change — why use an expensive vector for that?

std::unique_ptr<int[]> createSharedArray(int size) {
    int *result;
    result = new int[size];
    for (int ii = 0; ii < size; ++ii) {
        result[ii] = (ii * ii);
    }
    return std::unique_ptr<int[]>{result};
}

Asm (std::unique_ptr<int[]>)

makeArray(int):
        push    rbp
        push    r14
        push    rbx
        mov     ebp, esi
        movsxd  rbx, esi
        mov     ecx, 4
        mov     rax, rbx
        mul     rcx
        mov     r14, rdi
        mov     rdi, -1
        cmovno  rdi, rax
        call    operator new[](unsigned long)
        test    ebx, ebx
        jle     .LBB1_11
        mov     ecx, ebp
        cmp     ebp, 7
        ja      .LBB1_3
        xor     edx, edx
        jmp     .LBB1_10
.LBB1_3:
        mov     edx, ecx
        and     edx, -8
        lea     rsi, [rdx - 8]
        mov     rbp, rsi
        shr     rbp, 3
        add     rbp, 1
        test    rsi, rsi
        je      .LBB1_4
        mov     rdi, rbp
        and     rdi, -2
        neg     rdi
        movdqa  xmm0, xmmword ptr [rip + .LCPI1_0]
        xor     esi, esi
        movdqa  xmm1, xmmword ptr [rip + .LCPI1_1]
        movdqa  xmm2, xmmword ptr [rip + .LCPI1_2]
        movdqa  xmm3, xmmword ptr [rip + .LCPI1_3]
        movdqa  xmm4, xmmword ptr [rip + .LCPI1_4]
.LBB1_6:
        movdqa  xmm5, xmm0
        paddd   xmm5, xmm1
        movdqa  xmm6, xmm0
        pmuludq xmm6, xmm0
        pshufd  xmm6, xmm6, 232
        pshufd  xmm7, xmm0, 245
        pmuludq xmm7, xmm7
        pshufd  xmm7, xmm7, 232
        punpckldq       xmm6, xmm7
        pshufd  xmm7, xmm5, 245
        pmuludq xmm5, xmm5
        pshufd  xmm5, xmm5, 232
        pmuludq xmm7, xmm7
        pshufd  xmm7, xmm7, 232
        punpckldq       xmm5, xmm7
        movdqu  xmmword ptr [rax + 4*rsi], xmm6
        movdqu  xmmword ptr [rax + 4*rsi + 16], xmm5
        movdqa  xmm5, xmm0
        paddd   xmm5, xmm2
        movdqa  xmm6, xmm0
        paddd   xmm6, xmm3
        pshufd  xmm7, xmm5, 245
        pmuludq xmm5, xmm5
        pshufd  xmm5, xmm5, 232
        pmuludq xmm7, xmm7
        pshufd  xmm7, xmm7, 232
        punpckldq       xmm5, xmm7
        pshufd  xmm7, xmm6, 245
        pmuludq xmm6, xmm6
        pshufd  xmm6, xmm6, 232
        pmuludq xmm7, xmm7
        pshufd  xmm7, xmm7, 232
        punpckldq       xmm6, xmm7
        movdqu  xmmword ptr [rax + 4*rsi + 32], xmm5
        movdqu  xmmword ptr [rax + 4*rsi + 48], xmm6
        add     rsi, 16
        paddd   xmm0, xmm4
        add     rdi, 2
        jne     .LBB1_6
        test    bpl, 1
        je      .LBB1_9
.LBB1_8:
        movdqa  xmm1, xmmword ptr [rip + .LCPI1_1]
        paddd   xmm1, xmm0
        pshufd  xmm2, xmm0, 245
        pmuludq xmm0, xmm0
        pshufd  xmm0, xmm0, 232
        pmuludq xmm2, xmm2
        pshufd  xmm2, xmm2, 232
        punpckldq       xmm0, xmm2
        pshufd  xmm2, xmm1, 245
        pmuludq xmm1, xmm1
        pshufd  xmm1, xmm1, 232
        pmuludq xmm2, xmm2
        pshufd  xmm2, xmm2, 232
        punpckldq       xmm1, xmm2
        movdqu  xmmword ptr [rax + 4*rsi], xmm0
        movdqu  xmmword ptr [rax + 4*rsi + 16], xmm1
.LBB1_9:
        cmp     rdx, rcx
        je      .LBB1_11
.LBB1_10:
        mov     esi, edx
        imul    esi, edx
        mov     dword ptr [rax + 4*rdx], esi
        add     rdx, 1
        cmp     rcx, rdx
        jne     .LBB1_10
.LBB1_11:
        mov     qword ptr [r14], rax
        mov     rax, r14
        pop     rbx
        pop     r14
        pop     rbp
        ret
.LBB1_4:
        movdqa  xmm0, xmmword ptr [rip + .LCPI1_0]
        xor     esi, esi
        test    bpl, 1
        jne     .LBB1_8
        jmp     .LBB1_9

std::unique_ptr

Compiler developers know about this too, and if you use a vector template constructed "to size", then the call will be very similar to the last result, which works without most of the checks. This code will work fast, almost like a raw allocation, if RVO kicks in — the key word here is if. RVO does help in real programs, but it's a fragile mechanism, more on that in the next paragraph.

std::vector<int> createPrealloc(int size) {
    std::vector<int> result(size);
    for (int ii = 0; ii < size; ++ii) {
        result[ii] = ii * ii;
    }
    return result;
}

https://quick-bench.com/q/ETpIgVgy1uwpYkJhn5tQRdegCm4

War story

Back in 2019 I was invited (through a mutual acquaintance) to audit the codebase of one St. Petersburg studio's engine; they were already thinking about moving to Unreal but didn't want to abandon their own either. I won't name it, but you've definitely played one of their hidden-object games. The old team that actually wrote the engine had ridden off into the sunset, and the new team just piled on new functionality under the guidance of an ?ffective manager, without much understanding of how the engine was built under the hood. They didn't think about performance either, not at all, and all over the engine there was work with raw std::string, returning megabyte-sized vectors and creating std::map inside loops. To say the engine was in a bad way is putting it mildly. On the average Android of the time the game pushed 30 fps on an empty scene with a couple dozen objects, but since the scenes were mostly static, nobody really noticed the freezes and fps drops. My buddy and I sat over the profiler for a week and rolled out a report saying it'd be cheaper to bury the stewardess and move to Unreal. That's how improper memory handling buried a rather decent in-house engine.

Treacherous RVO

The simplest way to start managing memory is to allocate it dynamically every time you need it. This approach is considered ideal, encouraged in many programming books and even by the committee. A gray-haired computer-science professor who hasn't debugged a real application in ages lectures students on how you should build classes through strings, vectors and maps — this isn't my invention, I just went to a couple of lectures by an existing "maître" when I had to fix an fps drop caused by new colleagues.

And it really is very easy to use. Need a new instance of an animation when the player lands on a ledge? Allocate memory. Need to play a new sound when a goal is reached? Just allocate more memory.

Chaotic dynamic memory allocation in theory helps keep memory usage to a minimum, because you allocate only what's really needed and no more. But in practice everything turns out much worse, since each allocation suddenly turns out to have an unexpectedly large overhead, which starts to accumulate if the programmers get too relaxed. This is a great way to shoot off both of the compiler's legs, so it does less optimizing of our code.

struct ParticleSystem {
    std::vector<float> particles; // vector here on purpose, 
                                  // it was Particle, float is just for the test

    ParticleSystem(size_t count) : particles(count, 1.f) {}
};

std::optional<ParticleSystem> make_editable_particle_system() {
    ParticleSystem ps(100); // heavy allocation
    return ps;
}

std::optional<ParticleSystem> make_baked_particle_system() {
    const ParticleSystem ps(100); // `const` blocks RVO
    return ps;
}

https://quick-bench.com/q/2ysRdpKJWSklB1mDA1b0EuFysuE

War story

Once upon a time a particle system lived in a game, didn't bother anyone, didn't get in anyone's way. The time came to refactor it and make it more configurable in the effects editor — fire, smoke, sparks, residual trails... We gave designers the ability to tune about a hundred particle parameters, with different dynamics and lifetimes. Everything looked great in the editor — the effect was assembled from a multitude of parameters, and you could change colors, speed, particle types on the fly. The system returned this effect as a ParticleSystem object, which was then passed on into the game.

Designers got bolder and started rolling out more complex effects, but here's the trouble — scenes started freezing on them. We dug into the code, and there was this nonsense with const, which disabled RVO. That is, instead of a single object creation there was creation + a copy. And since the object was large (one of the effects exceeded 10k elements), it really lagged, and at 60 fps there was a clear freeze, and not one of the compilers of the day managed to do a move. Why they made it const and missed this in review is a separate topic.

The heavy callback

Memory allocations are of course harmful, but at least you can detect them with a profiler and remove them, whereas with newfangled things like chains of calls and callbacks everything is much more complicated. Suppose we have a set of units and we want to iterate over them and do some work, for example run different logic for friends and for enemies. The old logic is written through a loop with conditions, and some payload:

struct GameObject {
  uint32_t id;
  float lifetime;
  int visible;

  GameObject() : id(0), lifetime(0.0f), visible(std::rand()) {}
};

bool isVisibleForEnemy(const GameObject &o) noexcept {
  return o.visible && !(o.visible & (o.visible - 1));
}

bool isVisibleForFriends(const GameObject& o) noexcept {
  constexpr std::uint64_t visibleForFriends = 0x8000;
  return o.visible > 0 && (visibleForFriends & o.visible == visibleForFriends);
}

template <typename callback_type_>
void executeRealEntityWork(const GameObject& o, callback_type_ &&callback) noexcept {
  if (o.visible) {
    callback(o);
  }

  for (std::uint64_t condition = 1; condition <= o.lifetime; condition++) {
    callback(o);
  }
}

static void FilterEntitiesLambda(benchmark::State &state) {
  float sum = 0, count = 0;
  for (auto _ : state) {
    sum = 0, count = 0;
    for (const auto &o : entities) {
      if (!isVisibleForEnemy(o) && !isVisibleForFriends(o))
        executeRealEntityWork(o, [&](const GameObject &o) {
          sum += o.lifetime;
          count++;
          benchmark::DoNotOptimize(sum);
        });
    }
  }
}

In a fit of refactoring we decided to rewrite this through std::function, so everything would be by the guidelines, fashionable and pretty. And we get an increase in function runtime and compile time by several times when using std::function as callbacks. Although it simplifies the interface, it also leads to (!possible) heap allocations and drags the "noisy" <functional> — one of the largest headers of the standard library — into the compilation unit.

The same code, side view, but slower

template<class T>
void forEachEntities(T& ent, std::function<void(const GameObject& o)> const &callback) {
  for (const auto &o : ent) callback(o);
}

void filterEntitiesStl(  //
    const GameObject &o,
    std::function<bool(const GameObject &o)> const &predicate,
    std::function<void(const GameObject &o)> const &callback) {
  if (!predicate(o)) callback(o);
}

void executeRealEntityWorkStl(
    const GameObject &o,
    std::function<void(const GameObject &o)> const &callback) {
  executeRealEntityWork(o, callback);
}

static void FilterEntitiesStdFunction(benchmark::State &state) {
  float sum = 0, count = 0;
  for (auto _ : state) {
    sum = 0, count = 0;
    forEachEntities(entities, [&](const GameObject &o) {
      filterEntitiesStl(o, isVisibleForEnemy, [&](const GameObject &o) {
        filterEntitiesStl(o, isVisibleForFriends, [&](const GameObject &o) {
          executeRealEntityWorkStl(o, [&](const GameObject &o) {
            sum += o.lifetime;
            count++;
            benchmark::DoNotOptimize(sum);
          });
        });
      });
    });
  }
}

https://quick-bench.com/q/8WeASFxS-1cNykvdtZR2Hu14mlM

War story

We were once reworking a game AI — nothing special really: behavior tree nodes, a couple dozen conditions and actions, all in place. And then someone clever in quotes (unfortunately, me) suggested: instead of homemade callbacks and template functions, let's build it through std::function. You know, readability, extensibility, "C++ Core Guidelines" and all that. We built it — beautiful! The code is neat, the interfaces tidy, you can pass anonymous lambdas everywhere, from scripts or from code, you can change part of the behavior on the fly. Everything flexible, everything by the guideline. A holiday for a designer, really.

It got much worse when we shipped this to "prod". First the build time grew, not much — about 10% — but we'd moved only 2 NPCs out of a hundred onto the new system. We started looking at CI, found hundreds of <functional> inclusions, 27Kloc, which our careless fix had sneaked into very many places, and then it got worse: periodic in-frame freezes began. And not from rendering, not from loading, but right in the AI logic. What the h...

We started looking — and std::function, it turns out, sometimes allocates on the heap. Especially when it's not just a lambda but a lambda with captured state. And our functions get called several hundred times per frame, and the trend was growing, because new NPCs were being connected. There's your "possible allocations". On the PlayStation these allocations started hitting the allocator so hard that the FPS dropped by half. In the end… it was already too late to throw it all out and bring back the old version; more than a year had been spent on the new AI, so we had to spend a long time hugging the profiler fixing this little thing and writing workarounds.

Sometimes "fashionable" and "beautiful" is a luxury that not everyone can afford at runtime.

Expensive identifiers

Many libraries, including the STL (std::string), Folly::fbstring or boost::string, use a technique called "small string optimization" (SSO, Small String Optimization). This is a trick where short strings are stored right inside the string object, without allocating memory on the heap. Thanks to this, performance improves and memory fragmentation is reduced, especially with frequent operations on short strings (for example, identifiers, names, keys). The SSO implementation depends on the compiler and the standard library. In libstdc++ (GCC 13), std::string usually stores strings up to 15 bytes right in the object (sizeof(std::string) = 32 bytes). In libc++ (Clang 17), the limit is 22 bytes with sizeof(std::string) = 24 bytes. Custom string implementations also usually support up to 32 bytes of local storage, and that's usually enough for most identifiers — until a component system comes into the engine, and long composite strings. And then the string length starts directly affecting the game's performance, and it makes sense to switch to other identifier systems not tied to strings.

https://quick-bench.com/q/J5ZTH1z_E7dE3OSU4ih3S2lB4II

War story

In Sims Mobile, all the in-game logic — from character actions to object behavior and triggers in the house — was built on callbacks and interaction through string identifiers. The original Sims had the same system but on numeric IDs; porting it to Unity didn't work out for various reasons, so a custom one on raw strings was written.

Partly this was done temporarily, for debugging convenience. Each object had dozens of parameters: "fridge_open", "bed_sleep_short", "career_event_barista_1", "interaction_social_highfive" and so on. Designers could throw together reactions just by composing different strings and hanging a handler on them, for example fridge_open_sim_pants if a sim opened the fridge in his proverbial underpants.

In the early builds all this information was stored and passed by copying, to be able to do traces and logs of the work. This simplified debugging, made the tools convenient for designers, but time passed, and the temporary system became the main one, and everyone forgot why it was even made, but in prod too many problems started to surface.

The biggest pain was load time — an ordinary level took 3–4 minutes, and obviously players won't just stare at the screen for that time, even the most devoted sim-lovers. A sims' house with dozens of objects and several active sims, each with its own states and action queues, caused an avalanche of string comparisons. The player sees "a sim got out of bed and went to wash up", while under the hood the code does hundreds of string matchings: which animation? which interaction type? which outfit? All of these are long, composite names, often 130+ characters.

As content filled in, microfreezes appeared even from trivial actions: a sim tries to choose an available toilet — and spends 2–3 ms on it, just iterating over string keys in the routing logic. Seems like nonsense, but when you have 4 sims and 500 active objects on screen, and all of them are actively doing something, the game started to stutter, especially on weak devices.

We had to rework it: instead of strings — integer IDs. Each string involved in the logic was assigned a stable uint32_t at build time, and at runtime only number comparison happened. This removed most of the problems: scenes started loading faster, sims made decisions instantly, and the background simulation (even when the user is in the menu) stopped draining the phone's battery. We made everything nice and convenient for the designers and they barely cried, but that's a separate story.

Moral — no one ever again suggested "let's just pass a string in the parameters".

The raging sine

Standard-library functions such as sin, cos, tan and others are implemented with an emphasis on the maximum possible precision. For this, complex algorithms are used, which significantly slows down execution, so many developers prefer to write their own trigonometry with lower precision.

Absolute precision isn't always critical. It's acceptable to use approximate methods to significantly speed up computations — for example, when precision down to the fourth decimal place is already excessive. One such approximation is expanding the sine function into a Taylor–Maclaurin series and representing the function as a polynomial around some point. If we limit ourselves to the first few terms of the expansion, then the expression for sin(x) will look like this:

sin(x) ≈ x - (x³)/6 + (x⁵)/120

Such approximate sines are used in simulating breathing, the swaying of leaves, floating objects, oscillations on water or simulating wind in puddles. Besides, you can make the sine even simpler on weak devices.

An attempt to do it "right" and by the formula will most likely fail and be more expensive than the original, because the CPU can compute sine in hardware:

std::pow(phase, 3) / 6 + std::pow(phase, 5) / 120

Such an approach sharply reduces the number of floating-point operations, especially if you get rid of the std::pow calls and replace them with manually expanded multiplication. Precision is lost in the process, especially at large argument values, but for small values of x the result comes out close enough to the real one and is computed many times faster. Nobody said the leaves have to describe perfect circles.

double x2 = phase * phase;
double x3 = x2 * phase;
double x5 = x3 * x2;
offset = phase - x3 / 6.0 + x5 / 120.0;

https://quick-bench.com/q/HTCdf5dII1Be1vvb02rv_DGrhjw

War story

In Sims Mobile the usual character animations weren't used — there were too many actions with objects that had to be animated, but the sims had to animate quickly and smoothly while walking around the house, picking up a mug, opening the fridge and other actions. Instead, an animation system based on curves built between locators (points on an object) was written. That is, you could set a point on the mug and a point on the hand, and the sim would smoothly move its hand to the mug, and then just as smoothly to its mouth.

Inside the animation, each of the sims' limbs — legs, arms, head — worked as a physical bone driven by the given curves. These curves were built from sine functions to make the actions look realistic. Every simulation tick the code made hundreds of std::sin calls, per sim, per animated bone. Even on modern ARM chips this was a problem — to say nothing of mid-tier Androids.

As a temporary solution we replaced std::sin with an approximation using a Maclaurin series. Visually the animations stayed the same — smooth and lively, well, at least the producers didn't notice. But the fps grew — by 8–10 FPS in a scene.

A couple of months later a proper implementation with table interpolation and SIMD instructions was brought into the project, but it was exactly that "dirty" Maclaurin series that saved more than one gray hair on the lead programmer's head.

Slow division

Floating-point computations are more resource-intensive, but integer operations can unexpectedly turn out significantly worse in performance. This especially concerns the division (/) and modulo (%) operations, which are considered among the slowest arithmetic operations in the processor. Dividing 32-bit integers usually takes about 10 CPU cycles — that's roughly 2.5 nanoseconds.

If the divisor is known at compile time, the compiler can perform an optimization called strength reduction (https://en.wikipedia.org/wiki/Strength_reduction)

In this case, division is replaced with faster multiply and shift instructions, even for numbers as large as the maximum 32-bit one. This replacement makes division cheaper. For the compiler to treat a value as known at compile time, it's enough to just use const or constexpr for the optimization. But modern compilers can infer immutability themselves when analyzing a variable's lifetime (clang especially likes to do this, which can cause other bugs, but that's a separate story).

There's another interesting trick: dividing integers through floating-point numbers. Since the 64-bit double type can exactly represent any 32-bit signed integer, you can safely convert int to double, perform floating-point division and cast the result back to int. This lets you use the processor's blocks designed for double operations, which perform real-number division a full five times faster than the integer ALU.

https://quick-bench.com/q/HhAWEpdX5L96VlVcqTb7pz18C6A

War story

There won't be a war story — modern CPUs have enough power for us to think less about such tiny things. But sometimes you still have to — during the port of a game (XBlades 2) to the Nintendo Switch we ran into an unexpected problem: this platform had a frankly weak integer-division block in the ALU, and with a large number of integer-division operations per frame we observed an fps drop (it dropped by 1–2 frames, roughly). Analysis of profiler snapshots showed that a lot of time goes precisely into integer-division operations, which worked fine without issues on PC and the big consoles, and which were actively used in the calculation logic of everything in the game, from game mechanics to shader data. Add to that a relatively slow ~1 GHz CPU and you get an unpleasant bonus in the form of an overall slowdown. We didn't fix it everywhere, because it was a million places and many hours of work; we changed the code only in the truly hot functions.

The cost of cache misses

Once I had to work with an engine like this.

class GameObject {
    Transform* transform;       
    AIBeh* ai;             
    PhBody* physics;       
    Animation* animations;   
    // ... many more components ...
};

And this approach was in every class, right down to object positions Point(x, y, z) and the variables x, y, z themselves. Why was it done this way? Well, that's the architect coming/going (a pun). If you thought about separate memory pools for components that would return pointers to pre-allocated memory — think again.

The code was simply written the way it was — each object created everything it needed dynamically; what it didn't need, it didn't create; that's how it was written from the start, apparently the team believed in the "zero cost everything" promise from the committee. Don't believe it — you pay for everything, even for empty pointers.

What was happening with the cache is a separate song. I don't remember the exact numbers anymore, but it was something like this:

L1 cache misses: 70+%
L2 cache misses: 60+%
L3 cache misses: 40+%

That is, half the time the processor just waited for data, or tried to fetch it from RAM. What happens with such data organization can be seen in this benchmark.

Benchmark of the disgrace

enum class AccessPattern { Linear, Random };

struct Position {
  float x, y, z;  
};

struct Health {
  int value;  
};

struct Team {
  int id;  
};

template <AccessPattern pattern>
static void GameObjectAccessCost(benchmark::State &state) {
  const size_t objectCount = static_cast<size_t>(state.range(0));

  struct GameObject {
    Position* pos = new Position();  
    Health *health = new Health();  
    Team* team = new Team();      
  };

  std::vector<GameObject> gameObjects(objectCount);

  std::random_device rd;
  std::mt19937 gen(rd());
  std::uniform_int_distribution<int> healthDist(50, 200);
  std::uniform_real_distribution<float> posDist(-1000.0f, 1000.0f);

  for (auto &obj : gameObjects) {
    obj.pos->x = posDist(gen);
    obj.pos->y = posDist(gen);
    obj.pos->z = posDist(gen);
    obj.health->value = healthDist(gen);
    obj.team->id = healthDist(gen) % 4;
  }

  std::vector<size_t> accessOrder(objectCount);
  if constexpr (pattern == AccessPattern::Random) {
    std::iota(accessOrder.begin(), accessOrder.end(), 0);
    std::shuffle(accessOrder.begin(), accessOrder.end(), gen);
  } else {
    std::iota(accessOrder.begin(), accessOrder.end(), 0);
  }

  for (auto _ : state) {
    int totalHealth = 0;
    float totalDistance = 0.0f;

    for (size_t idx : accessOrder) {
      const auto &obj = gameObjects[idx];

      benchmark::DoNotOptimize(totalHealth += obj.health->value);
      benchmark::DoNotOptimize(
          totalDistance +=
                               std::sqrt(obj.pos->x * obj.pos->x +
                                         obj.pos->y * obj.pos->y +
                                         obj.pos->z * obj.pos->z));
    }
    benchmark::ClobberMemory();
  }
}

BENCHMARK(GameObjectAccessCost<AccessPattern::Linear>)
    ->MinTime(1)
    ->RangeMultiplier(8)
    ->Range(8 * 1024, 128 * 1024 * 1024)  
    ->Unit(benchmark::kMillisecond)
    ->Name("GameObjectAccess/Linear");

BENCHMARK(GameObjectAccessCost<AccessPattern::Random>)
    ->MinTime(1)
    ->RangeMultiplier(8)
    ->Range(8 * 1024, 128 * 1024 * 1024)
    ->Unit(benchmark::kMillisecond)
    ->Name("GameObjectAccess/Random");

Even with sequential access, all of it bottlenecked on memory work, or it just sat there in unloaded scenes, unable to squeeze out even 30 fps.

Linear/8192/min_time:1.000           0.015 ms        0.015 ms        99556
Linear/32768/min_time:1.000          0.065 ms        0.065 ms        21854
Linear/262144/min_time:1.000         0.761 ms        0.762 ms         1906
Linear/2097152/min_time:1.000         9.66 ms         9.59 ms          145
Linear/16777216/min_time:1.000        77.1 ms         77.3 ms           18
Linear/134217728/min_time:1.000        608 ms          602 ms            2
Random/8192/min_time:1.000           0.018 ms        0.018 ms        74667
Random/32768/min_time:1.000          0.144 ms        0.144 ms         9956
Random/262144/min_time:1.000          2.75 ms         2.76 ms          498
Random/2097152/min_time:1.000         66.5 ms         66.2 ms           21
Random/16777216/min_time:1.000         666 ms          664 ms            2
Random/134217728/min_time:1.000       7715 ms         7688 ms            1

If you make a simple layout instead of one through pointers, it already gets better. Just from the correct data layout, we have better data locality and higher speed.

  struct GameObject {
    float x, y, z;  
    int health;     
    int team;       
  };

Linear/8192/min_time:1.000           0.014 ms        0.014 ms        99556
Linear/32768/min_time:1.000          0.055 ms        0.055 ms        24889
Linear/262144/min_time:1.000         0.446 ms        0.444 ms         3200
Linear/2097152/min_time:1.000         4.21 ms         4.21 ms          345
Linear/16777216/min_time:1.000        34.5 ms         34.3 ms           41
Linear/134217728/min_time:1.000        268 ms          269 ms            5
Random/8192/min_time:1.000           0.014 ms        0.014 ms        99556
Random/32768/min_time:1.000          0.055 ms        0.055 ms        25600
Random/262144/min_time:1.000         0.705 ms        0.703 ms         1867
Random/2097152/min_time:1.000         19.8 ms         19.7 ms           69
Random/16777216/min_time:1.000         221 ms          221 ms            6
Random/134217728/min_time:1.000       1910 ms         1922 ms            1

War story

To my timid remarks, while I was still a mid, that not all was well in the state of Denmark, they simply piled on more tasks. The engine team was confident in its own infallibility and the correctness of the chosen path. For some reason I didn't pass the probation period and went off to work at EA. A year later, in 2015, that studio shut down — not necessarily because of the engine, but it does make you think.

Moral — value the cache, or the cache won't value you.

Hands off std::pair where you need to squeeze out more speed

In reality, any program needs memory — your Captain Obvious. But the operating speed is also affected by how this memory is organized, what structures are used, how they're placed in memory — especially in systems with limited resources.

Arbitrary data types (ADT, Algebraic Data Types) are one of the most desirable and powerful constructs in modern programming languages. They let you create complex types by combining simpler ones — for example, through "or" combinations (sum types, variants) and (product types, structures).

In plain C the only way to model an ADT is manual combination of struct and union, often with an explicit tag to indicate the active field. This is complicated, error-prone and requires careful application. In C++ we can use std::pair, std::tuple — for composite structures, std::variant — for variant values, std::optional — for representing a value that may be absent.

Despite the apparent simplicity and convenience, using these types is not "free" in terms of performance and compile time. This especially concerns std::function — it can lead to hidden heap allocations, as I showed above; <functional>, <tuple>, <variant> are among the "heaviest" headers of the standard library, containing tens of thousands of lines of template code; this code can't be cached and will be recompiled in every compilation unit — yes, only the needed parts, but the file will still be parsed anew.

It's naive to assume that std::pair works just as fast as a simple struct { T1 first; T2 second; }. It all depends on the nuances — how the compiler optimizes templates, whether debug flags are enabled, which standard-library implementation is used (libstdc++, libc++, MSVC STL), how exactly objects are created and copied.

https://quick-bench.com/q/46NOXqH9MFwr0FB0NcoORdClHH8

If you have hundreds of objects, you shouldn't think about things like std::pair — you won't see the difference, except in the profiler. If you have hundreds of thousands of objects that the engine operates on (models, textures, identifiers) — this already spills into seconds, minutes and hours of real time. And even 10% affects the build system.

War story

The St. Petersburg EA office also made SimCity BuildIt, one of the first city-builders for mobile; the engine itself and the resource-packing system were written in C++ — Marmalade (engine) + Marble (build system): textures, animations, level data — everything went through a complex pipeline that had a step for combining the meta-information about each resource into a DB.

The build was written by contractors (not the core team) with active use of modern things — std::pair, std::tuple and std::function for genericity. Each resource-bundle build (especially for iOS) took about 140 minutes, of which around 60+ went into the "metapackers" stage — processes that look simple but are loaded with template structures, like parsing textures, getting sizes, picking parameters for the atlas, adding to the resource system, and so on.

The time grew along with the addition of new objects, and when it crossed two hours someone on the team (not the author of this article, my paws hadn't grown enough back then) went to look at why exactly the metapackers were lagging. They ran the profiler on the build farm and saw a very interesting picture: a sea of time spent generating and copying objects of types std::tuple<std::string, std::function<bool(const ResourceInfo&)>> and std::pair<std::string, std::uint32_t>. All these constructs were passed by value between functions and copied many times — just some kind of factory for copying everything that could possibly be copied.

After getting rid of all this "noisy code", the metapacking step started working several times faster, the bundle build time returned to 60 minutes, and the build itself started torturing RAM less, which was especially rough for Jenkins nodes with a bunch of parallel builds. At the retro they got chewed out for poking into a system that wasn't theirs and now having to support it, and were given a T-shirt with the studio logo for reducing build times. But they could safely have been gifted a new Skoda Yeti — that's roughly how much the company was burning every month on renting capacity for Jenkins. Such fixes rarely make it into presentations or guides — more often into epic fails and retro post-mortems.

Moral — the build farm will endure anything; for optimizing build time you can get a T-shirt and a chewing-out.

The price of virtuality

To use virtual functions effectively, especially in the context of game performance, you need to understand in advance where and how they'll be applied. This may seem like a hard task for developers accustomed to the flexibility of OOP, but with a clear game-object design and a systematic approach (this is rare, but it happens) you can determine where polymorphism is really justified.

There's one important design philosophy you need to accept so that virtual functions don't become a bottleneck: all behavior in the game must be bounded and predictable. This shouldn't be perceived as a restriction of development freedom — these are the natural constraints of frame time. It means that any systems using virtual calls must avoid uncontrolled virtualization of functions at the drop of a hat.

Benchmark code

struct NonVirtual {
    int DoWork(int x) const {
        return x * 2;
    }
};

struct VirtualBase {
    virtual int DoWork(int x) const = 0;
    virtual ~VirtualBase() = default;
};

struct VirtualDerived : VirtualBase {
    int DoWork(int x) const override {
        return x * 2;
    }
};

static void BM_NonVirtualCall(benchmark::State &state) {
    NonVirtual obj;
    int sum = 0;
    for (auto _ : state) {
        sum += obj.DoWork(42);
    }
    benchmark::DoNotOptimize(sum);
}
BENCHMARK(BM_NonVirtualCall);

static void BM_VirtualCall(benchmark::State &state) {
    std::unique_ptr<VirtualBase> obj = std::make_unique<VirtualDerived>();
    int sum = 0;
    for (auto _ : state) {
        sum += obj->DoWork(42);
    }
    benchmark::DoNotOptimize(sum);
}
BENCHMARK(BM_VirtualCall);

static void BM_FunctionCall(benchmark::State &state) {
    std::function<int(int)> func = [] (int x) { return x * 2; };
    int sum = 0;
    for (auto _ : state) {
        sum += func(42);
    }
    benchmark::DoNotOptimize(sum);
}
BENCHMARK(BM_FunctionCall);

------------------------------------------------------------
Benchmark                  Time             CPU   Iterations
------------------------------------------------------------
BM_NonVirtualCall      0.327 ns        0.297 ns   1000000000
BM_VirtualCall          1.75 ns         1.72 ns    373333333
BM_FunctionCall         1.46 ns         1.46 ns    448000000

War story

This story isn't quite about gamedev, not quite about virtual calls, but very close. In the late 2000s I happened to take part in developing a video-surveillance system that grabbed points from certain parts of a frame and tried to build a 3D model of a face passing through a certain area. The algorithm was pulled from OpenCV, and the main structure in it was Matrix4x4, the classic 4x4 matrix of float. It was used in transformations, computations, and in general the whole face description was a set of 4k of these matrices. The recognition ran on some servers, there were plenty of turnstiles, plenty of passing faces, so this structure was used millions of times per second, but everything worked acceptably, recognizing roughly 0.5 faces per second from a single camera, gobbling up a ton of watts.

Time passed, and management came up with the task of reducing the size of the stored points (read: matrices), wrote a spec, threw it to the team, but instead of using profiling tools or template wrappers, the OOP-on-the-brain programmers simply made a base class IMatrix with a virtual function GetDiff(), from which they inherited Matrix4x4. That is, this function itself didn't participate in the computations.

At first glance the idea looked harmless: a virtual call that returned the diff of changes from the base matrix, which was usually 1–2% and made it possible not to store all 4k points. You can probably already imagine what happened when all this was rolled out to prod? The performance of the computational pipeline dropped sharply.

Now they brought in the profiler and proper developers to fix the bug. Adding the virtual function led to the appearance of a vtable and an increase in the size of Matrix4x4 from 64 to 72 bytes. This broke the alignment and destroyed cache locality, the neat offsets in the arrays floated. The matrices no longer lay densely, and the SIMD logic started working worse. The number of cache misses rose sharply, which led to functions that worked with matrices starting to spend more time. The servers on the IA-64 architecture suffered especially badly, where offsets and logic were specially calculated for maximum cache load and register math, and any deviation in alignment was very critical.

In the end, one day everyone got a beating, and that same night the update was rolled back: the virtual functions were removed, Matrix4x4 became a POD type without a vtable again, and the structure size returned to 64 bytes. All the logic was rewritten through a separate wrapper that did all this itself and in a separate thread. Performance returned to its previous level. Why it wasn't done that way from the start is already a separate story.

Moral — never add virtual functions to base math structures, especially those that participate in tight loops, SIMD computations or are stored in raw arrays. Even a single virtual can mess things up badly and seriously slow down the computations.

And on that note, really, game over!

UPD: special thanks to @Serpentine, who edited and fixed errors in the articles.

← All articles