Performance
Notes on what happens below the level of algorithms — CPU caches, data coherence, synchronization primitives and alignment. All with benchmarks and C++ examples.
- Performance
How memory works in game consoles
A history of console memory architecture from the Atari 2600 to the PS5 and Xbox Series: cycle counting and blanks, sharing the bus and DMA, the fast-but-slow RDRAM on the N64, the Dreamcast tile renderer, the PS2's three memory islands, SPUs and Local Store on the PS3, PS4 memory unification, and the SSD as a new level of the memory hierarchy.
- Performance
Redundant computations
How speculative execution and branch prediction interact with the memory subsystem: why branchless code isn't always faster, the cost of a misprediction versus memory latency, and branchful/branchless binary-search benchmarks on the Jaguar (PS4).
- Performance
Why don't we have L5 caches?
Why a CPU has several cache levels instead of one big one: on-die signal physics, a Christmas-ornament analogy, the history from the i486 to Nehalem, splitting L1I/L1D, coherence protocols, eDRAM and HBM as a "hidden" L4, and why L5 makes no sense yet.
- Performance
Hunting the red fps
Game profiling as a development philosophy: the evolution of FPS expectations, how artist content hurts performance (geometry, textures, render, culling), profiling methods — sampling, instrumentation, tracing, event-based, static analysis and replays — and the design of a tracing profiler with a zero-allocation architecture and string handles.
- Performance
The good, the bad, the colored and the fast
Abnormal memory allocators from practice: a lifetime-tracking allocator for hunting leaks, a randomizing (chaos) allocator for stress-testing long-running systems, a "colored" allocator for subsystem isolation (Deathloop), and the fast TLSF — with diagrams and a CPU-cycle benchmark.
- Performance
I reject commits that use the heap and ask colleagues to rewrite that logic
Experience building large C++ games without the heap: why dynamic allocations are dangerous on consoles and mobile (fragmentation, malloc-time degradation, a certification rejection) and what to replace them with — fixed-size containers gtl::vector<T,N>, etl/FastDelegate instead of std::function, CRTP and static polymorphism with std::variant, object pools and placement new.
- Performance
How to fit an elephant into a suitcase
The levels of optimization when porting games to weak hardware: architecture, algorithms and code. A story about an expensive string-escaping check — from a naive loop to a branchless version, a lookup table and SIMD with _mm_movemask_epi8.
- Performance
Spears & bits
Packing bools into bits: std::bitset, std::vector<bool>, a naive byte array and the CryEngine approach, benchmarks with spikes at 32/64/128, branch-predictor misses, and a final SSE vectorization with _mm_movemask_epi8.
- Performance
Task-based thinking in game engines
The core task-system patterns in engines: moving movement, physics, scripts, path-finding, animation, rendering and I/O onto separate threads, grouping tasks, message passing, and a comparison of a simple task, a thread manager and a task manager.
- Performance
A 486 ought to be enough for everyone
Which general-purpose technologies appeared in CPUs since the 486 and that we use every day: caches and prefetch, the TLB, speculative execution, NUMA, SIMD, out-of-order, GPGPU and branch prediction — with console examples and benchmarks.
- Performance
Just don't copy that
Pass-by-value vs const&, hidden allocations on strings and vectors, reserve, and a real drop to 10 FPS — all with QuickBench benchmarks.
- Performance
And 20 cores still aren't enough
Eleven C++ optimization techniques from production engines: cache warming, compile-time dispatch, loop unrolling, signed/unsigned and type matching, hotpath isolation, SIMD, prefetch, constexpr and lock-free — with benchmarks and war stories.
- Performance
No need to rush… (About spinlocks)
C++ spinlock implementations — from the naive TAS to a ticket spinlock — with benchmarks, cache misses and the nuances of thread prioritization.
- Performance
Cache pollution? Stock up on tests
How caches, associativity, coherence and data alignment affect performance — with benchmarks.