The magic of shader swizzle in C++

Swizzle is one of the nicest little things in shader languages: you can address any vector by reordering or duplicating its components — vec.xxy, vec.zx, vec.rgb. C++ has nothing like it out of the box, but you really want to write engine and game code as tersely as in GLSL. In this article — how we added swizzle to the engine's vector math via unions and templates, with no overhead in memory or in speed.

What's in a swizzle…

The best way to understand how swizzle works is to show it by example:

vec3 vec{1.0, 2.0, 3.0};
vec4 a;
vec3 b;
vec2 c;
float d;

b = vec.xyz;  // b is now (1.0, 2.0, 3.0)
d = vec[2];   // d is now 3.0
a = vec.xxxx; // a is now (1.0, 1.0, 1.0, 1.0)
c = vec.zx;   // c is now (3.0, 1.0)
b = vec.rgb;  // b is now (1.0, 2.0, 3.0)
a = vec.yy;   // error; incostistent size

There are several ways to solve this: use the code from the glm library, from CxxSwizzle, or write your own implementation. I recommend glm to everyone who is starting to write their own engine or wants to get a feel for game engines. That said, in my opinion its code is very specific and verbose — it's a legacy of the rough '90s (the first code was written in June 1992, that's more than 31 years ago).

You can also look at the implementation here. It implements a lot of extra functions, but a game engine usually already has most of that logic, even if in a different form. It's a good solution, but, again, the code there is very verbose. We, on the other hand, wanted a simple and elegant solution that fits in five lines of code.

The swizzling task isn't very hard in itself and is usually included in the list of interview questions not only at game companies, though in a somewhat different form. Before we get to the code, I want to draw your attention to a few points we kept in mind while building our solution:

Execution speed. The solution must not add any computational cost compared to equivalent scalar operations, must not sacrifice computational efficiency for the sake of swizzling, and where possible should use SSE intrinsics.
Use of non-adjacent vector components, e.g. v0.xxy + v1.xzy. This requires being able to access and operate on a vector's components in a custom order.
Writing into a swizzled vector, with constraints (we later dropped this so as not to bloat the logic; in 99% of cases it isn't used): e.g. v1.yxwy = v0;, on the condition that repeated elements are explicitly forbidden, i.e. each element must be assigned a value exactly once, and no element may receive multiple assignments.
No additional memory overhead: the base type must not change in size. This means vec3 will occupy the space equivalent to storing three elements of its data type, with no extra overhead.

There are several different ways to achieve the syntax v1.yxwz = v0; without parentheses: macros, the use of unions, and proxy objects. With macros you can hide functions, for example:

#define yxwz _yxwz()   vec4 vecb = veca.yxwz

The problem with any macros in general is that they make debugging harder, clutter the code, and have usage quirks. You can try to solve this with proxy objects that hold a reference to the source vector, but the solution already looks complicated. On top of that, operations on them may go through a pointer, which lowers overall performance when working with such data. And besides, I just don't like using macros when I can get by with code.

So after a discussion in the smoking room we settled on a union-based implementation — the pseudocode for such a union looks like this:

union {
    float v[2];
    ??? xx;
    ??? xy;
    ??? yx;
    ??? yy;
    ...
};

So as not to write a lot of wrappers and to fit the new data types into a union, they must be plain old data — that makes it easier to pull off the swizzling trick. All that's left is to build it so that we don't have to write these classes by hand — let the compiler do the work. It will look something like this (pseudocode):

template<T, X, Y>
struct proxy {
    template<T, X2, Y2>
    proxy operator = (const SwizzleProxy2<T, X2, Y2> &o) {
        ((T*)this)[X] = ((T*)&o)[X2];
        ((T*)this)[Y] = ((T*)&o)[Y2];
        return this;
    }
};

Here we reference this, i.e. the start of the class, implying that swizzling won't exist as a standalone data type but will only map onto the base class's memory. A naive implementation, it turned out, triggers fits of warnings in almost every compiler and false-positive hits in static analyzers. But the pseudocode above demonstrates the core idea of implementing swizzling operators with two elements: xx, xy, wx, and so on. The template arguments X and Y can be any indices of the vector's elements.

As you may have noticed, the class has no data of its own, and such free handling of the pointer to the class's start caused not only a heap of warnings when compiling the first versions under ps5 clang, but also increased compile time by almost 10% — from a notional 5 to six minutes. But if you give the class at least a minimal object, compile time went back to normal.

template<T, int X, int Y>
struct proxy {
	byte v;
    ...
};

In the end a local array was added to the class, matching the base type in size. Since all the main operators were already implemented in the vec2/3/4 classes, the swizzle class is left with only the cast-to-vector operator and the basic logic. So here's the first implementation, the way it shipped into the engine:

template<template<typename> class TT, typename T, size_t X, size_t Y>
struct swizzle_vec2{
  	T v[2];
	inline TT<T>& operator=	(const TT<T>& rhs) {
		v[X]				= rhs[0];
		v[Y]				= rhs[1];
		return				*(TT<T>*)this;
	}
	inline auto &operator=	(const swizzle_vec2& rhs) {
		v[X]				= rhs.v[X];
		v[Y]				= rhs.v[Y];
		return				*this;
	}
	inline operator TT<T>	() const { return TT<T>{v[X], v[Y]}; }
};

And copy-paste for each of the vec3/vec4 classes. The engine sticks to the DRY principle, but to road-test the feature and ship quickly we decided to postpone the cleanup, gathering feedback on working with the new functionality.

template<template<typename> class TT, typename T, size_t X, size_t Y, size_t Z>
struct swizzle_vec3{
	T v[3];
	inline TT<T>& operator=	(const TT<T>& rhs) {
		v[X]				= rhs[0];
		v[Y]				= rhs[1];
		v[Z]				= rhs[2];
		return				*(TT<T>*)this;
	}
  ...

	inline operator TT<T>	() const { return TT<T>{v[X], v[Y], v[Z]}; } 	// unpack
};

For swizzling to work transparently in code, each class also needed union structures that provide it:

// v2 swizzles
#define swizzle_v2      \
swizzle_vec2<T, 2, 0, 0> xx; \
swizzle_vec2<T, 2, 0, 1> xy; \
swizzle_vec2<T, 2, 1, 0> yx; \
swizzle_vec2<T, 2, 1, 1> yy; \
...

template <class T>
struct vec2 {
	union {
		struct { T		x, y; };
		swizzle_v2
	};
};

The final version

template<template<typename> class TT, typename T, size_t X, size_t Y>
struct swizzle_vec2 {
	T v[2];
	inline TT<T>& operator=	(const TT<T>& rhs) {
		v[X]				= rhs[0];					// access pack element 0
		v[Y]				= rhs[1];					// access pack element 1
		return				*(TT<T>*)this;
	}
	inline auto &operator=	(const swizzle_vec2& rhs) {
		v[X]				= rhs.v[X];
		v[Y]				= rhs.v[Y];
		return				*this;
	}

	inline operator TT<T>	() const { return TT<T>{v[X], v[Y]}; } 	// unpack
};

#define swizzle_v2(TT)      \
swizzle_vec2<TT, T, 0, 0> xx; \
swizzle_vec2<TT, T, 0, 1> xy; \
swizzle_vec2<TT, T, 1, 0> yx; \
swizzle_vec2<TT, T, 1, 1> yy;

template <class T>
struct vec2 {
	union {
		struct { T		x, y; };
            swizzle_v2(vec2)
	};
};

int main() {
    vec2<float> veca{0, 1};
    printf("veca{%f, %f}\n", veca.x, veca.y);

    vec2<float> vecb = veca.yx;
    printf("vecb{%f, %f}", vecb.x, vecb.y);
    return 0;
}

You can try it here.

The implementation for vec3/vec4 didn't differ much in anything except the big macro of the final swizzle structures, but that was the price — we wanted to make our colleagues' lives easier, so more logic in the code.

#define swizzle_v4      \
Swizzle_vec2<TT, T, 0, 0> xx; \
Swizzle_vec2<TT, T, 0, 1> xy; \
Swizzle_vec2<TT, T, 0, 2> xz; \
Swizzle_vec2<TT, T, 0, 3> xw; \
....
Swizzle_vec3<TT, T, 0, 0, 0> xxx; \
Swizzle_vec3<TT, T, 0, 0, 1> xxy; \
Swizzle_vec3<TT, T, 0, 0, 2> xxz; \
Swizzle_vec3<TT, T, 0, 0, 3> xxw; \
Swizzle_vec3<TT, T, 0, 1, 0> xyx; \
.....
Swizzle<TT, T, 0, 0, 0, 0> xxxx; \
Swizzle<TT, T, 0, 0, 0, 1> xxxy; \
Swizzle<TT, T, 0, 0, 0, 2> xxxz; \
Swizzle<TT, T, 0, 0, 0, 3> xxxw; \
Swizzle<TT, T, 0, 0, 1, 0> xxyx; \

Copy-paste is a very bad solution in any case, but it let us ship this functionality quickly and gather feedback. The solution worked, but the knowledge that it could be done better with less code, and that the copy-paste could be removed, wouldn't let me sleep peacefully. Later, during a refactor, a generic solution was built and the walls of macros removed.

The cat does not approve of copy-paste!

Mornings are never good

One morning the tests started failing. They began failing at this point:

...
vec3<float> vecc{1, 2, 3};
printf("vecc{%f, %f, %f}\n", vecc.x, vecc.y, vecc.z);

vec3<float> vecd{0, 0, 0};
vecd.xz = vecc.xz;
printf("vecd{%f, %f, %f}\n", vecd.x, vecd.y, vecd.z); <<<<<<<
...

output:
	vecc{1.000000, 2.000000, 3.000000}
	vecd{1.000000, 2.000000, 0.000000} -> vecd should be {1.0, 0.0, 2.0}

Blaming the sources led to this change in the code:

template<template<typename> class TT, typename T, size_t X, size_t Y>
struct swizzle_vec2{
	T v[2];
	inline TT<T>& operator=	(const TT<T>& rhs) {
		v[X]				= rhs[0];
		v[Y]				= rhs[1];
		return				*(TT<T>*)this;
	}

	!!!! inline auto &operator=	(const swizzle_vec2& rhs) = default; !!!!

    inline operator TT<T>	() const { return TT<T>{v[X], v[Y]}; }
};

At first glance everything's fine — this code is correct if swizzle_vec2 were a full-fledged class with its own data, rather than a mapping onto someone else's memory. As it turns out, the default copy operator does what it's meant to — copies memory from one place to another.

inline auto &operator=	(const swizzle_vec2& rhs) = default;
->
call    memset@PLTS						// vecd.xz = vecc.xz
mov     rax, qword ptr [rbp - 36]			// vecd.xz = vecc.xz
mov     qword ptr [rbp - 48], rax			// vecd.xz = vecc.xz

The bug was fixed quickly, and at the same time we set aside time for a code refactor. In the end all the classes could be reduced to a single template:

template <typename T, int RSIZE, int... INDEXES>
struct swizzle {
	T v[RSIZE];
	static constexpr int indexes[] = {INDEXES...};

    template <int... INDEXES2>
    inline auto &operator=(const swizzle<T, RSIZE, INDEXES2...> &rhs) {
		static_assert(sizeof...(INDEXES) == RSIZE, "error: assigning swizzle of different dimensions");
		constexpr int rindexes[] = {INDEXES2...};
		for(int i = 0; i < sizeof...(INDEXES); ++i) { v[indexes[i]] = rhs.v[rindexes[i]]; }
		return *this;
	}

	inline auto &operator=(const swizzle &rhs) {
		for(int i = 0; i < sizeof...(INDEXES); ++i) { v[indexes[i]]	= rhs.v[indexes[i]]; }
		return *this;
	}
};

And for the concrete vector implementations only the cast to the specific type was left:

template<class T> struct vec2;
template<class T> struct vec3;

template <typename T, int RSIZE, int... SWIZZLES>
struct swizzle2	: public swizzle<T, RSIZE, SWIZZLES...> {
    inline auto &operator=	(const vec2<T>& l) {
		static_assert		(sizeof...(SWIZZLES) == 2, "error: assigning swizzle2 not from vec2f");
		v[indexes[0]] = l.x;
		v[indexes[1]] = l.y;
		return				*this;
	}
	inline operator vec2<T> () const {
		static_assert(sizeof...(SWIZZLES) > 1, "error: no data that convert to vec2");
		return vec2<T>{v[indexes[0]], v[indexes[1]]};
	}
};

template <typename T, int RSIZE, int... SWIZZLES>
struct swizzle3 : public swizzle<T, RSIZE, SWIZZLES...> {
    inline auto &operator=	(const vec3<T>& l) {
		static_assert		(sizeof...(SWIZZLES) == 3, "error: assigning swizzle3 not from vec3f");
		v[indexes[0]] = l.x;
		v[indexes[1]] = l.y;
            v[indexes[2]] = l.z;
		return				*this;
	}
	inline operator vec3<T> () const {
		static_assert(sizeof...(SWIZZLES) > 2, "error: no data that convert to vec3");
		return vec3<T>{v[indexes[0]], v[indexes[1]], v[indexes[2]]};
	}
};

For complete happiness, all that remains is to fold up the wall of proxy structures that's still present in every vec2/3/4 class:

#define _(x) $(x ## xx) $(x ## xy) $(x ## xz) $(x ## yx) $(x ## yy) $(x ## yz) $(x ## zx) $(x ## zy) $(x ## zz)
		_(x) _(y) _(z) _(w)

The final vector-class code then looks like this:

template <class T>
struct vec2 {
	union {
		  struct { T x, y; };
		  #define $(name) swizzle2<T, 2, swizzle_idx__(#name, 0), swizzle_idx__(#name, 1)> name;
          		$(xx) $(xy) $(yx) $(yy)
              #undef $
	};
};

I'll put the example code under a spoiler — it's not very interesting, just a working implementation (godbolt).

The full example

template <typename T, int SIZE, int... INDEXES>
struct swizzle {
	T v[SIZE];
	static constexpr int indexes[] = {INDEXES...};

	template <int RSIZE, int... INDEXES2>
    inline auto &operator=	(const swizzle<T, RSIZE, INDEXES2...>& rhs) {
		static_assert		(SIZE == RSIZE, "error: assigning swizzle of different dimensions");
		constexpr int rindexes[] = {INDEXES2...};

		for(int i = 0; i < SIZE; ++i) {
			v[indexes[i]]	= rhs.v[rindexes[i]];
		}

		return				*this;
	}

	inline auto &operator=	(const swizzle& rhs) {
		for(int i = 0; i < SIZE; ++i) {
			v[indexes[i]]	= rhs.v[indexes[i]];
		}
		return				*this;
	}
};

template<class T> struct vec2;
template<class T> struct vec3;

template <typename T, int... SWIZZLES>
struct swizzle2	: public swizzle<T, SWIZZLES...> {
    inline swizzle<T, SWIZZLES...> &operator=	(const vec2<T>& l) {
		static_assert		(sizeof...(SWIZZLES) > 1, "error: assigning swizzle2 not from vec2f");
		this->v[this->indexes[0]]	= l.x;
		this->v[this->indexes[1]]	= l.y;
		return				*this;
	}

	inline operator vec2<T> () const {
		static_assert(sizeof...(SWIZZLES) > 1, "error: no data that convert to vec2");
		return vec2<T>{this->v[this->indexes[0]], this->v[this->indexes[1]]};
	}
};

template <typename T, int SIZE, int... SWIZZLES>
struct swizzle3				: public swizzle<T, SIZE, SWIZZLES...> {
	inline swizzle<T, SIZE, SWIZZLES...> &operator=	(const vec3<T>& l) {
		static_assert		(SIZE == 3, "error: assigning swizzle3 not from vec3f");
		this->v[this->indexes[0]] = l.x;
		this->v[this->indexes[1]] = l.y;
		this->v[this->indexes[2]] = l.z;
		return				*this;
	}

	inline operator vec3<T> () const {
		static_assert		(SIZE > 2, "error: no data that convert to vec3");
		return				vec3<T>{this->v[this->indexes[0]], this->v[this->indexes[1]], this->v[this->indexes[2]]};
	}
};

constexpr int inline swizzle_idx__(const char *x, int offset) { switch(*(x+offset)) { case 'x': return 0; case 'y': return 1; case 'z': return 2;  case 'w': return 3; } return -1; }

template <class T>
struct vec2 {
	union {
		  struct { T		x, y; };
		  #define $(name) swizzle2<T, 2, swizzle_idx__(#name, 0), swizzle_idx__(#name, 1)> name;
          		$(xx) $(xy) $(yx) $(yy)
              #undef $
	};
};

template <class T>
struct vec3 {
	union {
		struct { T		x, y, z; };

            #define $(name) swizzle2<T, 2, swizzle_idx__(#name, 0), swizzle_idx__(#name, 1)> name;
          		$(xx) $(xy) $(xz) $(yx) $(yy) $(yz) $(zx) $(zy) $(zz)
              #undef $

            #define $(name) swizzle3<T, 3, swizzle_idx__(#name, 0), swizzle_idx__(#name, 1),  swizzle_idx__(#name, 2)> name;
            #define _(x) $(x ## xx) $(x ## xy) $(x ## xz) $(x ## yx) $(x ## yy) $(x ## yz) $(x ## zx) $(x ## zy) $(x ## zz)
    		    _(x) _(y) _(z) _(w)
            #undef  _
            #undef  $
	};
};

int main() {
    vec2<float> veca{0, 1};
    printf("veca{%f, %f}\n", veca.x, veca.y);

    vec2<float> vecb = veca.yx;
    printf("vecb{%f, %f}\n", vecb.x, vecb.y);

    vec3<float> vecc{1, 2, 3};
    printf("vecc{%f, %f, %f}\n", vecc.x, vecc.y, vecc.z);

    vec3<float> vecd{0, 0, 0};
    vecd.xz = vecc.xz;
    printf("vecd{%f, %f, %f}\n", vecd.x, vecd.y, vecd.z);
    return 0;
}

Benchmarks

If we look at what our swizzling does in asm, it does exactly what we wrote — it shuffles bytes from one place to another (godbolt).

For the implemented class, the difference between 2/3-swizzling is entirely expected and depends only on the number of operations inside (quick-bench).

Unfortunately, the web version doesn't let you pull in the code from glm and CxxSwizzle, so I had to set it up locally. The volatile here is there so the compiler can't cheat with constant assignment.

float volatile xx;
float volatile x = xx + 1;
float volatile y = xx + 2;
float volatile z = xx + 3;

static void Vec3Swizzling(benchmark::State& state) {
  for (auto _ : state) {
    vec3<float> veca{x, y, z};
	vec3<float> vecb{1, 2, 3};
	vecb.xz = veca.xz;
        benchmark::DoNotOptimize(veca);
	benchmark::DoNotOptimize(vecb);
      }
    }

    BENCHMARK(Vec3Swizzling);

    static void Vec3SwizzlingGlm(benchmark::State& state) {
      for (auto _ : state) {
        glm::vec3<float> vecc{x, y, z};
    	glm::vec3<float> vecd{0, 0, 0};
    	vecd.xz = vecc.xz;
    	benchmark::DoNotOptimize(vecc);
    	benchmark::DoNotOptimize(vecd);
      }
    }
    BENCHMARK(Vec3SwizzlingGlm);

    static void Vec3SwizzlingCxxSwizzle(benchmark::State& state) {
      for (auto _ : state) {
        cxx::vec3<float> vecc{x, y, z};
    	cxx::vec3<float> vecd{0, 0, 0};
    	vecd.xz = vecc.xz;
    	benchmark::DoNotOptimize(vecc);
    	benchmark::DoNotOptimize(vecd);
      }
    }
    BENCHMARK(Vec3SwizzlingCxxSwizzle);

By the benchmarks, the current implementation is 1.2x faster than the one in glm and 2x faster than CxxSwizzle. The speedup comes from the absence of the extra checks present in those libs, and from better adaptation to the engine's data structures.

You can speed swizzling up further by using SSE intrinsics for shuffling — then in asm it'll be a single operation altogether. There's the question of how it all lays out in memory, but that'll be the topic of the next refactor.

← All articles