A few years ago, while working at Gaijin, I got to take part in porting a couple of the company's games to the Nintendo Switch, which was rapidly conquering new markets at the time. For me it was the first major project on the platform. And since neither the team nor the engine developers were familiar with the platform, the build system, or the Nintendo ecosystem in general, we had to find every rake and carefully step on it. To kick the tires of the new platform, alongside porting the game we wrote an internal middleware layer (dagor engine + nxsdk + jam), and the code grew all sorts of tests, build matrices, benchmarks, stability soak runs and other internal checks. It's worth noting that as of 2018, the Switch SDK itself hadn't implemented some POSIX functions like poll and send/receive, and most of the file I/O functions; we had to write the POSIX shim ourselves. At some point we also got around to writing various benchmarks for the standard-library functions, and we noticed some anomalies in the behavior of a few trigonometric functions across different build configurations. For reference, the SDK uses a trimmed-down variant of musl libc (musl-libc.org); everything is statically linked into one big binary by Nintendo's clang version 9 (2018), which then runs on the console. We didn't have access to the sources of Nintendo's own libc, but you can always look at the disassembly and get a more-or-less clear picture of what's going on.
I'll describe my actions in the present tense — just remember that it was 2018. The benchmark code is utterly childish: we spin sin in a loop on the console (in the SDK all the trigonometry lived in tr1).
Show code
static void nxsdk_sin(benchmark::State& state) {
// Code inside this loop is measured repeatedly
float i = 0;
for( auto _ : state) {
float p = tr1::sin(p);
i += 0.1f;
benchmark::DoNotOptimize(p);
}
}
BENCHMARK(nxsdk_sin);We get the following results (lower time is better).
Benchmark output (-O1)
-----------(lower is better)--------------------------
Benchmark Time CPU Iterations
------------------------------------------------------
nxsdk_sin 8.68 ns 8.58 ns 74666667 <<< https://git.musl-libc.org/cgit/musl/tree/src/math/sin.c
nxsdk_sin_vec 8.35 ns 8.30 ns 75825003 <<< sin(vec4f)
dagor_sin 5.48 ns 5.47 ns 102504001
glibc_sin 8.28 ns 8.08 ns 77623654 <<< https://github.com/lattera/glibc/blob/master/sysdeps/ieee754/dbl-64/s_sin.cSo far so good — the formulas for computing sin in musl are roughly the same as in glibc, and the timings come out roughly equal. But once we turn on -O3 + -math:precise, Nintendo's sine becomes almost a third faster. Either the compiler is too clever, or the benchmark is lying and the moon is in the wrong phase, or something else is going on. Actually, when my colleague first showed me these numbers, that's exactly what I told him — your bench is broken, go re-check it — and forgot about the whole thing. The next day we looked at the test together, because there really is nowhere for that performance to come from (or so I thought at first). As expected, you can see a small reduction in glibc_sin's time thanks to clang's aggressive optimizations.
Benchmark output (-O3 + -math:precise)
-----------(lower is better)--------------------------
Benchmark Time CPU Iterations
------------------------------------------------------
nxsdk_sin 6.03 ns 6.00 ns 87625024
nxsdk_sin_vec 5.82 ns 5.78 ns 90022043
dagor_sin 5.38 ns 5.27 ns 104602002
glibc_sin 8.08 ns 8.05 ns 78214356An even more "different" result came when the test was run with -O3 + -math:fast. This one genuinely surprised me, because at the time nothing was faster than dagor_sin at comparable precision, and all the sines passed the correctness tests equally well — the divergence from std::sin never exceeded 1e-6.
Benchmark output (-O3 + -math:fast)
-----------(lower is better)--------------------------
Benchmark Time CPU Iterations
------------------------------------------------------
nxsdk_sin 4.03 ns 3.99 ns 132307692
nxsdk_sin_vec 4.09 ns 4.08 ns 131403251
dagor_sin 5.13 ns 5.10 ns 110400021
glibc_sin 7.43 ns 7.16 ns 79544722And this is where I had to dig into the disassembly to understand what those geniuses at Nintendo cooked up to hit such impressive numbers in the benchmark.
The -O1 (speed) mode
Let's start with -O1 (speed). I'll post only the important bits, so it's clear what's happening, and I'll try not to dump the whole disassembly — who knows what IP might be in there. The sin function shipped with the NX SDK is an alias for __sin_nx_502 (2018, current SDK 4.0 and 5.0); apparently the function name is generated from those numbers.
Disassembly of __sin_nx_502
__sin_nx_502(double):
sub sp, sp, #32
stp x29, x30, [sp, #16]
add x29, sp, #16
fmov x8, d0
mov w9, #8699
ubfx x8, x8, #32, #31
movk w9, #16361, lsl #16
cmp w8, w9
b.hi .ERIT1_3 // .ERITX_X -> .BBX_X arm
lsr w9, w8, #20
cmp w9, #996
b.hi .ERIT1_5
mov x9, #4066750463515557888
mov x10, #5147614374084476928
fmov d1, x9
fmov d2, x10
fmul d1, d0, d1
fadd d2, d0, d2
cmp w8, #256, lsl #12
fcsel d1, d1, d2, lo
str d1, [sp]
ldp x29, x30, [sp, #16]
add sp, sp, #32
ret
.ERIT1_3:
lsr w8, w8, #20
cmp w8, #2047
b.lo .ERIT1_6
fsub d0, d0, d0
ldp x29, x30, [sp, #16]
add sp, sp, #32
ret
.ERIT1_5:
adrp x8, .ERIT1_0
... uninteresting code
add sp, sp, #32
ret
.ERIT1_6:
mov x0, sp
bl __rem_pio2_nx_502(double, double*) // call to __rem_pio2
and w8, w0, #0x3
... uninteresting code
ldp x29, x30, [sp, #16]
add sp, sp, #32
ret
.ERIT1_10:
ldp d1, d0, [sp]
...
fadd d2, d2, d3
fmov d5, #0.50000000
ldr d3, [x8, :lo12:.ERIT1_5]
...
ldp x29, x30, [sp, #16]
add sp, sp, #32
ret
.ERIT1_11:
ldp d0, d1, [sp]
bl __cos_nx_502(double, double) // call to __cos
ldp x29, x30, [sp, #16]
add sp, sp, #32
ret
.ERIT1_12:
ldp d0, d1, [sp]
bl __cos_nx_502(double, double) // call to __cos
fneg d0, d0
ldp x29, x30, [sp, #16]
add sp, sp, #32
retLet's quickly run through the structure of the code — it almost entirely mirrors the standard musl function; you can see the __rem_pio2/__cos call signatures.
Show code (musl sin)
double sin(double x) {
double y[2];
uint32_t ix;
unsigned n;
/* High word of x. */
GET_HIGH_WORD(ix, x);
ix &= 0x7fffffff;
/* |x| ~< pi/4 */
if (ix <= 0x3fe921fb) {
if (ix < 0x3e500000) { /* |x| < 2**-26 */
/* raise inexact if x != 0 and underflow if subnormal*/
FORCE_EVAL(ix < 0x00100000 ? x/0x1p120f : x+0x1p120f);
return x;
}
return __sin(x, 0.0, 0);
}
/* sin(Inf or NaN) is NaN */
if (ix >= 0x7ff00000)
return x - x;
/* argument reduction needed */
n = __rem_pio2(x, y);
switch (n&3) {
case 0: return __sin(y[0], y[1], 1);
case 1: return __cos(y[0], y[1]);
case 2: return -__sin(y[0], y[1], 1);
default:
return -__cos(y[0], y[1]);
}
}The -O3 + -math:precise mode
But no matter how clever the compiler is, it won't give us more than 10–15% here, whatever modes you enable. We dig deeper, turn on -O3 + -math:precise and look at what it generated this time. Now we got an alias to a different function called __sin_dorf_nx_502 — the code is very linear, few branches, just additions and multiplications.
Disassembly of __sin_dorf_nx_502
__sin_dorf_nx_502(float):
push {r4, r5, r6, r7, r8, r9, r10, r11, lr}
add r11, sp, #28
sub sp, sp, #4
ldr r1, .EVRT0_0
mov r4, r0
bl __aeabi_fmul
and r1, r4, #-2147483648
orr r1, r1, #1056964608
bl __aeabi_fadd
bl __aeabi_f2uiz
mov r9, r0
bl __aeabi_ui2f
ldr r8, .EVRT0_1
mov r5, r0
mov r1, r5
ldr r0, [r8, #24]
bl __aeabi_fmul
mov r1, r0
mov r0, r4
bl __aeabi_fsub
mov r4, r0
ldr r0, [r8, #28]
mov r1, r5
bl __aeabi_fmul
mov r1, r0
mov r0, r4
bl __aeabi_fsub
mov r1, r0
mov r7, r0
bl __aeabi_fmul
mov r5, r0
mov r0, r7
mov r1, r5
bl __aeabi_fmul
mov r6, r0
ldr r0, [r8, #8]
ldm r8, {r4, r10}
mov r1, r5
str r0, [sp]
ldr r0, [r8, #12]
bl __aeabi_fmul
ldr r1, [r8, #16]
bl __aeabi_fadd
mov r1, r0
mov r0, r5
bl __aeabi_fmul
ldr r1, [r8, #20]
bl __aeabi_fadd
mov r1, r0
mov r0, r6
bl __aeabi_fmul
mov r1, r0
mov r0, r7
bl __aeabi_fadd
mov r7, r0
mov r0, r4
mov r1, r5
bl __aeabi_fmul
mov r1, r0
mov r0, r10
bl __aeabi_fadd
mov r1, r0
mov r0, r5
bl __aeabi_fmul
mov r1, r0
ldr r0, [sp]
bl __aeabi_fadd
mov r1, r0
mov r0, r5
bl __aeabi_fmul
mov r1, #1065353216
bl __aeabi_fadd
tst r9, #1
moveq r0, r7
tst r9, #2
eorne r0, r0, #-2147483648
sub sp, r11, #28
pop {r4, r5, r6, r7, r8, r9, r10, r11, lr}
bx lr
.EVRT0_0:
.long 1059256694 @ 0x3f22f976
.EVRT0_1:
.long .KI_MergedGlobals
...
.KI_MergedGlobals: // here is the block of polynomial coefficients, somewhere in the data section
.long 3132246419 @ float -0.001360224
.long 1026203702 @ float 0.04165669
.long 3204448223 @ float -0.4999990
.long 3108801646 @ float -1.950727E-4
.long 1007190850 @ float 0.008332075
.long 3190467233 @ float -0.1666665
.long 1070141402 @ float 1.570796 /// this is Pi/2
.long 866263401 @ float 7.549790E-8This looks a lot like approximating sine with some polynomial. If we try to reconstruct the rough logic in pseudocode, it would look like the code below. And that already matches what the benchmark shows.
Pseudocode for __sin_dorf_nx_502
float __sin_dorf_nx_502(float x) {
const float EVRT0_0 = 0.636618972f;
const float EVRT0_1 = 1.0f;
const float KI_MergedGlobals[8] = {
-0.001360224f,
0.04165669f,
-0.4999990f,
-1.950727E-4f,
0.008332075f,
-0.1666665f,
1.570796f, // PI / 2
7.549790E-8f
};
float result;
float temp = EVRT0_0 * x;
temp = temp + 1056964608.0f;
int intTemp = (int)temp;
float floatTemp = (float)intTemp;
float r5 = floatTemp;
floatTemp = r5 * KI_MergedGlobals[0];
float r4 = x - floatTemp;
floatTemp = r5 * KI_MergedGlobals[1];
r4 = r4 - floatTemp;
float r7 = r4 * r4;
floatTemp = r7 * KI_MergedGlobals[2];
floatTemp = floatTemp + KI_MergedGlobals[3];
floatTemp = floatTemp * r7;
floatTemp = floatTemp + KI_MergedGlobals[4];
floatTemp = floatTemp * r7;
floatTemp = floatTemp + KI_MergedGlobals[5];
floatTemp = floatTemp * r7;
floatTemp = floatTemp + EVRT0_1;
if (intTemp & 1) {
floatTemp = floatTemp * r4;
} else {
floatTemp = -floatTemp * r4;
}
result = floatTemp;
return result;
}The -O3 + -math:fast mode
I think you can already guess what happens when we build the sample with -O3 + -math:fast? Right — an alias to yet another function. The final variant, the one used in the release build, the one games were shipped to certification with.
Disassembly of __sin_corf_nx_502
__sin_corf_nx_502(float):
push {r4, r5, r6, r7, r8, r9, r10, r11, lr}
add r11, sp, #28
sub sp, sp, #4
bl __aeabi_f2d
ldr r2, .DUIE0_0
ldr r3, .DUIE0_1
mov r5, r0
mov r6, r1
bl __aeabi_dmul
bl __aeabi_d2iz
mov r10, r0
bl __aeabi_i2d
ldr r3, .DUIE0_2
mov r2, #1610612736
bl __aeabi_dmul
mov r2, r0
mov r3, r1
mov r0, r5
mov r1, r6
bl __aeabi_dadd
bl __aeabi_d2f
ldr r1, .DUIE0_3
mov r7, r0
and r0, r10, #255
ldr r8, [r1]
ldr r0, [r8, r0, lsl #2]
bl __aeabi_f2d
mov r3, #266338304
mov r2, #0
str r0, [sp]
mov r9, r1
orr r3, r3, #-1342177280
bl __aeabi_dmul
mov r5, r0
mov r0, r7
mov r6, r1
bl __aeabi_f2d
mov r7, r0
mov r4, r1
mov r0, r5
mov r1, r6
mov r2, r7
mov r3, r4
bl __aeabi_dmul
mov r5, r0
add r0, r10, #64
mov r6, r1
and r0, r0, #255
ldr r0, [r8, r0, lsl #2]
bl __aeabi_f2d
mov r2, r5
mov r3, r6
bl __aeabi_dadd
mov r2, r7
mov r3, r4
bl __aeabi_dmul
ldr r2, [sp]
mov r3, r9
bl __aeabi_dadd
bl __aeabi_d2f
sub sp, r11, #28
pop {r4, r5, r6, r7, r8, r9, r10, r11, lr}
bx lr
.DUIE0_0:
.long 1682373044 @ 0x6446f9b4
.DUIE0_1:
.long 1078222640 @ 0x40445f30
.DUIE0_2:
.long 3214483963 @ 0xbf9921fb
.DUIE0_3:
.long __sin_corf_duie_nx_502The pseudocode for this assembly section is below, and judging by the logic this is sine computed via a lookup table. The fastest of them all, with precision that doesn't suffer much.
Pseudocode for __sin_corf_nx_502
float __sin_corf_nx_502(float x) {
const double DUIE0_0 = 1.4636916370976118;
const double DUIE0_1 = 3.141592653589793; // PI
const double DUIE0_2 = -0.0009424784718423055;
const float* DUIE0_3 = __sin_corf_duie_nx_502;
double temp = x * DUIE0_0;
temp = temp + DUIE0_1;
int intTemp = static_cast<int>(temp);
float floatTemp = static_cast<float>(intTemp);
float r5 = floatTemp;
floatTemp = r5 * DUIE0_2;
float r4 = x - floatTemp;
double r7 = r4 * r4;
floatTemp = static_cast<float>(r7);
int index = intTemp & 255;
float floatTemp2 = *(DUIE0_3 + index);
double r2 = floatTemp2;
double r3 = static_cast<double>(intTemp >> 8);
double r0 = static_cast<double>(r5);
double r1 = static_cast<double>(r4);
r0 = r0 + r1;
r0 = r0 * r2;
r0 = r0 + r3;
float result = static_cast<float>(r0);
return result;
}If you dig around the internet on the topic of fast sine, you can find, for example, another approximation — Bhaskara I's sine approximation formula — whose error stays within 0.001, which is often good enough for most tasks.
Or take a look at the paper by Lars Nyland (nVidia), one of the authors of CUDA.
For sines, the Dagor engine uses its own implementation that behaves equally well and fast on all platforms; you can find it here. Don't take it as advertising — I left the company long ago, maybe it'll be useful to someone. Warmest regards to my former colleagues.
Conclusions
We got no practical benefit out of this, but it was interesting to dig into the quirks of the SDK. All honor and praise to the Japanese engineers who gave the world this wonderful console. I only have one question: why three?
UPD: a fellow wrote to me in DMs that this is how Nintendo tracked the use of old and pirated SDKs — without an active developer key the first implementation was always called, so a build trying to pass certification helped detect stolen accounts and keys. I can't verify this, so I'll just have to take their word for it :)
← All articles