logo: an integrated circuit

Revisiting The Fast Inverse Square Root - Is It Still Useful?

April 20, 2023

This article has some discussion on Hacker News.

In 2005, id Software released the source code for their 1999 game Quake III Arena under the GPL-2 license. In the file code/game/q_math.c, there is a function for calculating the reciprocal square root of a number which at first glance seems to use a very peculiar algorithm:

float Q_rsqrt( float number )
{
    long i;
    float x2, y;
    const float threehalfs = 1.5F;

    x2 = number * 0.5F;
    y  = number;
    i  = * ( long * ) &y;                       // evil floating point bit level hacking
    i  = 0x5f3759df - ( i >> 1 );               // what the fuck?
    y  = * ( float * ) &i;
    y  = y * ( threehalfs - ( x2 * y * y ) );   // 1st iteration
//  y  = y * ( threehalfs - ( x2 * y * y ) );   // 2nd iteration, this can be removed

    return y;
}

Many articles have been written about this particular algorithm and it has its own well written Wikipedia page where it is referred to as the fast inverse square root. The algorithm actually appeared on various forums before the Q3 source code was released. Ryszard of Beyond3D did some investigating in 2004-2005 and eventually tracked down the original author of the algorithm to be Greg Walsh at Ardent Computer who created it more than a decade earlier.

Table of contents

How does it work?

So how does the method work, anyway? It is performed in two steps:

  1. obtain a rough approximation y for the reciprocal square root of our number:

    y  = number;
    i  = * ( long * ) &y;
    i  = 0x5f3759df - ( i >> 1 );
    y  = * ( float * ) &i;
  2. improve the approximation using a single step of the Newton-Raphson (NR) method:

    const float threehalfs = 1.5F;
    x2 = number * 0.5F;
    y  = y * ( threehalfs - ( x2 * y * y ) );

First approximation

The most interesting part is the first one. It uses a seemingly magic number 0x5f3759df and some bit shifting and somehow ends up with the reciprocal square root. The first line stores the 32-bit floating-point number y as a 32-bit integer i by taking a pointer to y, converting it to a long pointer and dereferencing it. So y and i hold two identical 32-bit vectors, but one is interpreted as a floating-point number and the other is interpreted as an integer number. Then, the integer number is shifted one step to the right, negated, and the constant 0x5f3759df is added. Finally, the resulting value is interpreted as a floating number again by dereferencing a float pointer that points to the integer i value.

Here, shifting, negation and addition is performed in the integer domain, how do these operations affect the number in the floating-point domain? In order to understand how this can yield an approximation of the reciprocal square root we must be familiar with how floating point numbers are represented in memory. A floating-point number consists of a sign s{0,1} $s \in \{0,1\}$, exponent eZ $e\in \mathbb{Z}$ and a fractional part 0f<1 $0 \leq f \lt 1$. The value of the floating-point number is then1

y=(-1)s·(1+f)·2e. $y = (-1)^s \cdot (1 + f) \cdot 2^e.$

In our case, we can assume that our float is in the IEEE 754 binary32 format2, the bits are then ordered as shown below.

The most significant bit is the sign bit S $S$, followed by 8 bits ( E $E$) representing the exponent e $e$ and the remaining 23 bits ( F $F$) representing the fractional part f $f$. The number is negative when S=1 $S=1$. The 8-bit number E $E$ is not directly used as the exponent, it has an offset or bias of 28-1=127 $2^8-1 = 127$. So E=0 $E=0$ means that the exponent is e=-127 $e=-127$. F $F$ is simply a fractional binary number with the decimal point before the first digit such that f=F·2-23 $f=F\cdot2^{-23}$.

We can write a simple C program interp.c to print both the integer and floating-point interpretations of a given number and also extract the different parts:

#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <stdint.h>

int main(int argc, char *args[]) {
    /* parse number from args */
    uint32_t i;
    int ret;
    if (argc == 2) {
        ret = sscanf(args[1], "%u", &i);
    } else if (argc == 3 && strcmp(args[1], "-h") == 0) {
        ret = sscanf(args[2], "%x", &i);
    } else if (argc == 3 && strcmp(args[1], "-f") == 0) {
        float y;
        ret = sscanf(args[2], "%f", &y);
        i = *(uint32_t*)&y;
    } else {
        return EXIT_FAILURE;
    }
    if (ret != 1return EXIT_FAILURE;

    /* print representations */
    printf("hexadecimal: %x\n", i);
    printf("unsigned int: %u\n", i);
    printf("signed int: %d\n", i);
    printf("floating-point: %f\n", *(float*)&i);

    /* print components */
    int S = i >> 31;
    int E = (i >> 23) & ((1 << 8)-1);
    int e = E - 127;
    int F = i & ((1 << 23)-1);
    float f = (float)F / (1 << 23);
    printf("S: %d\n", S);
    printf("E: %d (0x%x) <=> e: %d\n", E, E, e);
    printf("F: %d (0x%x) <=> f: %f\n", F, F, f);

    return EXIT_SUCCESS;
}

We can for example look at the number 0x40b00000:

$ ./interp -h 40b00000
hexadecimal: 40b00000
unsigned int: 1085276160
signed int: 1085276160
floating-point: 5.500000
S: 0
E: 129 (0x81) <=> e: 2
F: 3145728 (0x300000) <=> f: 0.375000

We can also extract the parts of a floating-point number:

$ ./interp -f -32.1
hexadecimal: c2006666
unsigned int: 3254806118
signed int: -1040161178
floating-point: -32.099998
S: 1
E: 132 (0x84) <=> e: 5
F: 26214 (0x6666) <=> f: 0.003125

Even now when we know how the floating-point numbers are represented in memory, it is not entirely obvious how performing operations in the integer domain would affect the floating-point domain. At first we can try to simply iterate over a range of floating-point number and see what integer values we get:

#include <stdio.h>

int main() {
    float x;
    for (x = 0.1; x <= 8.0; x += 0.1) {
        printf("%f\t%d\n", x, *(int*)&x);
    }
}

We can then plot the floating-point values on the x-axis and the integer values on the y-axis with e.g. gnuplot to get a plot like this:

Well, this curve looks quite familiar. We can look further at some of the data points using our previous program:

$ ./interp -f 1.0
hexadecimal: 3f800000
unsigned int: 1065353216
signed int: 1065353216
floating-point: 1.000000
S: 0
E: 127 (0x7f) <=> e: 0
F: 0 (0x0) <=> f: 0.000000

$ ./interp -f 2.0
hexadecimal: 40000000
unsigned int: 1073741824
signed int: 1073741824
floating-point: 2.000000
S: 0
E: 128 (0x80) <=> e: 1
F: 0 (0x0) <=> f: 0.000000

$ ./interp -f 3.0
hexadecimal: 40400000
unsigned int: 1077936128
signed int: 1077936128
floating-point: 3.000000
S: 0
E: 128 (0x80) <=> e: 1
F: 4194304 (0x400000) <=> f: 0.500000

For 1.0 and 2.0 we get S=0 $S=0$, F=0 $F=0$ and a non-zero biased exponent E $E$. If we remove the bias from this number (subtract by 127 << 23) and then shift it to the far right we end up with the exponent e $e$, in other words the base 2 logarithm of the floating-point number. However, this only works when S=0 $S=0$ and F=0 $F=0$, i.e. positive integers. If S=1 $S=1$ we have a negative number for which the logarithm is undefined. But if F0 $F\ne{}0$ and we shift the exponent to the far right we will simply lose all of that data. We can instead convert it to a floating-point value and divide by 223 $2^{23}$, such that the fractional part scales our resulting value linearly:

(float) (*(int*)&x - (127 << 23)) / (1 << 23)

Then we don’t exactly get the logarithm but we do get a linear approximation for all non power of two values. We can plot the approximation together with the actual logarithmic function:

This means that when we take a floating-point number and interpret it as an integer number, we obtain an approximation of the logarithm of that number, with some offset and scaling. And when we interpret an integer number as a floating-point number, we get the opposite, i.e. the exponential or antilogarithm of our integer value. This basically means that when we perform operations in the integer domain, it is as if we perform operations in the logarithmic domain. For example, if we remember our logarithmic identities, we know that if we take the logarithm of two numbers and add them together, we get the logarithm of their product:

loga+logb=log(a·b). $\log{a} + \log{b} = \log{(a \cdot b)}.$

In other words, if we perform addition in the integer domain we get multiplication in the floating-point domain — approximately anyway. We can try this with another simple C program. One thing we need to consider is how our operation affects the exponent bias. When we add two numbers with biased exponents we get double bias:

E1+E2=(e1+B)+(e2+B)=e1+e2+2B. $ \begin{align} E_1 + E_2 &=& (e_1 + B) + (e_2 + B) \\ &=& e_1 + e_2 + 2B. \end{align} $

We want our bias to remain as B $B$ rather than 2B $2B$ so in order to counter this we simply subtract the result by B $B$. Our C program that performs floating-point multiplication using integer addition may then look like this:

#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>

const uint32_t B = (127 << 23);

int main(int argc, char *args[]) {
    /* parse factors from args */
    float a, b;
    if (argc == 3) {
        int ret = sscanf(args[1], "%f", &a);
        ret += sscanf(args[2], "%f", &b);
        if (ret != 2return EXIT_FAILURE;
    } else {
        return EXIT_FAILURE;
    }

    /* perform multiplication (integer addition) */
    uint32_t sum = *(uint32_t*)&a + *(uint32_t*)&b - B;
    float y = *(float*)&sum;

    /* compare with actual */
    float y_actual = a*b;
    float rel_err = (y - y_actual) / y_actual;

    printf("%f =? %f (%.2f%%)\n", y, y_actual, 100*rel_err);
}

Let’s try it out:

$ ./mul 3.14159 8.0
25.132721 =? 25.132721 (0.00%)

$ ./mul 3.14159 0.2389047
0.741016 =? 0.750541 (-1.27%)

$ ./mul -15.0 3.0
-44.000000 =? -45.000000 (-2.22%)

$ ./mul 6.0 3.0
16.000000 =? 18.000000 (-11.11%)

$ ./mul 0.0 10.0
0.000000 =? 0.000000 (inf%)

Most of the time it is not perfectly accurate, it is correct only if one of the factors is a power of two, and least accurate when both factors are right between two powers of 2.

How about the reciprocal square root? The reciprocal square root 1x $\frac{1}{\sqrt{x}}$ is equivalent to x-1/2 $x^{-1/2}$ so we will need another logarithmic identity:

plogx=logxp $p\log{x} = \log{x^p}$

This means that if we perform multiplication in the integer domain, we get exponentiation in the floating-point domain. Depending on our exponent p $p$ we can obtain several different functions, e.g:

p $p$ f(x) $f(x)$
2 x2 $x^2$
1/2 x $\sqrt{x}$
-1 1x $\frac{1}{x}$
-1/2 1x $\frac{1}{\sqrt{x}}$

In order to get a first approximation of the reciprocal square root, we simply need to multiply by -1/2 in the integer domain and adjust for the bias. The bias will then be -B/2 $-B/2$ and we want the bias to be B $B$ so we simply need to add 3B/2=0x5f400000 $3B/2 = \texttt{0x5f400000}$. So, we will multiply by -1/2 by shifting right one step and negating, and then add the bias:

- (i << 1) + 0x5f400000;

This is now identical to the Q3 source code except that the constant value differs slightly. They used 0x5f3759df while we currently have 0x5f400000. We can see if it is possible to make improvements by looking at our error. We simply subtract our approximate value for the reciprocal square root by the exact value and plot the result for a certain range of numbers:

The graph repeats horizontally in both directions (only in different scale) so we only need to look at this part to understand the error for all (normal) floating-point numbers. We can see that the approximate value is always overestimating, by simply subtracting a constant that is around half the maximum error we can make it symmetric and thus decrease the average absolute error. Looking at the graph, subtracting something like 0x7a120 might work. Our constant would then be 0x5f385ee0 which is closer to the constant used in Q3. In the integer domain, our error will simply center the error around the x-axis in the above diagram. In the floating-point domain, the error is affected similarly except when our subtraction borrows from the exponent:

We could potentially try to find an actual optimum for some reasonable objective function but we will stop here. In the case of the original Q3 constant, it is not really clear how it was chosen, perhaps using trial and error.

Improving the approximation

The second part is less unconventional. When a first approximation has been obtained, one can improve it by using a method known as Newton-Raphson (NR). If you are unfamiliar with it, Wikipedia has a good article on it. The NR method is used to improve an approximation for the root of an equation. Since we want the reciprocal square root we need an equation f(y) $f(y)$ that is zero when y $y$ is exactly the reciprocal square root of x $x$:

y=1x1y2=xf(y)=1y2-x=0 $ \begin{align} y = \frac{1}{\sqrt{x}} \: \Leftrightarrow \: \frac{1}{y^2} = x \\ \Rightarrow f(y) = \frac{1}{y^2} - x = 0 \end{align} $

If we have an approximate value yn $y_n$ we can get a better approximation yn+1 $y_{n+1}$ by calculating where the tangent of the function’s graph at y=yn $y=y_n$ (i.e. the derivative) intersects f(y)=0 $f(y)=0$. That value can be expressed as

yn+1=yn-f(yn)f(yn)=yn(32-x2·yn2) $ \begin{align} y_{n+1} &=& y_n - \frac{f(y_n)}{f'(y_n)} \\ &=& y_n \left( \frac{3}{2} - \frac{x}{2} \cdot {y_n}^2 \right) \end{align} $

which is the exact same expression that is used in the second part of the Q3 function.

How fast is it?

Back in 2003 Chris Lomont wrote an article about his investigations of the algorithm. His testing yielded that the algorithm was four times faster than using the more straightforward way of simply using sqrt(x)3 from the standard library and taking its reciprocal.

In 2009, Elan Ruskin made a post, Timing Square Root, where he primarily looked at the square root function but also compared the fast inverse square root algorithm to other methods. On his Intel Core 2, the fast inverse square root was 4 times slower than using rsqrtss, or 30% slower than rsqrtss with a single NR step.

Since then, there has come several new extensions to the x86 instruction set. I have tried to sum up all square root instructions currently available:

Set x $\sqrt{x}$ 1x $\frac{1}{\sqrt{x}}$ Width
x87 (1980) fsqrt 32
3DNow! (1998) pfrsqrt 128
SSE (1999) sqrtps, sqrtss rsqrtps, rsqrtss 128
SSE2 (2000) sqrtpd, sqrtsd 128
AVX (2011) vsqrtps, vsqrtpd, vsqrtps_nr, vrsqrtps, vrsqrtps_nr 256
AVX-512 (2014) vrsqrt14pd, vrsqrt14ps, vrsqrt14sd, vrsqrt14ss 512

The fsqrt is quite obsolete by now. The 3DNow! extension has also been deprecated and is no longer supported. All x86-64 processors support at least SSE and SSE2. Most processors support AVX and some support AVX-512, but e.g. GCC currently chooses to not emit any AVX instructions by default.

The p and s is short for “packed” and “scalar”. The packed instructions are vector SIMD instructions while the scalar ones only operate on a single value at a time. With a register width of e.g. 256 bits, the packed instruction can perform 256/32=8 $256/32=8$ calculations in parallel. The s or d is short for “single” or “double” precision floating-point. Since we are considering approximations we will be using single precision floating-point numbers. We may then use either the ps or ss variants.

The fast inverse square root method had a pretty hard time against the rsqrtss instruction back in 2009 already. And since then, multiple extensions with specialized SIMD instructions has been implemented in modern x86 processors. Surely, the fast inverse square root has no chance today and its time has passed?

Why don’t we give it a try ourselves right now, we can start by running some tests on my current machine which has a relatively modern processor: an AMD Zen 3 5950X from late 2020.

Initial testing

We will write a C program that tries to calculate the reciprocal square root using three different methods:

  • exact: simply 1.0 / sqrtf(x), using the sqrtf function from the C standard library,
  • appr: first approximation from the Q3 source as explained above,
  • appr_nr: the full Q3 method with one iteration of Newton-Raphson.

For each method we perform the calculation for each value in a randomized input array and time how long it takes in total. We can use the clock_gettime function from libc (for POSIX systems) to get the time before and after we perform the calculations and calculate the difference. We will then repeat this many times to decrease the random variations. The C program looks like this:

#include <stdlib.h>
#include <stdio.h>
#include <stdint.h>
#include <time.h>
#include <math.h>

#define N 4096
#define T 1000
#define E9 1000000000

#ifndef CLOCK_REALTIME
#define CLOCK_REALTIME 0
#endif

enum methods { EXACT, APPR, APPR_NR, M };
const char *METHODS[] = { "exact""appr""appr_nr" };

static inline float rsqrt_exact(float x) { return 1.0f / sqrtf(x); }
static inline float rsqrt_appr(float x) {
    uint32_t i = *(uint32_t*)&x;
    i = -(i >> 1) + 0x5f3759df;
    return *(float*)&i;
}
static inline float rsqrt_nr(float x, float y) { return y * (1.5f - x*0.5f*y*y); }

static inline float rsqrt_appr_nr(float x) {
    float y = rsqrt_appr(x);
    return rsqrt_nr(x, y);
}

int main() {
    srand(time(NULL));

    float y_sum[M] = {0};
    double t[M] = {0};

    for (int trial = 0; trial < T; trial++) {
        struct timespec start, stop;
        float x[N], y[N];
        for (int i = 0; i < N; i++) { x[i] = rand(); }

        clock_gettime(CLOCK_REALTIME, &start);
        for (int i = 0; i < N; i++) { y[i] = rsqrt_exact(x[i]); }
        clock_gettime(CLOCK_REALTIME, &stop);
        for (int i = 0; i < N; i++) { y_sum[EXACT] += y[i]; }
        t[EXACT] += ((stop.tv_sec-start.tv_sec)*E9 + stop.tv_nsec-start.tv_nsec);

        clock_gettime(CLOCK_REALTIME, &start);
        for (int i = 0; i < N; i++) { y[i] = rsqrt_appr(x[i]); }
        clock_gettime(CLOCK_REALTIME, &stop);
        for (int i = 0; i < N; i++) { y_sum[APPR] += y[i]; }
        t[APPR] += ((stop.tv_sec-start.tv_sec)*E9 + stop.tv_nsec-start.tv_nsec);

        clock_gettime(CLOCK_REALTIME, &start);
        for (int i = 0; i < N; i++) { y[i] = rsqrt_appr_nr(x[i]); }
        clock_gettime(CLOCK_REALTIME, &stop);
        for (int i = 0; i < N; i++) { y_sum[APPR_NR] += y[i]; }
        t[APPR_NR] += ((stop.tv_sec-start.tv_sec)*E9 + stop.tv_nsec-start.tv_nsec);
    }

    printf("rsqrt\tps/op\tratio\terr\n");
    for (int m = 0; m < M; m++) {
        printf("%s\t%.0f\t%.2f\t%.4f\n",
               METHODS[m],
               t[m] * 1000.0f / N / T,
               (double) t[EXACT] / t[m],
               (y_sum[m] - y_sum[EXACT]) / y_sum[EXACT]);
    }

    return 0;
}

At the end of the program we print three things for each method:

  • the average time to calculate a single operation in picoseconds – the lower the better,
  • the ratio of the calculation time compared to the exact method – the higher the faster,
  • the average error between the method and the exact method – just to make sure the calculations are performed correctly.

So, what do we expect? There are dedicated functions for calculating the reciprocal square root in the x86 instruction set that the compiler should be able to emit. The throughput may then be higther than in the approximate method where we perform multiple operations.

Let’s go ahead and try it, we’ll compile it using GCC without any optimizations at first, explicitly with -O0. Since we are using math.h for the exact method we will also need to link the math library using -lm:

$ gcc -lm -O0 rsqrt.c
$ ./a.out
rsqrt   ps/op   ratio   err
exact   3330    1.00    0.0000
appr    2020    1.65    0.0193
appr_nr 6115    0.54    -0.0010

This seems reasonable. The error is noticeable for the first approximation but reduced after one iteration of NR. The first approximation is actually faster than the exact method but when done together with a step of NR it is twice as slow. The NR method requires more operations so this seems reasonable.

Alright, but this is only a debug build, let’s try adding optimizations using the -O3 flag. This will enable all optimizations that do not disregard any standards compliance.

$ gcc -lm -O3 rsqrt.c
$ ./a.out
rsqrt   ps/op   ratio   err
exact   1879    1.00    0.0000
appr    72      26.01   0.0193
appr_nr 178     10.54   -0.0010

Hmm, now the approximations are actually a lot faster than before but the time of the exact method has only halved, making the approximation with NR more than ten times faster than the exact method. Perhaps the compiler failed to emit the reciprocal square root functions? Maybe it will improve if we use the -Ofast flag instead which is described by the gcc(1) man page:

Disregard strict standards compliance. -Ofast enables all -O3 optimizations. It also enables optimizations that are not valid for all standard- compliant programs. It turns on -ffast-math, -fallow-store-data-races and the Fortran-specific -fstack-arrays, unless -fmax-stack-var-size is specified, and -fno-protect-parens. It turns off -fsemantic-interposition.

Our exact method may no longer be as accurate as before, but it may be faster.

$ gcc -lm -Ofast rsqrt.c
$ ./a.out
rsqrt   ps/op   ratio   err
exact   153     1.00    0.0000
appr    118     1.30    0.0137
appr_nr 179     0.85    -0.0009

And it is indeed faster. The first approximation is still faster, but with a step of NR it is slower than the exact method. The error has decreased slightly for the approximations because we are still comparing against the “exact” method which now yields different results. Oddly enough, the first approximation has become half as fast. This seems to be a quirk of GCC, as Clang does not have this issue, otherwise it produces similar results:

$ clang -lm -O0 rsqrt.c
$ ./a.out
rsqrt   ps/op   ratio   err
exact   3715    1.00    0.0000
appr    1933    1.92    0.0193
appr_nr 6001    0.62    -0.0010

$ clang -lm -O3 rsqrt.c
$ ./a.out
rsqrt   ps/op   ratio   err
exact   1900    1.00    0.0000
appr    61      31.26   0.0193
appr_nr 143     13.24   -0.0010

$ clang -lm -Ofast rsqrt.c
$ ./a.out
rsqrt   ps/op   ratio   err
exact   148     1.00    0.0000
appr    62      2.40    0.0144
appr_nr 145     1.02    -0.0009

Disassembly

For both compilers, there is quite a large difference between -O3 and -Ofast. We can look at the disassembly to see what is going on. We will need to provide the -g flag to the compiler in order to get debug symbols in the binary that tell us which object code corresponds to which source code. Thereafter we can run objdump -d with the -S flag to see the disassembled instructions next to the source code:

$ gcc -lm -O-g rsqrt.c
$ objdump -d -S a.out
...
static inline float rsqrt_exact(float x) { return 1.0f / sqrtf(x); }
    118e:       66 0f ef db             pxor   %xmm3,%xmm3
    1192:       0f 2e d8                ucomiss %xmm0,%xmm3
    1195:       0f 87 e1 02 00 00       ja     147c <main+0x3bc>
    119b:       f3 0f 51 c0             sqrtss %xmm0,%xmm0
    119f:       f3 0f 10 0d 99 0e 00    movss  0xe99(%rip),%xmm1    # 2040
    11a6:       00
...
    11ab:       f3 0f 5e c8             divss  %xmm0,%xmm1
...
    2040:       00 00 80 3f                                         # 1.0f

In case you are unfamiliar, this is the AT&T syntax for x86-64 assembly. Note that the source operand is always before the destination operand. The parentheses indicate an address, for example movss 0xecd(%rip),%xmm1 copies the value located 0xecd bytes ahead of the address in the rip register (instruction pointer, a.k.a. PC) to the xmm1 register. The xmmN registers are 128 bits wide, or 4 words. However, the ss instructions are for scalar single-precision values, so it will only apply the operation on a single floating-point value in the least significant 32 bits.

In the -O3 case we use the scalar sqrtss followed by divss. There is also a compare ucomiss and a jump ja that will set errno to EDOM in case the input is less than -0. We are not using errno at all so we can remove the setting of errno by providing the -fno-math-errno flag:

$ gcc -lm -O-g -fno-math-errno rsqrt.c
$ ./a.out
rsqrt   ps/op   ratio   err
exact   479     1.00    0.0000
appr    116     4.13    0.0193
appr_nr 175     2.74    -0.0010
$ objdump -d -S a.out
...
static inline float rsqrt_exact(float x) { return 1.0f / sqrtf(x); }
    1170:       0f 51 0c 28             sqrtps (%rax,%rbp,1),%xmm1
    1174:       f3 0f 10 1d c4 0e 00    movss  0xec4(%rip),%xmm3    # 2040
    117b:       00
    117c:       48 83 c0 10             add    $0x10,%rax
    1180:       0f c6 db 00             shufps $0x0,%xmm3,%xmm3
    1184:       0f 28 c3                movaps %xmm3,%xmm0
    1187:       0f 5e c1                divps  %xmm1,%xmm0
...
    2040:       00 00 80 3f                                         # 1.0f

This prevents us from having to check every input value individually and thus allows us to use the packed variants of the instructions, performing 4 operations at a time. This improved the performance a lot. However, we still use sqrtps followed by divps. We will have to also enable -funsafe-math-optimizations4 and -ffinite-math-only in order to make GCC emit rsqrtps instead. We then get identical code to when we used -Ofast:

$ gcc -lm -O-g -fno-math-errno -funsafe-math-optimizations -ffinite-math-only rsqrt.c
$ ./a.out
rsqrt   ps/op   ratio   err
exact   155     1.00    0.0000
appr    120     1.29    0.0137
appr_nr 182     0.85    -0.0009
$ objdump -d -S a.out
...
static inline float rsqrt_exact(float x) { return 1.0f / sqrtf(x); }
    1170:       0f 52 0c 28             rsqrtps (%rax,%rbp,1),%xmm1
    1174:       0f 28 04 28             movaps (%rax,%rbp,1),%xmm0
    1178:       48 83 c0 10             add    $0x10,%rax
    117c:       0f 59 c1                mulps  %xmm1,%xmm0
    117f:       0f 59 c1                mulps  %xmm1,%xmm0
    1182:       0f 59 0d c7 0e 00 00    mulps  0xec7(%rip),%xmm1    # 2050
    1189:       0f 58 05 b0 0e 00 00    addps  0xeb0(%rip),%xmm0    # 2040
    1190:       0f 59 c1                mulps  %xmm1,%xmm0
...
    2040:       00 00 40 c0                                         # -3.0f
...
    2050:       00 00 00 bf                                         # -0.5f

Now it uses rsqrtps, but it also has several multiplication instructions as well as an addition. Why are these needed, isn’t the reciprocal square root all we need? We can get a hint from looking at the disassembly of the appr_nr function:

static inline float rsqrt_nr(float x, float y) { return y * (1.5f - x*0.5f*y*y); }
    12f8:       f3 0f 10 1d 80 0d 00    movss  0xd80(%rip),%xmm3    # 2080
    12ff:       00
...
    1304:       0f 59 05 65 0d 00 00    mulps  0xd65(%rip),%xmm0    # 2070
...
    1310:       0f c6 db 00             shufps $0x0,%xmm3,%xmm3
...
    1318:       0f 28 d1                movaps %xmm1,%xmm2
    131b:       0f 59 d1                mulps  %xmm1,%xmm2
    131e:       0f 59 d0                mulps  %xmm0,%xmm2
    1321:       0f 28 c3                movaps %xmm3,%xmm0
    1324:       0f 5c c2                subps  %xmm2,%xmm0
    1327:       0f 59 c1                mulps  %xmm1,%xmm0
...
    2070:       00 00 00 3f                                         # 0.5f
...
    2080:       00 00 c0 3f                                         # 1.5f

The last part looks quite similar, because it is actually doing the same thing: an iteration of Newton-Raphson5. This is hinted in the man page of gcc(1):

This option enables use of the reciprocal estimate and reciprocal square root estimate instructions with additional Newton-Raphson steps to increase precision instead of doing a divide or square root and divide for floating-point arguments.

The rsqrtps instruction only guarantees a relative error smaller than 1.5·2-12 $1.5\cdot2^{-12}$, the NR iteration reduces it further just like in the Q3 code.

If we do not need this extra precision, can we get a speedup by skipping the NR step? We can use built-in compiler intrinsics in order to make the compiler only emit the rsqrtps instruction. The GCC manual has a list of built-in functions for the x86 instruction set. There is a __builtin_ia32_rsqrtps function that will emit the rsqrtps instruction:

v4sf __builtin_ia32_rsqrtps (v4sf);

The manual also has a chapter about how to use these vector instructions with built-in functions. We need to add a typedef for the v4sf type which contains four floating point numbers. We will then use an array of N/4 $N/4$ of these vectors and simply provide one vector at a time to the built-in function. N is a multiple of four so there are no half full vectors. We can simply cast our previous float input array to a vfs4 pointer. We will add these parts to our previous program:

typedef float v4sf __attribute__ ((vector_size(16)));
v4sf rsqrt_intr(v4sf x) { return __builtin_ia32_rsqrtps(x); };


        v4sf *xv = (v4sf*)x, *yv = (v4sf*)y;
        clock_gettime(CLOCK_REALTIME, &start);
        for (int i = 0; i < N/4; i++) { yv[i] = rsqrt_intr(xv[i]); }
        clock_gettime(CLOCK_REALTIME, &stop);
        for (int i = 0; i < N; i++) { y_sum[INTR] += y[i]; }
        t[INTR] += ((stop.tv_sec-start.tv_sec)*E9 + stop.tv_nsec-start.tv_nsec);

We can compile it in order to run and disassemble it:

$ gcc -lm -O-g rsqrt_vec.c
$ ./a.out
rsqrt   ps/op   ratio   err
exact   1895    1.00    0.0000
appr    72      26.39   0.0193
appr_nr 175     10.81   -0.0010
rsqrtps 61      31.00   0.0000
$ objdump -d -S a.out
...
v4sf rsqrt_intr(v4sf x) { return __builtin_ia32_rsqrtps(x); };
    1238:       41 0f 52 04 04          rsqrtps (%r12,%rax,1),%xmm0
...

Now we are down to a single instruction and it is slightly faster than before.

There are also extensions that not all processors support that we can try to use. We can tell the compiler to use any extensions that are available on our processor using -march=native. This may make the binary incompatible with other processors, though.

$ gcc -lm -Ofast -g -march=native rsqrt_vec.c
$ ./a.out
rsqrt   ps/op   ratio   err
exact   78      1.00    0.0000
appr    40      1.96    0.0137
appr_nr 85      0.91    -0.0009
rsqrtps 62      1.25    0.0000

Now we are down to almost as good as the first approximation. The intrinsic one is pretty much just as fast. The “exact” method got replaced by a 256-bit vrsqrtps and a step of NR:

static inline float rsqrt_exact(float x) { return 1.0f / sqrtf(x); }
    11d0:       c5 fc 52 0c 18          vrsqrtps (%rax,%rbx,1),%ymm1
    11d5:       c5 f4 59 04 18          vmulps (%rax,%rbx,1),%ymm1,%ymm0
    11da:       48 83 c0 20             add    $0x20,%rax
    11de:       c4 e2 75 a8 05 79 0e    vfmadd213ps 0xe79(%rip),%ymm1,%ymm0
    11e5:       00 00
    11e7:       c5 f4 59 0d 91 0e 00    vmulps 0xe91(%rip),%ymm1,%ymm1
    11ee:       00
    11ef:       c5 fc 59 c1             vmulps %ymm1,%ymm0,%ymm0

The __builtin_ia32_rsqrtps is now using a single vrsqrtps and no NR step, however, it still uses only 128-bit registers.

Broad sweep

So, we did some testing on my machine and got some insight into what kind of instructions we can use to calculate the reciprocal square root and how they might perform. We will now try to run these benchmarks on several machines to give us an idea how well our findings apply in general. Those machines include all the ones that I happen to have convenient SSH access to. All resulting data can be downloaded from here, it also includes results for the inverse and square root functions, separately.

Below is a list of the x86 machines that were tested along with their CPUs and their release date. All the previous tests were run on on the computer labeled as “igelkott”.

Hostname CPU Family CPU Model Year Form factor
jackalope Core Intel Celeron 550 2007 i686 laptop
narwhal Piledriver AMD FX-6300 2012 x86_64 desktop
silverback Ivy Bridge Intel Xeon E5-1410 2014 x86_64 server
bovinae Kaby Lake Intel Core i5-8250U 2017 x86_64 laptop
igelkott Zen 3 AMD Ryzen 5950X 2020 x86_64 desktop
deck Zen 2 AMD APU 0405 2022 x86_64 mobile

Below is a plot of the performance ratio compared to the exact method, i.e. the time of each method divided by the time of the exact method. A higher ratio means higher performance, anything below 1 is slower than exact and anything above is faster. We use the -Ofast flag here, as it is the fastest option that can be used without sacrificing portability.

The results are quite similar across all of the machines, the time of the methods are approximately ranked in the order rsqrtps <= appr < exact <= appr_nr. Using the appr_nr method is either slower or the same as the exact method, so it has no real benefit in this case.

The “jackalope” machine was not included in the above plot because it had an extremely slow exact method. Especially when not using -march=native as the compiler then resorted to using the antique fsqrt instruction.

Below is a table of the actual timings when using -Ofast, numbers in parenthesis uses -march=native. Each number is how long a single operation takes in picoseconds.

Machine/Compiler exact appr appr_nr rsqrtps
jackalope-clang 53634 (5363) 1500 (2733) 4971 (3996) N/A
narwhal-gcc 419 (363) 443 (418) 601 (343) 396 (231)
narwhal-clang 389 (796) 340 (321) 445 (859) 349 (388)
silverback-gcc 422 (294) 179 (199) 543 (543) 178 (189)
bovinae-gcc 260 (127) 155 (81) 321 (119) 108 (105)
bovinae-clang 255 (132) 108 (78) 272 (112) 95 (96)
igelkott-gcc 141 (79) 111 (63) 168 (87) 58 (64)
igelkott-clang 152 (76) 63 (40) 149 (70) 61 (62)
deck-gcc 342 (160) 234 (114) 444 (172) 226 (120)
deck-clang 297 (166) 189 (123) 332 (140) 101 (126)

The square root function yields slightly different results:

Oddly enough, the sqrtps built-in function is slower than the exact method, and the appr without NR is now faster instead. The appr_nr method still offers no advantage, it is instead consistently worse than exact.

Here are the original timings for the square root function as well, with -Ofast. Again, numbers in parentheses use -march=native:

Machine/Compiler exact appr appr_nr sqrtps
jackalope-clang 35197 (5743) 1494 (2738) 19191 (4308) N/A
narwhal-gcc 505 (399) 399 (427) 659 (559) 796 (785)
narwhal-clang 448 (823) 327 (319) 638 (847) 803 (780)
silverback-gcc 625 (297) 271 (190) 958 (728) 1163 (1135)
bovinae-gcc 301 (148) 155 (81) 408 (200) 225 (226)
bovinae-clang 315 (244) 92 (60) 399 (159) 317 (227)
igelkott-gcc 173 (95) 119 (38) 233 (124) 288 (296)
igelkott-clang 168 (143) 63 (48) 234 (104) 170 (283)
deck-gcc 419 (205) 215 (108) 519 (252) 575 (574)
deck-clang 325 (244) 153 (88) 372 (180) 315 (458)

As noted in discussions on HN, some usecases may require determinism, i.e. the guarantee that different machines will produce the exact same result. The rsqrtps instruction is only required to have a relative error smaller than 1.5·212 $1.5 \cdot 2^{−12}$ so this can be implemented differently on different machines6.

However, the IEEE 754 specification does have a specification for the square root function, which the sqrtps instruction is guaranteed to comply with. So, the best we may be able to do is to use sqrtps followed by divps — which is exactly what we got when we used -O3 and -fno-math-errno. It wasn’t until we introduced -funsafe-math-optimzations that we lost reproducibility. If reproducibility is a requirement, a fair comparison would use only IEEE-754 compliant floating-point operations:

This tips the scales in favor of the approximations. Using the appr_nr method results in a 2-4 times faster implementation compared to the exact method. narwhal-clang is an outlier here because of a surprisingly slow exact method.

Try it yourself

You can try to run the benchmarks on your machine and see if you get similar results. There is a shell script bench/run.sh that will generate and run benchmarks using the bench/bench.c.m4 file. These files can be found in this blog’s repo. Simply run the script with no arguments and it will generate a .tsv file with all results:

$ cd bench
$ sh run.sh
$ grep rsqrt bench.tsv | sort -nk| head
rsqrt   appr    40      1.91    0.0139  clang-Ofast-march=native
rsqrt   rsqrtps 56      32.08   0.0000  clang-O3
rsqrt   appr    58      31.08   0.0193  clang-O3
rsqrt   rsqrtps 58      2.48    0.0000  clang-O3-fno-math-errno-funsafe-math-optimizations-ffinite-math-only
rsqrt   rsqrtps 59      2.45    0.0000  gcc-Ofast
rsqrt   rsqrtps 59      2.48    0.0000  clang-Ofast
rsqrt   rsqrtps 59      31.07   0.0000  gcc-O3
rsqrt   rsqrtps 59      7.83    0.0000  gcc-O3-fno-math-errno
rsqrt   appr    60      2.41    0.0144  clang-O3-fno-math-errno-funsafe-math-optimizations-ffinite-math-only
rsqrt   rsqrtps 60      8.09    0.0000  clang-O3-fno-math-errno

Final thoughts

To summarize, using simply 1/sqrtf(x) on modern x86 processors can be both faster and more accurate than the fast inverse square root method from Quake III’s Q_rsqrt function.

However, a key takeaway is that you have to order the compiler to make it faster. When simply compiling using -O3, the fast inverse square root method is actually considerably faster than the naive implementation. We have to allow the compiler to violate some strict specification requirements in order to make it emit a faster implementation.

As noted by discussions on HN, there are usecases where determinism, i.e. reproducibility between machines, is a requirement. In this case, we may only use floating-point operations that are standardized to be deterministic, which the rsqrtps instruction is not. For such usecases, the fast inverse square root method can be 2-4 times faster than the naive implementation.

If very low accuracy can be tolerated, it is possible to get a slightly faster implementation by skipping the Newton-Raphson step from the fast inverse square root method. Interestingly, the compiler also performs an NR step after using approximate implementations of the inverse square root. This can also be made slightly faster by skipping the NR step — by only emitting the approximate instruction with the help of compiler intrinsics.

In this post, we focused on x86, but how about other instructions sets?7 The fast inverse square root method could perhaps still be useful for processors without dedicated square root instructions.

How are the hardware implementations of approximate square roots typically implemented? Could an approximate hardware implementation potentially use something similar to the first approximation of the fast inverse square root method?

2023-04-21: correct femtoseconds -> picoseconds, discuss determinism


  1. This only covers all normal floating-point values, there are also two zeroes, subnormal numbers and special values such as NaN and inf. The algorithm does not take these into consideration.↩︎︎

  2. float is not necessarily binary32 according to the C standard, but on most architectures, e.g. modern x86, it is. The author of the Q_rsqrt function has made the assumption that binary32 is used.↩︎︎

  3. Lomont wrote (float)(1.0/sqrt(x)), however sqrt (rather than sqrtf works on a double rather than a float. This could perhaps have made the naive method slower as this may cause casting to and from double.↩︎︎

  4. This includes -fno-signed-zeros, -fno-trapping-math, -fassociative-math and -freciprocal-math. I tried to set all of these individually but I was not able to get the rsqrtps instruction unless I enabled -funsafe-math-optimizations. It seems to do additional things besides enabling those flags.↩︎︎

  5. The compiler has performed the operations in a slightly different order. They appear to be calculating it as -0.5f*y*(x*y*y+-3.0f).↩︎︎

  6. As per Robert O’Callahan’s investigations, AMD and Intel implementations of rsqrt instructions seem to differ in exactly 22 out of the 232 $2^{32}$ possible input values.↩︎︎

  7. I did some quick tests on an armv7 phone, appr_nr performed similarly to exact. I also attempted running on a MIPS32 machine but I ran into issues with the cross-compiling toolchain. For another time..↩︎︎