```
||||| _ _ _ _
= Z . = | |_ | | |_ __ _ _ _ _ ___| |_
= , = | ' \| | | ' \| ' \ _| ' \/ -_) _|
= o ` = |_||_|_|_|_|_|_|_||_(_)_||_\___|\__|
|||||
```
# Revisiting The Fast Inverse Square Root - Is It Still Useful?
_April 20, 2023_
This article has some discussion on [Hacker News][hn-comments].
[hn-comments]: https://news.ycombinator.com/item?id=35646315
In 2005, id Software released the source code for their 1999 game _Quake III
Arena_ under the GPL-2 license. In the file [code/game/q\_math.c][q_rsqrt],
there is a function for calculating the reciprocal square root of a number
which at first glance seems to use a very peculiar algorithm:
[q_rsqrt]: https://github.com/id-Software/Quake-III-Arena/blob/master/code/game/q_math.c#L552
```c
float Q_rsqrt( float number )
{
long i;
float x2, y;
const float threehalfs = 1.5F;
x2 = number * 0.5F;
y = number;
i = * ( long * ) &y; // evil floating point bit level hacking
i = 0x5f3759df - ( i >> 1 ); // what the fuck?
y = * ( float * ) &i;
y = y * ( threehalfs - ( x2 * y * y ) ); // 1st iteration
// y = y * ( threehalfs - ( x2 * y * y ) ); // 2nd iteration, this can be removed
return y;
}
```
Many articles have been written about this particular algorithm and it has its
own well written Wikipedia [page][wikipage] where it is referred to as the
_fast inverse square root_. The algorithm actually appeared on various forums
before the Q3 source code was released. Ryszard of Beyond3D did some
[investigating][rys20061] in 2004-2005 and [eventually][rys20062] tracked down
the original author of the algorithm to be Greg Walsh at Ardent Computer who
created it more than a decade earlier.
[wikipage]: https://en.wikipedia.org/wiki/Fast_inverse_square_root
[rys20061]: https://www.beyond3d.com/content/articles/8/
[rys20062]: https://www.beyond3d.com/content/articles/15
Table of contents
- 1 [How does it work?][]
- 1.1 [First approximation][]
- 1.2 [Improving the approximation][]
- 2 [How fast is it?][]
- 2.1 [Initial testing][]
- 2.2 [Disassembly][]
- 2.3 [Broad sweep][]
- 2.4 [Try it yourself][]
- 3 [Final thoughts][]
## How does it work?
So how does the method work, anyway? It is performed in two steps:
1. obtain a rough approximation `y` for the reciprocal square root of our
`number`:
```c
y = number;
i = * ( long * ) &y;
i = 0x5f3759df - ( i >> 1 );
y = * ( float * ) &i;
```
2. improve the approximation using a single step of the Newton-Raphson (NR) method:
```c
const float threehalfs = 1.5F;
x2 = number * 0.5F;
y = y * ( threehalfs - ( x2 * y * y ) );
```
### First approximation
The most interesting part is the first one. It uses a seemingly magic number
`0x5f3759df` and some bit shifting and somehow ends up with the reciprocal
square root. The first line stores the 32-bit floating-point number `y` as a
32-bit integer `i` by taking a pointer to `y`, converting it to a `long`
pointer and dereferencing it. So `y` and `i` hold two identical 32-bit vectors,
but one is interpreted as a floating-point number and the other is interpreted
as an integer number. Then, the integer number is shifted one step to the
right, negated, and the constant `0x5f3759df` is added. Finally, the resulting
value is interpreted as a floating number again by dereferencing a `float`
pointer that points to the integer `i` value.
Here, shifting, negation and addition is performed in the integer domain, how
do these operations affect the number in the floating-point domain? In order to
understand how this can yield an approximation of the reciprocal square root we
must be familiar with how floating point numbers are represented in memory. A
floating-point number consists of a sign $`s \in \{0,1\}`, exponent $`e\in
\mathbb{Z}` and a fractional part $`0 \leq f \lt 1`. The value of the
floating-point number is then[^float-normal]
$$`y = (-1)^s \cdot (1 + f) \cdot 2^e.`
In our case, we can assume that our `float` is in the IEEE 754 [binary32][]
format[^cfloat], the bits are then ordered as shown below.
```
-31-30----23-22--------------------0
| S | E (8b) | F (23b) |
--- -------- -----------------------
```
The most significant bit is the sign bit $`S`, followed by 8 bits ($`E`)
representing the exponent $`e` and the remaining 23 bits ($`F`) representing
the fractional part $`f`. The number is negative when $`S=1`. The 8-bit number
$`E` is not directly used as the exponent, it has an offset or bias of $`2^8-1
= 127`. So $`E=0` means that the exponent is $`e=-127`. $`F` is simply a
fractional binary number with the decimal point before the first digit such
that $`f=F\cdot2^{-23}`.
We can write a simple C program `interp.c` to print both the integer and
floating-point interpretations of a given number and also extract the different
parts:
```c
#include
#include
#include
#include
int main(int argc, char *args[]) {
/* parse number from args */
uint32_t i;
int ret;
if (argc == 2) {
ret = sscanf(args[1], "%u", &i);
} else if (argc == 3 && strcmp(args[1], "-h") == 0) {
ret = sscanf(args[2], "%x", &i);
} else if (argc == 3 && strcmp(args[1], "-f") == 0) {
float y;
ret = sscanf(args[2], "%f", &y);
i = *(uint32_t*)&y;
} else {
return EXIT_FAILURE;
}
if (ret != 1) return EXIT_FAILURE;
/* print representations */
printf("hexadecimal: %x\n", i);
printf("unsigned int: %u\n", i);
printf("signed int: %d\n", i);
printf("floating-point: %f\n", *(float*)&i);
/* print components */
int S = i >> 31;
int E = (i >> 23) & ((1 << 8)-1);
int e = E - 127;
int F = i & ((1 << 23)-1);
float f = (float)F / (1 << 23);
printf("S: %d\n", S);
printf("E: %d (0x%x) <=> e: %d\n", E, E, e);
printf("F: %d (0x%x) <=> f: %f\n", F, F, f);
return EXIT_SUCCESS;
}
```
We can for example look at the number `0x40b00000`:
```cmdline
$ ./interp -h 40b00000
hexadecimal: 40b00000
unsigned int: 1085276160
signed int: 1085276160
floating-point: 5.500000
S: 0
E: 129 (0x81) <=> e: 2
F: 3145728 (0x300000) <=> f: 0.375000
```
We can also extract the parts of a floating-point number:
```cmdline
$ ./interp -f -32.1
hexadecimal: c2006666
unsigned int: 3254806118
signed int: -1040161178
floating-point: -32.099998
S: 1
E: 132 (0x84) <=> e: 5
F: 26214 (0x6666) <=> f: 0.003125
```
Even now when we know how the floating-point numbers are represented in memory,
it is not entirely obvious how performing operations in the integer domain
would affect the floating-point domain. At first we can try to simply iterate
over a range of floating-point number and see what integer values we get:
```c
#include
int main() {
float x;
for (x = 0.1; x <= 8.0; x += 0.1) {
printf("%f\t%d\n", x, *(int*)&x);
}
}
```
We can then plot the floating-point values on the x-axis and the integer values
on the y-axis with e.g. [gnuplot][] to get a plot like this:
[gnuplot]: http://www.gnuplot.info/
```
41000000 +------------------------------------------------------------+
| + + + + + +********** |
| *********** |
40800000 |-+ ********* +-|
| ***** |
| ***** |
40000000 |-+ ***** +-|
| ** |
| ** |
3f800000 |-+ *** +-|
| * |
| ** |
| * |
3f000000 |-+ * +-|
| * |
| * |
3e800000 |-* +-|
| * |
| * + + + + + + + |
3e000000 +------------------------------------------------------------+
0 1 2 3 4 5 6 7 8
floating-point
```
Well, this curve looks quite familiar. We can look further at some of the data
points using our previous program:
```cmdline
$ ./interp -f 1.0
hexadecimal: 3f800000
unsigned int: 1065353216
signed int: 1065353216
floating-point: 1.000000
S: 0
E: 127 (0x7f) <=> e: 0
F: 0 (0x0) <=> f: 0.000000
$ ./interp -f 2.0
hexadecimal: 40000000
unsigned int: 1073741824
signed int: 1073741824
floating-point: 2.000000
S: 0
E: 128 (0x80) <=> e: 1
F: 0 (0x0) <=> f: 0.000000
$ ./interp -f 3.0
hexadecimal: 40400000
unsigned int: 1077936128
signed int: 1077936128
floating-point: 3.000000
S: 0
E: 128 (0x80) <=> e: 1
F: 4194304 (0x400000) <=> f: 0.500000
```
For 1.0 and 2.0 we get $`S=0`, $`F=0` and a non-zero biased exponent $`E`. If
we remove the bias from this number (subtract by `127 << 23`) and then shift it
to the far right we end up with the exponent $`e`, in other words the base 2
logarithm of the floating-point number. However, this only works when $`S=0`
and $`F=0`, i.e. positive integers. If $`S=1` we have a negative number for
which the logarithm is undefined. But if $`F\ne{}0` and we shift the exponent
to the far right we will simply lose all of that data. We can instead convert
it to a floating-point value and divide by $`2^{23}`, such that the fractional
part scales our resulting value linearly:
```c
(float) (*(int*)&x - (127 << 23)) / (1 << 23)
```
Then we don't exactly get the logarithm but we do get a linear approximation
for all non power of two values. We can plot the approximation together with
the actual logarithmic function:
```
3 +----------------------------------------------------------------------+
| + + + + + ###*********** |
| ##************ |
2 |-+ #********** +-|
| #****** |
| #****** |
| ****** |
1 |-+ #*** +-|
| #*** |
| #** |
0 |........**............................................................|
| ** |
| * |
-1 |-+ #* +-|
| * |
| #* |
| * |
-2 |-* +-|
| * log2(x) ####### |
|* + + + + + approximation+******* |
-3 +----------------------------------------------------------------------+
0 1 2 3 4 5 6 7 8
```
This means that when we take a floating-point number and interpret it as an
integer number, we obtain an approximation of the logarithm of that number,
with some offset and scaling. And when we interpret an integer number as a
floating-point number, we get the opposite, i.e. the exponential or
antilogarithm of our integer value. This basically means that when we perform
operations in the integer domain, it is as if we perform operations in the
logarithmic domain. For example, if we remember our [logarithmic
identities][logident], we know that if we take the logarithm of two numbers and
add them together, we get the logarithm of their product:
$$`\log{a} + \log{b} = \log{(a \cdot b)}.`
In other words, if we perform addition in the integer domain we get
multiplication in the floating-point domain --- approximately anyway. We can
try this with another simple C program. One thing we need to consider is how
our operation affects the exponent bias. When we add two numbers with biased
exponents we get double bias:
$$`
\begin{align}
E_1 + E_2 &=& (e_1 + B) + (e_2 + B) \\
&=& e_1 + e_2 + 2B.
\end{align}
`
We want our bias to remain as $`B` rather than $`2B` so in order to counter
this we simply subtract the result by $`B`. Our C program that performs
floating-point multiplication using integer addition may then look like this:
```c
#include
#include
#include
const uint32_t B = (127 << 23);
int main(int argc, char *args[]) {
/* parse factors from args */
float a, b;
if (argc == 3) {
int ret = sscanf(args[1], "%f", &a);
ret += sscanf(args[2], "%f", &b);
if (ret != 2) return EXIT_FAILURE;
} else {
return EXIT_FAILURE;
}
/* perform multiplication (integer addition) */
uint32_t sum = *(uint32_t*)&a + *(uint32_t*)&b - B;
float y = *(float*)∑
/* compare with actual */
float y_actual = a*b;
float rel_err = (y - y_actual) / y_actual;
printf("%f =? %f (%.2f%%)\n", y, y_actual, 100*rel_err);
}
```
Let's try it out:
```cmdline
$ ./mul 3.14159 8.0
25.132721 =? 25.132721 (0.00%)
$ ./mul 3.14159 0.2389047
0.741016 =? 0.750541 (-1.27%)
$ ./mul -15.0 3.0
-44.000000 =? -45.000000 (-2.22%)
$ ./mul 6.0 3.0
16.000000 =? 18.000000 (-11.11%)
$ ./mul 0.0 10.0
0.000000 =? 0.000000 (inf%)
```
Most of the time it is not perfectly accurate, it is correct only if one of the
factors is a power of two, and least accurate when both factors are right
between two powers of 2.
How about the reciprocal square root? The reciprocal square root
$`\frac{1}{\sqrt{x}}` is equivalent to $`x^{-1/2}` so we will need another
logarithmic identity:
$$`p\log{x} = \log{x^p}`
This means that if we perform multiplication in the integer domain, we get
exponentiation in the floating-point domain. Depending on our exponent $`p` we
can obtain several different functions, e.g:
| $`p` | $`f(x)` |
|-------|-----------------------|
| 2 | $`x^2` |
| 1/2 | $`\sqrt{x}` |
| -1 | $`\frac{1}{x}` |
| -1/2 | $`\frac{1}{\sqrt{x}}` |
In order to get a first approximation of the reciprocal square root, we simply
need to multiply by -1/2 in the integer domain and adjust for the bias. The
bias will then be $`-B/2` and we want the bias to be $`B` so we simply need
to add $`3B/2 = \texttt{0x5f400000}`. So, we will multiply by -1/2 by shifting
right one step and negating, and then add the bias:
```c
- (i << 1) + 0x5f400000;
```
This is now identical to the Q3 source code except that the constant value
differs slightly. They used `0x5f3759df` while we currently have `0x5f400000`.
We can see if it is possible to make improvements by looking at our error. We
simply subtract our approximate value for the reciprocal square root by the
exact value and plot the result for a certain range of numbers:
```
f4240 +--------------------------------------------------------------+
| **+ *** + *** + + |
dbba0 |-+ * ** ****** ***** +-|
| * ** ** *** |
c3500 |-+ * * *** *** +-|
| * *** ** |
aae60 |-+ * ** +-|
| * * |
927c0 |-+ * ** +-|
7a120 |-+* ** +-|
| * ** |
61a80 |-+* ** +-|
| * * |
493e0 |-* ** +-|
| * ** |
30d40 |*+ * +-|
|* ** |
186a0 |*+ **-|
|* + + + + + *|
0 +--------------------------------------------------------------+
1 1.5 2 2.5 3 3.5 4
floating-point number
```
The graph repeats horizontally in both directions (only in different scale) so
we only need to look at this part to understand the error for all (normal)
floating-point numbers. We can see that the approximate value is always
overestimating, by simply subtracting a constant that is around half the
maximum error we can make it symmetric and thus decrease the average absolute
error. Looking at the graph, subtracting something like 0x7a120 might work. Our
constant would then be 0x5f385ee0 which is closer to the constant used in Q3.
In the integer domain, our error will simply center the error around the x-axis
in the above diagram. In the floating-point domain, the error is affected
similarly except when our subtraction borrows from the exponent:
```
0.03 +--------------------------------------------------------------+
| **+ *** + + + + |
| * ** ************ |
0.02 |-+ * ** *** *** +-|
| * * *** *** |
| * *** ** |
0.01 |-+ * ** +-|
| * * |
| * ** |
0 |..*..............................................***..........|
| * ** |
| * ** |
| * * |
-0.01 |-* ** +-|
| * ******|
|* |
-0.02 |*+ +-|
|* |
|* + + + + + |
-0.03 +--------------------------------------------------------------+
1 1.5 2 2.5 3 3.5 4
floating-point number
```
We could potentially try to find an actual optimum for some reasonable
[objective function][] but we will stop here. In the case of the original Q3
constant, it is not really clear how it was chosen, perhaps using trial and
error.
[binary32]: https://en.wikipedia.org/wiki/Single-precision_floating-point_format
[repeating fractional part]: https://en.wikipedia.org/wiki/Repeating_decimal
[logident]: https://en.wikipedia.org/wiki/List_of_logarithmic_identities
[objective function]: https://en.wikipedia.org/wiki/Objective_function
### Improving the approximation
The second part is less unconventional. When a first approximation has been
obtained, one can improve it by using a method known as Newton-Raphson (NR). If
you are unfamiliar with it, Wikipedia has a good [article][wiki-nr] on it. The
NR method is used to improve an approximation for the root of an equation.
Since we want the reciprocal square root we need an equation $`f(y)` that is
zero when $`y` is exactly the reciprocal square root of $`x`:
$$`
\begin{align}
y = \frac{1}{\sqrt{x}} \: \Leftrightarrow \: \frac{1}{y^2} = x \\
\Rightarrow f(y) = \frac{1}{y^2} - x = 0
\end{align}
`
If we have an approximate value $`y_n` we can get a better approximation
$`y_{n+1}` by calculating where the tangent of the function's graph at
$`y=y_n` (i.e. the derivative) intersects $`f(y)=0$$`. That value can be
expressed as
$$`
\begin{align}
y_{n+1} &=& y_n - \frac{f(y_n)}{f'(y_n)} \\
&=& y_n \left( \frac{3}{2} - \frac{x}{2} \cdot {y_n}^2 \right)
\end{align}
`
which is the exact same expression that is used in the second part of the Q3
function.
[wiki-nr]: https://en.wikipedia.org/wiki/Newton%27s_method
## How fast is it?
Back in 2003 Chris Lomont wrote an [article][lomont2003] about his
investigations of the algorithm. His testing yielded that the algorithm was
four times faster than using the more straightforward way of simply using
`sqrt(x)`[^sqrt] from the standard library and taking its reciprocal.
In 2009, Elan Ruskin made a post, [Timing Square Root][ruskin2009], where he
primarily looked at the square root function but also compared the fast inverse
square root algorithm to other methods. On his Intel Core 2, the fast inverse
square root was 4 times slower than using `rsqrtss`, or 30% slower than
`rsqrtss` with a single NR step.
[lomont2003]: https://www.lomont.org/papers/2003/InvSqrt.pdf
[ruskin2009]: https://web.archive.org/web/20210208132927/http://assemblyrequired.crashworks.org/timing-square-root/
Since then, there has come several new extensions to the x86 instruction set. I
have tried to sum up all square root instructions currently available:
| Set | $`\sqrt{x}` | $`\frac{1}{\sqrt{x}}` | Width |
|--------------------|-------------------------------------|--------------------------------------------------------|-------|
| [x87][] (1980) | `fsqrt` | | 32 |
| [3DNow!][] (1998) | | `pfrsqrt` | 128 |
| [SSE][] (1999) | `sqrtps`, `sqrtss` | `rsqrtps`, `rsqrtss` | 128 |
| [SSE2][] (2000) | `sqrtpd`, `sqrtsd` | | 128 |
| [AVX][] (2011) | `vsqrtps`, `vsqrtpd`, `vsqrtps_nr`, | `vrsqrtps`, `vrsqrtps_nr` | 256 |
| [AVX-512][] (2014) | | `vrsqrt14pd`, `vrsqrt14ps`, `vrsqrt14sd`, `vrsqrt14ss` | 512 |
[x87]: https://en.wikipedia.org/wiki/8087
[3DNow!]: https://en.wikipedia.org/wiki/3DNow!
[SSE]: https://en.wikipedia.org/wiki/Streaming_SIMD_Extensions
[SSE2]: https://en.wikipedia.org/wiki/SSE2
[AVX]: https://en.wikipedia.org/wiki/Advanced_Vector_Extensions
[AVX-512]: https://en.wikipedia.org/wiki/AVX-512
The `fsqrt` is quite obsolete by now. The 3DNow! extension has also been
deprecated and is no longer supported. All x86-64 processors support at least
SSE and SSE2. Most processors support AVX and some support AVX-512, but e.g.
GCC currently chooses to not emit any AVX instructions by default.
The `p` and `s` is short for "packed" and "scalar". The packed instructions are
vector [SIMD][] instructions while the scalar ones only operate on a single
value at a time. With a register width of e.g. 256 bits, the packed instruction
can perform $`256/32=8` calculations in parallel. The `s` or `d` is short for
"single" or "double" precision floating-point. Since we are considering
approximations we will be using single precision floating-point numbers. We may
then use either the `ps` or `ss` variants.
[SIMD]: https://en.wikipedia.org/wiki/Single_instruction,_multiple_data
The fast inverse square root method had a pretty hard time against the
`rsqrtss` instruction back in 2009 already. And since then, multiple extensions
with specialized SIMD instructions has been implemented in modern x86
processors. Surely, the fast inverse square root has no chance today and its
time has passed?
Why don't we give it a try ourselves right now, we can start by running some
tests on my current machine which has a relatively modern processor: an AMD Zen
3 5950X from late 2020.
### Initial testing
We will write a C program that tries to calculate the reciprocal square root
using three different methods:
- `exact`: simply `1.0 / sqrtf(x)`, using the `sqrtf` function from the C
standard library,
- `appr`: first approximation from the Q3 source as explained above,
- `appr_nr`: the full Q3 method with one iteration of Newton-Raphson.
For each method we perform the calculation for each value in a randomized input
array and time how long it takes in total. We can use the `clock_gettime`
function from libc (for POSIX systems) to get the time before and after we
perform the calculations and calculate the difference. We will then repeat this
many times to decrease the random variations. The C program looks like this:
```c
#include
#include
#include
#include
#include
#define N 4096
#define T 1000
#define E9 1000000000
#ifndef CLOCK_REALTIME
#define CLOCK_REALTIME 0
#endif
enum methods { EXACT, APPR, APPR_NR, M };
const char *METHODS[] = { "exact", "appr", "appr_nr" };
static inline float rsqrt_exact(float x) { return 1.0f / sqrtf(x); }
static inline float rsqrt_appr(float x) {
uint32_t i = *(uint32_t*)&x;
i = -(i >> 1) + 0x5f3759df;
return *(float*)&i;
}
static inline float rsqrt_nr(float x, float y) { return y * (1.5f - x*0.5f*y*y); }
static inline float rsqrt_appr_nr(float x) {
float y = rsqrt_appr(x);
return rsqrt_nr(x, y);
}
int main() {
srand(time(NULL));
float y_sum[M] = {0};
double t[M] = {0};
for (int trial = 0; trial < T; trial++) {
struct timespec start, stop;
float x[N], y[N];
for (int i = 0; i < N; i++) { x[i] = rand(); }
clock_gettime(CLOCK_REALTIME, &start);
for (int i = 0; i < N; i++) { y[i] = rsqrt_exact(x[i]); }
clock_gettime(CLOCK_REALTIME, &stop);
for (int i = 0; i < N; i++) { y_sum[EXACT] += y[i]; }
t[EXACT] += ((stop.tv_sec-start.tv_sec)*E9 + stop.tv_nsec-start.tv_nsec);
clock_gettime(CLOCK_REALTIME, &start);
for (int i = 0; i < N; i++) { y[i] = rsqrt_appr(x[i]); }
clock_gettime(CLOCK_REALTIME, &stop);
for (int i = 0; i < N; i++) { y_sum[APPR] += y[i]; }
t[APPR] += ((stop.tv_sec-start.tv_sec)*E9 + stop.tv_nsec-start.tv_nsec);
clock_gettime(CLOCK_REALTIME, &start);
for (int i = 0; i < N; i++) { y[i] = rsqrt_appr_nr(x[i]); }
clock_gettime(CLOCK_REALTIME, &stop);
for (int i = 0; i < N; i++) { y_sum[APPR_NR] += y[i]; }
t[APPR_NR] += ((stop.tv_sec-start.tv_sec)*E9 + stop.tv_nsec-start.tv_nsec);
}
printf("rsqrt\tps/op\tratio\terr\n");
for (int m = 0; m < M; m++) {
printf("%s\t%.0f\t%.2f\t%.4f\n",
METHODS[m],
t[m] * 1000.0f / N / T,
(double) t[EXACT] / t[m],
(y_sum[m] - y_sum[EXACT]) / y_sum[EXACT]);
}
return 0;
}
```
At the end of the program we print three things for each method:
- the average time to calculate a single operation in picoseconds -- the lower
the better,
- the ratio of the calculation time compared to the exact method -- the higher
the faster,
- the average error between the method and the exact method -- just to make
sure the calculations are performed correctly.
So, what do we expect? There are dedicated functions for calculating the
reciprocal square root in the x86 instruction set that the compiler should be
able to emit. The throughput may then be higther than in the approximate method
where we perform multiple operations.
Let's go ahead and try it, we'll compile it using GCC without any optimizations
at first, explicitly with `-O0`. Since we are using `math.h` for the exact
method we will also need to link the math library using `-lm`:
```cmdline
$ gcc -lm -O0 rsqrt.c
$ ./a.out
rsqrt ps/op ratio err
exact 3330 1.00 0.0000
appr 2020 1.65 0.0193
appr_nr 6115 0.54 -0.0010
```
This seems reasonable. The error is noticeable for the first approximation but
reduced after one iteration of NR. The first approximation is actually faster
than the exact method but when done together with a step of NR it is twice as
slow. The NR method requires more operations so this seems reasonable.
Alright, but this is only a debug build, let's try adding optimizations using
the `-O3` flag. This will enable all optimizations that do not disregard any
standards compliance.
```cmdline
$ gcc -lm -O3 rsqrt.c
$ ./a.out
rsqrt ps/op ratio err
exact 1879 1.00 0.0000
appr 72 26.01 0.0193
appr_nr 178 10.54 -0.0010
```
Hmm, now the approximations are actually a lot faster than before but the time
of the exact method has only halved, making the approximation with NR more than
ten times faster than the exact method. Perhaps the compiler failed to emit the
reciprocal square root functions? Maybe it will improve if we use the `-Ofast`
flag instead which is described by the [gcc(1)][] man page:
> Disregard strict standards compliance. -Ofast enables all -O3 optimizations.
> It also enables optimizations that are not valid for all standard- compliant
> programs. It turns on -ffast-math, -fallow-store-data-races and the
> Fortran-specific -fstack-arrays, unless -fmax-stack-var-size is specified,
> and -fno-protect-parens. It turns off -fsemantic-interposition.
[gcc(1)]: https://www.man7.org/linux/man-pages/man1/gcc.1.html
Our exact method may no longer be as accurate as before, but it may be faster.
```cmdline
$ gcc -lm -Ofast rsqrt.c
$ ./a.out
rsqrt ps/op ratio err
exact 153 1.00 0.0000
appr 118 1.30 0.0137
appr_nr 179 0.85 -0.0009
```
And it is indeed faster. The first approximation is still faster, but with a
step of NR it is slower than the exact method. The error has decreased slightly
for the approximations because we are still comparing against the "exact"
method which now yields different results. Oddly enough, the first
approximation has become half as fast. This seems to be a quirk of GCC, as
Clang does not have this issue, otherwise it produces similar results:
```cmdline
$ clang -lm -O0 rsqrt.c
$ ./a.out
rsqrt ps/op ratio err
exact 3715 1.00 0.0000
appr 1933 1.92 0.0193
appr_nr 6001 0.62 -0.0010
$ clang -lm -O3 rsqrt.c
$ ./a.out
rsqrt ps/op ratio err
exact 1900 1.00 0.0000
appr 61 31.26 0.0193
appr_nr 143 13.24 -0.0010
$ clang -lm -Ofast rsqrt.c
$ ./a.out
rsqrt ps/op ratio err
exact 148 1.00 0.0000
appr 62 2.40 0.0144
appr_nr 145 1.02 -0.0009
```
### Disassembly
For both compilers, there is quite a large difference between `-O3` and
`-Ofast`. We can look at the disassembly to see what is going on. We will need
to provide the `-g` flag to the compiler in order to get debug symbols in the
binary that tell us which object code corresponds to which source code.
Thereafter we can run `objdump -d` with the `-S` flag to see the disassembled
instructions next to the source code:
```objdumpc
$ gcc -lm -O3 -g rsqrt.c
$ objdump -d -S a.out
...
static inline float rsqrt_exact(float x) { return 1.0f / sqrtf(x); }
118e: 66 0f ef db pxor %xmm3,%xmm3
1192: 0f 2e d8 ucomiss %xmm0,%xmm3
1195: 0f 87 e1 02 00 00 ja 147c
119b: f3 0f 51 c0 sqrtss %xmm0,%xmm0
119f: f3 0f 10 0d 99 0e 00 movss 0xe99(%rip),%xmm1 # 2040
11a6: 00
...
11ab: f3 0f 5e c8 divss %xmm0,%xmm1
...
2040: 00 00 80 3f # 1.0f
```
In case you are unfamiliar, this is the AT&T syntax for x86-64 assembly. Note
that the source operand is always before the destination operand. The
parentheses indicate an address, for example `movss 0xecd(%rip),%xmm1` copies
the value located 0xecd bytes ahead of the address in the `rip` register
(instruction pointer, a.k.a. PC) to the `xmm1` register. The `xmmN` registers
are 128 bits wide, or 4 words. However, the `ss` instructions are for scalar
single-precision values, so it will only apply the operation on a single
floating-point value in the least significant 32 bits.
In the `-O3` case we use the scalar `sqrtss` followed by `divss`. There is also
a compare `ucomiss` and a jump `ja` that will set [`errno`][errno] to `EDOM` in
case the input is less than -0. We are not using `errno` at all so we can
remove the setting of `errno` by providing the `-fno-math-errno` flag:
```cmdline
$ gcc -lm -O3 -g -fno-math-errno rsqrt.c
$ ./a.out
rsqrt ps/op ratio err
exact 479 1.00 0.0000
appr 116 4.13 0.0193
appr_nr 175 2.74 -0.0010
```
```objdumpc
$ objdump -d -S a.out
...
static inline float rsqrt_exact(float x) { return 1.0f / sqrtf(x); }
1170: 0f 51 0c 28 sqrtps (%rax,%rbp,1),%xmm1
1174: f3 0f 10 1d c4 0e 00 movss 0xec4(%rip),%xmm3 # 2040
117b: 00
117c: 48 83 c0 10 add $0x10,%rax
1180: 0f c6 db 00 shufps $0x0,%xmm3,%xmm3
1184: 0f 28 c3 movaps %xmm3,%xmm0
1187: 0f 5e c1 divps %xmm1,%xmm0
...
2040: 00 00 80 3f # 1.0f
```
[errno]: https://en.wikipedia.org/wiki/Errno.h
This prevents us from having to check every input value individually and thus
allows us to use the packed variants of the instructions, performing 4
operations at a time. This improved the performance a lot. However, we still
use [`sqrtps`][sqrtps] followed by [`divps`][divps]. We will have to also
enable `-funsafe-math-optimizations`[^unsafe-math] and `-ffinite-math-only` in
order to make GCC emit [`rsqrtps`][rsqrtps] instead. We then get identical code
to when we used `-Ofast`:
```cmdline
$ gcc -lm -O3 -g -fno-math-errno -funsafe-math-optimizations -ffinite-math-only rsqrt.c
$ ./a.out
rsqrt ps/op ratio err
exact 155 1.00 0.0000
appr 120 1.29 0.0137
appr_nr 182 0.85 -0.0009
```
```objdumpc
$ objdump -d -S a.out
...
static inline float rsqrt_exact(float x) { return 1.0f / sqrtf(x); }
1170: 0f 52 0c 28 rsqrtps (%rax,%rbp,1),%xmm1
1174: 0f 28 04 28 movaps (%rax,%rbp,1),%xmm0
1178: 48 83 c0 10 add $0x10,%rax
117c: 0f 59 c1 mulps %xmm1,%xmm0
117f: 0f 59 c1 mulps %xmm1,%xmm0
1182: 0f 59 0d c7 0e 00 00 mulps 0xec7(%rip),%xmm1 # 2050
1189: 0f 58 05 b0 0e 00 00 addps 0xeb0(%rip),%xmm0 # 2040
1190: 0f 59 c1 mulps %xmm1,%xmm0
...
2040: 00 00 40 c0 # -3.0f
...
2050: 00 00 00 bf # -0.5f
```
[sqrtps]: https://www.felixcloutier.com/x86/sqrtps
[divps]: https://www.felixcloutier.com/x86/divps
[rsqrtps]: https://www.felixcloutier.com/x86/rsqrtps
Now it uses `rsqrtps`, but it also has several multiplication instructions as
well as an addition. Why are these needed, isn't the reciprocal square root all
we need? We can get a hint from looking at the disassembly of the `appr_nr`
function:
```objdumpc
static inline float rsqrt_nr(float x, float y) { return y * (1.5f - x*0.5f*y*y); }
12f8: f3 0f 10 1d 80 0d 00 movss 0xd80(%rip),%xmm3 # 2080
12ff: 00
...
1304: 0f 59 05 65 0d 00 00 mulps 0xd65(%rip),%xmm0 # 2070
...
1310: 0f c6 db 00 shufps $0x0,%xmm3,%xmm3
...
1318: 0f 28 d1 movaps %xmm1,%xmm2
131b: 0f 59 d1 mulps %xmm1,%xmm2
131e: 0f 59 d0 mulps %xmm0,%xmm2
1321: 0f 28 c3 movaps %xmm3,%xmm0
1324: 0f 5c c2 subps %xmm2,%xmm0
1327: 0f 59 c1 mulps %xmm1,%xmm0
...
2070: 00 00 00 3f # 0.5f
...
2080: 00 00 c0 3f # 1.5f
```
The last part looks quite similar, because it is actually doing the same thing:
an iteration of Newton-Raphson[^alt-nr]. This is hinted in the man page of
[gcc(1)][]:
> This option enables use of the reciprocal estimate and reciprocal square root
> estimate instructions with additional Newton-Raphson steps to increase
> precision instead of doing a divide or square root and divide for
> floating-point arguments.
The `rsqrtps` instruction only guarantees a relative error smaller than
$`1.5\cdot2^{-12}`, the NR iteration reduces it further just like in the Q3
code.
If we do not need this extra precision, can we get a speedup by skipping the NR
step? We can use built-in compiler [intrinsics][] in order to make the compiler
only emit the `rsqrtps` instruction. The GCC manual has a
[list][gcc-x86-builtins] of built-in functions for the x86 instruction set.
There is a `__builtin_ia32_rsqrtps` function that will emit the `rsqrtps`
instruction:
```c
v4sf __builtin_ia32_rsqrtps (v4sf);
```
[intrinsics]: https://en.wikipedia.org/wiki/Intrinsic_function
[gcc-x86-builtins]: https://gcc.gnu.org/onlinedocs/gcc-12.2.0/gcc/x86-Built-in-Functions.html
The manual also has a [chapter][gcc-vector] about how to use these vector
instructions with built-in functions. We need to add a `typedef` for the `v4sf`
type which contains four floating point numbers. We will then use an array of
$`N/4` of these vectors and simply provide one vector at a time to the
built-in function. N is a multiple of four so there are no half full vectors.
We can simply cast our previous `float` input array to a `vfs4` pointer. We
will add these parts to our previous program:
```c
typedef float v4sf __attribute__ ((vector_size(16)));
v4sf rsqrt_intr(v4sf x) { return __builtin_ia32_rsqrtps(x); };
v4sf *xv = (v4sf*)x, *yv = (v4sf*)y;
clock_gettime(CLOCK_REALTIME, &start);
for (int i = 0; i < N/4; i++) { yv[i] = rsqrt_intr(xv[i]); }
clock_gettime(CLOCK_REALTIME, &stop);
for (int i = 0; i < N; i++) { y_sum[INTR] += y[i]; }
t[INTR] += ((stop.tv_sec-start.tv_sec)*E9 + stop.tv_nsec-start.tv_nsec);
```
[gcc-vector]: https://gcc.gnu.org/onlinedocs/gcc/Vector-Extensions.html
We can compile it in order to run and disassemble it:
```cmdline
$ gcc -lm -O3 -g rsqrt_vec.c
$ ./a.out
rsqrt ps/op ratio err
exact 1895 1.00 0.0000
appr 72 26.39 0.0193
appr_nr 175 10.81 -0.0010
rsqrtps 61 31.00 0.0000
```
```objdumpc
$ objdump -d -S a.out
...
v4sf rsqrt_intr(v4sf x) { return __builtin_ia32_rsqrtps(x); };
1238: 41 0f 52 04 04 rsqrtps (%r12,%rax,1),%xmm0
...
```
Now we are down to a single instruction and it is slightly faster than before.
There are also extensions that not all processors support that we can try to
use. We can tell the compiler to use any extensions that are available on our
processor using `-march=native`. This may make the binary incompatible with
other processors, though.
```cmdline
$ gcc -lm -Ofast -g -march=native rsqrt_vec.c
$ ./a.out
rsqrt ps/op ratio err
exact 78 1.00 0.0000
appr 40 1.96 0.0137
appr_nr 85 0.91 -0.0009
rsqrtps 62 1.25 0.0000
```
Now we are down to almost as good as the first approximation. The intrinsic one
is pretty much just as fast. The "exact" method got replaced by a 256-bit
[`vrsqrtps`][rsqrtps] and a step of NR:
```objdumpc
static inline float rsqrt_exact(float x) { return 1.0f / sqrtf(x); }
11d0: c5 fc 52 0c 18 vrsqrtps (%rax,%rbx,1),%ymm1
11d5: c5 f4 59 04 18 vmulps (%rax,%rbx,1),%ymm1,%ymm0
11da: 48 83 c0 20 add $0x20,%rax
11de: c4 e2 75 a8 05 79 0e vfmadd213ps 0xe79(%rip),%ymm1,%ymm0
11e5: 00 00
11e7: c5 f4 59 0d 91 0e 00 vmulps 0xe91(%rip),%ymm1,%ymm1
11ee: 00
11ef: c5 fc 59 c1 vmulps %ymm1,%ymm0,%ymm0
```
The `__builtin_ia32_rsqrtps` is now using a single `vrsqrtps` and no NR step,
however, it still uses only 128-bit registers.
### Broad sweep
So, we did some testing on my machine and got some insight into what kind of
instructions we can use to calculate the reciprocal square root and how they
might perform. We will now try to run these benchmarks on several machines to
give us an idea how well our findings apply in general. Those machines include
all the ones that I happen to have convenient SSH access to. All resulting data
can be downloaded from [here][bench_res], it also includes results for the
inverse and square root functions, separately.
Below is a list of the x86 machines that were tested along with their CPUs and
their release date. All the previous tests were run on on the computer labeled
as "igelkott".
| Hostname | CPU Family | CPU Model | Year | Form factor |
|------------|----------------|---------------------|------|----------------|
| jackalope | [Core][] | Intel Celeron 550 | 2007 | i686 laptop |
| narwhal | [Piledriver][] | AMD FX-6300 | 2012 | x86_64 desktop |
| silverback | [Ivy Bridge][] | Intel Xeon E5-1410 | 2014 | x86_64 server |
| bovinae | [Kaby Lake][] | Intel Core i5-8250U | 2017 | x86_64 laptop |
| igelkott | [Zen 3][] | AMD Ryzen 5950X | 2020 | x86_64 desktop |
| deck | [Zen 2][] | AMD APU 0405 | 2022 | x86_64 mobile |
[Core]: https://en.wikipedia.org/wiki/Core_(microarchitecture)
[Piledriver]: https://en.wikipedia.org/wiki/Piledriver_(microarchitecture)
[Ivy Bridge]: https://en.wikipedia.org/wiki/Ivy_Bridge_(microarchitecture)
[Kaby Lake]: https://en.wikipedia.org/wiki/Kaby_Lake
[Zen 3]: https://en.wikipedia.org/wiki/Zen_3
[Zen 2]: https://en.wikipedia.org/wiki/Zen_2
Below is a plot of the performance ratio compared to the `exact` method, i.e.
the time of each method divided by the time of the `exact` method. A higher
ratio means higher performance, anything below 1 is slower than `exact` and
anything above is faster. We use the `-Ofast` flag here, as it is the fastest
option that can be used without sacrificing portability.
```
narwhal-gcc ***A*** bovinae-gcc %%%D%%%igelkott-clang ===G===
narwhal-clang ###B### bovinae-clang @@@E@@@ deck-gcc ***H***
silverback-gcc $$$C$$$ igelkott-gcc &&&F&&& deck-clang ###I###
Inverse square root performance ratio, -Ofast
4 +---------------------------------------------------------------------+
| + + + + |
3.5 |-+ +-|
| |
3 |-+ #I +-|
| ##@E |
2.5 |-+ =G= ##@@=G +-|
2 |-+ ==== ==== ##==== +-|
| === $=== ##==& |
1.5 |-+ === ####I## $==== @##=& **H +-|
|----------===########&&&&F&&######--$$====---==##&&-******-----------|-
1 |-+ I####&&&&########B####&&&&######$$=G=##*****########B +-|
| #####&###I#*###### |
0.5 |-+ +-|
| + + + + |
0 +---------------------------------------------------------------------+
exact appr appr\_nr rsqrtps
```
The results are quite similar across all of the machines, the time of the
methods are approximately ranked in the order `rsqrtps` <= `appr` < `exact` <=
`appr_nr`. Using the `appr_nr` method is either slower or the same as the
`exact` method, so it has no real benefit in this case.
The "jackalope" machine was not included in the above plot because it had an
extremely slow `exact` method. Especially when not using `-march=native` as the
compiler then resorted to using the antique `fsqrt` instruction.
Below is a table of the actual timings when using `-Ofast`, numbers in
parenthesis uses `-march=native`. Each number is how long a single operation
takes in picoseconds.
|Machine/Compiler|exact|appr|appr_nr|rsqrtps|
|-------|-----|----|-------|-------|
|jackalope-clang|53634 (5363)|1500 (2733)|4971 (3996)|N/A|
|narwhal-gcc|419 (363)|443 (418)|601 (343)|396 (231)|
|narwhal-clang|389 (796)|340 (321)|445 (859)|349 (388)|
|silverback-gcc|422 (294)|179 (199)|543 (543)|178 (189)|
|bovinae-gcc|260 (127)|155 (81)|321 (119)|108 (105)|
|bovinae-clang|255 (132)|108 (78)|272 (112)|95 (96)|
|igelkott-gcc|141 (79)|111 (63)|168 (87)|58 (64)|
|igelkott-clang|152 (76)|63 (40)|149 (70)|61 (62)|
|deck-gcc|342 (160)|234 (114)|444 (172)|226 (120)|
|deck-clang|297 (166)|189 (123)|332 (140)|101 (126)|
The square root function yields slightly different results:
```
narwhal-gcc ***A*** bovinae-gcc %%%D%%%igelkott-clang ===G===
narwhal-clang ###B### bovinae-clang @@@E@@@ deck-gcc ***H***
silverback-gcc $$$C$$$ igelkott-gcc &&&F&&& deck-clang ###I###
Square root performance ratio, -Ofast
4 +---------------------------------------------------------------------+
| + + + + |
3.5 |-+ E +-|
| @@ @@ |
3 |-+ @@ @@ +-|
| @@ =G= @@ |
2.5 |-+ @@ ===$C$== @@ +-|
2 |-+ @@ ===$$##I#$$=== @ +-|
| @@ ==$####**H*####$== @@ |
1.5 |-+ @@==####**** ***###===@@ +-|
|---------@=####**&&&&&&&&F&&&&--**####==@---------------%%%%D--------|-
1 |-+ I##&&&&&& &&&&&&&*####@ %%%%########I +-|
| &&*#I########********H |
0.5 |-+ $C$$$$$$$$&&&&&&&&F +-|
| + + + + |
0 +---------------------------------------------------------------------+
exact appr appr\_nr sqrtps
```
Oddly enough, the `sqrtps` built-in function is slower than the `exact` method,
and the `appr` without NR is now faster instead. The `appr_nr` method still
offers no advantage, it is instead consistently worse than `exact`.
Here are the original timings for the square root function as well, with
`-Ofast`. Again, numbers in parentheses use `-march=native`:
|Machine/Compiler|exact|appr|appr_nr|sqrtps|
|-------|-----|----|-------|-------|
|jackalope-clang|35197 (5743)|1494 (2738)|19191 (4308)|N/A|
|narwhal-gcc|505 (399)|399 (427)|659 (559)|796 (785)|
|narwhal-clang|448 (823)|327 (319)|638 (847)|803 (780)|
|silverback-gcc|625 (297)|271 (190)|958 (728)|1163 (1135)|
|bovinae-gcc|301 (148)|155 (81)|408 (200)|225 (226)|
|bovinae-clang|315 (244)|92 (60)|399 (159)|317 (227)|
|igelkott-gcc|173 (95)|119 (38)|233 (124)|288 (296)|
|igelkott-clang|168 (143)|63 (48)|234 (104)|170 (283)|
|deck-gcc|419 (205)|215 (108)|519 (252)|575 (574)|
|deck-clang|325 (244)|153 (88)|372 (180)|315 (458)|
[bench_res]: https://git.sr.ht/~nhellman/hllmn/tree/master/item/content/blog/2023-04-20_rsqrt/bench/res
As noted in [discussions][hn-comments] on HN, some usecases may require
determinism, i.e. the guarantee that different machines will produce the exact
same result. The [`rsqrtps`][rsqrtps] instruction is only required to have a
relative error smaller than $`1.5 \cdot 2^{−12}` so this can be implemented
differently on different machines[^rsqrt-diff].
However, the [IEEE 754][ieee-754] specification does have a specification for
the square root function, which the [`sqrtps`][sqrtps] instruction is
guaranteed to comply with. So, the best we may be able to do is to use `sqrtps`
followed by `divps` --- which is exactly what we got when we used `-O3` and
`-fno-math-errno`. It wasn't until we introduced `-funsafe-math-optimzations`
that we lost reproducibility. If reproducibility is a requirement, a fair
comparison would use only IEEE-754 compliant floating-point operations:
```
narwhal-gcc ***A*** bovinae-gcc %%%D%%%igelkott-clang ===G===
narwhal-clang ###B### bovinae-clang @@@E@@@ deck-gcc ***H***
silverback-gcc $$$C$$$ igelkott-gcc &&&F&&& deck-clang ###I###
Inverse square root performance ratio, -O3 -fno-math-errno
12 +---------------------------------------------------------------------+
| + # + + |
10 |-+ # C$ +-|
| # $$ $$ |
| # $$ $$$ |
8 |-+ # $$ =G= $$ +-|
| # $$ == ==== $$$ |
6 |-+ ## $$ === ==== $$$ +-|
| # $$ == ==== $$ |
| # $$ === ##I##### ====$$ |
4 |-+ # $$=== ####&&F&&&**############==== +-|
| # $$== #####&&&&%%D%%%@@&&&&&&******#####I |
2 |-+ #$===#####&&&%%%%% %%%%%%@@@@@@&&***H +-|
|------------$=####&&%%%-----------------------------%%@@@E-----------|---
| I##% + + |
0 +---------------------------------------------------------------------+
exact appr appr\_nr
```
[ieee-754]: https://en.wikipedia.org/wiki/IEEE_754
This tips the scales in favor of the approximations. Using the `appr_nr` method
results in a 2-4 times faster implementation compared to the `exact` method.
`narwhal-clang` is an outlier here because of a surprisingly slow `exact`
method.
### Try it yourself
You can try to run the benchmarks on your machine and see if you get similar
results. There is a shell script `bench/run.sh` that will generate and run
benchmarks using the `bench/bench.c.m4` file. These files can be found in [this
blog's repo][repodir]. Simply run the script with no arguments and it will
generate a `.tsv` file with all results:
```cmdline
$ cd bench
$ sh run.sh
$ grep rsqrt bench.tsv | sort -nk3 | head
rsqrt appr 40 1.91 0.0139 clang-Ofast-march=native
rsqrt rsqrtps 56 32.08 0.0000 clang-O3
rsqrt appr 58 31.08 0.0193 clang-O3
rsqrt rsqrtps 58 2.48 0.0000 clang-O3-fno-math-errno-funsafe-math-optimizations-ffinite-math-only
rsqrt rsqrtps 59 2.45 0.0000 gcc-Ofast
rsqrt rsqrtps 59 2.48 0.0000 clang-Ofast
rsqrt rsqrtps 59 31.07 0.0000 gcc-O3
rsqrt rsqrtps 59 7.83 0.0000 gcc-O3-fno-math-errno
rsqrt appr 60 2.41 0.0144 clang-O3-fno-math-errno-funsafe-math-optimizations-ffinite-math-only
rsqrt rsqrtps 60 8.09 0.0000 clang-O3-fno-math-errno
```
[repodir]: https://git.sr.ht/~nhellman/hllmn/tree/master/item/content/blog/2023-04-20_rsqrt
## Final thoughts
To summarize, using simply `1/sqrtf(x)` on modern x86 processors can be both
faster and more accurate than the _fast inverse square root_ method from Quake
III's `Q_rsqrt` function.
However, a key takeaway is that you have to order the compiler to make it
faster. When simply compiling using `-O3`, the fast inverse square root method
is actually _considerably_ faster than the naive implementation. We have to
allow the compiler to violate some strict specification requirements in order
to make it emit a faster implementation.
As noted by [discussions][hn-comments] on HN, there are usecases where
determinism, i.e. reproducibility between machines, is a requirement. In this
case, we may only use floating-point operations that are standardized to be
deterministic, which the `rsqrtps` instruction is _not_. For such usecases, the
_fast inverse square root_ method can be 2-4 times faster than the naive
implementation.
If very low accuracy can be tolerated, it is possible to get a slightly faster
implementation by skipping the Newton-Raphson step from the fast inverse square
root method. Interestingly, the compiler also performs an NR step after using
approximate implementations of the inverse square root. This can also be made
slightly faster by skipping the NR step --- by only emitting the approximate
instruction with the help of compiler intrinsics.
In this post, we focused on x86, but how about other instructions
sets?[^other-isas] The fast inverse square root method could perhaps still be
useful for processors without dedicated square root instructions.
How are the hardware implementations of approximate square roots typically
implemented? Could an approximate hardware implementation potentially use
something similar to the first approximation of the fast inverse square root
method?
_2023-04-21_: correct femtoseconds -> picoseconds, discuss determinism
[^float-normal]: This only covers all _normal_ floating-point values, there are
also two zeroes, subnormal numbers and special values such as `NaN` and
`inf`. The algorithm does not take these into consideration.
[^cfloat]: `float` is not necessarily binary32 according to the C standard, but
on most architectures, e.g. modern x86, it is. The author of the `Q_rsqrt`
function has made the assumption that binary32 is used.
[^sqrt]: Lomont wrote `(float)(1.0/sqrt(x))`, however `sqrt` (rather than
`sqrtf` works on a `double` rather than a `float`. This could perhaps have
made the naive method slower as this may cause casting to and from
`double`.
[^unsafe-math]: This includes `-fno-signed-zeros`, `-fno-trapping-math`,
`-fassociative-math` and `-freciprocal-math`. I tried to set all of these
individually but I was not able to get the `rsqrtps` instruction unless I
enabled `-funsafe-math-optimizations`. It seems to do additional things
besides enabling those flags.
[^alt-nr]: The compiler has performed the operations in a slightly different
order. They appear to be calculating it as `-0.5f*y*(x*y*y+-3.0f)`.
[^rsqrt-diff]: As per Robert O'Callahan's [investigations][ocallahan2021], AMD
and Intel implementations of rsqrt instructions seem to differ in exactly
22 out of the $`2^{32}` possible input values.
[ocallahan2021]: https://robert.ocallahan.org/2021/09/emulating-amd-rsqrtss-etc-on-intel.html
[^other-isas]: I did some quick tests on an armv7 phone, `appr_nr` performed
similarly to `exact`. I also attempted running on a MIPS32 machine but I
ran into issues with the cross-compiling toolchain. For another time..