# Why Standard C++ Math Functions Are Slow

Performance has always been a high priority for C++, yet there are many examples both in the language and the standard library where compilers produce code that is significantly slower than what a machine is capable of. In this blog post, I’m going to explore one such example from the standard math library.

Suppose we’re tasked with computing the square roots of an array of floating point numbers. We might write a function like this to perform the operation:

If we’re using gcc, we can compile the code with

With `-O3`, gcc will optimize the code heavily but will still produce code that is standard compliant. The `-march=native` option tells gcc to produce code targeting the native architecture’s instruction set. The resulting binaries may not be portable even between different x86-64 CPUs.

Now, let’s benchmark the function. We’ll use google benchmark to measure how long it takes to compute the square roots of `1,000,000` numbers:

Compiling our benchmark and running we get

Can we do better? Let try this version:

and compile with

The only difference between `compute_sqrt1` and `compute_sqrt2` is that we added the extra option `-fno-math-errno` when compiling. I’ll explain later what `-fno-math-errno` does; but for now, I’ll only point out that the produced code is no longer standard compliant.

Let’s benchmark `compute_sqrt2`.

Running

we get

Yikes! `compute_sqrt2` is more than 4 times faster than `compute_sqrt1`.

What’s different? Let’s drill down into the assembly to find out. We can produce the assembly for the code by running

The result will depend on what architecture you’re using, but looking at sqrt1.s on my architecture, we see this section

Let’s break down the first few instructions:

What are instructions 3 and 4 for? Recall that for real numbers, sqrt is undefined on negative values. When `std::sqrt` is passed a negative number, the C++ standard requires that it return the special floating point value `NaN` and that it set the global variable `errno` to `EDOM`. But that error handling ends up being really expensive.

If we look at sqrt2.s, we see these instructions for the main loop:

Without the burden of having to do error handling, gcc can produce much faster code. `vsqrtpd` is what’s known as a Single Instruction Multiple Data (SIMD) instruction. It computes the the square root of four double precision floating point numbers at a time. For computationally expensive functions like sqrt, vectorization helps a lot.

It’s unfortunate that the standard requires such error handling. It’s so much slower to do the error checking that many compilers like Intel’s icc and Apple’s default clang-based compiler opt out of the error handling by default. Even if we want `std::sqrt` do error handling, we can’t portably rely on major compilers to do so.

The complete benchmark can be found at rnburn/cmath-bechmark.