Written by me, proof-read by an LLM.
Details at end.
Who among us hasn’t looked at a number and wondered, “How many one bits are in there?” No? Just me then?
Actually, this “population count” operation can be pretty useful in some cases like data compression algorithms, cryptography, chess, error correction, and sparse matrix representations. How might one write some simple C to return the number of one bits in an unsigned 64 bit value?
Written by me, proof-read by an LLM.
Details at end.
A common theme for helping the compiler optimise is to give it as much information as possible. Using the right signedness of types, targeting the right CPU model, keeping loop iterations independent, and for today’s topic: telling it how many loop iterations there are going to be ahead of time.
Taking the range-based sum example from our earlier post on loops, but using a std::span1, we can explore this ability. Let’s take a look at what happens if we use a dynamically-sized span - we’d expect it to look very similar to the std::vector version:
Written by me, proof-read by an LLM.
Details at end.
Loop optimisations often surprise us. What looks expensive might be fast, and what looks clever might be slow.
Yesterday we saw how the compiler canonicalises loops so it (usually) doesn’t matter how you write them, they’ll come out the same. What happens if we do something a little more expensive inside the loop?
Written by me, proof-read by an LLM.
Details at end.
Which loop style is “best”? This very question led to the creation of Compiler Explorer! In 2011 I was arguing with my team about whether we could switch all our loops from ordinal or iterator-style to the “new” range-for1. I wrote a small shell script to iteratively show the compiler output as I edited code in vim, and the seed of Compiler Explorer was born.
C and C++ have many ways to phrase loops: for(), while(), do..while(), range-for (for (x : container)), STL algorithms like std::for_each, and now range transformations! Let’s see if loop style actually matters for performance.
Written by me, proof-read by an LLM.
Details at end.
I occasionally give presentations to undergraduates, and one of my favourites is taking the students on a journey of optimising a “binary to decimal” routine1. There are a number of tricks, which I won’t go in to here, but the opening question I have is “how do you even turn a number into its ASCII representation?”
If you’ve never stopped to think about it, take a moment now to do so, it can be a fun problem.
Written by me, proof-read by an LLM.
Details at end.
Broadly computers have the same kind of trouble with integer arithmetic that we do, addition and subtraction are quick, multiplication takes a bit longer, and then division is where things start to get really slow. On an average x86, addition is single-cycle (and many can happen at once), multiplication is three or so cycles, and then division can be up to a hundred!1 And usually only a single divide can happen at a time, too.
It’s worth avoiding divides then, and compilers know this too. Let’s start with the easy cases: you probably know that shifting right effectively divides by powers of two. You’ve likely seen return x >> 9; to divide the int x by 512, right? Let’s see what the compiler does if we use the more obvious return x / 512;:
Written by me, proof-read by an LLM.
Details at end.
Yesterday we talked about how the compiler handles multiplication with a constant on x86. x86 has some architectural quirks (like lea) that give the compiler quite some latitude for clever optimisations. But do other processors have similar fun quirks?
Today we’re going to look at what code gets generated for the ARM processor. Let’s see how our examples come out:
Written by me, proof-read by an LLM.
Details at end.
So far we’ve covered addition, and subtraction mostly follows suit. There’s an obvious next step: multiplication. Specifically, let’s try multiplying by constants on x861. We’ll try several constants: 2, 3, 4, 16, 25 and 522.
Before you look at the code below, make your predictions for what instructions the compiler will pick, then see if you were right or not. Let’s start with x86:
Matt Godbolt is a C++ developer living in Chicago. He works for Hudson River Trading on super fun but secret things. He is one half of the Two's Complement podcast. Follow him on Mastodon or Bluesky.