Written by me, proof-read by an LLM.
Details at end.
So far we’ve covered addition, and subtraction mostly follows suit. There’s an obvious next step: multiplication. Specifically, let’s try multiplying by constants on x861. We’ll try several constants: 2, 3, 4, 16, 25 and 522.
Before you look at the code below, make your predictions for what instructions the compiler will pick, then see if you were right or not. Let’s start with x86:
So - it’s mostly lea again! By this point maybe you feel like the Square Hole Girl, for each of those multiplies I bet you thought, “it will use the mul instruction, surely?” Or, for the powers of two, a shift, right? As you probably know, generally speaking adds and shifts are much cheaper than multiplication on CPUs2.
But again, our friendly versatile lea instruction lets us add and multiply by 1, 2, 4 or 8. The compiler knows this, and builds a multiply by two out of adding rdi with itself, multiply by three as rdi + rdi*2, four as 0 + rdi*4. The various choices here are to do with the specifics of the opcode encoding; the compiler avoids having to add the 32-bit constant 0 where it can as that adds four bytes of 00 in the machine code. You can toggle the “link to binary” option to see that in the pane above.
Moving on to the multiply by 16, finally the compiler generates a shift! Here again we see the annoying x86 foible where we need to move the value from the input register edi to the output eax and then shift it, as the x86 doesn’t let us do that in one step.
The multiply by 25 is cute: the compiler computes x * 5 by rdi + rdi*4, and then does it again to get x * 5 * 5 = x * 25. Neat!
Finally by 522 the compiler seems to accept defeat, and uses an actual multiply. Here it’s using the newer three-argument imul so it can actually output the result of x * 522 directly into eax.
But! We’re cleverer than the compiler. We know that 522 is 512 + 8 + 2, and each of those powers of two can be computed efficiently with a shift; so why not rephrase as (x << 9) + (x << 3) + (x << 1)? Let’s give that a go!
Well, shucks. The compiler saved us from ourselves. Just like yesterday, it spotted what we were doing. It knows that adding that set of shifted powers of two is tantamount to a multiply. And - more importantly - it knows that for modern x86, the multiply is faster than shifting and adding. So it has our back.3
The message here is clear: compilers know all the shift and add tricks for multiplication, and when they’re appropriate. You don’t need to write return x << 9; // multiply by 512 any more - just write the code for humans to consume and let the optimiser do its magic.
See the video that accompanies this post.
This post is day 4 of Advent of Compiler Optimisations 2025, a 25-day series exploring how compilers transform our code.
This post was written by a human (Matt Godbolt) and reviewed and proof-read by LLMs and humans.
Support Compiler Explorer on Patreon or GitHub, or by buying CE products in the Compiler Explorer Shop.
We’ll look at ARM’s architectural features for efficient arithmetic tomorrow, including multiplication. ↩
Typically for integers: adds are single-cycle, but multiplies take a few cycles, like 3 or 4. uops has all the gory details for x86 processors. Generally speaking, arithmetic is about as difficult for computers as humans: Adding and subtracting is easy, multiplication is harder, with division being the trickiest. We’ll look at division later. ↩
If you crank the options way back using -m32 -march=pentium you’ll see it does use shifts and adds, as multiplication was more expensive back then. But either way; we should let the compiler decide how best to multiply based on which CPU we are targeting, and stop trying to micro-optimise this kind of thing ourselves. ↩
Matt Godbolt is a C++ developer living in Chicago. He works for Hudson River Trading on super fun but secret things. He is one half of the Two's Complement podcast. Follow him on Mastodon or Bluesky.