ARM's barrel shifter tricks

Written by me, proof-read by an LLM.
Details at end.

Yesterday we talked about how the compiler handles multiplication with a constant on x86. x86 has some architectural quirks (like lea) that give the compiler quite some latitude for clever optimisations. But do other processors have similar fun quirks?

Today we’re going to look at what code gets generated for the ARM processor. Let’s see how our examples come out:

Here we see ARM’s orthogonal and sensible instruction² set, along with its superpower: the barrel shifter. On ARM, many instructions can include a shift of the second operand. While not always completely “free” on modern processors¹, it’s cheap enough that the compiler can use it to avoid separate shift instructions.

Multiplying by 2 is just a shift:

mul_by_2(int):
  lsl w0, w0, #1    ; w0 = w0 << 1
  ret

Multiplying by 3 is an add with w0 plus itself left shifted by 1, as a single instruction:

mul_by_3(int):
  add w0, w0, w0, lsl #1  ; w0 = w0 + (w0 << 1)
  ret

Multiplying by 4 and 16 are also simple shifts, but there’s no shortcut for multiplying by 25 or 522: The compiler has to generate a multiply instruction there.

It’s also interesting that the operands can’t be constant values except for mov; so the constant value of 25 or 522 have to be loaded before they can be used in the multiply. ARM has a fixed-size instruction format - every instruction is 4 bytes long³, so there’s limited space to pack all the operands in.

On older 32-bit ARM there’s an even cooler trick to let us multiply by one-less-than-a-power-of-two, using rsb (reverse subtract, dest = op2 - op1). If we pick 32-bit armv7 we get to see this in action:

mul_by_7(int):
  rsb r0, r0, r0, lsl #3    ; r0 = (r0 << 3) - r0
  bx lr

Here in a single instruction it calculates result = (8 * x) - x. Cool stuff, but only on 32-bit ARMs. I guess that’s the price of progress?⁴

Different architectures, different tricks: x86 has lea, ARM has the barrel shifter. The compiler knows them all, so we don’t have to.

See the video that accompanies this post.

This post is day 5 of Advent of Compiler Optimisations 2025, a 25-day series exploring how compilers transform our code.

← Multiplying with a constant | Division →

This post was written by a human (Matt Godbolt) and reviewed and proof-read by LLMs and humans.

Support Compiler Explorer on Patreon or GitHub, or by buying CE products in the Compiler Explorer Shop.

On modern cores like the Cortex-A76, small left shifts (≤4 bits) on add/sub are essentially free, but larger shifts or right shifts can add latency. Still much better than needing separate instructions! ↩
After Z80 and 6502, ARM was the next ISA I learned and I spent many of my formative years writing in pure ARM assembly language, so it has a very special place in my heart. ↩
I’m ignoring the thumb mode here, which uses 2 byte instructions. ↩
If I’m missing something obvious, please let me know! I’m still learning 64-bit ARM, so there could be something I’m missing here. ↩

ARM's barrel shifter tricks

About Matt Godbolt