Inlining - the ultimate optimisation — Matt Godbolt’s blog

Inlining - the ultimate optimisation

Written by me, proof-read by an LLM.
Details at end.

Sixteen days in, and I’ve been dancing around what many consider the fundamental compiler optimisation: inlining. Not because it’s complicated - quite the opposite! - but because inlining is less interesting for what it does (copy-paste code), and more interesting for what it enables.

Initially inlining was all about avoiding the expense of the call¹ itself, but nowadays inlining enables many other optimisations to shine.

We’ve already encountered inlining (though I tried to limit it until now): On day 8 to get the size of a vector, we called its .size() method. I completely glossed over the fact that while size() is a method on std::vector, we don’t see a call in the assembly code, just the subtraction and shift.

So, how does inlining enable other optimisations? Using ARMv7², let’s convert a string to uppercase. We might have a utility change_case function³ that either turns a single character from upper to lower, or lower to upper, so, we’ll use it in our code:

The compiler decides⁵ to inline change_case into make_upper, and then seeing that upper is always true⁴, it can simplify the whole code to:

.LBB0_1:
  ldrb r2, [r0]         ; read next `c`; c = *string;
  sub r3, r2, #97       ; tmp = c - 'a'
  uxtb r3, r3           ; tmp = tmp & 0xff
  cmp r3, #26           ; check tmp against 26
  sublo r2, r2, #32     ; if lower than 26 then c = c - 32
                        ; c = ((c - 'a') & 0xff) < 26 ? c - 32 : c;
  strb r2, [r0], #1     ; store 'c' back; *string++ = c
  subs r1, r1, #1       ; reduce counter
  bne .LBB0_1           ; loop if not zero

There’s no trace left of the !upper case and the compiler, having inlined the code, has a fresh copy of the code to then further modify to take advantage of things it knows are true. It does a neat trick of avoiding a branch to check whether the character is uppercase: If (c - 'a') & 0xff⁶ is less than 26, it must be a lowercase character. It then conditionally subtracts 32, which has the effect of making a into A.

Inlining gives the compiler the ability to make local changes: The implementation can be special cased at the inline site as by definition there’s no other callers to the code. The special casing can include propagating values known to be constants (like the upper bool above), and looking for code paths that are unused⁷.

Inlining has some drawbacks though: if it’s overused, the code size of your program can grow quite substantially⁸. The compiler has to make its best guess as to whether inlining a function (and the functions that it calls…and so on), based on heuristics about the code size increase, and whether the perceived benefit is worth it. Ultimately it’s a guess though.

In rare cases accepting the cost of calling a common routine can be a benefit: if there is an unavoidable branch in the routine that’s globally predictable, sometimes having one shared branch site can be better for the branch predictor. In many cases, though, the reverse is true: if there’s a branch in code that’s inlined many times across the codebase then sometimes the (more local) branch history for the many copies of that branch can yield more predictability. It’s…complex⁹.

An important consideration for inlining is the visibility of the definition of the function you’re calling (that is, the body of the function). If the compiler has only seen the declaration of a function (e.g. in the case above just char change_case(char c, bool upper);), then it can’t inline it: there’s nothing to inline! In modern C++ with templates and a lot of code in headers, this usually isn’t a problem, but if you’re trying to minimise build times and interdependency this can be an issue¹⁰.

Frustratingly, inlining is also one of the most heuristic-driven optimisations; with different compilers making reasonable but different guesses about which functions should be inlined. This can be frustrating when adding a single line to a function somewhere has ripple effects throughout a codebase affecting inlining decisions¹¹.

All that said: Inlining is the ultimate enabling optimisation. On its own, copying function bodies into call sites might save a few cycles here and there. But give the compiler a fresh copy of code at the call site, and suddenly it can propagate constants, eliminate dead branches, and apply transformations that would be impossible with a shared function body. Who said copy paste was always bad?

See the video that accompanies this post.

This post is day 17 of Advent of Compiler Optimisations 2025, a 25-day series exploring how compilers transform our code.

← Calling all arguments | Partial inlining →

This post was written by a human (Matt Godbolt) and reviewed and proof-read by LLMs and humans.

Support Compiler Explorer on Patreon or GitHub, or by buying CE products in the Compiler Explorer Shop.

The call and return itself, coupled with the preserving and restoring of registers required for the calling convention. ↩
Again, this is for simplicity and to avoid vectorisation which, while super important, is something we’ll get to later. ↩
A bit contrived, and I’m deliberately rolling my own std::toupper etc to avoid locales, and further function calls. ↩
This is called constant propagation - when the compiler knows a value is constant, it substitutes it everywhere and simplifies the result. I had planned to do a post on this alone, but somehow I didn’t have space for it! ↩
I deliberately made the change_case function static so it won’t even appear in the output here as it’s otherwise unused. This also strongly hints to the compiler’s optimiser to inline it. If I made it non-static, then make_upper doesn’t change at all (it’s still inlined), but there’s a big (unused) change_case in the output to confuse things. Give it a go and look at the more complex code generated in change_case! ↩
The & 0xff comes from the “Unsigned eXTend Byte” uxtb instruction. ↩
Also known as dead code elimination, if the compiler can prove that parts of the code are unreachable it can remove them. ↩
This can have performance effects: the instruction cache is a limited resource so filling it up with lots of copies of essentially the same code can put extra pressure on the memory and decode subsystem of the CPU. ↩
That is, it’s a nuanced trade-off that depends on the specific code and runtime patterns. The compiler often doesn’t have that kind of information about your code, and so has to guess. Things like Profile Guided Optimisation can help this, but we won’t be covering that in this series. ↩
Using link-time optimisation (LTO; sometimes called link-time code generation), will allow the compiler to inline across translation units. LTO is a powerful technique and is well-supported on compilers. I always enable LTO for my release builds. ↩
Quite famously, once a single newline in a function definition caused a significant Linux performance regression, though this is more a limitation with the way GCC chooses to interpret inline assembly’s cost during inline analysis. ↩

Inlining - the ultimate optimisation

About Matt Godbolt