Written by me, proof-read by an LLM.
Details at end.
In one of my talks on assembly, I show a list of the 20 most executed instructions on an average x86 Linux desktop. All the usual culprits are there, mov, add, lea, sub, jmp, call and so on, but the surprise interloper is xor - “eXclusive OR”. In my 6502 hacking days, the presence of an exclusive OR was a sure-fire indicator you’d either found the encryption part of the code, or some kind of sprite routine. It’s surprising then, that a Linux machine just minding its own business, would be executing so many.
That is, until you remember that compilers love to emit a xor when setting a register to zero:
We know that exclusive-OR-ing anything with itself generates zero, but why does the compiler emit this sequence? Is it just showing off?
In the example above, I’ve compiled with -O2 and enabled Compiler Explorer’s “Compile to binary object” so you can view the machine code that the CPU sees, specifically:
31 c0 xor eax, eax
c3 ret
If you change GCC’s optimisation level down to -O1 you’ll see:
b8 00 00 00 00 mov eax, 0x0
c3 ret
The much clearer, more intention-revealing mov eax, 0 to set the EAX register to zero takes up five bytes, compared to the two of the exclusive OR. By using a slightly more obscure instruction, we save three bytes every time we need to set a register to zero, which is a pretty common operation. Saving bytes makes the program smaller, and makes more efficient use of the instruction cache.
It gets better though! Since this is a very common operation, x86 CPUs spot this “zeroing idiom” early in the pipeline and can specifically optimise around it: the out-of-order tracking systems knows that the value of “eax” (or whichever register is being zeroed) does not depend on the previous value of eax, so it can allocate a fresh, dependency-free zero register renamer slot. And, having done that it removes the operation from the execution queue - that is the xor takes zero execution cycles!1 It’s essentially optimised out by the CPU!
You may wonder why you see xor eax, eax but never xor rax, rax (the 64-bit version), even when returning a long:
In this case, even though rax is needed to hold the full 64-bit long result, by writing to eax, we get a nice effect: Unlike other partial register writes, when writing to an e register like eax, the architecture zeros the top 32 bits for free. So xor eax, eax sets all 64 bits to zero.
Interestingly, when zeroing the “extended” numbered registers (like r8), GCC still uses the d (double width, ie 32-bit) variant:
Note how it’s xor r8d, r8d (the 32-bit variant) even though with the REX prefix (here 45) it would be the same number of bytes to xor r8, r8 the full width. Probably makes something easier in the compilers, as clang does this too.
xor eax, eax saves you code space and execution time! Thanks compilers!
See the video that accompanies this post.
This post is day 1 of Advent of Compiler Optimisations 2025, a 25-day series exploring how compilers transform our code.
This post was written by a human (Matt Godbolt) and reviewed and proof-read LLMs and humans.
Support Compiler Explorer on Patreon or GitHub, or by buying CE products in the Compiler Explorer Shop.
It still has to retire, so some on-chip resources are still allocated to it. ↩
Matt Godbolt is a C++ developer living in Chicago. Follow him on Mastodon or Bluesky.