Pop goes the...population count?

Written by me, proof-read by an LLM.
Details at end.

Who among us hasn’t looked at a number and wondered, “How many one bits are in there?” No? Just me then?

Actually, this “population count” operation can be pretty useful in some cases like data compression algorithms, cryptography, chess, error correction, and sparse matrix representations. How might one write some simple C to return the number of one bits in an unsigned 64 bit value?

One way might be to loop 64 times, checking each bit and adding one if set. Or, equivalently, shifting that bit down and adding it to a running count: sometimes the population count operation is referred to as a “horizontal add” as you’re adding all the 64 bits of the value together, horizontally. There are “divide and conquer” approaches too, see the amazing Stanford Bit Twiddling Hacks page for a big list.

My favourite way is to loop while the value is non-zero, and use a cute trick to “clear the bottom set bit”. The loop count is then the number of set bits. How do you clear the bottom set bit? You and a value with itself decremented!

value       : 11010100
subtract 1  : 11010011
& value     : 11010000

If you try some examples on paper, you’ll see that subtracting one always moves the bottom set bit down by one place, setting all the bits from there down. Everything else is left the same. Then when you and, the bottom set bit is guaranteed to be and-ed with zero, but everything else remains. Great stuff!

All right, let’s see what the compiler makes of this:

The core loop is pretty much what we’d expect, using the lea trick to get value - 1, anding and counting:

.L3:
  lea rax, [rdi-1]          ; rax = value - 1
  add edx, 1                ; ++result
  and rdi, rax              ; value &= value - 1
  jne .L3                   ; ...while (value)

Great stuff, but we can do better. By default gcc and clang both target some kind of “generic” processor which influences which instructions they can use. We’re compiling for Intel here, and gcc’s default is somewhere around Intel’s “nocona” architecture, from 2004. Unless you are running vintage hardware you can probably change it to something better. Let’s pick the super up-to-date “westmere” (from 2010…) using -march=westmere and see what happens1:

Wow! The entire routine has been replaced with a single instruction - popcnt rax, rdi3. When I first saw this optimisation I was blown away: the compiler recognises a relatively complex loop as being functionally equivalent to a single instruction. Both gcc and clang can do this, and within Compiler Explorer you can use the optimisation pipeline viewer in clang to see that clang’s “loop deletion pass” is responsible for this trick:

Screenshot of CE showing the opt pipeline viewer with the loop being replaced with a call to @llvm.ctpop.i64
Compiler Explorer's Opt Pipeline View

Compilers canonicalise code too, so some similar population count code will also be turned into a single instruction, though sadly not all. In this case, it’s probably better to actually use a standard C++ routine to guarantee the right instruction as well as reveal your intention: std::popcount. But even if you don’t, the compiler might just blow your mind with a single instruction anyway.

See the video that accompanies this post.


This post is day 11 of Advent of Compiler Optimisations 2025, a 25-day series exploring how compilers transform our code.

This post was written by a human (Matt Godbolt) and reviewed and proof-read by LLMs and humans.

Support Compiler Explorer on Patreon or GitHub, or by buying CE products in the Compiler Explorer Shop.


  1. Interestingly, if you select other architectures like -march=sandybridge you’ll see the compiler emits a seemingly redundant xor eax, eax before the popcnt eax, edi. This is to work around a performance bug in the CPU2

  2. The register renamer / dependency tracking in the out of order engine incorrectly believes that popcnt has a dependency on the previous value of the destination register. The xor eax, eax breaks this dependency. 

  3. While this post is using Intel, ARM has popcount, and some RISC-V variants have cpop, which do the same thing. 

Filed under: Coding AoCO2025
Posted at 06:00:00 CST on 11th December 2025.

About Matt Godbolt

Matt Godbolt is a C++ developer living in Chicago. He works for Hudson River Trading on super fun but secret things. He is one half of the Two's Complement podcast. Follow him on Mastodon or Bluesky.