Branch prediction - part two — Matt Godbolt’s blog

Branch prediction - part two

I’m in the middle of an investigation of the branch predictor on newer Intel chips. Read the previous article to get some background.

Where I left off I’d just decided to look into static prediction myself. I’ve previously used Agner Fog’s tools to do these kinds of investigations, and so that was my first stop.

Agner’s tools install a kernel module to give user-mode access to the hardware performance monitoring counters inside the Intel chips. Each CPU has four counters that can be used to count one of a number of internal CPU events. Agner’s tools then run micro-benchmarks while counting the various internal things going on inside the processor. This seems perfect for us!

Each CPU microarchitecture has a different set of counters available, so the first stop was to pick a CPU and choose the counters. As a starting point for my investigations I picked my laptop’s CPU – an Arrendale CPU (Core(TM) i5 CPU M 520 @ 2.40GHz). This is a mobile version of the Westmere chip. Later I’d run similar investigations on other CPUs.

Digging out the docs (Intel Architectures Software Developer Manual Volume 3B, part 2, pages 362-393), there’s a bewildering array of useful-sounding counters.

Back to Agner’s tools and my plan for the test started to form – I would write a series of microbenchmarks with:

Forward branches that were not taken
Backward branches that were not taken
Forward branches that were taken
Backward branches that were taken

Within each microbenchmark I’d count the number of instructions and branch mispredictions (and some other useful-sounding counters). The expectation is that on the first run of the benchmark we would see a number of mispredictions, as the Branch Target Buffer (BTB) would miss for every branch. Then the decoder would be the first to spot the branch, and it has a choice as to whether it re-steers the fetcher to the branch destination or not. Either way, once the instruction retires we get to see if the decoder made the right guess or not.

The expectation is if the decoder assumes the branch is not taken, then we should see no mispredictions for forward-not-taken and backward-not-taken, but we should see pretty much a 100% misprediction rate for the taken cases.

If the decoder assumes backwards branches are loops and are likely to be taken, then we’d expect forward-not-taken and backwards-taken to have no mispredicts, and the other two cases would be close to 100% misprediction rate.

After a number of false starts with Agner’s set-up (and noting that some of his counter definitions seem to be incorrect for Westmere/Arrendale), I had some initial results. The test was 1000 back-to-back branches:

Arrendale

Test	Instructions	Mispredictions
Ahead NT	1501707	3
Behind NT	1501707	1011
Ahead Taken	1301707	1003
Behind Taken	900507	2

Pretty compelling evidence for loop-style static prediction here!¹

I ran the same test on an Ivy Bridge (E5-2667 v2) and a Haswell (E5-1650 v3):

Ivy Bridge

Test	Instructions	Mispredictions
Ahead NT	1501707	374
Behind NT	1501707	114
Ahead Taken	1301707	818
Behind Taken	900507	297

Haswell

Test	Instructions	Mispredictions
Ahead NT	1501707	133
Behind NT	1501707	119
Ahead Taken	1301707	1123
Behind Taken	900507	1086

Here the evidence is a little more puzzling. The Ivy seems to weakly predict ahead as not taken, but there’s enough variability to make me wonder if something else is going on. For Haswell it seems no branch is ever predicted as taken. Interesting, and worth more investigation.

The full results – including some notes I made and the other counters – are available in a Google Sheet.

By this point I had resolved to improve upon Agner’s tools which are a little finicky and easy to misuse. To that end I branched his code and started on a little refactoring. That refactored code is available on GitHub as the agner project. The Google Sheet link above links to the git hash of the exact point at which the data was collected, should you wish to try reproducing.

So: job done? Seems the Arrendale does static prediction of loop-like branches, but newer Intels don’t, so case closed?

Not quite.

Agner’s tests run each microbenchmark multiple times, back-to-back. By default 10 iterations are performed. Obviously, once the BPU has learned about the whereabouts of each branch, the misprediction rate drops to essentially zero. This was useful initially in ensuring I was using the right counters: after the first iteration the misprediction rate ought to drop to zero.

But then I thought I should be able to “scramble” the BPU in-between iterations to get a consistent misprediction rate on each iteration. I wrote some code (ScrambleBTB) to flood the CPU with loads of branches in an effort to evict the whole BTB. This I placed between iterations (but outside of the counter regions).

To my surprise it took a lot more effort than I thought to make the routine effective: and I’m still not quite sure what is actually required to trick the BTB properly.

In order to find out what’s going on, I decided to further lift the lid and try and work out the size, arrangement and eviction policies of the BTB in my Arrendale.