Over the last week or so I’ve been investigating the static branch prediction on modern Intel processors. A thread on the excellent Mechnical Sympathy mailing list got me thinking about it: a claim was made that static prediction is still used on Intel processors; and my understanding from Agner Fog’s excellent resources is that newer Intel processors no longer do so.
This has led to quite an odyssey of understanding, which I’m still embroiled in; so forgive the length of this post and the fact that it’s the first in a series…
So what’s branch prediction? And what’s static prediction?
As you may know, modern processors have long pipelines composed of Fetch / Decode / Execute / Retire1 stages. An instruction is fetched, then decoded, then executed, and finally retired. This process is overlapped between adjacent instructions; so while one is retiring, the next is executing, the next is decoding and another is being fetched.
On older CPUs these stages would be sequential2: most instructions would take several cycles before the next even started, so pipelining is a big win. However, branches throw a spanner into the works: when a branch gets to the execution stage it changes what instruction will be executed next. This means anything in the Fetch and Decode stage needs to be thrown away – wasted work!
This is where branch prediction comes in: something ahead of even the Fetch stage makes a guess as to where the branches might be – and in the case of conditional branches, whether they’re going to be taken or not. This stage is the branch predictor (or branch prediction unit, BPU).
The BPU has two jobs then: before the instruction stream has even been decoded it must make a guess whether there are any branches coming up. For unconditional branches, it makes a guess where the branch is going, and steers the fetcher accordingly. For conditional branches, it makes a guess whether the branch is going to be taken too.
On Intel CPUs whose pipelines are far more complex and often have 10+ stages between the fetcher and the execution unit, the BPU’s job is critical for performance. Intel have understandably put a lot of effort into making the BPU very good at its job. Older CPUs (Pentium M era) have been reversed engineered to some extent, but Westmere and beyond haven’t really been looked at3.
Intel processors fetch instructions in blocks of 16 bytes every cycle and each block can contain several branches. As best is understood, a cache-like structure called the Branch Target Buffer (BTB) is used to look up previously-seen branches. Just like a memory cache, the BTB has a number of sets, and each set has a number of ways. Some number of bits of the current program counter are used to pick a set within the BTB, and then another subset of the bits is compared with the tags in each of the ways of the set. Any hits found contain information about potential branches. Because there can be multiple matching branches in a set, some logic may have to pick the “first” encountered branch. For conditional branches, a second system is used to predict the outcome. I won’t go into too many details about this part in this post.
The guesses of the branch destination are sent on, and then in the decode stage, those guesses are checked against the actual decoded instruction stream. If a branch destination is wrong – or if a non-branch was mis-guessed as a branch – it’s noted there, the BTB is corrected, and the fetchers are re-steered to the right destination.
It’s at this point the “static prediction” comes in: If the decoder spots a branch that the BPU hadn’t predicted, it has to re-steer the fetcher. If it’s a conditional, the decoder gets a chance to pick whether it’s predicted taken or not. This guess is made based on static rules instead of any kind of knowledge about that particular branch.
What might such a static prediction algorithm be? The most obvious is “predict not taken” – just assume the fetcher should carry on to the instruction beyond the branch.
Another sensible static prediction algorithm is to assume conditional branches to previous addresses are the branch at the end of a loop and so are more likely to be taken than not. Conditional branches forward are considered not taken. This latter algorithm is what the Pentium Ms and some earlier processors used, according to the documentation.
A final choice is kind of not really a choice: instead use whatever prediction the dynamic “is it taken or not” predictor gives for this branch, even though it may know nothing about this branch.
So, returning to the point in hand: I was reading the Mechanical Sympathy thread and – having spent rather a long time reading about and thinking about this kind of thing – I had an xkcd moment:
And so began a long process of me trying to work out what type of static prediction modern processors use. Over the next few days I’ll blog about how I went about this, and what I’ve found so far.
Next: some results.
This of course is a huge simplification as this post is already getting gargantuan. Modern Intels have probably 10-15 cycles’ worth of cycles in its front end before it even reaches the even-more-complex Out-Of-Order system. If your interest is piqued in this, check out my YouTube video on the subject. ↩
Though even the simple (but awesome) 6502 had a very simple pipeline where the fetch of the next instruction was overlapped with the execution of the previous. ↩
Though a friend has told me there’s some research in a paper I haven’t yet been able to get hold of: “Shah Mohammad Faizur Rahman, Zhe Wang, and Daniel A. Jiménez, “Studying Microarchitectural Structures with Object Code Reordering”, Proceedings of the 2009 Workshop on Binary Instrumentation and Applications (WBIA), December, 2009.” ↩