Thomas has a very specific and important job. He is a signalman, which operates at a very busy railroad. There are a lot of trains coming by and Thomas has to send all the trains which are filled with coal he has to send to the left and all trains filled with metal he has to send to the right.
Let's create a small illustration:
Coal Metal
\ /
\ /
| . This is Thomas
|
O This is a train filled either with coal or metal
|
Thomas has to check what is inside the train and depending on the outcome he has to flick the lever so that the train either goes left, when it is filled with coal, or right when it's filled with metal. Of checking and flicking the lever costs some time.
Let's say C
indicates a train filled with coal and M
a train filled with metal. On monday 9 o'clock Thomas knows there are 8 trains coming and the trains are ordered as follows:
Coal Metal
\ /
\ /
| . This is Thomas
|
C
|
C
|
M
|
C
|
M
|
C
|
M
|
M
The order seems kind of random and Thomas has to check a lot and pull the lever quite often. Because of that it took him 10 minutes to check and switch the lever.
Now in the afternoon typically the coal comes first because the lunch break is over sooner. And directly afterwards comes the metal.
Coal Metal
\ /
\ /
| . This is Thomas
|
C
|
C
|
C
|
C
|
M
|
M
|
M
|
M
Now Thomas checks the first coal trains and "knows" that there will be further trains coming with coal. So he doesn't have to flick the lever nor does he has to check if the next one is coal. After half of the trains are through he also knows there will be metal coming, so he flicks the lever one time. Now the whole work for him took him only 2 minutes instead of 10.
Back to our computer. Following the analogy Thomas is your CPU determining if a condition is met (train has coal or metal). If the condition is met he has to do some work. But what does "condition met" mean for a CPU?
Jumps
Let's have a look at this simple code:
private static void PrintHello(bool shouldPrint)
{
if (shouldPrint)
{
System.Console.Write("Hello");
}
}
Your C# compiler will translate this to the following machine-instructions. Don't worry you don't have to understand all of them 😉
PrintHello(Boolean)
L0000: test cl, cl
L0002: je short L000f
L0004: mov ecx, [0x8d822b8]
L000a: call System.Console.Write(System.String)
L000f: ret
L0000
:test cl, cl
means we do a bitwise AND instruction. Please note that space betweencl, cl
. This space meansNOT
.L0002
:je
means conditional jump if condition is met. As we have the negation that means ifshouldPrint == false
. And will jump to the addressL000f
- That means if the condition is not met, we just return otherwise we call
System.Console.Write
Kind of a goto
thing going on here.
What does a jump cost
It depends 😉
But seriously that is hard to answer. The are multiple factors which can play a major role:
- processor architecture (x86, ARM, ...)
- Do we have conditional branching (
je
for example) or unconditionaljmp
?
The cost itself can vary from a single cycle to around 20 cycles. But this is not the only problem what comes into play here: Pipelining.
Pipelining
Modern processors are using a technique called pipelining. They execute multiple instructions at one time. While there is a current instruction running the processor can fetch and decode later instructions.
BUT: Especially conditional jumps limit this "overlapped execution" because the processor does not know what to fetch next.
Solution: The processor predicts what path to take (so whether or not your condition will evaluate to true). Branch prediction
Well that is not entirely true but for the sake of simplicity we live with that for now. Just a small thought experiment: Just imagine our CPU predicates the wrong branch which has side effects to our code like the following:
private void Do(int a)
{
if (a > 2)
{
a--;
}
else
{
a++;
}
}
Sorted Array vs unsorted Array
To see this in action, we can setup the following Benchmark. I use Benchmark.Net, which is an awesome tool
public class Benchi
{
private const int ArraySize = 65535;
private int[] unsortedArray = new int[ArraySize];
private int[] sortedArray = new int[ArraySize];
public Benchi()
{
var random = new Random();
for (var i = 0; i < ArraySize; i++)
{
unsortedArray[i] = sortedArray[i] = random.Next(256);
}
Array.Sort(sortedArray);
}
[Benchmark(Baseline = true)]
public int Unsorted() => SumOfEverythingAbove128Unsorted(unsortedArray);
[Benchmark]
public int Sorted() => SumOfEverythingAbove128Unsorted(sortedArray);
private static int SumOfEverythingAbove128Unsorted(int[] data)
{
var sum = 0;
for (var i = 0; i < data.Length; i++)
{
if (data[i] >= 128)
{
sum += data[i];
}
}
return sum;
}
}
The results are here:
BenchmarkDotNet=v0.13.1, OS=Windows 10.0.19043.1288 (21H1/May2021Update)
Intel Core i7-7820HQ CPU 2.90GHz (Kaby Lake), 1 CPU, 8 logical and 4 physical cores
.NET SDK=6.0.100
[Host] : .NET 6.0.0 (6.0.21.52210), X64 RyuJIT
DefaultJob : .NET 6.0.0 (6.0.21.52210), X64 RyuJIT
| Method | Mean | Error | StdDev | Ratio |
|--------- |----------:|---------:|----------:|------:|
| Unsorted | 371.15 us | 7.401 us | 11.077 us | 1.00 |
| Sorted | 55.44 us | 1.924 us | 5.674 us | 0.16 |
Wow, 5x the speed. But how do we know this is really because of branch prediction and not other mechanism? Well there is an awesome extension for Benchmark.Net called BenchmarkDotNet.Diagnostics.Windows.
Right now it only works under windows and you have to start the benchmark with Admin rights to make it work!
We only have to annotate our test class like this:
[HardwareCounters(HardwareCounter.BranchMispredictions)]
public class Benchi
And this all another column to our benchmark output:
| Method | Mean | Error | StdDev | Ratio | BranchMispredictions/Op |
|--------- |----------:|---------:|----------:|------:|------------------------:|
| Unsorted | 371.15 us | 7.401 us | 11.077 us | 1.00 | 32,958 |
| Sorted | 55.44 us | 1.924 us | 5.674 us | 0.16 | 14 |
Because of the pure random nature of the unsorted array our CPU has a lot of misses when it comes to its predictions. With the sorted one we almost have no misses and the execution times is just a fraction of the unsorted one.
Conclusion
Branch prediction is a one of many mechanism your CPU has in place to optimize the execution of your code. You can take that to your advantage.
If you want to know more about which types of branch prediction there is, here you go.
The repository with the shown example can be found here.
A comprehensive overview over all examples used in that blog can be found here.