CPU-Bound Benchmarks



 Proper design and analysis of CPU-bound benchmarks require knowledge of different runtime and hardware “features” that can affect performance

  • Registers and Stack

  •  JIT compiler keeps the intermediate values in registers and when it uses the stack for it.

  • Inlining

     JIT compiler can inline your methods and why it’s important.

  • Instruction-Level Parallelism (ILP)

    One of the most important hardware features, ILP, which allows processing multiple instructions at the same time inside a single thread.

  • Branch Prediction

    Ability of modern CPUs to predict which branches will be taken in your programs. Correct predictions help to improve conditions for the ILP. It’s important in the context of benchmarking because the input data can significantly affect the method performance based on the correct prediction rate.

  • Arithmetic

    Problems we can get with benchmarks that use arithmetic operations. We will talk about hardware (floating-point numbers and IEEE 754) and runtime (different environments and JIT optimizations) features.

  • Intrinsics

    Cases when a JIT compiler can generate a “smart” native implementation for specific methods and statements.


Registers and Stack


When we have a local variable, the JIT compiler can put it into registers or on the stack. Operations with registers are usually much faster than operations with stack values.
 In most cases, when we use a struct value in local variables, the JIT compiler keeps its fields on the stack. In some special cases, the fields can be saved into registers. 
Such an approach is known as struct promotion or scalar replacement. It’s implemented in RyuJIT, but you can’t manually enable or disable this feature for a particular method



Promoted struct has to follow some rules1:
  • It must have only primitive fields.
  • It must not be an argument or a return value that is passed in registers.
  • It can’t be larger than 32 bytes.
  • It can’t have more than 4 fields.

LOCAL VARIABLES

“Introduce a local variable” is a popular refactoring that can improve the readability of your code. 

public struct Struct
{
  public Struct(uint? someValue)
  {
    SomeValue = someValue;
  }
  public uint? SomeValue { get; }
}
public class Benchmarks
{
  [Benchmark(Baseline = true)]
  public uint? Foo()
  {
    return new Struct(100).SomeValue;
  }
  [Benchmark]  -FASTER
  public uint? Bar()
  {
    Struct s = new Struct(100);
    return s.SomeValue;
  }
}

Here we have two benchmarks: Foo and Bar. Both methods do the same thing: they create an instance of Struct and return the only field of it. However, Bar differs from Foo: it saves the struct instance to a local variable instead of using it in the return expression. The performed logic is identical for both cases, but we have minor changes on the C# level;

 Method |     Mean |    StdDev | Ratio |
------- |---------:|----------:|------:|
    Foo | 6.597 ns | 0.0433 ns |  1.00 |
    Bar | 4.975 ns | 0.0439 ns |  0.75 |

As you can see, Bar works ~20–30% faster than Foo. How is that possible?
As you can see, there are some minor differences between these methods. Foo creates Struct via newobj, loads the result to a local variable, and loads the address of this variable. Meanwhile, Bar creates Struct via call (which saves the result to a local variable) and instantly loads the address of this variable. Both implementations are equivalent, but they use different IL instructions.

TRY-CATCH


A proper exception handling is important if you want to develop stable .NET applications. A lot of developers put try-catch blocks here and there “just in case.” They don’t expect performance penalty because exceptions are considered as rare events. It may seem that if the source code doesn’t throw any exceptions, the try-catch overhead shouldn’t be noticeable. Unfortunately, this is not always true because the JIT compiler can modify the generated native code when a try-catch block is added.
public class Benchmarks
{
  private const int N = 93;
  [Benchmark(Baseline = true)]
  public long Fibonacci1()
  {
    long a = 0, b = 0, c = 1;
    for (int i = 1; i < N; i++)
    {
      a = b;
      b = c;
      c = a + b;
    }
    return c;
  }
  [Benchmark]
  public long Fibonacci2()//WITH TRY CATCH
  {
    long a = 0, b = 0, c = 1;
    try
    {
      for (int i = 1; i < N; i++)
      {
        a = b;
        b = c;
        c = a + b;
      }
    }
    catch {}
    return c;
  }
}
     Method |      Mean |    StdDev | Ratio |
----------- |----------:|----------:|------:|
 Fibonacci1 |  41.07 ns | 0.1446 ns |  1.00 |
 Fibonacci2 | 102.93 ns | 0.3394 ns |  2.50 |

Fibonacci2 works much SLOWER  Fibonacci1 because the read/write operations with stack values take much more time than operations with registers. 
try-catch block forced RyuJIT to put the c variable on the stack (VS REGISTER), which spoiled the method performance.

NUMBER OF CALLS

The number of calls in a method is an important factor for some JIT compiler heuristics. The overhead of these calls can be small, but it can force the JIT compiler to change the generated native code for other statements in the same method.
public class X {}
[LegacyJitX86Job]
public class Benchmarks
{
  private const int N = 100001;
  [Benchmark(Baseline = true)]
  public double Foo()
  {
    double a = 1, b = 1;
    for (int i = 0; i < N; i++)
      a = a + b;
    return a;
  }
  [Benchmark]
  public double Bar()
  {
    double a = 1, b = 1;
    new X(); new X(); new X();
    for (int i = 0; i < N; i++)
      a = a + b;
    return a;
  }
}

Here we have two methods: Foo and Bar. Both methods add one double variable to another one in a loop. However, the Bar method has three additional constructor calls.

Results

Here is an example of results (Windows 10.0.17763.195, Intel Core i7-6700HQ CPU 2.60GHz, .NET Framework 4.7.2, LegacyJIT-x86 v4.7.3260.0):
 Method |     Mean |    StdDev | Ratio |
------- |---------:|----------:|------:|
    Foo | 103.5 us | 0.4686 us |  1.00 |
    Bar | 309.7 us | 1.4324 us |  2.99 |
The described performance phenomena in the preceding example is valid only for LegacyJIT-x86; you will not observe a performance drop for this case study with other JIT compilers. However, the number of calls in a method still can be used by any JIT compiler as a factor for different optimizations. In general, you shouldn’t try to optimize methods by reducing the number of additional calls: this factor is important only in some specific cases. When you get a situation when adding/removing an additional call leads to unexpected performance changes (larger than the expected call duration), you should check how these calls affect the generated native code of the whole method.

Inlining

The topic of inlining has already been discussed in this book several times. When the JIT compiler inlines a method, it means that a call of this method is replaced by its body. It’s not easy to decide when we should use inlining because this optimization has some advantages and disadvantages.

Advantages:
  • Eliminated call overhead

    When we call a method, we always have some overhead. For example, we perform a couple of additional instructions (callret). Sometimes, we have to save some register values before the call and restore them after the call. Inlining eliminates this overhead. It can be important for hot methods that should be superfast.

  • Opportunity for other optimizations

    Once a method is inlined, other optimizations like constant folding or code elimination are possible.

  • Better register allocation

    In some cases, when a method is inlined, the JIT compiler can use the registers better because it shouldn’t pass arguments to the called method.

Disadvantages:
  • Increasing code size

    On the CPU level, we have an instruction cache, which helps to load the executed code faster. Duplication of the inlined method native code across its usages may hurt the instruction cache performance. This effect is almost invisible on small programs, but it can affect applications with a huge source code base.

  • Preventing further inlining

    Imagine three methods ABC where A calls B and B calls C. If the JIT compiler inlines B into AA may become too complicated, which will prevent further inlining C into A. Meanwhile, the CB inlining can be more profitable than BA inlining.

  • Worse register allocation

    You may think about a method as a scope for the JIT compiler where it tries to use registers as best as possible. Since the number of registers is limited, an inlined method may lead to worse conditions for register usage. In the previous section, we already discussed many cases when we have a performance drop because some variables are placed on the stack instead of registers.

Thus, inlining can be a good optimization or a bad optimization. Usually, the JIT compiler tries to make a decision that is best for performance. However, these decisions are not always obvious, and they may lead to unexpected performance phenomena. Let’s look at some case studies that help us to recognize situations when the knowledge about inlining is important for performance investigations.


CALL OVERHEAD

When we have a hot method, we want to make it as fast as possible. Typically, a call of a simple method takes a few nanoseconds. If a method also takes a few nanoseconds, the call overhead may increase its duration twice. This overhead can be eliminated with the help of inlining. Unfortunately, it’s not always possible to inline a method. Let’s look at an example that shows what kind of performance drop we could get when it’s impossible to inline a hot method.

Source code

Consider the following BenchmarkDotNet-based benchmarks:
public class Benchmarks
{
  private const int N = 1000;
  private int[] x = new int[N];
  private int[] y = new int[N];
  private int[] z = new int[N];
  [Benchmark(Baseline = true)]
  public void Foo()
  {
    for (int i = 0; i < z.Length; i++)
      z[i] = Sum(x[i], y[i]);
  }
  [Benchmark]
  public void Bar()
  {
    for (int i = 0; i < z.Length; i++)
      z[i] = VirtualSum(x[i], y[i]);
  }
  public int Sum(int a, int b) => a + b;
  public virtual int VirtualSum(int a, int b) => a + b;
}

Here we have three int arrays of the same length: xy, and z. In the declared benchmarks Foo and Bar, we perform z[i] = x[i] + y[i] for all array elements. Instead of the direct calculations, the sum operation is extracted to a separate method. Foo uses Sum (a nonvirtual method), and Bar uses VirtualSum (a virtual method).

Results

Here is an example of results (macOS 10.14.2, Intel Core i7-4870HQ CPU 2.50GHz, .NET Core 2.1.3, RyuJIT-x64):
 Method |     Mean |    StdDev | Ratio |
------- |---------:|----------:|------:|
    Foo | 1.121 us | 0.0148 us |  1.00 |
    Bar | 2.311 us | 0.0196 us |  2.06 |

As you can see, Bar works two times slower than Foo.

Explanation

VirtualSum can’t be inlined because it’s marked as a virtual method. According to the set of RyuJIT heuristics in .NET Core 2.1.3, virtual methods can’t be inlined here. Sum can be inlined because there are no factors that prevent inlining. Foo works two times faster than Bar because it used the inlined version of the sum operation and it doesn’t have the Sum call overhead.