This blog post in a paragraph

Why is this C code 5x faster than this equivalent C# code? On my machine, the C program completes in 32ms versus 150ms for the C# program. 30ms of the difference is due to C# start-up/shutdown overhead. The other 90ms is due to gcc vectorizing the loop, after being allowed to ignore some standards around floating-point calculations. This is a very specific case of numerically-intensive code: it’s not guaranteed that C will always be 5x faster than C#!

The long story

After making C# go fast, I wondered if I could do much better using a non-managed language like C. Rewriting my game in C is a large project, so I went hunting for smaller examples. Before long, I came across speed comparison, which compares many languages performing the same small calculation of pi. It’s a trivially small piece of code which only compares a tiny portion of what each language is capable of, but it’s a good starting point for a learning exercise.

C comes in at 3 times faster than C# in the above comparison. In this blog post, I’ll investigate why, as someone with a rather patchy understanding of low level code and performance (me).

The questions of which language is the fastest, and whether managed languages like C# are slow have been around forever, and many people smarter than me have written about it online. See appendix B for some interesting reads. The generic reasons given are VM/runtime overhead, garbage collection (GC), just-in-time compilation (JIT), array bounds checking and more. I wasn’t satisfied with generic reasons, and wanted to know specifically why two pieces of comparable code differed.

Running the speed comparison, poking around

Speed comparison is relatively straightforward to run on a Linux-like environment. On my machine, C performed even better than reported, averaging 35ms per run, versus 155ms for C#.

The code:

reads a number rounds from a file
iteratively calculates the value of pi, looping as many times as rounds specifies
prints the value of pi

I tinkered with the code a bit to get an understanding of where the time was being spent. Eliminating file I/O by hard coding rounds saved a few milliseconds in both C and C#. Setting rounds to 1 reduced C’s run time to effectively zero, while C# still took 30ms. I assume this time is start-up/shutdown time of the C# runtime (CLR), which I don’t particularly care about.

Out of curiosity, I quickly checked out ahead-of-time compilation (AOT) for C#. It only had a small impact on overall run time, and start-up time was still around 20ms. I didn’t investigate any further.

So: ignoring file I/O and start-up time, C completes the pi calculation in 30ms, while C# takes 120ms. The code for the calculation is almost identical:

for (unsigned i=2u; i < rounds; ++i) {
    double x = -1.0 + 2.0 * (i & 0x1);
    pi += (x / (2u * i - 1u));
}

C#:

for (int i = 2; i < rounds + 2; i++) {
    x *= -1;
    pi += (x / (2 * i - 1));
}

Changing the signed-ness of variables and the time at which to increment i made no difference. The line double x = -1.0 + 2.0 * (i & 0x1); produces the same value of x as x *= -1, but is important for reasons I’ll talk about later. First, let’s take a look at the machine code that is generated by C and C#.

Digging into the machine code

C

To see C disassembly, I used a tool called objdump. I bypassed Earthly and ran gcc and objdump directly. The C compilation command in the Earthfile is:

gcc leibniz.c -o leibniz -O3 -s -static -flto -march=native \
    -mtune=native -fomit-frame-pointer -fno-signed-zeros \
    -fno-trapping-math -fassociative-math

I modified this slightly to make the objdump output easier to understand:

removed -s to keep symbol information
removed -static, so that external library code was excluded from the executable
added -g to add debugging information

This didn’t affect the speed of the program.

# compile the program
gcc leibniz.c -o leibniz -O3 -g -flto -march=native \
    -mtune=native -fomit-frame-pointer -fno-signed-zeros \
    -fno-trapping-math -fassociative-math

# extract the machine code in intel assembly format,
# along with other useful details to help understand
# which part of the code the assembly is for
objdump -drwlCS -Mintel leibniz > leibniz.asm

I’ve put all the assembly together in an appendix.

I’ll cover the gcc options in more detail later. The objdump options are described here: objdump man page.

C#

Accessing C# disassembly has a much more ‘Windows-y’ feel to it. The most common advice is to use Visual Studio, pause the program at a breakpoint, then access the disassembly via a menu: Debug menu -> Windows -> Disassembly. To make sure you’re looking at optimised code, before running the debugger, you need to disable ‘Suppress JIT optimization on module load’ and ‘Enable Just My Code’ within the menus Tools -> Options -> Debugging -> General.

A screenshot of Visual Studio, showing the disassembly menu item being selected — How to show native disassembly of your C# program

This process makes sense, as .NET executables are distributed in .NET’s intermediate language format (IL), and IL is only compiled to native code by the JIT as needed. ¹ ²

However, the assembly shown by Visual Studio was suspiciously long compared to the C program’s assembly. With certain compilation settings (more on this later), the C program completed in a similar time to the C# program, yet the C assembly code was less than half as long as the C# disassembly. It didn’t make sense that they could run at the same speed.

This stack overflow answer says that the JIT behaves differently when run under the debugger. I had a feeling that this was the case even with the Visual Studio option set to allow JIT optimization. I tried using vsjitdebugger, spending hours manually stepping through all the C# runtime start-up until eventually reaching my code, only to find that it was still the same long set of instructions. If you ever find yourself needing to use vsjitdebugger, I found that I could only get it to work by running Visual Studio as admin, then running vsjitdebugger from the Powershell terminal within Visual Studio.

Eventually, I came across this post about viewing code generated by the JIT. It outlines a number of ways to view the disassembly, one being to use BenchmarkDotNet with DisassemblyDiagnoser. I extracted out the pi calculation to a method, put a benchmark around it, and finally got some assembly code that was comparable to the C program’s assembly.

See the appendix for the C# disassembly.

What does it all mean?

Having the machine code generated from the C and C# code let me do a like for like comparison of each program. I’ve put the assembly code of just the for loops in an appendix at the end of this post.

The first challenge was understanding what all the instructions meant! This quick introduction to x64 assembly (pdf) ³ was a good primer. This x86 reference ⁴ got me the rest of the way to a basic understanding of what was happening. Rather than decode every instruction and figure out which variables were being stored in which registers, I thought I’d just play around with the C compilation options to get a feel for what changes.

Gcc has many options to control the generated machine code. I’ll summarise the relevant options here. For more details, see gcc’s optimize options ⁵ and gcc options summary.⁶

I found that the top contributors to program run time were -O3 -march=native -fassociative-math -fno-signed-zeros -fno-trapping-math.

O1, O2, O3 are shorthand for a collection of optimisation settings, O3 being the highest setting
-march=native: compile for the host CPU, using any instructions it supports. This is generally only done for numerically intensive code, at the cost of portability. Most widely distributed C programs won’t use this option, as it limits the number of CPUs the program can run on. For more details see this stack overflow question: Why is -march=native used so rarely?
-fassociative-math: reorders floating point operations for efficiency, possibly causing underflow / overflow / NaNs
-fno-signed-zeros: ignore the sign of zeros
-fno-trapping-math: assume that there will be no “division by zero, overflow, underflow, inexact result and invalid operation”

The last three options may cause violations of IEEE or ANSI standards, and are referred to as unsafe math optimizations. I assume this is not necessarily a bad thing, if you know what you’re doing. Indeed, the value of pi calculated with these options varied slightly from the value calculated without them (and by the C# code), but all calculated pi values were still accurate to 7 decimal places.

Using just the -O3 option, the C program took 120ms, which is the same as C# program, ignoring start-up time. -march=native didn’t make a big difference unless the unsafe math options were enabled. If you compare the assembly of the ‘unsafe’ and the ‘safe’ options, you can see that the unsafe assembly uses ymm registers (256 bit registers), allowing more calculations to be done per CPU instruction. It also only loops 12.5 million times, versus the 100 million times for the safe assembly. Intuitively, this explains the speedup to me: The unsafe assembly runs 8x fewer loops, but runs about twice as many instructions per loop, resulting in a 4x speedup. I don’t know if that’s the exact reason, but I’m satisfied enough with that answer for now.

Vectorization

What gcc is doing with the unsafe compilation options above is vectorizing the loop. There’s a great explanation of vectorization here: Stack Overflow: What is “vectorization”?. In a sentence, it’s the compiler generating code that executes multiple loop iterations at once, using more efficient CPU instructions.

Gcc was able to automatically vectorize the C code, albeit with the unsafe math options. The C programmer also needed to know that writing double x = -1.0 + 2.0 * (i & 0x1) allows the compiler to vectorize, while x *= -1 doesn’t. I didn’t investigate why, but I assume it’s due to the first code determining the value of x using only the loop counter, while the second code determines x from the value of x in the previous loop.

It’s also possible to write vectorized code yourself, as demonstrated by the leibniz_avx2.cpp code in speed-comparison. On my machine, this beat the auto-vectorized C by a couple of milliseconds, without needing the unsafe math options. Java and C# also support manual vectorization, however the result in speed-comparison is underwhelming: this vectorized Java code performs 3 times worse than its non-vectorized counterpart! It’d be interesting to see if C# fares any better. Microsoft remarks:

SIMD [vectorization] is more likely to remove one bottleneck and expose the next, for example memory throughput. In general the performance benefit of using SIMD varies depending on the specific scenario, and in some cases it can even perform worse than simpler non-SIMD equivalent code.

The end

So there we have it! For this particular code, C runs 5x faster than C# by relaxing some strict IEEE calculation rules, auto-vectorizing the loop, and having a very low start-up time.

Ignoring start-up time, C# was as fast as standards-compliant C, but vastly outperformed by manually vectorized C++ code. It may be possible to manually vectorize the C# code, but I’ll leave that for a later post.

It’s important to note that the code compared in this post is by no means a fair comparison of the performance of C and C# in general, and there are many more factors to consider when choosing a programming language. Have a look at the links in appendix B for some more perspective.

Appendix A: assembly

Assembly for C and C#. Only the pi calculation loop is shown.

C with unsafe math optimizations: (-O3 -march=native -fassociative-math -fno-signed-zeros -fno-trapping-math, 30ms):

;address:  instruction (bytes)   instruction (assembly)        # my understanding of what's happening
1130:      c5 fd 6f c2           vmovdqa ymm0,ymm2             # ymm0 = ymm2
1134:      ff c0                 inc    eax                    # eax += 1
1136:      c5 ed fe d7           vpaddd ymm2,ymm2,ymm7         # ymm2 += (4xi64)ymm7
113a:      c5 fd db ce           vpand  ymm1,ymm0,ymm6         # ymm1 = (4xi64)ymm0 & ymm6
113e:      c5 fd 72 f0 01        vpslld ymm0,ymm0,0x1          # ymm0 <<= 1, ie. ymm0 *= 2
1143:      c5 fd fe c5           vpaddd ymm0,ymm0,ymm5         # ymm0 += ymm5     (ymm5 = -1?, see 1117)
1147:      c5 7e e6 c9           vcvtdq2pd ymm9,xmm1           # (4xi64) ymm9 = (4x double?)xmm1 ... https://www.felixcloutier.com/x86/cvtdq2pd
114b:      c4 e3 7d 39 c9 01     vextracti128 xmm1,ymm1,0x1    # xmm1 = ymm1[255:128]
1151:      c5 fe e6 c9           vcvtdq2pd ymm1,xmm1           # (4xi64) ymm1 = (4x double?)xmm1
1155:      c5 7e e6 d0           vcvtdq2pd ymm10,xmm0          # (4xi64) ymm10 = (4x double?)xmm0
1159:      c4 e3 7d 39 c0 01     vextracti128 xmm0,ymm0,0x1    # xmm0 = ymm0[255:128]
115f:      c4 62 e5 98 cc        vfmadd132pd ymm9,ymm3,ymm4    # (4x double): ymm9 = ymm9 * ymm4 + ymm3
1164:      c5 fe e6 c0           vcvtdq2pd ymm0,xmm0           # (4xi64) ymm0 = (4x double?)xmm0
1168:      c4 e2 e5 98 cc        vfmadd132pd ymm1,ymm3,ymm4    # (4x double): ymm1 = ymm1 * ymm4 + ymm3
116d:      c4 41 35 5e ca        vdivpd ymm9,ymm9,ymm10        # ymm9 /= ymm10
1172:      c5 f5 5e c0           vdivpd ymm0,ymm1,ymm0         # ymm0 = ymm1/ymm0
1176:      c5 b5 58 c0           vaddpd ymm0,ymm9,ymm0         # ymm0 += ymm9
117a:      c5 3d 58 c0           vaddpd ymm8,ymm8,ymm0         # ymm8 += ymm0
117e:      3d 20 bc be 00        cmp    eax,0xbebc20           # if eax != 12500000 (=100M/8)
1183:      75 ab                 jne    1130 <main+0x90>       # goto 1130

Safe C (-O3 -march=native, 120ms):

;address:  instruction (bytes)   instruction (assembly)        # my understanding of what's happening
1120:      89 c1                 mov    ecx,eax                # ecx = eax
1122:      c5 e3 2a d2           vcvtsi2sd xmm2,xmm3,edx       # xmm2 = (double)edx?  https://www.felixcloutier.com/x86/cvtsi2sd
1126:      ff c0                 inc    eax                    # eax += 1
1128:      83 c2 02              add    edx,0x2                # edx += 2
112b:      83 e1 01              and    ecx,0x1                # ecx &= 1
112e:      c5 e3 2a c9           vcvtsi2sd xmm1,xmm3,ecx       # xmm1 = (double)ecx?
1132:      c4 e2 d9 99 cd        vfmadd132sd xmm1,xmm4,xmm5    # xmm1 = xmm1 * xmm5 + xmm4
1137:      c5 f3 5e ca           vdivsd xmm1,xmm1,xmm2         # xmm1 /= xmm2
113b:      c5 fb 58 c1           vaddsd xmm0,xmm0,xmm1         # xmm0 += xmm1
113f:      3d 02 e1 f5 05        cmp    eax,0x5f5e102          # if eax != 100M+2
1144:      75 da                 jne    1120 <main+0x80>       # goto 1120

C# (BenchmarkDotNet):

M01_L00:
; instruction    params               # my understanding
  vmulsd           xmm0,xmm0,xmm2     # xmm0 *= xmm2
  lea              edx,[rax*2-1]      # edx = *(rax*2-1)
  vxorps           xmm3,xmm3,xmm3     # xmm3 = 0
  vcvtsi2sd        xmm3,xmm3,edx      # xmm3 = (double)edx
  vdivsd           xmm3,xmm0,xmm3     # xmm3 = xmm0 / xmm3
  vaddsd           xmm1,xmm3,xmm1     # xmm1 = xmm3 + xmm1
  inc              eax                # eax += 1
  cmp              eax,5F5E102        # if eax < 100M+2
  jl               short M01_L00      # goto M01_L00

C# (Visual Studio, not optimised):

;address          instruction (assembly)                      # my understanding of what's happening
00007FFF1835417B  vmovsd      xmm0,qword ptr [rbp-48h]        # xmm0 = *(rbp-48h)
00007FFF18354180  vmulsd      xmm0,xmm0,                      # xmm0 = xmm0 * *(07FFF18354210)    mmword ptr [Program.<<Main>$>g__CalcPi|0_0()+0E0h (07FFF18354210h)]
00007FFF18354188  vmovsd      qword ptr [rbp-48h],xmm0        # *(rbp-48h) = xmm0
00007FFF1835418D  vmovsd      xmm0,qword ptr [rbp-48h]        # xmm0 = *(rbp-48h)
00007FFF18354192  mov         eax,dword ptr [rbp-4Ch]         # eax = *(rbp-4Ch)
00007FFF18354195  lea         eax,[rax*2-1]                   # eax = *(rax*2-1)
00007FFF1835419C  vxorps      xmm1,xmm1,xmm1                  # xmm1 = 0
00007FFF183541A0  vcvtsi2sd   xmm1,xmm1,eax                   # xmm1 = (double)eax?  https://www.felixcloutier.com/x86/cvtsi2sd
00007FFF183541A4  vdivsd      xmm0,xmm0,xmm1                  # xmm0 /= xmm1
00007FFF183541A8  vaddsd      xmm0,xmm0,mmword ptr [rbp-40h]  # xmm0 += *(rbp-40h)
00007FFF183541AD  vmovsd      qword ptr [rbp-40h],xmm0        # *(rbp-40h) = xmm0
00007FFF183541B2  mov         eax,dword ptr [rbp-4Ch]         # eax = *(rbp-4Ch)
00007FFF183541B5  inc         eax                             # eax += 1
00007FFF183541B7  mov         dword ptr [rbp-4Ch],eax         # *(rbp-4Ch) = eax
00007FFF183541BA  mov         ecx,dword ptr [rbp-58h]         # ecx = *(rbp-58h)
00007FFF183541BD  dec         ecx                             # ecx -= 1
00007FFF183541BF  mov         dword ptr [rbp-58h],ecx         # *(rbp-58h) = ecx
00007FFF183541C2  cmp         dword ptr [rbp-58h],0           # if *(rbp-58h) > 0
00007FFF183541C6  jg                                          # goto 07FFF183541D6   Program.<<Main>$>g__CalcPi|0_0()+0A6h (07FFF183541D6h)
00007FFF183541C8  lea         rcx,[rbp-58h]                   # rcx = *(rbp-58h)
00007FFF183541CC  mov         edx,33h                         # edx = 51
00007FFF183541D1  call        00007FFF77D8C9B0                # call mystery function. appears to be JIT related
00007FFF183541D6  cmp         dword ptr [rbp-4Ch],5F5E102h    # if *(rbp-4Ch) < 100M +2
00007FFF183541DD  jl                                          # goto 07FFF1835417B

Appendix B: Interesting reading

Is C# slower than say C++?
- detailed generic answer
  - covers language features, VM, GC, benchmarks
  - conclusion: you can almost always write C/C++ that’s faster than C#, but not the other way around
- specific case answer
  - yes, but it may take a C++ expert significant time to make it run faster than C#
Jeff Atwood: The bloated world of Managed Code
- C# cons: slower, uses more memory
- C# pros: faster development
Jeff Atwood: On Managed Code Performance
- C# pro: security: buffer overruns possible in C/C++ can’t happen in C#
- quake 2 .NET port was initially faster than the C version, but 15% slower after C targeted the host CPU, and 30% slower than the hand-optimised assembly version

Why Is C Faster Than C#?

2023-11-05

Contents