Main blog page


TLDR: I wrote a fuzzer that randomly generates SIMD C++ code, runs it through different compilers/settings, and checks that the output is the same for all. This led to a few miscompilation and crash bugs in Clang/LLVM

Intro

Compilers are generally very good at what they do, but they aren't magic. Sometimes they get things wrong: generating incorrect code for a valid program. These "miscompilation" bugs can be pretty serious because they reduce trust in the rest of the software stack. How can we find these bugs most effectively?

Tools exist to find these bugs in C compilers, like CSmith and YARPGen. These are "fuzzers": they produce random inputs, and test that compilers work correctly on these inputs. More specifically, they attempt to generate random C programs that are deterministic and well-defined. This is then fed to multiple compilers, and if the results aren't unanimous then at least one compiler is wrong. This approach is also called "differential testing," and while it's useful for all sorts of programs, it's often applied to testing compilers

But, this approach has its limits. We need to generate as wide a variety of C programs as possible, but they have to obey the rules. It's a balancing act between injecting enough chaos to test interesting edge cases, while keeping enough order that the result is still deterministic and well-defined. Specifically, we don't just need to keep correct syntax/semantics, but we need to avoid "undefined behaviour." Some code will compile, but if it has undefined behaviour the compiler can have more leeway in what the resulting program does. Because of this balance between chaos and order, we still have plenty of code paths that aren't exercised by the current compiler fuzzers

Fuzzing Effectively

I have a theory about fuzzing: it doesn't matter how many code paths your fuzzer exercises, it matters how many new and interesting code paths it exercises. If 99% of the code in a compiler has been fuzz tested extensively, then that 1% should be the goal of any new fuzzer you write. Especially if that 1% is potentially bug-prone

A while ago, I thought about what that might look like for compilers. It ended up being decided for me, because while futzing about with SIMD intrinsics, I came across a miscompilation bug). This isn't a very serious bug, but I lost a few hours to it thinking I had done something wrong in my code. Somewhat miffed by this, I decided to write a fuzzer for SIMD, specifically Intel's SSE/AVX intrinsics. The idea is to generate some code that uses a bunch of these intrinsics, without any undefined behaviour, and then test that different compilers and optimization levels all agree on the result

One of the dangers of a mis-compilation bug is that the user will waste a lot of time chasing down a bug in their code that doesn't exist

For instance, we might generate some code like this:

__m256i do_stuff(const int* iVals) {
	// Load some input
	__m256i I0 = _mm256_loadu_si256((const __m256i*)&iVals[0]);
	__m256i I8 = _mm256_loadu_si256((const __m256i*)&iVals[8]);
	__m256i I16 = _mm256_loadu_si256((const __m256i*)&iVals[16]);
	__m256i I24 = _mm256_loadu_si256((const __m256i*)&iVals[24]);
	// Do some work on the input
	__m256i A = _mm256_maddubs_epi16(I16, I0);
	__m256i B = _mm256_packs_epi16(A, I8);
	__m256i C = _mm256_mpsadbw_epu8(I24, I16, 4);
	__m256i D = _mm256_maddubs_epi16(C, B);
	return D;
}

which will repro the bug I originally came across (Present in LLVM 3.4.1 - 13.0): _mm256_mpsadbw_epu8 was marked as commutative in LLVM's source even though it is not commutative. For this code LLVM decided to swap the order of the mpsadbw instruction, I think to optimize memory accesses. This led to a different (incorrect) result

Writing the fuzzer for SIMD

After reporting this, I set about writing a fuzzer to test this sort of code, since bugs often come in clusters. There are a lot of ways to do this, but I decided to go with the simplest approach I could think of. I wanted get something running fast so I could test if it actually has potential. What I went with was to generate a branchless dataflow graph of intrinsics with 1 output and 1 or more inputs. What that means is:

  1. We start with some number of inputs (at least 1, but could be more)
  2. Several intrinsics are run, using either the inputs, or the results of previous intrinsics
  3. We return one of the results of these intrinsics

You can see this in the above do_stuff function: some inputs, some intrinsics, a return. So now the question is how can we do this automatically, and in a way that provides thorough testing of the compiler?

To create it, I actually generate this code backwards. We pick a random type for the function's return type, then pick a random intrinsic that returns that type. For the inputs to that function, we pick more random intrinsics for its inputs, and so on. We can also randomly re-use an existing variable as input to more than one intrinsic to try to create some interesting data flows.

Once we've generated a decent number of intrinsics, the remaining variables that have not been bound to an intrinsic can instead just read data from an input stream we provide the function. This process guarantees that we generate a syntactically valid program, and one that is correctly typed. It also avoids having large amounts of dead code: that may be desirable if we wanted to test dead-code-elimination, but I guessed it was better to give the optimizer more live code to chew on for each pass.

There is one lingering issue though, which is making sure our program does not generate undefined behavior: I took the bold strategy of just ignoring this. My initial thoughts were: undefined behavior should be pretty rare, and if it did manifest I could just comb through some false positives. This ended up paying off, because I never encountered any differences in semantics due to UB: I suspect the intrinsics just have less UB than normal C++, or that the compiler doesn't take advantage of it.

There are additional wrinkles for testing floating point, but I decided to only test integer-based intrinsics for miscompilations, just to make things easier

Writing a tester for the generated code

Once we've generated the code, we need to run it through compilers, and run the resulting binary on the same input. If any of the outputs differ, we've (likely) found a bug

This is reasonably straightforward: generate a harness that reads input from stdin, runs the function on it, and prints the output. Then diff the outputs of each compiler and verify they're the same

Why bother reading the input from stdin, instead of just compiling it in? A few reasons. First, this lets us compile the code once and run it on several inputs to get some better coverage. And second, if we hard-coded in values, the optimizer could potentially use inlining and constant-propagation to reduce all our code to a constant. That could be interesting, but I figured it would be better to keep the input unknown to the optimizer

What bugs I found

Examples of miscompilation bugs found:

I also found a few crash bugs: while I think these are often worth reporting, I don't find them as interesting. You can see every issue I opened on LLVM here) which includes the above bugs, and the crashes.

It is possible that crashes can turn into miscompilations in different situations. Most of the crash bugs I found didn't exhibit this, but they're still generally good to fix.

Conclusions

You may notice that all the bugs I found were in LLVM. This is partly because I ended up focusing on LLVM/Clang since it was easier to build from source on Windows. But I do also think that LLVM tends to be more aggressive in optimizing vectorized code. By contrast, GCC and MSVC seem to run basic optimizations but mostly leave your intrinsics alone.

Some examples can be seen with this Godbolt.

I've seen some discussion that this shouldn't be done: after all, intrinsics are generally a sign that the programmer wants direct control over the generated assembly. Why optimize it further? I can actually think of a few reasons:

  1. Sometimes, CPU's come up with new instructions that can enable faster code. Compilers can use these as needed without updates to existing source code
  2. Sometimes scalar code is automatically vectorized. This will usually be faster, but may not be fully optimal. Or optimizations may be enabled in the vector form that couldn't be done in the scalar code
  3. Sometimes developers (raises hand) use intrinsics without being a complete expert. Not all code with intrinsics is intrinsically fast

But miscompilations do give more fuel to the "don't optimize" crowd

Bugs can be in the front-end (parser and language semantics), middle-end (optimizing the IR), or back-end (generating assembly from the final IR). I found zero in the front-end, a few in the middle-end, and most in the back-end. Mostly I was looking for (and found) optimizer bugs, but LLVM backends tend to do a lot of platform-specific optimizations

Overall, there's the question of if these bugs matter. I originally had thought that they'd be mostly pointless. But apparently one of them was independently discovered by Apple.

If you are interested in seeing more details about the fuzzer, you can follow along at the Github repo. Note that it is very much a hacked-together research project, so I would not expect it to be easy to use on your own.

Finally, I have to give thanks to the LLVM community. Despite having a variety of causes and being in different parts of the code base, they promptly fixed every bug I filed.

Future Work

There are some obvious follow-ups for this:

I'm interested in making this more generic, and creating a framework for quick, targeted fuzzers for compilers. Tools like CSmith have done a very good job in the general case, but there's clearly a use in being able to quickly test more specific areas.

Addendum 1: The World's Worst Linker

I wanted to run the generated code without having to write it out to disk and shell out to it. So I tried to see if there was a way to have the compiler write its binary output to stdout, and directly load it in the fuzzer's address space. It turns out that this isn't possible normally, since the linker requires being able to seek throughout the output file to fix up relocations. However, if you compiler without linking, Clang will happily write its compilation to stdout: but since it's unlinked, you need to link it yourself, in-memory. So I did!

This turned out to be a very bad good bad idea. I constantly would run into crashes and false positives, where I thought I found a compiler bug but actually I was just harnessing the code incorrectly. I don't regret this though, since I learned a lot about linkers, calling conventions, and other details that don't matter to 99.9% of devs.

Next time: just use a ramdisk and DLLs

Addendum 2: An ARM and a LEG

Long story short: I have since ported the fuzzer to be able to generate NEON code for ARM processors. This found a couple new bugs, both in the ARM backend and middle-end optimizeations that weren't being stressed by my X86 code.

Originally, I had an even more cursed setup, where I would pipe the compiled blob to my Android phone running a custom app that would load the binary off a socket, execute it, and send the results back to the fuzzer running on my laptop

This led to me wasting several days chasing down a bug in my ARM ELF loader that only manifested if a load had an offset >=4096. This only manifested with very large programs, and I struggled to find the issue among all the chaos. I eventually gave up for several months.

But much later, I bought an M2 Mac and got Asahi Linux running on it. With this, I finally sat down and got the NEON fuzzer working properly. I ended up finding a new bug as a result, though overall it hasn't found much else.

Addendum 3: GCC

With some tweaks, and a fix for pointer alignment, I was able to also fuzz X86 intrinsics on GCC. This shook out a couple bugs, such as [14 Regression] Wrong code from combining VPABSB/VPBLENDVB since g:1ede03e2d0437ea9c2f7

It was nice to see the core idea of the fuzzer work in different contexts, as this implies it's actually exploring a deep problem space instead of hitting some shallow edge cases specific to one platform/vendor.