Andrei Terechko coordinated software testing and test infrastructure development for the Pareon dynamic analysis tools.

17 May 2016

Random testing tools have been state-of-the-art for decades in hardware verification, while in software they are just gaining momentum. Andrei Terechko discusses his years of production experience with random testing on software.

In daily practice software engineers focus on validating the code’s correct operation: the so-called ‘happy path’. Happy-path testing does a good job at achieving high line coverage, that is, the number of source code lines exercised by the test. However, many long code paths through multiple source files and libraries normally triggered by customers remain uncovered. And how about testing the code that checks input data validity? Even experienced test engineers often underestimate the importance of thoroughly testing error handling and error recovery code.

Various testing tools can create new test cases for your software. Tools based on formal methods, for example, can prove that certain invariants do not hold for your code and can provide a test case that triggers the assertion. However, they don’t scale to handling millions of lines of source code. Other tools merely examine a function’s input and output and generate a trivial test code template that misses its actual functionality. It saves typing, but not thinking.

Test Monkey

Random tests can invent complex use cases that exercise long code paths no developer or tester ever dreamed of. Such use cases trigger when a new customer evaluates your product without any idea what buttons he should press. Random tests can also uncover security vulnerabilities. For instance, Codenomicon used a fuzz testing tool to uncover Heartbleed, the infamous security bug in OpenSSL. Other famous fuzzer examples are Quickcheck and American Fuzzy Lop.

In general, fuzzing tools feed invalid, unexpected or random input to the system. It begins with the tool generating input to the software under test (SUT) based on many random seeds; the input can also be invalid, and the SUT should not crash on it. The produced output is checked for errors, for example by comparing it to a golden reference for the given input. Upon an error, the fuzz framework prints a test report sufficient to reproduce and debug the problem.

Asserts and error handling built into the source code are also very effective in fuzzing. In addition, dynamic program analysis tools will spot buggy operations ignored by the built-in checks, even when the program ‘successfully’ finishes. For example, the Valgrind and Pareon Verify tools detect out-of-bound access, memory leaks and data races.

Vector Fabrics Figure 1
A generic fuzzing test framework including dynamic program analysis tools

Infinite monkey theorem

User: ‘Your tool crashes within one minute.’

Tester: ‘No way; we spent months testing!’

User opens the GUI and clicks frantically around for ten seconds. The tool errors out.

Tester: ‘Hmm… That’s not an intended use of our tool.’

User: ‘But that’s how I use it.’

User interfaces are notoriously hard to test. The sequences of possible GUI events are endless. Many outsourcing companies offer UI testing services using relatively cheap labour specifically on UI and website testing. They test corner cases such as shrinking a window to its absolute minimum size, entering Cyrillic characters or clicking on the same GUI button in rapid succession.

After initially outsourcing such testing to India and Ukraine, we set out to automate these tests with a fuzzer. To easily generate random GUI events without specifying every widget in detail, we settled on Sikuli, an open-source image-based automation tool. Sikuli scripts are Python code acting on images on the screen. The framework can match them even if they have different resolutions. The positions of the GUI widgets are not hard-coded either; built-in computer vision algorithms find them automatically. Furthermore, Sikuli provides a simple Python API to mouse gestures and text input, which is sufficient to code very complex UI manipulations with minimal effort.

To test our Pareon product’s GUI, we built a customized test framework using Sikuli. The script starts up the Pareon GUI, brings it into a known initial state, executes random actions and captures screenshots for each event, including highlights of the manipulated screen regions. Whenever the framework detects an error window or other abnormal behaviour, the test stops and the engineer can examine the screenshots and logs or simply rerun the same sequence of events using the saved random seed.

In general, image-based testing should be used with care due to some inherent challenges: the navigation of ‘file open’-like dialogues, images hidden by dynamic pop-ups, platform-dependent widgets, and so on. Obviously, we had to program Sikuli to avoid certain actions, such as the application close button.

To effectively spot bugs in a huge search space, any fuzzing tool needs to be customized to the problem domain. Just randomly clicking around won’t cut it for a production-quality GUI. So besides mouse gestures, we coded about twenty complex actions. Each complex action involved tens of elementary events, bringing the Pareon GUI (and the rest of the tools) into a totally different state.

It’s great fun to see a variation of the infinite monkey theorem in action, inventing brilliant test cases at random. This GUI fuzzer spotted about ten severe bugs. More importantly, it saved us lots of money on outsourcing and manual testing.

Vector Fabrics Figure 2
Image-based Sikuli scripts are Python code operating on image objects.

Round trip to Korea

Simply put, Pareon operates as an instrumenting C/C++ compiler to detect critical software errors at runtime. The number of possible programs it needs to digest is mind-boggling. How can we invent challenging new inputs for the Pareon compiler beyond handcrafted directed tests and open-source software?

Csmith is an open-source random program generator. We developed a test infrastructure, which runs Csmith with a random seed as input and feeds the generated code to the native Clang and Pareon compilers. After running both executables, the infrastructure compares the Pareon run against the native run. If either compilation or execution fails, or if the outputs differ, the test is marked as failed. The failed test case is reported along with the original random seed and detailed logs. We run this test iteratively on Arm and Intel processors with Linux and Android, collecting their results in a Jenkins continuous integration.

Given the enormous search space, the fuzzer must be tuned to focus its tests on potential vulnerable parts in our code base. Luckily, Csmith is highly parameterizable. To tune it towards random programs that are challenging for dynamic analysis, we configured it to generate functions with many arguments, deep struct nesting, packed structs, bit fields, and so on.

The fuzzer caught approximately forty severe bugs. Csmith found, for example, an obscure error in our compilation of packed structs using bit fields on Android. This potentially saved us a round trip to Korea to fix the bug on site.

Testers will never come up with test cases for all your system’s state transitions. Random testing is a powerful means to cover the last mile, especially when run in concert with a dynamic program analysis tool. Customized random testing is effective for a broad range of software: user interfaces, secure web applications and safety-critical embedded systems, as well as any multithreaded code, which can be stress-tested with randomized inputs and execution schedules.

Edited by Nieke Roos