Static analysis (part 1/4)

23 août 2012 Souad Hadjiat

Software quality is modern software development. In fact, software defects can lead to serious issues, like data loss. Besides, they need to get fixed, which adds to the project cost. That is why code review, different set of eyes examining the source code looking for possible bugs, is becoming a routine procedure in software development.

Along with code review, is static code analysis. Indeed, code review, while an important and necessary step, is not enough. It is tedious and fallible because it depends on humans. Thus, there is a large class of software defects that can be detected in an automated way. Enter the static code analysis, with tools like PMD, CheckStyle and FindBugs. Even if you haven’t used one of those tools, you still have encountered static analysis by simply using your IDE of choice, like Eclipse or IntelliJ. Integrated Development Environments do perform some nice static analysis that can detect important bugs, which are not compiling errors, like dead code or unused variables.
One interesting body of defects that could be detected by static analysis is security flaws. This post, the first of a series of four, is going to present the FindBugs detector I developed that performs an advanced analysis to find input injections, called « Taint Analysis ». The program can find in java bytecode most vulnerabilities related to input representation, including Cross-site scripting, SQL injection, Cookie poisoning, path traversal.
But first, let’s introduce static analysis: how it works, and how it can help developers build a strong system.

Static analysis in theory

Static analysis is the process of analyzing code, either in source of compiled representation, without executing it. Due to its automated nature, such a tool can go through a large number of scenarios and corner cases that are otherwise difficult for testers to cover.
Static analysis mostly follow the compilation process: the code gets parsed and then represented in a model which gets analysed, and finally results are outputted.

Static analysis can be applied to varying scales: from a token or a line, to separate functions and classes, or even a single program and module. The most advanced tools can be run on an entire system involving multiple modules written in different languages. Still, regardless the different scopes, the analysis follows mostly the same structure described on the diagram above.
The first steps of the analysis process, the lexing and parsing, are quite similar between the different tools. Lexical analysis consists of transforming the input from a stream of characters to a stream of higher-level tokens. Next, using a set of rules defined by the language’s grammar, this linear stream of tokens gets transformed into a parse-tree representing the program structure. The latter is usually further transformed into a cleaner and tweaked model.
There are multiple types of analysis, among them:

pattern-based: searches for a given function or a sequence of instructions.
dataflow: enables to follow data across the program. Its main applications are: constant and type propagation, reaching definitions, taint propagation.
interprocedural: gathers information across multiple procedures (typically across the entire program), analyses interaction between functions.

Depending on the type of analysis, the used code representation can be a parse tree, an abstract syntax tree, a control flow graph (for dataflow analysis), along with a symbol table mapping identifiers to their type and declaration site. Some more custom representations could be used to match the analysis. The most important thing is that the representation should contain all the information needed by the tool to perform the analysis.
As an illustration, here is an example of java code:

int i=0;
int j=4;
for (; i<10; i++)
{
   if (i<j) j=i*j;
}
return j;

From this set of instructions can be built an abstract syntax tree:

and a control flow graph:

Static analysis in practice

There are many different static analysis tools, targeting one or multiple languages and performing all kinds of analysis. I will not list all the existing tools, but I can give most categories they belong to.

Type checking

Type checking is an important use of static analysis. It may not get all the attention it deserves, compilers being after all the first ones checking language typing specification. Let’s look at this example:

Object[] objects = new String[3];
objects[0] = new Object();

This code will compile without a warning, but an ArrayStoreException will be thrown at runtime. This kinds of errors can be detected with static analysis.

Style checking

This type of analysis can find all sorts of style errors: wrong indentation, bad naming, duplicate code, … Here is an example of bad indentation that can be confusing:

int x = 0, y = 3;
      if (x == y)
             if (y == 3)
                    x = 3;
      else
             x = 4;

The else instruction is not for the first if as we may think, but the second one. This simple error can lead to implementation mistakes.

Program understanding

This is one of IDEs famous features. For example: find all declarations or uses of a given variable or method. It doesn’t find errors but gives useful information on the program.

Program verification and property checking

This analysis enables to check properties of a given specification, for example division by zero. More generally, program verification can also be done with formal methods, which are used to prove the correctness of algorithms and formal specifications.

Bug finding

Bug finding is about pointing out programming mistakes: bad practice, coding errors, unexpected behaviour.
One interesting example of bugs that static analysis can find is null pointer dereference.

int i=0;
String s = null;
if(i>0){
	s = "positive";
}
if(s.contains("pos")){
	System.out.println(s);
}

This code will compile but at runtime a null pointer exception will be thrown, the String s being null and calling the method contains. Static analysis tools, for instance FindBugs can find this bug and report it.

Security review

Static analysis can be applied to find security flaws in code. With dataflow analysis, it is possible to follow the propagation of input data, and thus detect possible code injection. Besides the classic cross-site scripting, here is an example of path traversal:

public static void main(String args[]){
      File f = new File(args[0]);
      f.open();
      //...
}

This program opens a file with an argument entered in command line. The fact that we use this argument to open a file is just an example, the important fact is that we use directly an input without validation, which constitutes a serious security vulnerability. Dataflow analysis can detect this kind of flaw by finding the source of data inputs, following their propagation until their use in a sensitive instruction (like creating a File object). This is called Taint analysis.

Limits

Static analysis has nevertheless its limits. There are some questions that algorithms can’t answer. In computability theory, finding all possible runtime errors or proving a non trivial property of a program is undecidable. It means that no algorithm can always resolve this problem. This all comes to the halting problem : given a program and an input, decide whether the program will eventually halt or run indefinitely. The halting problem has been proven to be undecidable since the only way of finding for sure whether a program halts is to run it.
The consequences of this fact is that static analysis makes approximations which generates what we call false positives and false negatives. A false positive is a result that indicates a condition is true when it is false (opposite to false negative), for instance the tool will point an error where there is none. False negatives are the most problematic since it misses errors giving a false sense of security. This involves two concepts: soundness and completeness. A tool is sound if it never gets false negatives. This means that it will always find an error when it exists but the counter part is there can be false positives. A tool is complete when it never suffer false positives. It will never issue a bug without a feasible counter example proving it. Bug finding tools tend to be unsound and complete, minimizing false positives results, while property checking tools need to find all property violations which make them more sound. The challenge is to find the perfect compromise between soundness and completeness based on the type of analysis.

Conclusion

Static analysis tools aims to help programmers during the development phase to ensure the safety and quality of their code. If used very early, the cost is minimal in contrast to testing which is done at the very last time, after coding.
In the next post, I will present FindBugs, an open source static analysis tool that can detect bugs in java bytecode.