Ambiguity Detection for Programming Language Grammars

Bas Basten. Ambiguity Detection for Programming Language Grammars. PhD thesis, Universiteit van Amsterdam, December 2011.

Abstract

Context-free grammars are the most suitable and most widely used method for describing the syntax of programming languages. They can be used to generate parsers, which transform a piece of source code into a tree-shaped representation of the code’s syntactic structure. These parse trees can then be used for further processing or analysis of the source text. In this sense, grammars form the basis of many engineering and reverse engineering applications, like compilers, interpreters and tools for software analysis and transformation. Unfortunately, context-free grammars have the undesirable property that they can be ambiguous, which can seriously hamper their applicability. A grammar is ambiguous if at least one sentence in its language has more than one valid parse tree. Since the parse tree of a sentence is often used to infer its semantics, an ambiguous sentence can have multiple meanings. For programming languages this is almost always unintended. Ambiguity can therefore be seen as a grammar bug. A specific category of context-free grammars that is particularly sensitive to ambiguity are character-level grammars, which are used to generate scannerless parsers. Unlike traditional token-based grammars, character-level grammars include the full lexical definition of their language. This has the advantage that a language can be specified in a single formalism, and that no separate lexer or scanner phase is necessary in the parser. However, the absence of a scanner does require some additional lexical disambiguation. Character-level grammars can therefore be annotated with special disambiguation declarations to specify which parse trees to discard in case of ambiguity. Unfortunately, it is very hard to determine whether all ambiguities have been covered. The task of searching for ambiguities in a grammar is very complex and time consuming, and is therefore best automated. Since the invention of context-free grammars, several ambiguity detection methods have been developed to this end. However, the ambiguity problem for context-free grammars is undecidable in general, so the perfect detection method cannot exist. This implies a trade-off between accuracy and termination. Methods that apply exhaustive searching are able to correctly find all ambiguities, but they might never terminate. On the other hand, approximative search techniques do always produce an ambiguity report, but these might contain false positives or false negatives. Nevertheless, the fact that every method has flaws does not mean that ambiguity detection cannot be useful in practice. This thesis investigates ambiguity detection with the aim of checking grammars for programming languages. The challenge is to improve upon the state-of-the-art, by finding accurate enough methods that scale to realistic grammars. First we evaluate existing methods with a set of criteria for practical usability. Then we present various improvements to ambiguity detection in the areas of accuracy, performance and report quality. The main contributions of this thesis are two novel techniques. The first is an ambi- guity detection method that applies both exhaustive and approximative searching, called AMBIDEXTER. The key ingredient of AMBIDEXTER is a grammar filtering technique that can remove harmless production rules from a grammar. A production rule is harmless if it does not contribute to any ambiguity in the grammar. Any found harmless rules can therefore safely be removed. This results in a smaller grammar that still contains the same ambiguities as the original one. However, it can now be searched with exhaustive techniques in less time. The grammar filtering technique is formally proven correct, and experimentally validated. A prototype implementation is applied to a series of programming language grammars, and the performance of exhaustive detection methods are measured before and after filtering. The results show that a small investment in filtering time can substantially reduce the run-time of exhaustive searching, sometimes with several orders of magnitude. After this evaluation on token-based grammars, the grammar filtering technique is extended for use with character-level grammars. The extensions deal with the increased complexity of these grammars, as well as their disambiguation declarations. This enables the detection of productions that are harmless due to disambiguation. The extentions are experimentally validated on another set of programming language grammars from practice, with similar results as before. Measurements show that, even though character-level grammars are more expensive to filter, the investment is still very worthwhile. Exhaustive search times were again reduced substantially. The second main contribution of this thesis is DR. AMBIGUITY, an expert system to help grammar developers to understand and solve found ambiguities. If applied to an ambiguous sentence, DR. AMBIGUITY analyzes the causes of the ambiguity and proposes a number of applicable solutions. A prototype implementation is presented and evaluated with a mature Java grammar. After removing disambiguation declarations from the grammar we analyze sentences that have become ambiguous by this removal. The results show that in all cases the removed filter is proposed by DR. AMBIGUITY as a possible cure for the ambiguity. Concluding, this thesis improves ambiguity detection with two novel methods. The first is the ambiguity detection method AMBIDEXTER that applies grammar filtering to substantially speed up exhaustive searching. The second is the expert system DR. AMBIGUITY that automatically analyzes found ambiguities and proposes applicable cures. The results obtained with both methods show that automatic ambiguity detection is now ready for realistic programming language grammars.