Pathfinder: XQuery Compilation Techniques for Relational Database Targets

Jens Teubner. Pathfinder: XQuery Compilation Techniques for Relational Database Targets. PhD thesis, Technische Universität München, October 2006. [doi]

Abstract

Even after at least a decade of work on XML and semi-structured information, such data is still predominantly processed in main-memory, which obviously leads to significant constraints for growing XML document sizes.

On the other hand, mature database technologies are readily available to handle vast amounts of data easily. The most efficient ones, relational database management systems (RDBMSs), however, are largely locked-in to the processing of very regular, table-shaped data only. In this thesis, we will unleash the power of relational database technology to the domain of semi-structured data, the world of XML. We will present a purely relational XQuery processor that handles huge amounts of XML data in an efficient and scalable manner.

Our setup is based on a relational tree encoding. The XPath accelerator, also known as pre/post numbering, has since become a widely accepted means to efficiently store XML data in a relational system. Yet, it has turned out that there is additional performance to gain. A close look into the encoding itself and the deliberate choice of relational indexes will give intriguing insights into the efficient access to node-based tree encodings.

The backbone of XML query processing, the access to XML document regions in terms of XPath tree navigation, becomes particularly efficient if the database system is equipped with enhanced, tree-aware algorithms. We will devise staircase join, a novel join operator that encapsulates knowledge on the underlying tree encoding to provide efficient support for XPath navigation primitives. The required changes to the DBMS kernel remain remarkably small: staircase join integrates well with existing features of relational systems and comes at the cost of adding a single join operator only. The impact on performance, however, is significant: we observed speedups of several orders of magnitude on large-scale XML instances.

Although existing XQuery processors make use of the resulting XPath performance gain by evaluating XPath navigation steps in the DBMS back-end, they tend to perform other core XQuery operations outside the relational database kernel. Most notably this affects the FLWOR iteration primitive, XQuery’s node construction facilities, and the dynamic type semantics of XQuery. The loop-lifting technique we present in this work deals with these aspects in a purely relational fashion. The outcome is a loop-lifting compiler that translates arbitrary XQuery expressions into a single relational query plan. Generated plans take particular advantage of the operations that relational databases know how to perform best: relational joins as well as the computation of aggregates.

The MonetDB/XQuery system to which this thesis has contributed provides the experimental proof of the effectiveness of our approach. MonetDB/XQuery is one of the fastest and most scalable XQuery engines available today and handles queries in the multi-gigabyte range in interactive time. The key to this performance are the techniques described in this thesis. They form the basis for MonetDB/XQuery’s core component: the XQuery compiler Pathfinder.