Saturday, May 29, 2010

Boost your Spirits

After struggling with Boost.Spirit for a couple days, I've finally gotten my head around the parser component (Qi) enough I thought I might share.

Most of the trouble getting things working is the same trouble one gets from most template-based libraries often found in Boost: since template programming is not based on a solid methodology (such as OO) that one can build intuition upon -- rather on C++ tricks or compile-time side effects, and the errors the compiler issues are nigh unreadable, when something goes wrong it's hard to know where or why.

However Spirit is quite redeemable because frankly I'm not sure how many ad-hoc parsers I wish to write in my life-time. Spirit is a recursive descent parser, where the grammar is specified directly in C++, using C++ operators to build the parser. This means Spirit could be a quick and simple way to add parsing to your projects without a lot of effort. Spirit is limited to LL(k) grammars, but most grammars one encounters are fairly simple, so this fits the bill quite ably.

There is some good documentation and examples on the Boost site, so I won't be writing a tutorial. Instead I'll try to give you advice and help you conceptualize what's going on, so you have an easier time reading the tutorials.

Basically, Spirit (or more accurately the Qi parsing component) has a beautiful design: parsers are built by joining simpler parse objects together, using overloaded C++ operators, to build complicated parsers. So,

char_


is a predefined boost::spirit::terminal (grammar[1]) that recognizes a single character.

char_("a")


recognizes the single character 'a'.

char_("a-z")


recognizes a single lower-case character from a to z.

int_ recognizes integers, hex recognizes hexadecimal integers, double_ recognizes double precision floating point numbers, etc.

Now using the grammar operators, we can recognize the following statement:

{0xbeef}


as 48879 with the following grammar:

char_('{') >> string("0x") >> hex >> char_('}')


If we had a list of such bracketed numbers, we could place a Kleene star in front[2] of the whole expression to recognize the list. The full detail of built-in parsers and grammar operators can be found in the documentation.

Spirit has another interesting idea called a skip parser. Basically this is what's used to help tokenize the input by deciding what is not semantically relevant. Usually the skip parser skips space and comments.

For example, if comments are "//-to-eol" delimited, the following may be your skip grammar:


ascii::space | "//" >> *(char_ - eol) >> eol;


So now we have handled white-space and comments, and recognized our number, how do we read 48879 into our program?

Each parser has an associated "attribute" that, in the case of terminals, is easy to deduce. For char_ it is char, for int_ and hex it is int, in the case of *double_ it is std::vector<double>, and so on.

Wait, how did that vector get in there? Good question. You see, the operator not only joins parsers together in a new modified parser, it joins the attributes together into a new modified type. For example if parser p has attribute a of type A, then parser *p has an attribute of type std::vector<A>

Each operator has its own modifier, and this is where things start to get messy. If you have built up recursively a fairly large grammar, the attribute type of the final grammar can be a mammoth unwieldy beast. This type (whatever huge amalgamation it may have turned out to be) is your output AST. Even with simplifying rules, this is a bit of an unwieldy representation.

The documentation tells you define your own sub-rules and sub-grammars so you can get a handle on the complexity, but my recommendation for non-trivial grammars is to skip this automatic attribute generation entirely, and use semantic actions to manually construct your own custom intermediate representation. The result will be more verbose boiler-plate code, but a cleaner representation and ultimately more readable, understandable code.

Quite simply, a semantic action is a callable object which is executed on successful recognition of a token. The parameter of the call is the attribute type. Since you are attaching functions to parsers at a low level, we don't have a problem with huge aggregate types.

Going back to our original example, given the following grammar:


*('{' >> lit("0x") >> hex >> '}'))


and the following test input:


// my data!
{0xabba}
{0xfeed}
{0xbeef}


if we add a semantic action ("got_it()") as follows:


std::list<int> list;

void got_it (int n) { list.push_back (n); }

*('{' >> lit("0x") >> hex[&got_it] >> '}'))


We have a our own custom made representation built in easy to debug steps. Reading this is simple: "when a hexadecimal number is recognized, call a function which pushes the number onto a list".

[1] Actually terminal doesn't appear to inherit from boost::spirit::qi::grammar at all.
[2] Because it's the C++ deference operator*() overloaded, it goes in front, not behind as is traditional.