Wednesday, February 11, 2009

Compiler Compilers

Long ago I sat a set of exams designed to test the breadth of our computer science knowledge. I failed Compilers. Luckily, they had just changed the rules, and it was acceptable to fail Compilers provided you passed the Databases exam. Fortune smiled upon me, and I passed the Databases exam despite possessing even less knowledge in that field. Or perhaps fortune was actually laughing at me: I stayed away from compiler theory for the rest of my degree to my detriment. It's scandalous I still know so little.

I was especially dispirited by my failure since I had thought I knew compilers well. When I was a child, I felt moved after learning about recursive descent parsers. I had been mystified: how could a program could possibly read an expression like (1+2)*((3+4)/5)? All I could think of was to use a simple loop to iterate over the tokens in the expression and assigning them to local variables before processing them. This led nowhere, because the nesting of parentheses can be arbitrarily deep. Thus I was awestruck by the elegance of recursive descent; after the first reading the technique was indelibly etched in my mind.

Much later, in university, basic compiler theory (grammar, languages, automata, and so on) came easily to me. I implemented a compiler for a toy language which produced the smallest and fastest code in my class. I'd also written the odd recursive descent parser for pet projects.

So I thought I was a compiler expert. But I should have paid attention to my weaknesses. I found LR(1) parsers unintuitive. I leafed through their proofs of correctness and promptly forgot their details, especially since the textbook stated these parsers are rarely coded by hand and instead generated by programs such as yacc or bison. Instead of embracing these powerful tools, I stuck with trusty hand-coded parsers because I understood them completely.

Ultimately this caused me to fail my Compilers exam, which was peppered with yacc questions, and more seriously, to waste copious amounts of time coding and debugging parsers and lexical analyzers. I have at last invested energy in seriously learning flex and bison, and I urge anyone in a similar situation to do the same. All these years, instead of writing recursive descent parsers unless there was a compelling reason not to, my default attitude should have been to use flex and bison unless there was a compelling reason not to.

I wrote a compiler that converts a toy language (essentially a tiny subset of Limbo) to Dalvik bytecode. I also wrote a Dalvik assembler along the way. In both cases, flex and bison were invaluable: gone are the finicky routines I've written and rewritten a million times, like handling an unexpected end of input, and fine-tuning mutually recursive functions and logic to get the desired operator precedence and associativity.

I've put my efforts on hold. I was hoping for a relatively functional and less object-oriented input language, but it seems such a language would be tricky to shoehorn into Dalvik bytecode: function calls are abstracted, and with no control over stack frames I doubt I can implement trampolining.

Also, I greatly underestimated my ignorance of compiler theory. I know nothing about the static single assignment (SSA) form, which appears too new for the textbooks I had. Moreover, I need to refresh my memory on older optimization techniques.

On the other hand, I don't want to discard my code, particularly as it will forever remind me to consider flex/bison before coding yet another parser by hand. Below is the git repository containing my primitive compiler and assembler that turns this:
init() {
print("Hello, World");

on_input(arg : string) {
a: int;
a:= parse_int(arg);
print(int_to_string(a * (a + 1) / 2));
into an Android app.


jserv said...


When attempting to git clone, I encountered the following problem:

Initialized empty Git repository in /tmp/bender/.git/
fatal: not found: did you run git update-server-info on the server?

Could you check? Thanks!

Ben Lynn said...

Oops! Thanks for pointing that out. I need to read my own Git guide again.

I fixed it. (But don't get your hopes up. The code is in a very primitive state.)