Wednesday, February 11, 2009

Compiler Compilers

Long ago I sat a set of exams designed to test the breadth of our computer science knowledge. I failed Compilers. Luckily, they had just changed the rules, and it was acceptable to fail Compilers provided you passed the Databases exam. Fortune smiled upon me, and I passed the Databases exam despite possessing even less knowledge in that field. Or perhaps fortune was actually laughing at me: I stayed away from compiler theory for the rest of my degree to my detriment. It's scandalous I still know so little.

I was especially dispirited by my failure since I had thought I knew compilers well. When I was a child, I felt moved after learning about recursive descent parsers. I had been mystified: how could a program could possibly read an expression like (1+2)*((3+4)/5)? All I could think of was to use a simple loop to iterate over the tokens in the expression and assigning them to local variables before processing them. This led nowhere, because the nesting of parentheses can be arbitrarily deep. Thus I was awestruck by the elegance of recursive descent; after the first reading the technique was indelibly etched in my mind.

Much later, in university, basic compiler theory (grammar, languages, automata, and so on) came easily to me. I implemented a compiler for a toy language which produced the smallest and fastest code in my class. I'd also written the odd recursive descent parser for pet projects.

So I thought I was a compiler expert. But I should have paid attention to my weaknesses. I found LR(1) parsers unintuitive. I leafed through their proofs of correctness and promptly forgot their details, especially since the textbook stated these parsers are rarely coded by hand and instead generated by programs such as yacc or bison. Instead of embracing these powerful tools, I stuck with trusty hand-coded parsers because I understood them completely.

Ultimately this caused me to fail my Compilers exam, which was peppered with yacc questions, and more seriously, to waste copious amounts of time coding and debugging parsers and lexical analyzers. I have at last invested energy in seriously learning flex and bison, and I urge anyone in a similar situation to do the same. All these years, instead of writing recursive descent parsers unless there was a compelling reason not to, my default attitude should have been to use flex and bison unless there was a compelling reason not to.

I wrote a compiler that converts a toy language (essentially a tiny subset of Limbo) to Dalvik bytecode. I also wrote a Dalvik assembler along the way. In both cases, flex and bison were invaluable: gone are the finicky routines I've written and rewritten a million times, like handling an unexpected end of input, and fine-tuning mutually recursive functions and logic to get the desired operator precedence and associativity.

I've put my efforts on hold. I was hoping for a relatively functional and less object-oriented input language, but it seems such a language would be tricky to shoehorn into Dalvik bytecode: function calls are abstracted, and with no control over stack frames I doubt I can implement trampolining.

Also, I greatly underestimated my ignorance of compiler theory. I know nothing about the static single assignment (SSA) form, which appears too new for the textbooks I had. Moreover, I need to refresh my memory on older optimization techniques.

On the other hand, I don't want to discard my code, particularly as it will forever remind me to consider flex/bison before coding yet another parser by hand. Below is the git repository containing my primitive compiler and assembler that turns this:
init() {
print("Hello, World");
}

on_input(arg : string) {
a: int;
a:= parse_int(arg);
print("\n");
print(int_to_string(a * (a + 1) / 2));
}
into an Android app.

http://cs.stanford.edu/~blynn/bender.git

Friday, February 6, 2009

Minimal Dalvik Executables

When I was a kid, the Windows era had yet to arrive. Floor-model PCs in shops ran MS-DOS. They'd often take steps to stop little imps like me messing around with their systems. I recall one where they even deleted the standard DOS utilities. But they had left COMMAND.COM unmodified which is enough for plenty of mischief. One can create executables via:
 $ copy con: hello.com
and then enter machine code by holding Alt while typing a number on the keypad. (It doesn't work on DOSBox by the way.) Some numbers, such as 0, must be avoided, but there are workarounds; the book from which I learned the trick, Ralph Burger's "Computer Viruses: a High-tech Disease", recommended first creating a program that read numbers in ASCII and output the corresponding binary, a sort of proto-assembler, which in turn could be used to create more complex executables.

I created a HELLO.COM program by typing the following, where the numbers are typed on the numberpad with the Alt key held:
 180 9 190 9 1 205 33 205 32 hello$
which is perhaps more recognizable to DOS programmers when converted to hex:
 B409 BA0901 CD21 CD20 hello$
It is the machine code for:
MOV AH, 09
MOV DX, 109h
INT 21h
INT 20h
This instructs DOS to print the '$'-terminated string at address 0x109 and then exit. When .COM files are loaded, DOS places their contents at address 0x100 (locations 0 to 0x100 are populated with miscellaneous information such as the command-line), hence the 0x109.

Thus with 9 bytes plus the size of the message, plus a byte for the terminator, we get a valid DOS program. We can type raw machine code from the keyboard to produce a full program.

Linux on x86 systems is less straightforward. The machine code is similar, but there must also be an ELF header (not strictly true: in some circumstances, the entire machine code can be placed within a truncated ELF header!), making binaries more tedious to construct by hand. However, certain kernel configurations allow different header types, some of them brief enough to be easily memorized and hand-rolled.

Minimalist Android

How about Android? An APK file is a ZIP file containing, among other files, the Dalvik executable in a DEX file. We're interested in minimizing this DEX file, whose full specification is well-documented.

All Android code must lie in some class, and Android applications typically inherit from a library class such as Activity, so we must also record information such as the class name, access rights, and inheritance. This, in turn, requires a strings section so we can store the class names, as well as invoked library functions and function signatures. Other overheads include the header and the map section. We can cheat a little by defining the UI with XML, so it won't contribute to the size of the DEX file.

After compiling to Java class files, most or all the R*.class files can be removed before making a DEX file,since the constants they contain are hardcoded in other class files by javac. For some projects, R$styleable must be retained, though this class appears to have since been removed from the public API.

I chose to write a program that parsed the DEX file from the HelloAndroid example and output only the necessary sections rather than attempt to create a DEX file from scratch. I found I could safely elide the source filenames, debug information and annotations. The result consists of 632 bytes, though we could trivially shrink this further by using a shorter name for the class.
0000000: 6465 780a 3033 3500 2f4f 153b 3623 8747  dex.035./O.;6#.G
0000010: 6d02 4697 5b1e 959d a8b1 2f0f 9c3a a14f m.F.[...../..:.O
0000020: 7802 0000 7000 0000 7856 3412 0000 0000 x...p...xV4.....
0000030: 0000 0000 f001 0000 0a00 0000 7000 0000 ............p...
0000040: 0500 0000 9800 0000 0300 0000 ac00 0000 ................
0000050: 0000 0000 0000 0000 0500 0000 d000 0000 ................
0000060: 0100 0000 f800 0000 6001 0000 1801 0000 ........`.......
0000070: 6201 0000 6a01 0000 6d01 0000 8501 0000 b...j...m.......
0000080: 9a01 0000 bc01 0000 bf01 0000 c301 0000 ................
0000090: c701 0000 d101 0000 0100 0000 0200 0000 ................
00000a0: 0300 0000 0400 0000 0500 0000 0500 0000 ................
00000b0: 0400 0000 0000 0000 0600 0000 0400 0000 ................
00000c0: 5401 0000 0700 0000 0400 0000 5c01 0000 T...........\...
00000d0: 0100 0000 0000 0000 0100 0200 0800 0000 ................
00000e0: 0300 0000 0000 0000 0300 0200 0800 0000 ................
00000f0: 0300 0100 0900 0000 0300 0000 0100 0000 ................
0000100: 0100 0000 0000 0000 ffff ffff 0000 0000 ................
0000110: e101 0000 0000 0000 0100 0100 0100 0000 ................
0000120: 0000 0000 0400 0000 7010 0000 0000 0e00 ........p.......
0000130: 0300 0200 0200 0000 0000 0000 0900 0000 ................
0000140: 6f20 0100 2100 1500 037f 6e20 0400 0100 o ..!.....n ....
0000150: 0e00 0000 0100 0000 0000 0000 0100 0000 ................
0000160: 0200 063c 696e 6974 3e00 0149 0016 4c61 .....I..La
0000170: 6e64 726f 6964 2f61 7070 2f41 6374 6976 ndroid/app/Activ
0000180: 6974 793b 0013 4c61 6e64 726f 6964 2f6f ity;..Landroid/o
0000190: 732f 4275 6e64 6c65 3b00 204c 636f 6d2f s/Bundle;. Lcom/
00001a0: 616e 6472 6f69 642f 6865 6c6c 6f2f 4865 android/hello/He
00001b0: 6c6c 6f41 6e64 726f 6964 3b00 0156 0002 lloAndroid;..V..
00001c0: 5649 0002 564c 0008 6f6e 4372 6561 7465 VI..VL..onCreate
00001d0: 000e 7365 7443 6f6e 7465 6e74 5669 6577 ..setContentView
00001e0: 0000 0001 0102 8180 0498 0203 01b0 0200 ................
00001f0: 0b00 0000 0000 0000 0100 0000 0000 0000 ................
0000200: 0100 0000 0a00 0000 7000 0000 0200 0000 ........p.......
0000210: 0500 0000 9800 0000 0300 0000 0300 0000 ................
0000220: ac00 0000 0500 0000 0500 0000 d000 0000 ................
0000230: 0600 0000 0100 0000 f800 0000 0120 0000 ............. ..
0000240: 0200 0000 1801 0000 0110 0000 0200 0000 ................
0000250: 5401 0000 0220 0000 0a00 0000 6201 0000 T.... ......b...
0000260: 0020 0000 0100 0000 e101 0000 0010 0000 . ..............
0000270: 0100 0000 f001 0000 ........
Only the 4 16-bit words at 0x128 and the 9 16-bit words at 0x140 are actual Dalvik bytecode. They belong to the constructor and the onCreate method respectively.