Thursday, April 7, 2011

List Comprehensions in C

Some introductions to Haskell monads begin with the IO monad. After a lot of study and setting up a few functions, you can write lines of Haskell guaranteed to execute exactly once in the order of appearance, with errors handled appropriately.

Congratulations, you’ve waded through large tracts of theory only to acheive what C programmers take for granted. Monads are undoubtedly a formidable weapon for pure functional programmers battling with side-effects, but what do imperative programmers care?

On the other hand, list comprehensions impress veteran coders of all stripes due to their astounding conciseness, and hence are a more universally appealing showpiece for monads. Consider a simple example:

[(x,y) | x <- [1,2,3], y <- [1,2,3], x /= y]

which is:

[(1,2),(1,3),(2,1),(2,3),(3,1),(3,2)]

Wouldn’t it be great if C could do this? Our goal for this post is to get list comprehensions in C, preferably without learning anything about monads!

A spoonful of syntactic sugar

How do list comprehensions automatically try all combinations of x and y? Does Haskell have a special list comprehension parser that generates loops and stuff?

Kind of. There is a special parser, but it merely converts our list comprehension into:

do x <- [1,2,3]
y <- [1,2,3]
True <- return (x /= y)
return (x,y)

This hardly differs: some argue it is so close to the original that list comprehension should be rarely used. There are still no signs of any list operations, nor loops. We’re no closer to solving the mystery.

On the bright side, it now resembles C enough that we can consider translating it.

Let me C

A first attempt:

BEGIN_DO;
int a[] = { 1, 2, 3, 0 };
int b[] = { 1, 2, 3, 0 };
ASSIGN(x, a);
ASSIGN(y, b);
if (*x != *y) printf(" (%d, %d)", *x, *y);
END_DO;

We’ll figure out the macro definitions in a second. We’re taking a few shortcuts here. We’ve only translated the first two <- expressions. We terminate the arrays with 0. And instead of returning a list of tuples, we simply print them out.

If the ASSIGN macros could change the above into two nested for loops then we’d be done:

for(*x = a; *x; x++) {
for(*y = b; *y; y++) {
if (*x != *y) printf(" (%d, %d)", *x, *y);
}
}

Unfortunately, the ASSIGN lines are self-contained; if we translate them to for loops they’ll stay unnested. Instead, we can use computed goto to get the same effect:

#define BEGIN_DO do { \
void *stack[128]; \
int stacki = 0; \
stack[stacki] = &&label_end
#define END_DO goto *stack[stacki]; label_end: ; } while(0)
#define ASSIGN(x, a) \
int *x, *x##next = a; \
stack[++stacki] = &&label_##x; \
label_##x: if (!*(x = x##next)) goto *stack[--stacki]; else x##next = x + 1

It looks horrible, and we needed GNU C extensions, but we’ve hidden loops in the macros, just as Haskell must be doing.

Another spoonful

It turns out Haskell’s do and <- are themselves syntactic sugar. Our example in fact expands again to:

[1,2,3] >>= (\ x -> [1,2,3] >>= (\y -> return (x/=y) >>=
(\r -> case r of True -> return (x,y)
_ -> fail "")))

This is unsweetened Haskell code. Still no loops, but we’ve learned those separate lines of Haskell are really one big line: later lines are function evaluations nested within earlier lines.

Simulating nested for loops in C was on the right track after all. However, we should have used nested functions. The extra flexibility and power should help complete the conversion.

This time, we’ll build a list of elements instead of just printing them. Also, let’s replace the tuple (x,y) by the integer x+y so we can stick to integers and lists of integers; we can find a C equivalent for Haskell lists and tuples another day.

Intuition suggests all our functions should work with lists; then they’ll look alike, which should make macros easy to write. This implies we must occasionally package integers in singleton lists. We wind up with:

int *f0() {
int a[] = { 1, 2, 3, 0 };
int b[] = { 1, 2, 3, 0 };

int x;
auto int *f1();
{
int *r = list_0();
for(int *p = a; (x = *p); p++) r = list_cat(r, f1());
return r;
}

int y;
auto int *f2();
int* f1() {
int *r = list_0();
for(int *p = b; (y = *p); p++) r = list_cat(r, f2());
return r;
}

int z;
auto int *f3();
int* f2() {
int *r = list_0();
for(int *p = list_1(x != y); (z = *p); p++) r = list_cat(r, f3());
return r;
}

int* f3() {
if (z) return list_1(x + y);
return list_0();
}
}

where list_0 returns an empty list of integers, list_cat concatenates two lists of integers, and list_1 creates a singleton list containing the given integer. The functions f0 and f1 take on the roles of the two nested for loops from our previous attempt, only now they also concatenate the values they compute.

Because forward-declaring functions is difficult with macros, we introduce an array of function pointers so one function can call the next. Also, since C doesn’t do pattern matching, we’ll cheat with an ad hoc macro. The full program:

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

#define BEGIN_DO ({ \
int *(*fun[64])(int); \
int funi = 0

#define END_DO fun[0](0); \
})

#define ASSIGN(x, a) int x; \
int *x##fun(int i) { \
int *r = list_0(); \
for(int *p = a; (x = *p); p++) r = list_cat(r, fun[i+1](i+1)); \
return r; \
} \
fun[funi++] = x##fun

#define MATCH(x, a) int* match() { \
if (x) return a; \
return list_0(); \
} \
fun[funi++] = match

int *list_0() {
int *r = malloc(sizeof(*r));
*r = 0;
return r;
}

int list_len(int *a) {
int *x = a;
while(*x) x++;
return x - a;
}

int *list_cat(int *a, int *b) {
int m = list_len(a), n = list_len(b);
int *r = realloc(a, sizeof(*r) * (m + n + 1));
memcpy(r + m, b, sizeof(*r) * n);
r[m+n] = 0;
free(b);
return r;
}

int *list_1(int n) {
int *r = malloc(sizeof(*r) * 2);
*r = n;
r[1] = 0;
return r;
}

int main() {
int *r = BEGIN_DO;
int a[] = { 1, 2, 3, 0 };
int b[] = { 1, 2, 3, 0 };
ASSIGN(x, a);
ASSIGN(y, b);
ASSIGN(z, list_1(x != y)); // Memory leak.
MATCH(z, list_1(x + y));
END_DO;

// Should be 3 4 3 5 4 5.
for(int *x = r; *x; x++) printf(" %d", *x);
putchar('\n');
free(r);
exit(0);
}

At last! List comprehensions in C.

Haskell’s list monad

Haskell does not need zany macros. Instead, it has:

instance Monad [] where
return a = [a]
xs >>= f = concat (map f xs)
fail _ = []

Mystery solved. The >>= operator is like our ASSIGN macro: they both compute a function on each element of a list and concatenate all the results. The return function is our list_1 function, and fail is our list_0.

In both languages, we made looping over lists invisible by writing a routine that converts an integer into a list (just put it in a list of size one) and another that runs a function on lists of integers even though the function normally takes integers (just run the function on each integer in the list, then concatenate the results).

We can generalize this idea to hide other kinds of repetitive code. This is precisely what Haskell monads are, and why do notation exists. Roughly speaking, we augment the abilities of existing functions by describing how to convert a regular output into a new and improved type (monadic type) and how to get a function to operate on monadic types when it expects the regular type.

It’s cool that we can get the same effect in C. Nevertheless:

  • Haskell has a few lines of robust, clean code instead of fragile, ugly macros.

  • This design may never occur to a C programmer, while monads are ubiquitous in Haskell.

  • Monads mesh well with pattern matching, which C lacks.

A comprehensions compromise

Our macros were challenging to write because their combined output must be nested. It’s why we needed computed goto and a stack of labels when simulating nested for loops, and why we needed an array of function pointers when we moved to nested functions.

It would be easier if we tolerated nesting:

int* assign(int *x, int *a, int *(*fun)()) {
int *r = list_empty();
for(int *p = a; (*x = *p); p++) r = list_cat(r, fun());
return r;
}

int *lamb() {
int a[] = { 1, 2, 3, 0 };
int b[] = { 1, 2, 3, 0 };
int x;
auto int *lamb();
return assign(&x, a, lamb);
int *lamb() {
int y;
auto int *lamb();
return assign(&y, b, lamb);
int *lamb() {
int z;
auto int *lamb();
return assign(&z, list_1(x != y), lamb);
int *lamb() {
return z ? list_1(x + y) : list_empty();
}
}
}
}
int *r = lamb();

The code is verbose, because at several points, one line declares a function, another calls it, and a third defines it. All three would coalesce into a single line if C had lambda expressions. Frustratingly, GNU C supports closely related nested functions but not anonymous functions. Why not? Even standard C++ is getting lambda expressions! However, if we tolerate one macro, we can ease the pain:

#define ASSIGN(x, a) \
int x; \
auto int *lamb(); \
return assign(&x, a, lamb); \
int *lamb()

...

int *lamb() {
int a[] = { 1, 2, 3, 0 };
int b[] = { 10, 20, 30, 0 };
ASSIGN(x, a) {
ASSIGN(y, b) {
ASSIGN(z, list_1(x != y)) {
return z ? list_1(x + y) : list_empty();
}
}
}
}
int *r = lamb();

While messier than Haskell, it is still readable. Moreover, the nested braces may aid seasoned imperative programmers, who might incorrectly think the Haskell do notation represents statements executing one after another. Here, it is unmistakeably clear we are nesting function calls.

2 comments:

John said...

Is (*x = *p); a typo?

Ben Lynn said...

No: when converting the macro ASSIGN to the function assign(), I had to introduce indirection since C only passes by value.

With macros, we can get away with things like "#define FOO(x) x = 10". An equivalent function would have to look like "void foo(int *x) { *x = 10; }", an you'd have to pass it "&x", not plain "x".