Thursday, April 7, 2011

List Comprehensions in C

Some introductions to Haskell monads begin with the IO monad. After a lot of study and setting up a few functions, you can write lines of Haskell guaranteed to execute exactly once in the order of appearance, with errors handled appropriately.

Congratulations, you’ve waded through large tracts of theory only to acheive what C programmers take for granted. Monads are undoubtedly a formidable weapon for pure functional programmers battling with side-effects, but what do imperative programmers care?

On the other hand, list comprehensions impress veteran coders of all stripes due to their astounding conciseness, and hence are a more universally appealing showpiece for monads. Consider a simple example:

[(x,y) | x <- [1,2,3], y <- [1,2,3], x /= y]

which is:

[(1,2),(1,3),(2,1),(2,3),(3,1),(3,2)]

Wouldn’t it be great if C could do this? Our goal for this post is to get list comprehensions in C, preferably without learning anything about monads!

A spoonful of syntactic sugar

How do list comprehensions automatically try all combinations of x and y? Does Haskell have a special list comprehension parser that generates loops and stuff?

Kind of. There is a special parser, but it merely converts our list comprehension into:

do x <- [1,2,3]
y <- [1,2,3]
True <- return (x /= y)
return (x,y)

This hardly differs: some argue it is so close to the original that list comprehension should be rarely used. There are still no signs of any list operations, nor loops. We’re no closer to solving the mystery.

On the bright side, it now resembles C enough that we can consider translating it.

Let me C

A first attempt:

BEGIN_DO;
int a[] = { 1, 2, 3, 0 };
int b[] = { 1, 2, 3, 0 };
ASSIGN(x, a);
ASSIGN(y, b);
if (*x != *y) printf(" (%d, %d)", *x, *y);
END_DO;

We’ll figure out the macro definitions in a second. We’re taking a few shortcuts here. We’ve only translated the first two <- expressions. We terminate the arrays with 0. And instead of returning a list of tuples, we simply print them out.

If the ASSIGN macros could change the above into two nested for loops then we’d be done:

for(*x = a; *x; x++) {
for(*y = b; *y; y++) {
if (*x != *y) printf(" (%d, %d)", *x, *y);
}
}

Unfortunately, the ASSIGN lines are self-contained; if we translate them to for loops they’ll stay unnested. Instead, we can use computed goto to get the same effect:

#define BEGIN_DO do { \
void *stack[128]; \
int stacki = 0; \
stack[stacki] = &&label_end
#define END_DO goto *stack[stacki]; label_end: ; } while(0)
#define ASSIGN(x, a) \
int *x, *x##next = a; \
stack[++stacki] = &&label_##x; \
label_##x: if (!*(x = x##next)) goto *stack[--stacki]; else x##next = x + 1

It looks horrible, and we needed GNU C extensions, but we’ve hidden loops in the macros, just as Haskell must be doing.

Another spoonful

It turns out Haskell’s do and <- are themselves syntactic sugar. Our example in fact expands again to:

[1,2,3] >>= (\ x -> [1,2,3] >>= (\y -> return (x/=y) >>=
(\r -> case r of True -> return (x,y)
_ -> fail "")))

This is unsweetened Haskell code. Still no loops, but we’ve learned those separate lines of Haskell are really one big line: later lines are function evaluations nested within earlier lines.

Simulating nested for loops in C was on the right track after all. However, we should have used nested functions. The extra flexibility and power should help complete the conversion.

This time, we’ll build a list of elements instead of just printing them. Also, let’s replace the tuple (x,y) by the integer x+y so we can stick to integers and lists of integers; we can find a C equivalent for Haskell lists and tuples another day.

Intuition suggests all our functions should work with lists; then they’ll look alike, which should make macros easy to write. This implies we must occasionally package integers in singleton lists. We wind up with:

int *f0() {
int a[] = { 1, 2, 3, 0 };
int b[] = { 1, 2, 3, 0 };

int x;
auto int *f1();
{
int *r = list_0();
for(int *p = a; (x = *p); p++) r = list_cat(r, f1());
return r;
}

int y;
auto int *f2();
int* f1() {
int *r = list_0();
for(int *p = b; (y = *p); p++) r = list_cat(r, f2());
return r;
}

int z;
auto int *f3();
int* f2() {
int *r = list_0();
for(int *p = list_1(x != y); (z = *p); p++) r = list_cat(r, f3());
return r;
}

int* f3() {
if (z) return list_1(x + y);
return list_0();
}
}

where list_0 returns an empty list of integers, list_cat concatenates two lists of integers, and list_1 creates a singleton list containing the given integer. The functions f0 and f1 take on the roles of the two nested for loops from our previous attempt, only now they also concatenate the values they compute.

Because forward-declaring functions is difficult with macros, we introduce an array of function pointers so one function can call the next. Also, since C doesn’t do pattern matching, we’ll cheat with an ad hoc macro. The full program:

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

#define BEGIN_DO ({ \
int *(*fun[64])(int); \
int funi = 0

#define END_DO fun[0](0); \
})

#define ASSIGN(x, a) int x; \
int *x##fun(int i) { \
int *r = list_0(); \
for(int *p = a; (x = *p); p++) r = list_cat(r, fun[i+1](i+1)); \
return r; \
} \
fun[funi++] = x##fun

#define MATCH(x, a) int* match() { \
if (x) return a; \
return list_0(); \
} \
fun[funi++] = match

int *list_0() {
int *r = malloc(sizeof(*r));
*r = 0;
return r;
}

int list_len(int *a) {
int *x = a;
while(*x) x++;
return x - a;
}

int *list_cat(int *a, int *b) {
int m = list_len(a), n = list_len(b);
int *r = realloc(a, sizeof(*r) * (m + n + 1));
memcpy(r + m, b, sizeof(*r) * n);
r[m+n] = 0;
free(b);
return r;
}

int *list_1(int n) {
int *r = malloc(sizeof(*r) * 2);
*r = n;
r[1] = 0;
return r;
}

int main() {
int *r = BEGIN_DO;
int a[] = { 1, 2, 3, 0 };
int b[] = { 1, 2, 3, 0 };
ASSIGN(x, a);
ASSIGN(y, b);
ASSIGN(z, list_1(x != y)); // Memory leak.
MATCH(z, list_1(x + y));
END_DO;

// Should be 3 4 3 5 4 5.
for(int *x = r; *x; x++) printf(" %d", *x);
putchar('\n');
free(r);
exit(0);
}

At last! List comprehensions in C.

Haskell’s list monad

Haskell does not need zany macros. Instead, it has:

instance Monad [] where
return a = [a]
xs >>= f = concat (map f xs)
fail _ = []

Mystery solved. The >>= operator is like our ASSIGN macro: they both compute a function on each element of a list and concatenate all the results. The return function is our list_1 function, and fail is our list_0.

In both languages, we made looping over lists invisible by writing a routine that converts an integer into a list (just put it in a list of size one) and another that runs a function on lists of integers even though the function normally takes integers (just run the function on each integer in the list, then concatenate the results).

We can generalize this idea to hide other kinds of repetitive code. This is precisely what Haskell monads are, and why do notation exists. Roughly speaking, we augment the abilities of existing functions by describing how to convert a regular output into a new and improved type (monadic type) and how to get a function to operate on monadic types when it expects the regular type.

It’s cool that we can get the same effect in C. Nevertheless:

  • Haskell has a few lines of robust, clean code instead of fragile, ugly macros.

  • This design may never occur to a C programmer, while monads are ubiquitous in Haskell.

  • Monads mesh well with pattern matching, which C lacks.

A comprehensions compromise

Our macros were challenging to write because their combined output must be nested. It’s why we needed computed goto and a stack of labels when simulating nested for loops, and why we needed an array of function pointers when we moved to nested functions.

It would be easier if we tolerated nesting:

int* assign(int *x, int *a, int *(*fun)()) {
int *r = list_empty();
for(int *p = a; (*x = *p); p++) r = list_cat(r, fun());
return r;
}

int *lamb() {
int a[] = { 1, 2, 3, 0 };
int b[] = { 1, 2, 3, 0 };
int x;
auto int *lamb();
return assign(&x, a, lamb);
int *lamb() {
int y;
auto int *lamb();
return assign(&y, b, lamb);
int *lamb() {
int z;
auto int *lamb();
return assign(&z, list_1(x != y), lamb);
int *lamb() {
return z ? list_1(x + y) : list_empty();
}
}
}
}
int *r = lamb();

The code is verbose, because at several points, one line declares a function, another calls it, and a third defines it. All three would coalesce into a single line if C had lambda expressions. Frustratingly, GNU C supports closely related nested functions but not anonymous functions. Why not? Even standard C++ is getting lambda expressions! However, if we tolerate one macro, we can ease the pain:

#define ASSIGN(x, a) \
int x; \
auto int *lamb(); \
return assign(&x, a, lamb); \
int *lamb()

...

int *lamb() {
int a[] = { 1, 2, 3, 0 };
int b[] = { 10, 20, 30, 0 };
ASSIGN(x, a) {
ASSIGN(y, b) {
ASSIGN(z, list_1(x != y)) {
return z ? list_1(x + y) : list_empty();
}
}
}
}
int *r = lamb();

While messier than Haskell, it is still readable. Moreover, the nested braces may aid seasoned imperative programmers, who might incorrectly think the Haskell do notation represents statements executing one after another. Here, it is unmistakeably clear we are nesting function calls.

Wednesday, April 6, 2011

Convert line breaks? No!

When I moved my blog to this site, I left the default settings alone. Unfortunately one of them was "Convert line breaks". While harmless at first, it eventually caused headaches when editing HTML for fancier formatting. Turning off the option destroyed all paragraph breaks in older posts, so I had to leave it on for their sake.
I believe Blogger no longer has this problem: on a new test blog, flipping the switch appeared to have no effect on existing posts. However, my blog predates this change.
Googling yields many reports of trouble with this maleficent option. Some suggest a laborious method of disabling it: turn it off, then break paragraphs in old posts by hand!
Such advice was probably written before Blogger gained the export and import features, which are the key to a simpler but scarier solution:
  1. Export the blog, from the "Basic" tab under "Settings".
  2. In the "Edit Posts" tab under "Posting", select all posts.
  3. Delete them all!
  4. import the blog and publish all posts, again from "Basic" tab under "Settings".
Now you can say no to "Convert line breaks" without trashing ancient posts, though when editing new posts, you still have to select "Use <br /> tags" under "Post Options" to get the desired behaviour.
It works because the export format robustly represents line breaks with HTML tags whether or not the evil option was on when the post was written. Take care: after I exported my blog, I got distracted by the recently added "Spam" tab under "Comments", and found a non-spam comment sequestered within. I freed it, but forgot to re-export my blog, so after I deleted all posts and imported my blog, the comment was lost forever.
The export format has also given me ideas for my script that publishes HTML posts via the Blogger Data API, which has a hard time preserving formatted source code. Unfortunately, frequent experiments have temporarily barred my script:
Blog has exceeded rate limit or otherwise requires word verification for new posts

I’ll try again another day.

Permalink Pain

Unfortunately, I belatedly discovered serious manual editing is still required. Permalinks have changed; they now put the day of the month in the URL. Restoring old entries generates the new kind of permalink, invalidating all existing links to old posts.

Sunday, April 3, 2011

Tagline-Oriented Programming

In the land of "tl;dr", the one-liner is king. Short is sweet, and also highly contagious. A paragraph is easy to ignore, while a good tagline embeds itself in a reader’s mind against their will, spreading independently of the accompanying product.

In programming, some one-liners are equally alluring. There is only room for the bare essentials: we must distill the purest form of solutions to problems. Their brevity makes them memorable, and awe-inspiring, as shrinking code can take considerable skill; most recently, I’ve been stunned by a 1433-character chess engine in C.

Having already praised Bash one-liners, this time I’m taking a shallow look at Haskell, and in particular, a trivial function:

emp :: a -> ([b] -> a) -> [b] -> a
emp a f [] = a
emp a f (x:xs) = f (x:xs)

Or since we’re advocating one-liners:

emp a f x = if null x then a else f x

Why? In Haskell, recursion on lists often goes like this:

f []     = empty_list_case
f (x:xs) = recurse_somehow x f(xs)

The emp function let us combine these two lines into one. The fold family of functions is similar: for them, the 2nd argument handles the empty-list case, while the 1st argument is a function that describes how to process the next element. However, they are insufficiently general for our purposes.

Should we go to this trouble just to squeeze the empty-list case into the same line as the inductive step? Yes, if you share my appreciation for one-liners! Also:

  • this is a natural extension of the precedent set by the fold family

  • it’s harder to forget about the base case

  • single lines suit interactive Haskell (if you do type these in, remember to add "let" in front of the definitions).

Examples

Here’s foldl and foldr in one line each:

foldr f z = emp z $ \(x:xs) -> f x (foldr f z xs)
foldl f z = emp z $ \(x:xs) -> foldl f (f z x) xs

And the membership test:

elem x = emp False $ \(y:ys) -> x==y || (x `elem` ys)

All subsequences (though in an order differing from Data.List.subsequences):

sub = emp [[]] $ \(x:xs) -> sub xs ++ (map (x:) (sub xs))

How about all permutations? In C, where we can loop through an array and swap elements in place, an easy but impure recursive solution takes every element in turn, temporarily swaps it to the first position, then calls itself on the remainder of the elements, where the lowest level of recursion prints the whole buffer.

In Haskell, I found it easiest to do something similar by maintaining 2 lists on either side of a selected element. My first attempt came out as:

-- Invoke with: p "" "" "abc"
p a [] [] = [a]
p [] ys [] = []
p a ys [] = map (a++) (p [] [] ys)
p a ys (x:xs) = (if null a then (p [x] ys xs) else []) ++ (p a (x:ys) xs)

It’s unclear how to rewrite this in one line. Perhaps it’s best to break the problem into two sub-problems to get one-liners. The following is based on elegant code found on a mailing list:

f = emp [] $ \(x:xs) -> (x,xs):[(y,x:ys) | (y,ys) <- (f xs)]
perm = emp [[]] $ \xs -> [y:zs | (y,ys) <- (f xs), zs <- (perm ys)]

The helper function f is the FP version of iterating through a list and plucking an element out, and has uses outside this problem.

Integers

A similar definition for integers is also handy:

induct a f x = if 0==x then a else f

For starters, we can define emp with induct:

emp a f = induct a f . length

List all n-letter strings consisting of the letters in "BEN":

f = induct [""] $ \n -> [x:xs | x <- "BEN", xs <- f (n-1)]

Compute an n-bit Gray code:

g = induct [""] $ \n -> (map ('0':)) (g (n-1)) ++ (map ('1':)) (reverse (g (n-1)))

Though shallow, Haskell would gain from default induct and emp functions. (Perhaps there already are? I don’t know Haskell that well.) They bring the length of many interesting function defintions under a mystical threshold that makes them catchy and appealing.