Thursday, April 12, 2012

Sudoku: Cohen vs my reading ability

I was wrong about being wrong in my last post. My original predictions were correct after all.

I had skimmed Bram Cohen’s post too quickly, and when I tried following his approach, my branching heuristic was simply to take the clause with the fewest literals.

What happens if we do this? Initially, there are many clauses consisting of 2 negative literals, and a few consisting of 9 positive literals. Thus most of the time we pick a negative clause. We’ll only ever pick a positive clause if only 2 literals are left. By this stage, it’s no different from a negative clause because assigning to true to one of the variables is the same as assigning false to the other.

In other words, apart from unit propagation, each step simply eliminates a particular digit from a particular cell, depending on the the order of the clauses. That is, apart from unit propagation, this is no different to brute force!

I failed to see this when I tried writing a SAT solver, and wound up surprised at the results. It was only later I realized I should have been ignoring negative clauses to make Cohen’s program behave like like Knuth’s: the positive clauses correspond exactly to the columns in the exact cover problem, and the literals within a positive clause correspond to 1s in a column.

The only difference is performance: manipulating lists or arrays to find which negative clauses affect which positive clauses is much slower than manipulating a few links to find which rows affect which columns.

When I returned to Cohen’s post, I realized he had explicitly mentioned skipping negative clauses during branching.

Below is my C translation of Cohen’s program. It is more efficient than the original because of in-place deletion and undeletion of clauses and literals. (I’ve used “dancing arrays” instead of Python lists.) As expected, this solver orders of magnitude slower than my dancing links solver, but handily beats Norvig’s program.

#include <ctype.h>
#include <limits.h>
#include <stdio.h>

#define F(i,n) for(int i = 0; i < n; i++)

int main() {
  int a[9][9] = {{0}}, x[4*9*9*(1+9*8/2)][10] = {{0}}, n = 0, m;
  // Read a sudoku.
  F(i, 9) F(j, 9) do if (EOF == (m = getchar())) return 1; while(
      isdigit(m) ? a[i][j] = m - '0', 0 : m != '.');
  int enc(int a, int b, int c) {return 9*9*a + 9*b + c + 1;}
  void add(int n, int a, int b, int c) {x[n][++*x[n]] = enc(a, b, c);}
  F(i, 9) F(j, 9) F(k, 9 || (n += 4, 0)) {  // At least one digit per:
    add(n  , k, i, j);  // ...box
    add(n+1, i, k, j);  // ...column
    add(n+2, i, j, k);  // ...row
    add(n+3, i, j/3*3 + k/3, j%3*3 + k%3);  // ...3x3 region.
  }
  for(int i = n-1; i >= 0; i--) F(j, x[i][0]) F(k, j) {
    x[n][1] = -x[i][j+1];  // At most one digit per positive clause.
    x[n][2] = -x[i][k+1];  // (Hence the 9 choose 2 factor above.)
    x[n++][0] = 2;
  }
  int y[n], out[9*9*9];
  F(i, n) y[i] = i;
  int assign(int n, int v) {
    F(i, n) {
      int k = y[i];
      F(j, x[k][0]) {
        if (x[k][j+1] == v) {  // Satisfied clause:
          y[i--] = y[--n];     // Swap with last one
          y[n] = k;            // and decrement array count.
          break;
        } else if (x[k][j+1] == -v) {  // False literal:
          x[k][j+1] = x[k][x[k][0]];   // Swap with last one
          x[k][x[k][0]--] = -v;        // and decrement clause size.
          break;  // Assume literals are unique in a clause.
        }
      }
    }
    return n;
  }
  void solve(int n) {
    int s = INT_MAX, t = 0;
    if (!n) {  // Print solution.
      F(i, m) if ((t = out[i] - 1) >= 0) a[t/9%9][t%9] = t/9/9 + 1;
      F(r, 9) F(c, 9 || 0*puts("")) putchar('0'+a[r][c]);
      return;
    }
    F(i, n) if (x[y[i]][0] < s) {  // Find smallest positive clause.
      if (x[y[i]][0] > 1 && x[y[i]][1] < 0) continue;
      if (!(s = x[y[i]][0])) return;  // Empty clause: no solution.
      t = y[i];
    }
    void try(int v) {
      solve(assign(n, out[m++] = v));
      F(i, n) {  // Undo any clause deletions.
        int k = y[i];
        if (x[k][0] < 9 && x[k][x[k][0]+1] == -v) x[k][0]++;
      }
      m--;
    }
    try(x[t][1]);
    if (s > 1) try(-x[t][1]);
  }
  // Fill in the given digits, and go!
  F(r, 9) F(c, 9) if (a[r][c]) n = assign(n, enc(a[r][c]-1, r, c));
  m = 0, solve(n);  // Reuse m as count of 'out' array.
  return 0;
}

Wednesday, April 11, 2012

Sudoku: Knuth vs Norvig vs Cohen

For reasons I may explain eventually, I’m looking up Sudoku algorithms again. Long ago, I wrote a solver after I read about Knuth applying the “dancing links” algorithm to Sudoku. This ingenious method worked so beautifully that I left the problem alone, save for a failed attempt to solve Sudoku with ZDDs.

Because of dancing links' effectiveness, I had thought there would be little else to say about Sudoku algorithms. How wrong I was! Wikipedia now sports an article dedicated to Sudoku algorithms, and the web is inundated with Sudoku programs in many different languages using many different algorithms.

Peter Norvig and Bram Cohen both wrote articles about Sudoku programs. Neither followed Knuth. Intriguing: are they instead using a method even more ingenious than encoding a Sudoku puzzle as an exact cover problem then deftly manipulating linked lists to solve it?

On closer inspection, it appeared not. Norvig’s code uses the same constraints as Knuth’s to eliminate particular digits from particular squares. However, its search heuristic is inferior, as it only looks at one type of constraint (that each box contains exactly one digit) when deciding where to branch next; Knuth’s code instead implicitly examines all constraints.

Cohen’s code encoded the same constraints as a Boolean satisfiability problem. When branching in the search, we choose the clause with the fewest remaining literals, so like Knuth, all constraints play a part in the decision. Thus:

  1. Norvig’s program should be slowest because its search heuristic only uses one type of constraint.

  2. Cohen’s program should search for solutions somewhat like Knuth’s. Both Cohen and Knuth use all constraints when branching: Knuth picks the column with fewest 1s, while Cohen picks the clause with fewest literals.

  3. Cohen’s program should be slower than Knuth’s. Dancing links follows a few pointers when eliminating competing digits, while the simple SAT solver performs many linear scans for a certain literal. Even if we used something fancier than lists to represent clauses, when eliminating competing digits in a box, row, column, or 3x3 subregion, we’d still visit on the order of (9 choose 2) = 36 clauses as opposed to about 9 links in a sparse matrix.

On the other hand, although slower, Norvig’s and Cohen’s code looked more succinct than my old dancing links solver. Maybe because they’re using a scripting language. Or perhaps my implementation unnecessarily verbose? Or perhaps dancing links is intrinsically more complex?

To compare, I coded all 3 approaches in C; in particular, I rewrote my old dancing links solver. As I suspected, though I compressed my old code, it was still the longest solution.

However, I was surprised when I compared performance. As I predicted, dancing links was fastest. But I had overlooked an important detail; an error lurks in my reasoning above.

Where was I wrong, and why? I’ll answer this another time because I want to show off something else for now.

A one-liner Sudoku solver?

While researching, I stumbled across a 3-line Sudoku solver in Perl, and a similar solver in Python, both of which have similarities with a C program by Aidan Thornton.

Why hadn’t I tried writing a terse solver yet? After all, I appreciate minimalist code the same way normal people appreciate poetry.

Without delay, I wrote a solver using GCC’s nested functions.

#include <ctype.h>
#include <stdio.h>
#define F(n) for(int i=0;i<n;i++)
int main() {
  char s[82] = {0}, c;
  F(81) while(isdigit(c = getchar()) ? s[i] = c, 0 : c != '.');
  void f(char *p) {
    int g(char d, int r, int c) {
      int h(int r, int c) { return s[9*r+c] == d; }
      F(9) if (h(r, i) || h(i, c) || h(r/3*3+i/3, c/3*3+i%3)) return 0;
      return 1;
    }
    if (p == s+81) puts(s); else if (*p>'0') f(p+1);
    else F(9) if (g(i+'1', (p-s)/9, (p-s)%9)) *p = i+'1', f(p+1), *p = 0;
  }
  return f(s), 0;
}

When comparing notes, I was happy to find my approach differed from the others. My inner loop iterates 9 times and checks 3 boxes at once; the others loop 81 items and check the box when the index meets one of 3 crtieria.

This suggests my code should be faster. A few puzzles I tried supported this. In particular, my program took around 8.5s on my laptop to solve Wikipedia’s "near worst case" puzzle:

.........
.....3.85
..1.2....
...5.7...
..4...1..
.9.......
5......73
..2.1....
....4...9

while Thornton’s brute force solver took about 2m 45s. Both programs were compiled with the -O2 option. (With -O3, my program took about 5s, while the other program seemed unchanged.)

All the same, I’d prefer a shorter solution. Haskell might be a better medium, due to many space-saving features. Also, I tend to avoid Perl, Python, and friends because I like static typing.

Translating my above code yielded:

module Main where
f x s@('.':y)=[a|z<-['1'..'9'],notElem z(map((x++s)!!)(foldr1(++)[[9*r+i,9*i+c,9*(r`div`3*3+i`div`3)+c`div`3*3+i`mod`3]|i<-[0..8]])),a<-f(x++[z])y]where(r,c)=divMod(length x)9
f x(z:y)=f(x++[z])y
f x[]=[x]

-- The puzzle from the Wikipedia article on Sudoku.
main=print$f[] "53..7....6..195....98....6.8...6...34..8.3..17...2...6.6....28....419..5....8..79"

This wasn’t quite as short as I’d like, due to peculiarities of Haskell. The div and mod operators take 5 characters each, while in most languages these are single non-alphanumeric chracters. I spent a couple of lines on the trivial cases of the recursion. More space is taken by folding a list concatenation over the 4 terms for each i in [0..8].

We can eliminate most divs and mods by looping over all 81 boxes as in Thornton’s code, except with two nested loops going up to 9 rather than a single loop iterating 81 times. After a few more edits, we have:

module Main where
f x s@(h:y)=let(r,c)=divMod(length x)9;m#n=m`div`3==n`div`3;e=[0..8]in[a|z<-['1'..'9'],h==z||h=='.'&&notElem z(map((x++s)!!)[i*9+j|i<-e,j<-e,i==r||j==c||i#r&&j#c]),a<-f(x++[z])y]
f x[]=[x]

main=print$f[] "53..7....6..195....98....6.8...6...34..8.3..17...2...6.6....28....419..5....8..79"

It’s still a bit longer than the Python and Perl solutions I saw. It could be my approach fundamentally takes more space. Or maybe Haskell itself is harder to shorten, due to syntax or features like static typing. Or maybe I just need to learn more Haskell.

Thursday, April 7, 2011

List Comprehensions in C

Some introductions to Haskell monads begin with the IO monad. After a lot of study and setting up a few functions, you can write lines of Haskell guaranteed to execute exactly once in the order of appearance, with errors handled appropriately.

Congratulations, you’ve waded through large tracts of theory only to acheive what C programmers take for granted. Monads are undoubtedly a formidable weapon for pure functional programmers battling with side-effects, but what do imperative programmers care?

On the other hand, list comprehensions impress veteran coders of all stripes due to their astounding conciseness, and hence are a more universally appealing showpiece for monads. Consider a simple example:

[(x,y) | x <- [1,2,3], y <- [1,2,3], x /= y]

which is:

[(1,2),(1,3),(2,1),(2,3),(3,1),(3,2)]

Wouldn’t it be great if C could do this? Our goal for this post is to get list comprehensions in C, preferably without learning anything about monads!

A spoonful of syntactic sugar

How do list comprehensions automatically try all combinations of x and y? Does Haskell have a special list comprehension parser that generates loops and stuff?

Kind of. There is a special parser, but it merely converts our list comprehension into:

do x <- [1,2,3]
y <- [1,2,3]
True <- return (x /= y)
return (x,y)

This hardly differs: some argue it is so close to the original that list comprehension should be rarely used. There are still no signs of any list operations, nor loops. We’re no closer to solving the mystery.

On the bright side, it now resembles C enough that we can consider translating it.

Let me C

A first attempt:

BEGIN_DO;
int a[] = { 1, 2, 3, 0 };
int b[] = { 1, 2, 3, 0 };
ASSIGN(x, a);
ASSIGN(y, b);
if (*x != *y) printf(" (%d, %d)", *x, *y);
END_DO;

We’ll figure out the macro definitions in a second. We’re taking a few shortcuts here. We’ve only translated the first two <- expressions. We terminate the arrays with 0. And instead of returning a list of tuples, we simply print them out.

If the ASSIGN macros could change the above into two nested for loops then we’d be done:

for(*x = a; *x; x++) {
for(*y = b; *y; y++) {
if (*x != *y) printf(" (%d, %d)", *x, *y);
}
}

Unfortunately, the ASSIGN lines are self-contained; if we translate them to for loops they’ll stay unnested. Instead, we can use computed goto to get the same effect:

#define BEGIN_DO do { \
void *stack[128]; \
int stacki = 0; \
stack[stacki] = &&label_end
#define END_DO goto *stack[stacki]; label_end: ; } while(0)
#define ASSIGN(x, a) \
int *x, *x##next = a; \
stack[++stacki] = &&label_##x; \
label_##x: if (!*(x = x##next)) goto *stack[--stacki]; else x##next = x + 1

It looks horrible, and we needed GNU C extensions, but we’ve hidden loops in the macros, just as Haskell must be doing.

Another spoonful

It turns out Haskell’s do and <- are themselves syntactic sugar. Our example in fact expands again to:

[1,2,3] >>= (\ x -> [1,2,3] >>= (\y -> return (x/=y) >>=
(\r -> case r of True -> return (x,y)
_ -> fail "")))

This is unsweetened Haskell code. Still no loops, but we’ve learned those separate lines of Haskell are really one big line: later lines are function evaluations nested within earlier lines.

Simulating nested for loops in C was on the right track after all. However, we should have used nested functions. The extra flexibility and power should help complete the conversion.

This time, we’ll build a list of elements instead of just printing them. Also, let’s replace the tuple (x,y) by the integer x+y so we can stick to integers and lists of integers; we can find a C equivalent for Haskell lists and tuples another day.

Intuition suggests all our functions should work with lists; then they’ll look alike, which should make macros easy to write. This implies we must occasionally package integers in singleton lists. We wind up with:

int *f0() {
int a[] = { 1, 2, 3, 0 };
int b[] = { 1, 2, 3, 0 };

int x;
auto int *f1();
{
int *r = list_0();
for(int *p = a; (x = *p); p++) r = list_cat(r, f1());
return r;
}

int y;
auto int *f2();
int* f1() {
int *r = list_0();
for(int *p = b; (y = *p); p++) r = list_cat(r, f2());
return r;
}

int z;
auto int *f3();
int* f2() {
int *r = list_0();
for(int *p = list_1(x != y); (z = *p); p++) r = list_cat(r, f3());
return r;
}

int* f3() {
if (z) return list_1(x + y);
return list_0();
}
}

where list_0 returns an empty list of integers, list_cat concatenates two lists of integers, and list_1 creates a singleton list containing the given integer. The functions f0 and f1 take on the roles of the two nested for loops from our previous attempt, only now they also concatenate the values they compute.

Because forward-declaring functions is difficult with macros, we introduce an array of function pointers so one function can call the next. Also, since C doesn’t do pattern matching, we’ll cheat with an ad hoc macro. The full program:

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

#define BEGIN_DO ({ \
int *(*fun[64])(int); \
int funi = 0

#define END_DO fun[0](0); \
})

#define ASSIGN(x, a) int x; \
int *x##fun(int i) { \
int *r = list_0(); \
for(int *p = a; (x = *p); p++) r = list_cat(r, fun[i+1](i+1)); \
return r; \
} \
fun[funi++] = x##fun

#define MATCH(x, a) int* match() { \
if (x) return a; \
return list_0(); \
} \
fun[funi++] = match

int *list_0() {
int *r = malloc(sizeof(*r));
*r = 0;
return r;
}

int list_len(int *a) {
int *x = a;
while(*x) x++;
return x - a;
}

int *list_cat(int *a, int *b) {
int m = list_len(a), n = list_len(b);
int *r = realloc(a, sizeof(*r) * (m + n + 1));
memcpy(r + m, b, sizeof(*r) * n);
r[m+n] = 0;
free(b);
return r;
}

int *list_1(int n) {
int *r = malloc(sizeof(*r) * 2);
*r = n;
r[1] = 0;
return r;
}

int main() {
int *r = BEGIN_DO;
int a[] = { 1, 2, 3, 0 };
int b[] = { 1, 2, 3, 0 };
ASSIGN(x, a);
ASSIGN(y, b);
ASSIGN(z, list_1(x != y)); // Memory leak.
MATCH(z, list_1(x + y));
END_DO;

// Should be 3 4 3 5 4 5.
for(int *x = r; *x; x++) printf(" %d", *x);
putchar('\n');
free(r);
exit(0);
}

At last! List comprehensions in C.

Haskell’s list monad

Haskell does not need zany macros. Instead, it has:

instance Monad [] where
return a = [a]
xs >>= f = concat (map f xs)
fail _ = []

Mystery solved. The >>= operator is like our ASSIGN macro: they both compute a function on each element of a list and concatenate all the results. The return function is our list_1 function, and fail is our list_0.

In both languages, we made looping over lists invisible by writing a routine that converts an integer into a list (just put it in a list of size one) and another that runs a function on lists of integers even though the function normally takes integers (just run the function on each integer in the list, then concatenate the results).

We can generalize this idea to hide other kinds of repetitive code. This is precisely what Haskell monads are, and why do notation exists. Roughly speaking, we augment the abilities of existing functions by describing how to convert a regular output into a new and improved type (monadic type) and how to get a function to operate on monadic types when it expects the regular type.

It’s cool that we can get the same effect in C. Nevertheless:

  • Haskell has a few lines of robust, clean code instead of fragile, ugly macros.

  • This design may never occur to a C programmer, while monads are ubiquitous in Haskell.

  • Monads mesh well with pattern matching, which C lacks.

A comprehensions compromise

Our macros were challenging to write because their combined output must be nested. It’s why we needed computed goto and a stack of labels when simulating nested for loops, and why we needed an array of function pointers when we moved to nested functions.

It would be easier if we tolerated nesting:

int* assign(int *x, int *a, int *(*fun)()) {
int *r = list_empty();
for(int *p = a; (*x = *p); p++) r = list_cat(r, fun());
return r;
}

int *lamb() {
int a[] = { 1, 2, 3, 0 };
int b[] = { 1, 2, 3, 0 };
int x;
auto int *lamb();
return assign(&x, a, lamb);
int *lamb() {
int y;
auto int *lamb();
return assign(&y, b, lamb);
int *lamb() {
int z;
auto int *lamb();
return assign(&z, list_1(x != y), lamb);
int *lamb() {
return z ? list_1(x + y) : list_empty();
}
}
}
}
int *r = lamb();

The code is verbose, because at several points, one line declares a function, another calls it, and a third defines it. All three would coalesce into a single line if C had lambda expressions. Frustratingly, GNU C supports closely related nested functions but not anonymous functions. Why not? Even standard C++ is getting lambda expressions! However, if we tolerate one macro, we can ease the pain:

#define ASSIGN(x, a) \
int x; \
auto int *lamb(); \
return assign(&x, a, lamb); \
int *lamb()

...

int *lamb() {
int a[] = { 1, 2, 3, 0 };
int b[] = { 10, 20, 30, 0 };
ASSIGN(x, a) {
ASSIGN(y, b) {
ASSIGN(z, list_1(x != y)) {
return z ? list_1(x + y) : list_empty();
}
}
}
}
int *r = lamb();

While messier than Haskell, it is still readable. Moreover, the nested braces may aid seasoned imperative programmers, who might incorrectly think the Haskell do notation represents statements executing one after another. Here, it is unmistakeably clear we are nesting function calls.

Wednesday, April 6, 2011

Convert line breaks? No!

When I moved my blog to this site, I left the default settings alone. Unfortunately one of them was "Convert line breaks". While harmless at first, it eventually caused headaches when editing HTML for fancier formatting. Turning off the option destroyed all paragraph breaks in older posts, so I had to leave it on for their sake.
I believe Blogger no longer has this problem: on a new test blog, flipping the switch appeared to have no effect on existing posts. However, my blog predates this change.
Googling yields many reports of trouble with this maleficent option. Some suggest a laborious method of disabling it: turn it off, then break paragraphs in old posts by hand!
Such advice was probably written before Blogger gained the export and import features, which are the key to a simpler but scarier solution:
  1. Export the blog, from the "Basic" tab under "Settings".
  2. In the "Edit Posts" tab under "Posting", select all posts.
  3. Delete them all!
  4. import the blog and publish all posts, again from "Basic" tab under "Settings".
Now you can say no to "Convert line breaks" without trashing ancient posts, though when editing new posts, you still have to select "Use <br /> tags" under "Post Options" to get the desired behaviour.
It works because the export format robustly represents line breaks with HTML tags whether or not the evil option was on when the post was written. Take care: after I exported my blog, I got distracted by the recently added "Spam" tab under "Comments", and found a non-spam comment sequestered within. I freed it, but forgot to re-export my blog, so after I deleted all posts and imported my blog, the comment was lost forever.
The export format has also given me ideas for my script that publishes HTML posts via the Blogger Data API, which has a hard time preserving formatted source code. Unfortunately, frequent experiments have temporarily barred my script:
Blog has exceeded rate limit or otherwise requires word verification for new posts

I’ll try again another day.

Permalink Pain

Unfortunately, I belatedly discovered serious manual editing is still required. Permalinks have changed; they now put the day of the month in the URL. Restoring old entries generates the new kind of permalink, invalidating all existing links to old posts.

Sunday, April 3, 2011

Tagline-Oriented Programming

In the land of "tl;dr", the one-liner is king. Short is sweet, and also highly contagious. A paragraph is easy to ignore, while a good tagline embeds itself in a reader’s mind against their will, spreading independently of the accompanying product.

In programming, some one-liners are equally alluring. There is only room for the bare essentials: we must distill the purest form of solutions to problems. Their brevity makes them memorable, and awe-inspiring, as shrinking code can take considerable skill; most recently, I’ve been stunned by a 1433-character chess engine in C.

Having already praised Bash one-liners, this time I’m taking a shallow look at Haskell, and in particular, a trivial function:

emp :: a -> ([b] -> a) -> [b] -> a
emp a f [] = a
emp a f (x:xs) = f (x:xs)

Or since we’re advocating one-liners:

emp a f x = if null x then a else f x

Why? In Haskell, recursion on lists often goes like this:

f []     = empty_list_case
f (x:xs) = recurse_somehow x f(xs)

The emp function let us combine these two lines into one. The fold family of functions is similar: for them, the 2nd argument handles the empty-list case, while the 1st argument is a function that describes how to process the next element. However, they are insufficiently general for our purposes.

Should we go to this trouble just to squeeze the empty-list case into the same line as the inductive step? Yes, if you share my appreciation for one-liners! Also:

  • this is a natural extension of the precedent set by the fold family

  • it’s harder to forget about the base case

  • single lines suit interactive Haskell (if you do type these in, remember to add "let" in front of the definitions).

Examples

Here’s foldl and foldr in one line each:

foldr f z = emp z $ \(x:xs) -> f x (foldr f z xs)
foldl f z = emp z $ \(x:xs) -> foldl f (f z x) xs

And the membership test:

elem x = emp False $ \(y:ys) -> x==y || (x `elem` ys)

All subsequences (though in an order differing from Data.List.subsequences):

sub = emp [[]] $ \(x:xs) -> sub xs ++ (map (x:) (sub xs))

How about all permutations? In C, where we can loop through an array and swap elements in place, an easy but impure recursive solution takes every element in turn, temporarily swaps it to the first position, then calls itself on the remainder of the elements, where the lowest level of recursion prints the whole buffer.

In Haskell, I found it easiest to do something similar by maintaining 2 lists on either side of a selected element. My first attempt came out as:

-- Invoke with: p "" "" "abc"
p a [] [] = [a]
p [] ys [] = []
p a ys [] = map (a++) (p [] [] ys)
p a ys (x:xs) = (if null a then (p [x] ys xs) else []) ++ (p a (x:ys) xs)

It’s unclear how to rewrite this in one line. Perhaps it’s best to break the problem into two sub-problems to get one-liners. The following is based on elegant code found on a mailing list:

f = emp [] $ \(x:xs) -> (x,xs):[(y,x:ys) | (y,ys) <- (f xs)]
perm = emp [[]] $ \xs -> [y:zs | (y,ys) <- (f xs), zs <- (perm ys)]

The helper function f is the FP version of iterating through a list and plucking an element out, and has uses outside this problem.

Integers

A similar definition for integers is also handy:

induct a f x = if 0==x then a else f

For starters, we can define emp with induct:

emp a f = induct a f . length

List all n-letter strings consisting of the letters in "BEN":

f = induct [""] $ \n -> [x:xs | x <- "BEN", xs <- f (n-1)]

Compute an n-bit Gray code:

g = induct [""] $ \n -> (map ('0':)) (g (n-1)) ++ (map ('1':)) (reverse (g (n-1)))

Though shallow, Haskell would gain from default induct and emp functions. (Perhaps there already are? I don’t know Haskell that well.) They bring the length of many interesting function defintions under a mystical threshold that makes them catchy and appealing.

Sunday, March 27, 2011

CMU -= OOP, CMU++;

I should have been a comedian. I have excellent timing.

I applied for Computer Science grad school during the irrationally exuberant dot-com boom, when much of my competition ran off chasing easy money. During the subsequent crash I was already a sheltered student. By the time I finished, the economy was up again. My lucky streak continued as I managed to secure a job before the global financial crisis struck.

However, I’m most thankful that I was amongst the last batch of students of my undegraduate CS program who were only taught object-oriented programming in an optional course.

It wasn’t just my alma mater. According to an interview with a CS professor emeritus at NYU, around that time, many universities bowed to pressure to switch to Java and focus less on fundamentals. This produced less capable programmers, or as the article puts it, “today’s Java-savvy college grad is tomorrow’s pizza-delivery man”. Again, my competition was decimated merely because of timing.

Apart from increased job security, I’m grateful I learned less object-oriented nonsense and more CS, as I’m deeply interested in the subject. In fact, this is why I chose to study CS; I’m just fortunate I happen to live in a time and place where coding is a marketable skill.

Now that I can afford to be less selfish, I wish universities would undo this injustice and remove object-oriented programming from introductory CS curriculums, so future students can benefit, just as Carnegie Mellon University is doing. CMU's decision inspired me to post this entry, and update my page denouncing object-oriented programming.

Business

While some applaud the move, others state that learning a language and programming paradigm rarely seen in industry leaves a student ill-prepared.

My own experience suggests otherwise. I have little trouble coding in new programming languages so long as they’re not too esoteric. Granted, the first few lines are painful, as I’ll have to constantly look up keywords in a reference while acclimatizing to error messages and other peculiarities of the development environment. But after a while, I’ll pick up the syntax for arithmetic, loops, conditionals, function calls, and so on, and that’ll be enough to keep me going. I might miss out on idioms, but I can get my point across.

The difficult part is designing the underlying data structures and algorithms. That’s where CS is most magical. Functional programming gives me an edge. It makes recursion less scary, and recursion is often a great approach to a problem: sorting, parsing, anything with trees. Also, experience with lazy evaluation increases my awareness of parallelism.

I believe it also makes my code cleaner, because treating functions as first-class citizens opens my eyes to design choices I might otherwise miss. It’s hard to say: a Java fanatic might be trained to spot perfect places for anonymous inline classes.

In interviews, I let candidates choose the language as I’m more interested in the idea being expressed. Some candidates write code comfortably, but struggle with relatively simple tasks because they can’t find the right algorithm or data structure. They’re like an author with perfect spelling but who only knows boring stories.

Others quickly find a good solution, but their code may have syntax issues here and there. These mistakes hardly bother me because after a week of seeing the same compiler error messages over and over again, their syntax will be as good as mine.

Personal

Although OOP was optional for me, I studied it with relish. I read “Design Patterns” cover to cover shortly after purchasing it. Bertrand Meyer’s words were gospel. Eiffel, a pure OO language, became my favourite. I promoted OOP more fiercely than I attack it now; back then I was even more obstinate. I believed the only path to virtues such as modularity, information hiding, and code reuse was through OOP, partly because I thought OOP folks invented these concepts.

Over the years, events caused me to pause and reexamine what I had been taught. Many lines of code later, I finally discarded OOP and freed myself. I felt cheated, which is why I’m more emotionally attached to this stuff than one might expect.

There’s no harm in learning OOP as long as you avoid my mistake and arm yourself with a healthy dose of skepticism (which you should do all the time anyway, including now).

Monday, February 7, 2011

UTF-8 good, UTF-16 bad

Lately I’ve been programming in Go. I started by writing a few popular Unix tools in Go, before moving on to a lexer that supports structural regular expressions and UTF-8.

Unicode support was easy to add because Go lives and breathes UTF-8. But why did the Go designers favour UTF-8, and not UTF-16? Widespread software such as Windows, Java, and Python use UTF-16 internally, so why not Go?

Further reading revealed the answer. UTF-16 is a historical accident that persists mainly due to inertia. UTF-16 has no practical advantages over UTF-8, and it is worse in some ways. The Go designers knew what they were doing, so they chose UTF-8.

Down with UTF-16

Before, from vague hunches, I had thought that UTF-16 was simpler because each character took exactly 2 bytes (so it handled only a strict subset of Unicode), and that UTF-16 was more compact for certain languages.

The first statement is patently false. UTF-16 is a variable-length encoding, and hence not much simpler than UTF-8. As for the second: consider HTML. The amount of whitespace and tags in a typical document is high enough that UTF-8 is more compact than UTF-16. This generalizes to other formats, such as Go source code.

So there’s no upside. But there are plenty of downsides to UTF-16:

  1. We must worry about byte ordering because UTF-16 deals with 16 bits at a time. A related drawback is the loss of self-synchronization at the byte level.

  2. We lose backwards compatibility with ASCII.

  3. We lose backwards compatibility with code treating NUL as a string terminator.

  4. We cannot losslessly convert from byte streams.

  5. The encoding is inflexible: hopefully 0x110000 code points should be enough, but if it were extended (again), UTF-16 would need major surgery. In contrast, by simply allowing up to 6 bytes, UTF-8 can handle up to 231 code points.

  6. Ugly encoding: it’s the reason why U+D800 to U+DFFF is reserved (though thankfully UTF-16 is self-synchronizing with 16-bit granularity).

I tired of thinking about the drawbacks of UTF-16, but there are likely more. Thus even if we only worked with a strange set of documents where UTF-16 is more efficient, the savings are too small to justify the complexity.

A living dead standard

How did so many projects settle on such a horrible standard? It’s a sad story. The original Unicode supported exactly 0x10000 code points, making UCS-2 (the precursor to UTF-16) attractive. Each character neatly corresponded to one 16-bit word. Although UTF-8 might still be preferable thanks to byte ordering and ASCII compatibility, UCS-2 is certifiably sane. People built systems based on UCS-2, and rejoiced that the codepage Stone Age was over.

Then an unnatural disaster struck: the Unicode Consortium decided more code points were needed. The supple UTF-8 got away unscathed, but there’s no way to squeeze more than 16 bits into UCS-2. Apart from abandoning UCS-2, the only recourse was to reserve a chunk of values for encoding the higher code points. Thus UTF-16 was born. Hopefully, developers could simply tweak masses of existing UCS-2 code for the new standard. Sadly, various UTF-16 bugs suggest that much code remains unaware UTF-16 is a variable-length encoding.

Despite the trauma it would cause, they should have killed UCS-2 when they had the chance. Now we have a grotesque encoding scheme that shambles alongside UTF-8.

Gauss was here

According to Wikipedia, UTF-8 was built on ASCII by Rob Pike and Ken Thompson. ASCII was developed from telegraph codes by a committee. The telegraph codes stemmed from a Western Union code, which evolved from a code by Donald Murray, who modified a 5-bit code invented by Émile Baudot, who based his work on a code designed by Carl Friedrich Gauss and Wilhelm Weber. It’s awe-inspiring that the Prince of Mathematicians was involved. And if he were alive today, I bet he’d disparage UTF-16!