Saturday, June 10, 2017

Solving the JavaScript Problem

The JavaScript Problem is a good problem to have. Against the odds, “write once, run anywhere” is a reality for web browsers because of a language governed by a standards organization. Not so long ago, proprietary technologies such as Flash and Silverlight threatened the openness of the web.

So we should be grateful the JavaScript problem is merely a technical one, namely that JavaScript is poorly designed. Though the language has improved over the years, its numerous flaws are too deeply entrenched to remove. Transpilers help by unlocking access to better languages, but JavaScript was never intended to be an assembly language. It’s only thanks to heroic efforts of many talented engineers that JavaScript has gone so far.

Ideally, the lingua franca of the web should be low-level, clean, and simple, so we can develop in any language with little overhead.

WebAssembly

Recent versions of several popular browsers have fulfilled our wishes with WebAssembly, also known as wasm. WebAssembly is an open standard, and well-designed. At last, works such as Benjamin Pierce’s “Types and Programming Languages” are mainstream enough that WebAssembly has formally specified reduction and typing rules, and even a proof of soundness. In contrast, weakly typed languages such as JavaScript ignore a century or so of mathematical progress.

In WebAssembly, nondeterminstic behvaviour can only arise from exhaustion, external host functions, and the IEEE-754 floating-point standard, which fails to specify the NaN bit pattern for all cases. Recall in C and the many languages built upon it such as Go and Haskell, signed integer overflow causes undefined behaviour. WebAssembly fixes this by stipulating two’s complement for negative numbers, as competing representations of negative numbers are ultimately responsible for this defect of C. Endianness is similarly settled, though curiously by travelling the road not taken by network byte order: numbers in WebAssembly are encoded in little-endian.

The WebAssembly virtual machine is stack-based. Years ago, I read that register-based virtual machines are faster, but perhaps these results are now obsolete. Browsing briefly, I found newer papers:

It’s the same old story. Register-based virtual machines are still faster after all. It seems WebAssembly prioritizes code size, and trusts browsers will ship with good JIT compilers.

Toy Compilers

Online demonstrations of WebAssembly compilers are fun to build, and fun to describe: I compiled a Haskell program to JavaScript that when executed, reads a Haskell-like program and compiles it to WebAssembly, which some JavaScript loads and executes.

Perhaps it’s easier to invite the reader to:

For the last 2 programs, I transformed the source to a tree of S and K combinators, so apart from graph reduction, I only had to code 2 combinators in assembly. The resulting binaries are excruciatingly slow, especially since numbers are Church-encoded, but it all seems to work.

I look forward to a Haskell compiler that produces efficient WebAssembly, though it may have to wait until WebAssembly gains a few more features, such as threads and tail calls.

1 comment:

Fabian Giesen said...

Go does not leave signed overflow undefined: integers are defined to be represented using two's complement, and the language further specifies that "for signed integers, the operations +, -, *, and << may legally overflow and the resulting value exists and is deterministically defined by the signed integer representation, the operation, and its operands."

Two's complement vs. one's complement was historically the reason for C not specifying signed overflow semantics, but the reason current compilers (for two's complement platforms) do not just nail this down are separate. Some have to do with loop dependence analysis, but the biggest single contributor is simply that C didn't manage the transition to 64-bit gracefully.

Specifically, for backwards compatibility reasons, C "int" on 64-bit platforms is generally 32-bit, not 64-bit. But this is a big source of low-level code generation issues. Consider something like an array access "arr[i + 1]" where i is a 32-bit integer, in some loop that already accesses "arr[i]" elsewhere. Compilers would like to compute the address of "arr[i]" once, and then use a relative addressing mode to access "arr[i + 1]" (or just eliminate i altogether and turn it into pointer arithmetic). This is equivalent if the integer type in question is pointer-sized, but not if it isn't.

The primary application of the "signed overflow is undefined" rule in current C/C++ compilers (for 64-bit platforms) turns out to be silently promoting 32-bit ints to 64-bit ints when they are used as loop counters or feed into addressing operations. (See e.g. this write-up.)

The choice of endianness boils down to all current major web host platforms (PC Windows/Linux, Mac OS X, iOS, Android) using Little Endian. Specifying WASM to use Big Endian would introduce extra friction and conversions in every interface to native code on all these targets.