Monday, January 24, 2011

Geek corner: on blurring the distinction between code and data

A while back I wrote this as a throwaway comment in a discussion on comp.lang.lisp:

IMHO (one of) the hallmark(s) of "real" programming is a general blurring of the distinction between "compile time" and "run time". Compilation is just one kind of optimization. Running that optimization as a batch job makes it easier to apply, but the real challenge is refining the optimization on a continual basis in response to new information, including changes to the operational spec.

Someone sent me an email asking me to expand on that thought, and I promised I would. It took me a lot longer to render that expansion into words than I anticipated, so I thought I'd put it up here in case others might find it useful.

Writing programs typically goes something like this: First, a specification of what the program is supposed to do is written. Then that specification is rendered into code. The code is then (typically) compiled into some kind of an executable image. That image is then delivered to users who run the program and (again, typically) provide it with input, which we call "data".

This distinction between code and data is purely artificial. On a fundamental theoretical level there is no distinction between the two. All "code" can be viewed as "data" that is fed as input into an interpreter or a compiler, and all "data" can be viewed as a "code" for a specialized interpreter (or compiler) that comprises the application. From the computer's point of view it's all the same: bits go in, bits come out. Whether those bits are code or data is in the eye of the beholder.

We choose to make the (artificial) distinction between code and data because doing so has benefits. "Programs" can serve to bridge the often severe impedance mismatch between the mental states of typical users and the underlying reality of computational hardware. They can also restrict what a user can do in order to prevent him or her from getting the machine into undesirable states. And constraining what a program does allows optimizations that makes the resulting code run faster.

But making this distinction also has drawbacks. There is, obviously, a fundamental tradeoff between writing "programs" according to certain assumptions and constraints (and hence availing yourself of the benefits of those assumptions and constraints) and the freedom and flexibility to discharge those assumptions and constraints. This is the reason that code "maintenance" is considered an activity in its own right. Code doesn't require "maintenance" the way that mechanical systems do. Code doesn't degrade or wear out or require periodic lubrication. What happens instead is that the users of a program come to the realization that what the program does isn't quite what they wanted. There are bugs, or missing features, or parts that run too slowly or consume too much storage. So now you have to go back and change the code to conform to new assumptions and constraints. Often this is more work than the initial development.

The important point is not that these things happen, but that they happen because of an engineering decision, namely, the strong distinction between code and data, and the correspondingly strong distinction between programmer and user. There is nothing wrong with the decision to make this distinction. There are perfectly sound reasons to make this decision. But it is a decision. And because it is a decision, it can be changed. And it often is changed in small ways. For example, a spreadsheet blurs the distinction a little. Embedding a macro programming language like Visual Basic into, say, a word processor blurs the distinction more. Javascript probably blurs the distinction more than anything nowadays. Anyone with a web browser and a text editor has a Javascript development environment.

The line between code and data is blurrier now than it used to be, but it is still quite distinct nonetheless. There is still a strong division of labor between those who write web browsers and Javascript interpreters and those who write Javascript, and also between those who write Javascript and those who typically use web pages. There are still fairly clear distinctions between "scripting" languages, which tend to be easier to use but slow, and "real programming languages," which tend to be harder to use but faster, though this distinction too is beginning to blur as well. But the final merging of code and data, coder and user, compile-time and run-time, is still a ways off, and for a very good reason: it's really, really hard to do. That is what I meant by my original quip.

Whether or not the trend towards blurring the distinction will continue to the point where it disappears entirely is an open question. There are theoretical reasons to believe that a complete blurring might not be possible or even desirable. But the trend is inarguably in that direction.

One of the reasons I like to program in Lisp in general and Common Lisp in particular (and one of the reasons I think CL has had so much staying power) is that it is still the language that most effectively blurs the distinction between code and data, compile-time and run-time. (It doesn't blur the distinction between coder and user because of its abstruse syntax. Like I said, this is a really hard problem.) It's the only language in existence that allows you to change the lexical syntax while a program is running. On top of that, you can get native-code compilers for it. That is a stunning -- and massively under-appreciated -- accomplishment. Suddenly decide you want to use infix notation to write your code? You can do that, and you don't have to stop your already-running program in order to do it. That is mind-blowing. It's incredibly powerful. And, of course, it's dangerous, and if you don't know what you're doing you can get yourself into deep trouble if you're not careful. Solving that part of the puzzle is still an open problem.


John Dougan said...

> Anyone with a web browser and a text editor has a Javascript development environment.

And with Dan Ingall's Lively Kernel, you don't even need a text editor.

Ron said...

Whoa, Lively Kernel just blew my mind!

Ben Karel said...

Great post!

Arguably the Factor language is another one with runtime-extensible syntax, and, not coincidentally, its creator drew inspiration from SBCL in both language design and compiler optimization.

I actually think code-vs-data is not a precise characterization of what your post is about. The hardware can make a distinction between instructions and data without constraining the semantics of higher-level languages. Availability of convenient data structures for representing ASTs is likely the biggest factor in whether programmers see a distinction between code and data. And again, that's a separate issue from whether the language itself treats its own code as data (consider ML, which is usually seen as a metalanguage for other languages, rather than a metalanguage for itself).

I think the more fundamental distinction -- what your post is really about -- is whether a language makes a closed-world or open-world assumption about the environment its programs run in. Most compiled languages assume a closed world, because it makes optimization easier (or rather, it makes de-optimization a nonissue) and most interpreted languages assume an open world for flexibility. But they are orthogonal: Factor is a compiled language which assumes an open world. Of course open-vs-closed is itself a continuum; Java's classloaders give Java partially-open world semantics.

John Dougan said...

> Whoa, Lively Kernel just blew my mind!

Yeah, Lively rocks pretty hard. And it illustrates my favorite challenge to the people pushing their favorite language at me: Build something with the feel the Smalltalk, Lively Kernel or Lisp IDEs, where you are working inside the environment you are modifying -- and I will strongly consider using your language.

There haven't been a lot of takers.