Thursday, February 07, 2008

Z SHRTR BTTR?

People seem to be falling all over themselves to take the Arc challenge.

Central to this flurry of activity is Paul Graham's quality metric, that shorter (in terms of node count) is better. In fact, Paul explicitly says "making programs short is what high level languages are for."

I have already posted an argument against this position, citing APL as an example of a very concise language that is nonetheless not widely considered to be the be-all-and-end-all of high-level languages (except, perhaps, among its adherents). Here I want to examine Paul's assumption on his own terms, and see what happens if you really take his quality metric seriously on its own terms.

Paul defines his quality metric at the end of this essay:

The most meaningful test of the length of a program is not lines or characters but the size of the codetree-- the tree you'd need to represent the source. The Arc example has a codetree of 23 nodes: 15 leaves/tokens + 8 interior nodes.

He is careful to use nodes rather than characters or lines of code in order to exclude obvious absurdities like single-character variable and function names (despite the fact that he has a clear tendency towards parsimony even in this regard). So let's take a look at Arc and see if we can improve it.

One of the distinguishing features of Arc compared to other Lisps is that it has two different binding forms that do the same thing, one for the single-variable case (LET) and another for the multiple-variable case (WITH). This was done because Paul examined a large corpus of Lisp source code and found that the traditional Lisp LET was used most of the time to bind only a single variable, and so it was worthwhile making this a special case.

But it turns out that we can achieve the exact same savings WITHOUT making the single-variable case special. Let's take a traditional Lisp LET form:

(let ((var1 expr1) (var2 expr2) ...) form1 form2 ... formN)

and just eliminate the parentheses:

(let var1 expr1 var2 expr2 ... form1 form2 ... formN)

At first glance this doesn't work because we have no way of knowing where the var-expr pairs stop and the body forms start. But actually this is not true. We CAN parse this if we observe that If there is more than one body form, then the first body form MUST be a combination (i.e. not an atom) in order to be semantically meaningful. So we can tell where the var/expr pairs stop and the body forms begin by looking for the first form in a varN position that is either 1) not an atom or 2) not followed by another form. If you format it properly it doesn't even look half bad:


(let var1 expr1
var2 expr2
...
(form1 ...)
(form2 ...)
)

OR

(let var1 expr1
var2 expr2
...
result)


In order to avoid confusion, I'm going to rename this construct BINDING-BLOCK (BB for short). BB is, by Paul's metric, an unambiguous improvement in the design of ARC because it's the same length (in terms of nodes) when used in the single-variable case, and it's a node shorter when used in the multiple-variable case.

In fact, we can do even better by observing that any symbol in the body of a binding block that is not at the end can be unambiguously considered to be a binding because otherwise it would not be semantically meaningful. So we can simplify the following case:


(bb var1 expr1
(do-some-computation)
(bb var2 expr2
(do-some-more-computation)
....


to:


(bb var1 expr1
(do-some-computation)
var2 expr2
(do-some-more-computation)
....


Again, this is an unambiguous win because every time we avoid a nested binding block we save two nodes.

But notice that something is happening that ought to make us feel just a teensy bit queasy about all these supposed "improvements": our code is starting to look not so much like Lisp. It's getting harder to parse, both for machines and humans. Logical sub-blocks are no longer delimited by parentheses, so it's harder to pick out where they start and end.

Still, it's not completely untenable to argue that binding blocks really are an improvement. (For one thing, they will probably appeal to parenthophobes.) I personally have mixed feelings about them. I kind of like how they can keep the code from creeping off to the right side of the screen, but I also like being able to e.g. select a sub-expression by double-clicking on a close-paren.

Can we do even better? It would seem that with binding-block we have squeezed every possible extraneous node out of the code. There simply are no more parentheses we can get rid of without making the code ambiguous. Well, it turns out that's not true. There is something else we can get rid of to make the code even shorter. You might want to see if you can figure out what it is before reading on.

Here's a hint: Arc already includes a construct [...] which is short for (fn (_) ...).

Why stop at one implied variable? Why not have [...] be short for (fn ((o _) (o _1) (o _2) ...) ...)? Again, this is an unambiguous improvement by Paul's quality metric because it allows us to replace ANY anonymous function with a shorter version using implied variable names, not just anonymous functions with one variable.

And now we should be feeling very queasy indeed. On Paul's quality metric, all those variable names are expensive (every one requires a node) and yet I hope I don't have to convince you that getting rid of them is not an unalloyed good. If you doubt this, consider that we can use the same argument convention for non-anonymous functions as well, and replace the lambda-list with a single number indicating the number of arguments:

(def foo 3 (do-some-computation-on _1 _2 _3))

Again, an unambiguous improvement by Paul's metric.

We can extend this to binding blocks as well, where we replace variable names with numbers as well. In fact, we don't even need to be explicit about the numbers! We can just implicitly bind the result of every form in a binding block to an implicitly named variable! Again, on Paul's quality metric this would be an unambiguous improvement in the language design.

All this is not as absurd as it might appear. If we push this line of reasoning to its logical conclusion we would end up with a language that very closely resembled Forth. Forth is a fine language. It has its fans. It's particularly good at producing very small code for use on very small embedded processors, so if you need to program an 8-bit microcontroller with 4k of RAM it's not a bad choice at all. But God help you if you need to debug a Forth program. If you think Perl is write-only, you ain't seen nothin' yet.

The flaw in the reasoning is obvious: adding nodes to program source has a cost to be sure, but it can also provide benefits in terms of readability, flexibility, maintainability, editability, reusability... and there's no way to make those tradeoffs unambiguously because they are incommensurate quantities. That is why language design is hard.

Now, I don't really mean to rag on Arc. I only want to point out that if Arc succeeds (and I think it could, and I hope it does) it won't be because it made programs as short as they can possibly be. It will be because it struck some balance between brevity and other factors that appeals to a big enough audience to reach critical mass. And I think Paul is smart enough and charismatic enough to gather that audience despite the fact that there might be a discrepancy between Paul's rhetoric and Arc's reality. But IMO the world would be a better place (and Arc would be a better language) if the complex tradeoffs of language design were taken more seriously.

[UPDATE:]

Sami Samhuri asks:

How do you account for the following case?


(let i 0
(x y) (list 1 2)
(prn "i is " i)
(prn "x is " x)
(prn "y is " y))


Well, the easiest way is (bb i 0 x 1 y 2 ...

;-)

Or, less glibly: (bb i 0 L (something-that-returns-a-list) x (1st L) y (2nd L) ...

Of course, if you want to support destructuring-bind directly you need some kind of marker to distinguish trees of variables from forms. In this case, Arc's WITH syntax, which essentially uses a set of parens to be that marker, becomes tenable again (though it doesn't actually support destructuring even though the syntax could easily be extended to support it). Another possibility is to overload the square-bracket syntax, since that also would be a no-op in a binding position. Now things start to get even more complicated to parse, but hey, shorter is better, right?

But what happens if some day you decide that multiple values would be a spiffy feature to add to the language? Now you're back to needing a marker.

Personally, in my BB macro (in Common Lisp) use the :mv and :db keywords as markers for multiple-value-bind and destructuring-bind. So your example would be:


(bb i 0
:db (x y) (list 1 2)
; and just for good measure:
:mv (q s) (round x y)
...


I also use a prefix "$" character to designate dynamic bindings, which tends to send CL purists into conniptions. :-)

And to top it off, I use the :with keyword to fold in (with-X ...) forms as well. The result is some very un-Lispy-looking Lisp, but which shorter-is-better advocates ought to find appealing.

Here's the code for BB in case you're interested:


;;; Binding Block
(defmacro bb (&rest body)
(cond
((null (rst body)) (fst body))
((consp (1st body))
`(progn ,(1st body) (bb ,@(rst body))))
((not (symbolp (1st body)))
(error "~S is not a valid variable name" (1st body)))
((eq (1st body) ':mv)
(if (symbolp (2nd body))
`(let ((,(2nd body) (multiple-value-list ,(3rd body))))
(bb ,@(rrrst body)))
`(multiple-value-bind ,(2nd body) ,(3rd body)
(bb ,@(rrrst body)))))
((eq (1st body) :db)
`(destructuring-bind ,(2nd body) ,(3rd body)
(declare (special ,@(find-specials (2nd body))))
(bb ,@(rrrst body))))
((eq (1st body) :with)
`(,(concatenate-symbol 'with- (2nd body)) ,(3rd body) (bb ,@(rrrst body))))
((keywordp (1st body))
(error "~S is not a valid binding keyword" (1st body)))
(t `(let ((,(1st body) ,(2nd body)))
(declare (special ,@(find-specials (1st body))))
(bb ,@(rrst body))))))


Note that it would be straightforward to extend this to allow users to define their own binding keywords, which would really help things to spin wildly out of control :-)

9 comments:

sjs said...

How do you account for the following case?

(let i 0
(x y) (list 1 2)
(prn "i is " i)
(prn "x is " x)
(prn "y is " y))

Whether the 2nd set of binding forms are part of the body or not is ambiguous.

(Sorry, I don't know how to post code blocks properly)

Ron said...

> How do you account for the following case?

Good question, and too long to answer in a comment. I'll post an update shortly. In the meantime, to format code I use:

<code><pre>...</pre></code>

Except that blogger doesn't allow <code> or <pre> tags in comments. Bummer.

I guess the only thing I can suggest if you want to put code in a comment is put the code somewhere else and post a link.

Unknown said...

"Z SHRTR BTTR?"

In general, of course shorter is better. I have a hard time understanding why it's such a controversial idea.

Remember that the metric isn't number of characters (as the title of the post would imply) but number of nodes in the code tree. Though even acknowledging that, there's still something to be said for concise (in number of characters) names: I think most people would prefer call/cc or ccc to call-with-current-continuation.

Unknown said...

I think maybe what you are looking for is Minimum Description Length: a good language should minimise both the length of the language grammar AND the program text. MDL may be why people like Python and C for being small languages--they have a small grammar even though programs are sometimes longer.

In arc's case, the grammar is defined to be the arc source code. For what it's worth, Paul seems to be trying to minimise arc's grammar as well as its programs. He may be forced to keep it short, at least, if the source remains the only standard and other people constantly refer to it.

Peter Gruenwald has a tutorial on Minimum Description Length [PDF] for people who are interested.

Ron said...

> In general, of course shorter is better. I have a hard time understanding why it's such a controversial idea.

Did you even read the actual post? Do you really believe that arguments should be numbered instead of named?

> I think maybe what you are looking for is Minimum Description Length

That may be what Paul is looking for, but it's certainly not what I'm looking for. Using implicitly named function arguments (for example) shortens both the code and the grammar, but seems to me to be pretty clearly a horrible idea.

Unknown said...

> Do you really believe that arguments should be numbered instead of named?

Of course not. It's certainly true that there's a tradeoff to be made between brevity and readability, but I'd bet you can push pretty hard in the brevity direction before code becomes unreadable.

Your post seems to imply that PG doesn't understand that there's a tradeoff between brevity and readability... I highly doubt that. I don't think he'd go for the examples in your post, even though they do make things shorter.

"Z SHRTR BTTR?"

Again, I just don't understand the hostility to the concept that shorter is generally preferable.

Ron said...

> Your post seems to imply that PG doesn't understand that there's a tradeoff between brevity and readability... I highly doubt that.

Of course Paul understands this. But his rhetoric doesn't reflect that understanding.

> I don't think he'd go for the examples in your post, even though they do make things shorter.

Of course. That is the whole point.

> Again, I just don't understand the hostility to the concept that shorter is generally preferable.

It's not my intention to be hostile, only to (respectfully) disagree. Sometimes the line between disagreement and hostility can be a fine one, and I apologize if I've crossed it. (There's a reason I became an engineer and not a diplomat.)

I do confess to a certain amount of frustration at what seems to me a rare opportunity to make the world a better place that is (let me take a shot at diplomacy here) not being used to its full advantage. Maybe my frustration comes across as hostility. That's not my intent. That's why I've been very careful to say all along that I agree with the vast majority (95%+) of what Paul says.

I sincerely do not believe that shorter is generally preferable. I think brevity is only one of a long list of incommensurate qualities that need to be traded off in language design, which is what makes language design hard. Paul saying "just make it shorter and your done" is, I sincerely believe, oversimplifying the problem to a degree that has the potential to be genuinely harmful given the amount of attention he's getting. That's the only reason I'm spending so much time on this.

Unknown said...

> I sincerely do not believe that shorter is generally preferable.

Really? Suppose I told you I had two versions of a program, functionally equivalent, where program A was half as long as program B. You wouldn't have any opinion at all about which you think might be better?

Of course it could be possible that B is in fact better than A, but I'd bet otherwise. In general I'd say the shorter program is probably better. That certainly fits better with my experience reading and working with other people's code.

Yegge had a post about code size that I thought was very interesting.

Ron said...

> Really? Suppose I told you I had two versions of a program, functionally equivalent, where program A was half as long as program B. You wouldn't have any opinion at all about which you think might be better?

Absent any additional information, no, I wouldn't.

> Of course it could be possible that B is in fact better than A, but I'd bet otherwise.

You would lose that bet. At the risk of repeating myself, it is true that all else being equal shorter is better, but all else is rarely equal. You can take any program and make it shorter (both in terms of characters and in terms of node count) by replacing all the variable names with implicitly named variables. But IMO it would be perverse to call that transformation an improvement.

I can give many other examples. Here's a realistic one. Consider a program that has the following line in it:

const int options = CONFABULATE_DATA || OPTIMIZE_FROBS || IGNORE_OUTLIERS;

I can shorten that program by getting rid of that line and systematically replacing all instances of options with a precomputed numerical value. But I wouldn't want to hire you to work on my code if you called that an improvement.