Sunday, February 03, 2008

The joy of iterators

Since I've been taking so many pot-shots at Arc I thought I'd post one of my own ideas so that people can shoot at me for a change. (Well, it's not really my idea, but it's an idea that I advocate.) I'll motivate this with a multi-step puzzle: first, write a function that can accept either a (linked) list or a vector an print out its elements. Here's one solution in Common Lisp:


(defun print-elements (thing)
(etypecase thing
(list (loop for element in thing do (print element)))
(vector (loop for element across thing do (print element))))


This is not particularly elegant. The amount of repetition between the two clauses in the etypecase is pretty annoying. We need to copy almost the entire LOOP clause just so we can change "in" to "across". How can we make this prettier?

The usual fix for repetitive code is to use a macro, e.g.:


(defmacro my-loop1 (var keyword thing &body body)
`(loop for ,var ,keyword ,thing do ,@body))


But this hardly helps at all (and arguably makes the situation worse):


(defun print-elements (thing)
(etypecase thing
(list (my-loop1 element in thing (print element)))
(vector (my-loop1 element across thing (print element))))


We could do this:


(defmacro my-loop2 (keyword)
`(loop for element ,keyword thing do (print element)))

(defun print-elements (thing)
(etypecase thing
(list (my-loop2 in)
(vector (my-loop2 across))))


But that's not a very good solution either. We're not likely to get much re-use out of my-loop2, the code overall is not much shorter, and it feels horribly contrived.

Here's another attempt, taking advantage of the fact that Common Lisp has a few functions built-in that are generic to sequences (which include lists and vectors):


(defun print-elements (thing)
(loop for i from 0 below (length thing) do (print (elt thing i))))


That looks better, but it's O(n^2) when applied to a list, which is not so good. And in any case it completely falls apart when we suddenly get a new requirement:

Extend print-elements-in-order so that it can accept a hash table, and prints out the hash-table's keyword-value pairs.

Now the first solution is suddenly looking pretty good, because it's the easiest to extend to deal with this new requirement. We just have to add another line to the etypecase:


(defun print-elements (thing)
(etypecase thing
(list (loop for element in thing do (print element)))
(vector (loop for element across thing do (print element)))
(hash-table (loop for key being the hash-keys of thing do (print (list key (gethash key thing)))))))


Now we get yet another requirement:

Extend print-elements so that it can accept a file input stream, and print out the lines of the file.

I'll leave that one as an exercise.

Now, the situation seems more or less under control here, with one minor problem: every time we get a new requirement of this sort we need to change the print-elements function. That's not necessarily a problem, except that if we have more than one programmer working on implementing these requirements we have to be sure that they don't stomp on each other as they edit print-elements (but modern revision-control systems should be able to handle that).

But now we get this requirement:

Extend print-elements so that it takes as an optional argument an integer N and prints the elements N at a time.

Now all our hard work completely falls apart, because the assumption that the elements are to be printed one at a time is woven deeply into the fabric of our code.

Wouldn't it be nice if the LOOP macro just automatically did the Right Thing for us, so that we could just write, e.g.:


(defun print-elements (thing &optional (n 1))
(loop for elements in (n-at-a-time thing n) do (print elements)))


and be done with it? Can we make that happen? Yes, we can. Here's how:


(defconstant +iterend+ (make-symbol "ITERATION_END"))

(defmacro for (var in thing &body body)
(with-gensym itervar
`(let ( (,itervar (iterator ,thing)) )
(loop for ,var = (funcall ,itervar)
until (eq ,var +iterend+)
,@body))))

(defmethod iterator ((l list))
(fn () (if l (pop l) +iterend+)))

(defmethod iterator ((v vector))
(let ( (len (length v)) (cnt 0) )
(fn () (if (< cnt len)
(prog1 (elt v cnt) (incf cnt))
+iterend+))))

(defun print-elements (thing)
(for element in thing do (print element)))


The overall code is much longer than our original solution, but I claim that it is nonetheless a huge win. Why? Because it is so much easier to extend. In order to make print-elements work on new data types all we have to do is define an iterator method on that data type which returns a closure that conforms to the iteration protocol that we have (implicitly for now) defined. So, for example, to handle streams all we have to do is:


(define-method (iterator (s stream)) (fn () (read-char s nil +iterend+)))


We don't have to change any existing code. And in particular, we don't need to redefine print-elements.

Furthermore, we can now do this neat trick:


(defmacro n-of (form n)
`(loop for #.(gensym "I") from 1 to ,n collect ,form))

(defun n-at-a-time (n thing)
(let ( (iter (iterator thing)) )
(fn () (let ((l (n-of (funcall iter) n)))
(if (eq (car l) +iterend+) +iterend+ l)))))


It actually works:

? (for l in (n-at-a-time 2 '(1 2 3 4)) do (print l))

(1 2)
(3 4)
NIL
? (for l in (n-at-a-time 2 #(1 2 3 4)) do (print l))

(1 2)
(3 4)
NIL

Furthermore, n-at-a-time works on ANY data type that has an iterator method defined for it.

There's a fly in the ointment though. Consider this requirement:


Write a function that takes either a list or a vector of integers and destructively increments all of its elements by 1.


We can't do that using the iteration protocol as it currently stands, because it only gives us access to values and not locations. Can we extend the protocol so that we get both? Of course we can, but it's a tricky design issue because there are a couple of different ways to do it. The way I chose was to take advantage of Common Lisp's ability to let functions return multiple values. I extended the FOR macro to accept multiple variables corresponding to those multiple values. By convention, iterators return two values (except for the N-AT-A-TIME meta-iterator): the value, and a key that refers back to the container being iterated over (assuming that makes sense -- some data types, like streams, don't have such keys). This makes the code for the FOR macro and the associated ITERATOR methods quite a bit more complicated:


(defmacro for (var in thing &body body)
(unless (sym= in :in) (warn "expected keyword 'in', got ~A instead" in))
(with-gensym itervar
`(let ( (,itervar (iterator ,thing)) )
,(if (consp var)
`(loop for ,var = (multiple-value-list (funcall ,itervar))
until (eq ,(fst var) +iterend+)
,@body)
`(loop for ,var = (funcall ,itervar)
until (eq ,var +iterend+)
,@body)))))

(define-method (iterator (v vector))
(let ( (len (length v)) (cnt 0) )
(fn () (if (< cnt len)
(multiple-value-prog1 (values (elt v cnt) cnt) (incf cnt))
+iterend+))))

(define-method (iterator (h hash-table))
(let ( (keys (loop for x being the hash-keys of h collect x)) )
(fn ()
(if keys (let ( (k (pop keys)) ) (values k (gethash k h))) +iterend+))))

(defun n-at-a-time (n thing)
(let ( (iter (iterator thing)) )
(fn () (apply 'values (n-of (funcall iter) n)))))


Note that N-AT-A-TIME has actually gotten simpler since we no longer need to check for +iterend+. When the underlying iterator returns +iterend+ it just gets passed through automatically.

All this works in combination with another generic function called REF which is like ELT except that, because it's a generic function, can be extended to work on data types other than lists and vectors. The solution to the last problem is now:


(for (elt key) in thing do (setf (ref thing key) (+ elt 1)))


So we've had to do a lot of typing, but the net result is a huge win in terms of maintainability and flexibility. We can apply this code to any data type for which we can define an iterator and reference method, which means we can swap those types in and out without having to change any code. We can define meta-iterators (like N-AT-A-TIME) to modify how our iterations happen. (By default, the iterator I use for streams iterates over characters, but I have a meta-iterator called LINES that iterates over lines instead. It's easy to define additional meta-iterators that would iterate over paragraphs, or forms, or other semantically meaningful chunks.)

All this code, include the support macros (like with-gensym) is available at http://www.flownet.com/ron/lisp in the file utilities.lisp and dictionary.lisp. (There's lots of other cool stuff in there too, but that will have to wait for another day.)

UPDATE: I mentioned that iterators are not my idea, but didn't provide any references. There are plenty of them, but here are two to get you started if you want to read what people who actually know what they are talking about have to say about this:

http://okmij.org/ftp/Scheme/enumerators-callcc.html

http://okmij.org/ftp/papers/LL3-collections-enumerators.txt

Thanks to sleepingsquirrel for pointing these out.

I also realized after reading these that I probably should have used catch/throw instead of +iterend+ to end iterations.

Finally, it's worth noting that Python has iterators built-in to the language. But if it didn't you couldn't add them yourself like you can in Common Lisp. It can be done in Arc, but it's harder (and IMO less elegant) because ARC does not have generic functions, so you have to essentially re-invent that functionality yourself.

10 comments:

  1. Minor pseudo-typo: In (both of) your two FOR macros, the LOOP clauses don't seem indented as far to the right as they should be.

    ReplyDelete
  2. I love reading about important patterns such as this. This is an excellent article.

    To get the code to run, though, I had to make a few changes, particularly to the utilities.lisp file.

    1. The n-at-a-time function had to be replaced by the one shown in this article.
    2. s/ccl::%string-push-extend/vector-push-extend/
    3. "defsetf subseq" had to be commented out because the cl-user package was locked.
    4. class-direct-superclasses function is not defined, so make-load-form couldn't be evaluated (but I don't think it's needed).
    5. the directory function has two keywords (:directories and :files) that are not defined in sbcl.

    I'm using the latest sbcl.
    Thank you!

    ReplyDelete
  3. Thanks for the feedback. I've not been putting a lot of effort into making my code portable (or even usable by anyone other than me) because it's all kind of a gawdawful mess at the moment (I'm pulling out the less messy bits in these posts) and there doesn't seem to be all that much interest. (Everyone's busy with Arc.) But if enough people ask (and "enough" would probably be three or four at this point) I'd be happy to whip it into shape, get it under proper revision control, and all that.

    ReplyDelete
  4. I don't think it's terribly important to make the utilities code portable. I think most lispers can port the code themselves. I am inquisitive about a couple of things, however.

    I would be interested in what the class-direct-superclasses function looks like (if it runs in ansi cl or uses a popular cl library---not interested if it's purely a LW or Allegro thing).

    I don't understand CL well enough to know why "defsetf subseq" was defined in the utilities.lisp file since it was already defined in cl-user. An explanation would help me learn something new about lisp. (I just stumbled upon defsetf recently. I've been playing with CL off and on for years, but I'm still turning over important rocks.)

    One last request: I think the pattern you explain in this article is important; are there other patterns (besides the well-known) that you've written about, or care to share?
    Thanks

    ReplyDelete
  5. > I would be interested in what the class-direct-superclasses function looks like

    It's part of the MOP (meta-object protocol). In SBCL it's in the SB-MOP package. (APROPOS is your friend :-)

    > I don't understand CL well enough to know why "defsetf subseq" was defined in the utilities.lisp file since it was already defined in cl-user.

    I don't actually remember. It's probably a fix for a CCL (Clozure Common Lisp) bug and ought to have an appropriate #+ flag attached.

    > are there other patterns (besides the well-known) that you've written about, or care to share?

    Stay tuned to my blog :-)

    ReplyDelete
  6. Great article. Did you notice how similar this is to Python's for loop and iteration paradigm? Sweetness.

    ReplyDelete
  7. Yes, of course. This is more or less a straight lift of Python's design. (Python gets a lot of things right.)

    ReplyDelete
  8. Here's a version that doesn't use a sentine. It uses values, with the second value being whether or not the value returned is valid. (Took me a while to realize prog1 didn't return values.)

    I tried my hand at one with conditions, but that was really messy.

    (defmethod iter ((v vector))
    (let ((len (length v))
    (i 0))
    (lambda ()
    (if (< i len)
    (multiple-value-prog1 (values (elt v i) t) (incf i))
    (values nil nil)))))

    (defmacro for (var in thing &body body)
    (declare (ignorable in))
    (let ((itervar (gensym))
    (cont (gensym)))
    `(let ((,itervar (iter ,thing))
    (,cont t))
    (loop for ,var =
    (multiple-value-call
    #'(lambda (i end)
    (setf ,cont end)
    i)
    (funcall ,itervar))
    until (null ,cont)
    ,@body))))

    ReplyDelete
  9. Actually, here's the exception one.

    (define-condition end-of-iteration (condition) ())

    (defmethod iter ((v vector))
    (let ((len (length v)) (i 0))
    (lambda ()
    (if (< i len)
    (prog1 (elt v i) (incf i))
    (error 'end-of-iteration)))))

    (defmacro for (var in thing &body body)
    (declare (ignorable in))
    "Builds on the loop syntax but assumes iteration."
    (let ((itervar (gensym))
    (cont (gensym)))
    `(let ((,itervar (iter ,thing))
    (,cont t))
    (loop for ,var =
    (handler-case (funcall ,itervar)
    (end-of-iteration (e)
    (declare (ignorable e))
    (setf ,cont nil)))
    until (null ,cont)
    ,@body))))

    (for i in #(1 2 3) collect i)

    (How am I supposed to format code?)

    ReplyDelete