Read Thinking Functionally with Haskell Online
Authors: Richard Bird
The comparison test above doesn’t determine what should happen if the two lines have the same length. But it is a consequence of the fact that all layouts flatten to the same string that two first lines with the same length will be the
same
line. Consequently, the first line is fixed and the comparison can pass to the second pair of lines. And so on.
The second property about decreasing shapes can be used to simplify the comparison test slightly because if layout
lx
precedes layout
ly
in the list of layouts, then the first line of
lx
is known to be at least as long as the first line of
ly
. And if the two lines are equally long, then the same statement is true of the second lines. And so on.
Given our shallow embedding of documents, here is a simple implementation of the function
pretty
that finds the best layout:
pretty :: Int -> Doc -> Layout
pretty w = fst . foldr1 choose . map augment
where
augment lx = (lx,shape lx)
choose alx aly
= if better (snd alx) (snd aly) then alx else aly
better [] ks
= True
better js []
= False
better (j:js) (k:ks) | j == k = better js ks
| otherwise = (j <= w)
Each layout is augmented with shape information to guide the choice of layout, which is then determined by a simple search. The test
better
implements the comparison operation described above. Finally, shape information is discarded.
This definition of
pretty
is hopelessly inefficient because every layout is computed and examined. If there are
n
possible choices of whether to have a line break or not, there are 2
n
layouts to be examined and pretty-printing will be very slow indeed. For example,
ghci> putStrLn $ pretty 30 $ para pg
This is a fairly short
paragraph with just twenty-two
words. The problem is that
pretty-printing it takes time,
in fact 31.32 seconds.
(31.32 secs, 17650013284 bytes)
Ouch! What is worse, pretty-printing a longer paragraph will cause GHCi to crash with an ‘out of memory’ message. An exponential time and space algorithm is not acceptable.
What is wanted is an algorithm for
pretty
that can decide on which first line to choose without looking ahead more than
w
characters. The algorithm should also be efficient, taking linear time in the size of the document being pretty-printed. Ideally the running time should be independent of
w
, but a running time that does depend on
w
is acceptable if a faster one means a much more complicated program.
The problem with identifying a document with its list of possible layouts is that useful structure is lost. Rather than bring all the alternatives to the top level as a list, we really want to bury them as deep as possible. For example, consider the following two expressions for a document:
A<0>B<0>D + A<0>B<1>D + A<1>C<0>E + A<1>C<1>E
A(<0>B(<0>D + <1>D) + <1>C(<0>E + <1>E))
As before,
<0>
denotes a single space and
<1>
a single line break. The five letters denote five nonempty texts. Since all four alternatives have to flatten to the same document, we require that
B<0>D = C<0>E
. In the first expression (which is essentially what is given by representing a document by its list of layouts) we have four layouts to compare. In the second expression we can shortcut some of the comparisons. For example, if we know that the common prefix
A
cannot fit in the given width, the first two layouts can be thrown away without further comparisons. Even better, if we choose between alternatives from the innermost to the outermost, we can base the comparison test on just the first lines of layouts. For instance, if we choose the better of
C<0>E
and
C<1>E
first, then that choice is not changed by subsequent choices.
The way to maintain the structure of documents is to represent a document as a tree:
data Doc = Nil
| Line
| Text String
| Nest Int Doc
| Group Doc
| Doc :<>: Doc
Note the use of an infix constructor in the last line. Haskell allows infix operators as constructors, but they have to begin with a colon. They do not have to end with a colon as well, but it seems more attractive if they do. This tree is called an
abstract syntax tree
; each operation of the library is represented by its own constructor. An implementation in terms of abstract syntax trees is known as a
deep embedding
.
We will
not
provide the user with the details of the data type
Doc
, just its name. To explain why not, it is useful to insert a short digression about Haskell data types. In Haskell the effect of a
data
declaration is to introduce a new data type by describing how its values are constructed. Each value is named by an expression built only from the constructors of the data type, in other words a
term
. Moreover, different terms denote different values (provided there are no strictness flags). We can define functions on the data type by pattern matching on the constructors. There is therefore no need to state what the operations on the data type are – we can just define them. Types in which the values are described, but the operations are not, are called
concrete
types.
The situation is exactly the reverse with
abstract
data types. Here the operations are named, but not how the values are constructed, at least not publicly. For example,
Float
is an abstract data type; we are given the names of the primitive arithmetic
and comparison operations, and also a way of displaying floating-point numbers, but it is not stated how such numbers are actually represented. We cannot define functions on these numbers by pattern matching, but only in terms of the given operations. What can and should be stated publicly are intended meanings and the algebraic properties of the operations. However, Haskell provides no means for such descriptions beyond informal comments.
As it stands,
Doc
is a concrete type. But in our understanding of this type, different terms do not denote different values. For instance, we intend each constructor to be a replacement for the corresponding operation. Thus
nil | = Nil |
line | = Line |
text s | = Text s |
nest i x | = Nest i x |
group x | = Group x |
x <> y | = x :<>: y |
We also want to keep the algebraic properties of these operations, so equations such as
(x :<>: y) :<>: z = x :<>: (y :<>: z)
Nest i (Nest j x) = Nest (i+j) x
should hold. But of course they do not. The solution is to use the module structure to hide the constructors of
Doc
from the user and insist only that the laws are ‘observably’ true. For instance we require
layouts ((x :<>: y) :<>: z) = layouts (x :<>: (y :<>: z))
The only way we can observe documents is through
layouts
; from the user’s point of view if two documents produce the same layouts, then they are essentially the same document.
Let’s get back to programming. Here is one definition of
layouts
. It is just the laws of
layouts
that we saw earlier, but now expressed as a proper Haskell definition:
layouts :: Doc -> [Layout] | |
layouts (x :<>: y) | = layouts x <++> layouts y |
layouts Nil | = [""] |
layouts Line | = ["\n"] |
layouts (Text s) | = [s] |
layouts (Nest i x) | = map (nestl i) (layouts x) |
layouts (Group x) | = layouts (flatten x) ++ layouts x |
The function
flatten
is similarly defined by
flatten :: Doc -> Doc | |
flatten (x :<>: y) | = flatten x :<>: flatten y |
flatten Nil | = Nil |
flatten Line | = Text " " |
flatten (Text s) | = Text s |
flatten (Nest i x) | = flatten x |
flatten (Group x) | = flatten x |
With these definitions, our 24 laws are either true by definition, or are observably true in the sense above.
The definition of
layouts
is simple enough, but it is unnecessarily inefficient. There are two separate reasons why this is so. First, consider the function
egotist
defined by
egotist :: Int -> Doc
egotist n | n==0 = nil
| otherwise = egotist (n-1) <> text "me"
The document
egotist n
is a very boring one, and its sole layout consists of a string of
n
repetitions of
me
. By the way, we could have expressed the definition using
Nil
,
(:<>:)
and
Text
but, as we have said, we are not going to make these constructors public. As it stands, the definition of
egotist
could have been made by a user of the library. Anyway, back to the main point, which is that the association of the
(<>)
operations is to the left, and it takes Θ(
n
2
) steps to compute its layout(s). The
(++)
operations pile up to the left. The situation is entirely analogous to the fact that
concat
defined in terms of
foldl
is an order of magnitude less efficient than one defined in terms of
foldr
.
The second source of inefficiency concerns nesting. For example, consider the function
egoist
defined by
egoist :: Int -> Doc
egoist n | n==0 = nil
| otherwise = nest 1 (text "me" <> egoist (n-1))
There are no line breaks in sight, so
egoist n
describes the same boring document as
egotist n
. But although the concatenation associates to the right, it still takes quadratic time to construct the layout. Each nesting operation is carried out by running through the entire document. Try it and see.
The way to solve the first problem is to delay concatenation, representing a concatenated document by a list of its component documents. The way to solve the second problem is to delay nesting, representing a nested document by a pair consisting of an indentation to be applied only when necessary and the document it is to be applied to. Combining both solutions, we represent a document by a list of indentation-document pairs. Specifically, consider the function
toDoc
defined by
toDoc :: [(Int,Doc)] -> Doc
toDoc ids = foldr (:<>:) Nil [Nest i x | (i,x) <- ids]
We can now calculate a definition of a function
layr
such that
layr = layouts . toDoc
and then define a new version of
layouts
based on
layr
. We leave the details as an exercise, but here is the result:
layouts x = layr [(0,x)] | |
layr [] | = [""] |
layr ((i,x :<>: y):ids) | = layr ((i,x):(i,y):ids) |
layr ((i,Nil):ids) | = layr ids |
layr ((i,Line):ids) | = ['\n':replicate i ' ' ++ ls |
| ls <- layr ids] | |
layr ((i,Text s):ids) | = [s ++ ls | ls <- layr ids] |
layr ((i,Nest j x):ids) | = layr ((i+j,x):ids) |
layr ((i,Group x):ids) | = layr ((i,flatten x):ids) ++ |
layr ((i,x):ids) |
This definition takes linear time for each layout. Exactly the same template is used for the function
pretty
, which chooses a single best layout:
pretty w x = best w [(0,x)] | |
where | |
best r [] | = "" |
best r ((i,x :<>: y):ids) | = best r ((i,x):(i,y):ids) |
best r ((i,Nil):ids) | = best r ids |
best r ((i,Line):ids) | = '\n':replicate i ' ' ++ |
best (w-i) ids | |
best r ((i,Text s):ids) | = s ++ best (r-length s) ids |
best r ((i,Nest j x):ids) | = best r ((i+j,x):ids) |
best r ((i,Group x):ids) | = better r |
(best r ((i,flatten x):ids)) | |
(best r ((i,x):ids)) |