Read Structure and Interpretation of Computer Programs Online
Authors: Harold Abelson and Gerald Jay Sussman with Julie Sussman
In the exercises below we will work with a system that uses
Huffman trees to encode and decode messages and generates Huffman
trees according to the algorithm outlined above. We will begin by
discussing how trees are represented.
Leaves of the tree are represented by a list consisting of the
symbol
leaf
, the symbol at the leaf, and the weight:
(define (make-leaf symbol weight)
(list 'leaf symbol weight))
(define (leaf? object)
(eq? (car object) 'leaf))
(define (symbol-leaf x) (cadr x))
(define (weight-leaf x) (caddr x))
A general tree will be a list of a left branch, a right branch, a set
of symbols, and a weight. The set of symbols will be simply a list of
the symbols, rather than some more sophisticated set representation.
When we make a tree by merging two nodes, we obtain the weight of the
tree as the sum of the weights of the nodes, and the set of symbols as
the union of the sets of symbols for the nodes. Since our symbol sets are
represented as lists, we can form the union by using the
append
procedure we defined in section
2.2.1
:
(define (make-code-tree left right)
(list left
right
(append (symbols left) (symbols right))
(+ (weight left) (weight right))))
If we make a tree in this way, we have the following selectors:
(define (left-branch tree) (car tree))
(define (right-branch tree) (cadr tree))
(define (symbols tree)
(if (leaf? tree)
(list (symbol-leaf tree))
(caddr tree)))
(define (weight tree)
(if (leaf? tree)
(weight-leaf tree)
(cadddr tree)))
The procedures
symbols
and
weight
must do something
slightly different depending on whether they are called with a leaf or
a general tree. These are simple examples of
generic
procedures
(procedures that can handle more than one kind of data),
which we will have much more to say about in
sections
2.4
and
2.5
.
The following procedure implements the decoding algorithm.
It takes as arguments a list of zeros and ones, together with
a Huffman tree.
(define (decode bits tree)
(define (decode-1 bits current-branch)
(if (null? bits)
'()
(let ((next-branch
(choose-branch (car bits) current-branch)))
(if (leaf? next-branch)
(cons (symbol-leaf next-branch)
(decode-1 (cdr bits) tree))
(decode-1 (cdr bits) next-branch)))))
(decode-1 bits tree))
(define (choose-branch bit branch)
(cond ((= bit 0) (left-branch branch))
((= bit 1) (right-branch branch))
(else (error "bad bit -- CHOOSE-BRANCH" bit))))
The procedure
decode-1
takes two arguments: the list of remaining bits
and the current position in the tree. It keeps moving
“down” the tree, choosing a left or a right branch according to
whether the next bit in the list is a zero or a one. (This is done
with the procedure
choose-branch
.) When it reaches a leaf, it
returns the symbol at that leaf as the next symbol in the message by
cons
ing it onto the result of decoding
the rest of the message, starting at the root of the tree.
Note the error check in the final clause of
choose-branch
, which
complains if the procedure finds something other than a zero or a one in the
input data.
In our representation of trees, each non-leaf node contains a set of
symbols, which we have represented as a simple list. However, the
tree-generating algorithm discussed above requires that we also work
with sets of leaves and trees, successively merging the two smallest
items. Since we will be required to repeatedly find the smallest item
in a set, it is convenient to use an ordered representation for this
kind of set.
We will represent a set of leaves and trees as a list of elements,
arranged in increasing order of weight. The following
adjoin-set
procedure for constructing sets is similar to the one
described in exercise
2.61
; however, items are compared
by their weights, and the element being added to the set is
never already in it.
(define (adjoin-set x set)
(cond ((null? set) (list x))
((< (weight x) (weight (car set))) (cons x set))
(else (cons (car set)
(adjoin-set x (cdr set))))))
The following procedure takes a list of
symbol-frequency pairs such as
((A 4) (B 2) (C 1) (D 1))
and
constructs an initial ordered set of leaves, ready to be merged
according to the Huffman algorithm:
(define (make-leaf-set pairs)
(if (null? pairs)
'()
(let ((pair (car pairs)))
(adjoin-set (make-leaf (car pair)
; symbol
(cadr pair))
; frequency
(make-leaf-set (cdr pairs))))))
Exercise 2.67.
Define an encoding tree and a sample message:
(define sample-tree
(make-code-tree (make-leaf 'A 4)
(make-code-tree
(make-leaf 'B 2)
(make-code-tree (make-leaf 'D 1)
(make-leaf 'C 1)))))
(define sample-message '(0 1 1 0 0 1 0 1 0 1 1 1 0))
Use the
decode
procedure to decode the
message, and give the result.
Exercise 2.68.
The
encode
procedure takes as arguments a message and a tree and
produces the list of bits that gives the encoded message.
(define (encode message tree)
(if (null? message)
'()
(append (encode-symbol (car message) tree)
(encode (cdr message) tree))))
Encode-symbol
is a procedure, which you must write, that returns
the list of bits that encodes a given symbol according to a given
tree. You should design
encode-symbol
so that it signals an
error if the symbol is not in the tree at all. Test your procedure by
encoding the result you obtained in exercise
2.67
with
the sample tree and seeing whether it is the same as the original
sample message.
Exercise 2.69.
The following procedure takes as its argument a list of
symbol-frequency pairs (where no symbol appears in more than one pair)
and generates a Huffman encoding tree according to the Huffman
algorithm.
(define (generate-huffman-tree pairs)
(successive-merge (make-leaf-set pairs)))
Make-leaf-set
is the procedure given above that transforms the
list of pairs into an ordered set of leaves.
Successive-merge
is the procedure you must write, using
make-code-tree
to
successively merge the smallest-weight elements of the set until there
is only one element left, which is the desired Huffman tree. (This
procedure is slightly tricky, but not really complicated. If you find
yourself designing a complex procedure, then you are almost certainly
doing something wrong. You can take significant advantage of the fact
that we are using an ordered set representation.)
Exercise 2.70.
The following eight-symbol alphabet with associated relative
frequencies was designed to efficiently encode the lyrics of 1950s
rock songs. (Note that the “symbols” of an “alphabet” need not be
individual letters.)
A | 2 | NA | 16 |
BOOM | 1 | SHA | 3 |
GET | 2 | YIP | 9 |
JOB | 2 | WAH | 1 |
Get a job
Sha na na na na na na na na
Get a job
Sha na na na na na na na na
Wah yip yip yip yip yip yip yip yip yip
Sha boom
How many bits are required for the encoding? What is the smallest
number of bits that would be needed to encode this song if we
used a fixed-length code for the eight-symbol alphabet?
Exercise 2.71.
Suppose we have a Huffman tree for an alphabet of
n
symbols, and
that the relative frequencies of the symbols are 1, 2, 4,
...
,
2
n
-1
. Sketch the tree for
n
=5; for
n
=10. In such a tree
(for general
n
) how may bits are required to encode the most
frequent symbol? the least frequent symbol?
Exercise 2.72.
Consider the encoding procedure that you designed in
exercise
2.68
. What is the order of growth in the
number of steps needed to encode a symbol? Be sure to include the
number of steps needed to search the symbol list at each node
encountered. To answer this question in general is difficult.
Consider the special case where the relative frequencies of the
n
symbols are as described in exercise
2.71
, and give
the order of growth (as a function of
n
) of the number of steps
needed to encode the most frequent and least frequent symbols in the
alphabet.
32
Allowing quotation in a language wreaks havoc
with the ability to reason about the language in simple terms, because
it destroys the notion that equals can be substituted for equals. For
example, three is one plus two, but the word “three” is not the
phrase “one plus two.” Quotation is powerful because it gives us a way
to build expressions that manipulate other expressions (as we will see
when we write an interpreter in chapter 4). But allowing statements in
a language that talk about other statements in that language makes it
very difficult to maintain any coherent principle of what “equals can
be substituted for equals” should mean. For example, if we know that
the evening star is the morning star, then from the statement “the
evening star is Venus” we can deduce “the morning star is Venus.”
However, given that “John knows that the evening star is Venus” we
cannot infer that “John knows that the morning star is Venus.”
33
The single quote is different
from the double quote we have
been using to enclose character strings to be printed. Whereas the
single quote can be used to denote lists or symbols, the double quote
is used only with character strings. In this book, the only use for
character strings is as items to be printed.
34
Strictly, our
use of the quotation mark violates the general rule that all compound
expressions in our language should be delimited by parentheses
and look like lists. We
can recover this consistency by introducing a special form
quote
, which serves the same purpose as the quotation mark. Thus, we
would type
(quote a)
instead of
'a
, and we would type
(quote (a b c))
instead of
'(a b c)
. This is precisely how the
interpreter works. The quotation mark is just a single-character
abbreviation for wrapping the next complete expression with
quote
to form
(quote <
expression
>)
. This is important
because it maintains the principle that any expression seen by the
interpreter can be manipulated as a data object. For instance, we
could construct the expression
(car '(a b c))
, which is the same as
(car (quote (a b c)))
,
by evaluating
(list 'car (list 'quote '(a b c)))
.
35
We can consider two symbols to be “the same” if they
consist of the same characters in the same order. Such a definition
skirts a deep issue that we are not yet ready to address: the meaning
of “sameness” in a programming language. We will return to this in
chapter 3 (section
3.1.3
).
36
In practice, programmers
use
equal?
to compare lists that contain numbers as well as
symbols. Numbers are not considered to be symbols. The question
of whether two numerically equal numbers (as tested by
=
) are also
eq?
is highly implementation-dependent. A better definition
of
equal?
(such as the one that comes as a primitive in Scheme)
would also stipulate that if
a
and
b
are
both numbers, then
a
and
b
are
equal?
if they are
numerically equal.
37
If
we want to be more formal, we can specify “consistent with the
interpretations given above” to mean that the operations satisfy a
collection of rules such as these:
38
Halving the size of the problem at each step is the
distinguishing characteristic of
logarithmic growth, as we saw with
the fast-exponentiation algorithm of section
1.2.4
and the half-interval search method of
section
1.3.3
.
39
We are
representing sets in terms of trees, and trees in terms of lists – in
effect, a data abstraction built upon a data abstraction. We can
regard the procedures
entry
,
left-branch
,
right-branch
, and
make-tree
as a way of isolating the
abstraction of a “binary tree” from the particular way we might wish
to represent such a tree in terms of list structure.
40
Examples of such structures include
B-trees
and
red-black trees
. There is a large literature on
data structures devoted to this problem. See Cormen,
Leiserson, and Rivest 1990.
41
Exercises
2.63
-
2.65
are due to Paul Hilfinger.
42
See Hamming 1980
for a discussion of the mathematical properties of Huffman codes.