Read Thinking Functionally with Haskell Online
Authors: Richard Bird
getc = Parser f
where f [] = []
f (c:cs) = [(c,cs)]
This parser returns the first character of the input if there is one. It plays exactly the same role for parsers as
getChar
does for the input–output monad of the previous chapter.
Next, here is a parser for recognising a character that satisfies a given condition:
sat :: (Char -> Bool) -> Parser Char
sat p = do {c <- getc;
if p c then return c
else fail}
where
fail
is defined by
fail = Parser (\s -> [])
The parser
fail
is another basic parser that returns no parses. The parser
sat p
reads a character and, if it satisfies
p
, returns the character as the result. The definition of
sat
can be written more briefly by using a little combinator called
guard
:
sat p = do {c <- getc; guard (p c); return c}
guard :: Parser ()
guard True = return ()
guard False = fail
To see that these two definitions are the same, observe that if
p c
is false, then
guard (p c) >> return c = fail >> return c = fail
Note the use of the law
fail >> p = fail
, whose proof we leave as an exercise. If
p c
is true, then
guard (p c) >> return c
= return () >> return c
= return c
Using
sat
we can define a number of other parsers; for instance
char :: Char -> Parser ()
char x = do {c <- sat (==x); return ()}
string :: String -> Parser ()
string [] = return ()
string (x:xs) = do {char x; string xs; return ()}
lower :: Parser Char
lower = sat isLower
digit :: Parser Int
digit = do {d <- sat isDigit; return (cvt d)}
where cvt d = fromEnum d - fromEnum '0'
The parser
char x
looks for the specific character
x
as the next item in the input string, while
string xs
looks for a specific string; both parsers return
()
if successful. For example,
ghci> apply (string "hell") "hello"
[((),"o")]
The parser
digit
looks for a digit character and returns the corresponding integer if successful. The parser
lower
looks for a lowercase letter, returning such a letter if found.
In order to define more sophisticated parsers we need operations for choosing between alternative parsers and for repeating parsers. One such alternation operator is
(<|>)
, defined by
(<|>) :: Parser a -> Parser a -> Parser a
p <|> q = Parser f
where f s = let ps = apply p s in
if null ps then apply q s
else ps
Thus
p <|> q
returns the same parses as
p
unless
p
fails, in which case the parses of
q
are returned. If both
p
and
q
are deterministic, then so is
p <|> q
. For another choice of
<|>
see the exercises. We claim that
<|>
is associative with
fail
as its identity element, but again we relegate the proof as an exercise.
Here is a parser for recognising a string of lowercase letters:
lowers :: Parser String
lowers = do {c <- lower; cs <- lowers; return (c:cs)}
<|> return ""
To see how this parser works, suppose the input is the string ‘Upper’. In this case the parser on the left of
<|>
fails because ‘U’ is not a lowercase letter. However, the parser on the right succeeds, so
ghci> apply lowers "Upper"
[("","Upper")]
With input string ‘isUpper’, the left-hand parser succeeds, so
ghci> apply lowers "isUpper"
[("is","Upper")]
Use of the choice operator
<|>
requires care. For example, consider a very simple form of arithmetic expression that consists of either a single digit or a digit followed by a plus sign followed by another digit. Here is a possible parser:
wrong :: Parser Int
wrong = digit <|> addition
addition :: Parser Int
addition = do {m <- digit; char '+'; n <- digit;
return (m+n)}
We have
ghci> apply wrong "1+2"
[(1,"+2")]
The parser
digit
succeeds, so
addition
is not invoked. But what we really wanted was to return
[(3,"")]
, absorbing as much of the input as possible. One way to correct
wrong
is to rewrite it in the form
better = addition <|> digit
Then on
1+2
the parser
addition
succeeds, returning the result we want. What is wrong with
better
is that it is inefficient: applied to the input
1
it parses the digit but fails to find a subsequent plus sign, so parser
addition
fails. As a result
digit
is invoked and the input is parsed again from scratch. Not really a problem with a single digit, but the repetition of effort could be costly if we were parsing for a numeral that could contain many digits.
The best solution is to
factor
the parser for digits out of the two component parsers:
best = digit >>= rest
rest m = do {char '+'; n <- digit; return (m+n)}
<|> return m
The argument to
rest
is just an accumulating parameter. We saw essentially the same solution in the chapter on pretty-printing. Factoring parsers to bring out common prefixes is a Good Idea to improve efficiency.
Generalising from the definition of
lowers
, we can define a parser combinator that repeats a parser zero or more times:
many :: Parser a -> Parser [a]
many p = do {x <- p; xs <- many p; return (x:xs)}
<|> none
none = return []
The value
none
is different from
fail
(why?). We can now define
lowers = many lower
In many applications, so-called
white space
(sequences of space, newline and tab characters) can appear between
tokens
(identifiers, numbers, opening and closing parentheses, and so on) just to make the text easier to read. The parser
space
recognises white space:
space :: Parser ()
space = many (sat isSpace) >> return ()
The function
isSpace
is defined in the library
Data.Char
. The function
symbol :: String -> Parser ()
symbol xs = space >> string xs
ignores white space before recognising a given string. More generally we can define
token :: Parser a -> Parser a
token p = space >> p
for ignoring white space before invoking a parser. Note that
token p <|> token q = token (p <|> q)
but the right-hand parser is more efficient as it does not look for white space twice if the first parser fails.
Sometimes we want to repeat a parser one or more times rather than zero or more times. This can be done by a combinator which we will call
some
(it is also called
many1
in some parser libraries):
some :: Parser a -> Parser [a]
some p = do {x <- p; xs <- many p; return (x:xs)}
This definition repeats that of the first parser in the definition of
many
, a fact we can take into account by redefining
many
in terms of
some
:
many :: Parser a -> Parser [a]
many p = optional (some p)
optional :: Parser [a] -> Parser [a]
optional p = p <|> none
The parsers
many
and
some
are now mutually recursive.
Here is a parser for natural numbers, one that allows white space before the number:
natural :: Parser Int
natural = token nat
nat = do {ds <- some digit;
return (foldl1 shiftl ds)}
where shiftl m n = 10*m+n
The subsidiary parser
nat
does not allow white space before the number.
Consider now how to define a parser for an
integer
numeral, which by definition is a nonempty string of digits possibly prefixed by a minus sign. You might think that the parser
int :: Parser Int
int = do {symbol "-"; n <- natural; return (-n)}
<|> natural
does the job, but it is inefficient (see Exercise H) and may or may not be what we want. For example,
ghci> apply int " -34"
[(-34,"")]
ghci> apply int " - 34"
[(-34,"")]
Whereas we are quite happy with white space before a numeral, we may not want any white space to appear between the minus sign and the ensuing digits. If that is the case, then the above parser will not do. It is easy to modify the given definition of
int
to give what we want:
int :: Parser Int
int = do {symbol "-"; n <- nat; return (-n)}
<|> natural
This parser is still inefficient, and a better alternative is to define
int :: Parser Int
int = do {space; f <- minus; n <- nat; return (f n)}
where
minus = (char '-' >> return negate) <|> return id
The parser
minus
returns a function, either
negate
if the first symbol is a minus sign, or the identity function otherwise.
Next, let us parse a list of integers, separated by commas and enclosed in square brackets. White space is allowed before and after commas and brackets though not of course between the digits of the integers. Here is a very short definition:
ints :: Parser [Int]
ints = bracket (manywith (symbol ",") int)
The subsidiary parser
bracket
deals with the brackets:
bracket :: Parser a -> Parser a
bracket p = do {symbol "[";
x <- p;
symbol "]";
return x}
The function
manywith sep p
acts a bit like
many p
but differs in that the instances of
p
are separated by instances of
sep
whose results are ignored. The definition is
manywith :: Parser b -> Parser a -> Parser [a]
manywith q p = optional (somewith q p)
somewith :: Parser b -> Parser a -> Parser [a]
somewith q p = do {x <- p;
xs <- many (q >> p);
return (x:xs)}
For example,
ghci> apply ints "[2, -3, 4]"
[([2,-3,4],"")]
ghci> apply ints "[2, -3, +4]"
[]
ghci> apply ints "[]"
[([],"")]
Integers cannot be preceded by a plus sign, so parsing the second expression fails.
The combinators described so far are sufficiently powerful for translating a structural description of what is required directly into a functional parser. Such a struc
tural description is provided by a
grammar
. We will illustrate some typical grammars by looking at parsers for various kinds of arithmetic expression.
Let us start by building a parser for the type
Expr
, defined by
data Expr
= Con Int | Bin Op Expr Expr
data Op
= Plus | Minus
Here is a grammar for fully parenthesised expressions, expressed in what is known as
Backus-Naur form
, or BNF for short:
expr ::= nat | '(' expr op expr ')'
op ::= '+' | '-'
nat ::= {digit}+
digit ::= '0' | '1' | ... | '9'
This grammar defines four
syntactic categories
. Symbols enclosed in quotes are called
terminal
symbols and describe themselves; these are symbols that actually occur in the text. There are ten possible characters for a digit, and a
nat
is defined as a sequence of one or more digits. The meta-symbol
{-}+
describes a non-zero repetition of a syntactic category. Note that we do not allow an optional minus sign before a sequence of digits, so constants are natural numbers, not arbitrary integers. The grammar states that an expression is either a natural number or else a compound expression consisting of an opening parenthesis, followed by an expression, followed by either a plus or minus sign, followed by another expression, and finally followed by a closing parenthesis. It is implicitly understood in the description that white space is ignored between terminal symbols except between the digits of a number. The grammar translates directly into a parser for expressions:
expr :: Parser Expr