Looks good! Nondeterminism has apparently given us the power to
recognize languages that deterministic
machines can’t handle.
But wait: we
saw in
Equivalence
that
nondeterministic machines without a stack are exactly equivalent in
power to deterministic ones. Our Ruby NFA simulation behaved like a
DFA—moving between a finite number of “simulation states” as it read
each character of the input string—which gave us a way to turn any NFA
into a DFA that accepts the same strings. So has nondeterminism really
given us any extra power, or does our Ruby NPDA simulation just behave
like a DPDA? Is there an algorithm for converting any nondeterministic
pushdown automaton into a deterministic one?
Well, no, it turns out that there isn’t. The NFA-to-DFA trick only works because we can
use a single DFA state to represent many possible NFA states. To simulate an NFA, we only
need to keep track of what states it could currently be in, then pick a different set of
possible states each time we read an input character, and a DFA can easily do that job if we
give it the right rules.
But that trick doesn’t work for PDAs: we can’t usefully represent
multiple NPDA configurations as a single DPDA configuration. The
problem, unsurprisingly, is the stack. An NPDA simulation needs to know
all the characters that could currently be on top of the stack, and it
must be able to pop and push several of the simulated stacks
simultaneously. There’s no way to combine all the possible stacks into a
single stack so that a DPDA can still see all the topmost characters and
access every possible stack individually. We didn’t have any difficulty
writing a Ruby program to do all this, but a DPDA just isn’t powerful
enough to handle it.
So unfortunately, our NPDA simulation does
not
behave like a DPDA,
and there isn’t an NPDA-to-DPDA algorithm. The unmarked palindrome problem is an example of
a job where an NPDA can do something that a DPDA can’t, so nondeterministic pushdown
automata really
do
have more power than deterministic
ones.
Regular Expressions
showed
how nondeterministic finite automata can be used to
implement regular expression matching. Pushdown automata have an important
practical application too: they can be used to parse programming
languages.
We already saw in
Implementing Parsers
how to use
Treetop to build a parser for part of the
Simple
language. Treetop parsers use a single
parsing expression grammar
to describe the complete
syntax of the language being parsed, but that’s a relatively modern idea.
A more traditional approach is to break the parsing process apart into two
separate stages:
Read a raw string of characters and turn it into a sequence of
tokens
. Each token
represents an individual building block of program
syntax, like “variable name,” “opening bracket,” or “while
keyword.” A lexical analyzer uses a
language-specific set of rules called a
lexical
grammar
to decide which sequences of characters should
produce which tokens. This stage deals with messy character-level
details like variable-naming rules, comments, and whitespace,
leaving a clean sequence of tokens for the next stage to
consume.
Read a sequence of tokens and
decide whether they represent a valid program according to the
syntactic grammar
of the language being parsed. If the program
is valid, the syntactic analyzer may produce additional information about its structure
(e.g., a parse tree).
The lexical analysis stage is
usually pretty straightforward. It can be done with regular expressions (and
therefore by an NFA), because it involves simply matching a flat sequence of characters
against some rules and deciding whether those characters look like a keyword, a variable
name, an operator, or whatever else. Here’s some quick-and-dirty Ruby code to chop up a
Simple
program into tokens:
class
LexicalAnalyzer
<
Struct
.
new
(
:string
)
GRAMMAR
=
[
{
token
:
'i'
,
pattern
:
/if/
},
# if keyword
{
token
:
'e'
,
pattern
:
/else/
},
# else keyword
{
token
:
'w'
,
pattern
:
/while/
},
# while keyword
{
token
:
'd'
,
pattern
:
/do-nothing/
},
# do-nothing keyword
{
token
:
'('
,
pattern
:
/\(/
},
# opening bracket
{
token
:
')'
,
pattern
:
/\)/
},
# closing bracket
{
token
:
'{'
,
pattern
:
/\{/
},
# opening curly bracket
{
token
:
'}'
,
pattern
:
/\}/
},
# closing curly bracket
{
token
:
';'
,
pattern
:
/;/
},
# semicolon
{
token
:
'='
,
pattern
:
/=/
},
# equals sign
{
token
:
'+'
,
pattern
:
/\+/
},
# addition sign
{
token
:
'*'
,
pattern
:
/\*/
},
# multiplication sign
{
token
:
'<'
,
pattern
:
/
},
# less-than sign
{
token
:
'n'
,
pattern
:
/[0-9]+/
},
# number
{
token
:
'b'
,
pattern
:
/true|false/
},
# boolean
{
token
:
'v'
,
pattern
:
/[a-z]+/
}
# variable name
]
def
analyze
[].
tap
do
|
tokens
|
while
more_tokens?
tokens
.
push
(
next_token
)
end
end
end
def
more_tokens?
!
string
.
empty?
end
def
next_token
rule
,
match
=
rule_matching
(
string
)
self
.
string
=
string_after
(
match
)
rule
[
:token
]
end
def
rule_matching
(
string
)
matches
=
GRAMMAR
.
map
{
|
rule
|
match_at_beginning
(
rule
[
:pattern
]
,
string
)
}
rules_with_matches
=
GRAMMAR
.
zip
(
matches
)
.
reject
{
|
rule
,
match
|
match
.
nil?
}
rule_with_longest_match
(
rules_with_matches
)
end
def
match_at_beginning
(
pattern
,
string
)
/\A
#{
pattern
}
/
.
match
(
string
)
end
def
rule_with_longest_match
(
rules_with_matches
)
rules_with_matches
.
max_by
{
|
rule
,
match
|
match
.
to_s
.
length
}
end
def
string_after
(
match
)
match
.
post_match
.
lstrip
end
end
This implementation uses single characters as tokens—w
means “thewhile
keyword,”+
means “the addition sign,” and so
on—because soon we’re going to be feeding those tokens to a PDA, and
our Ruby PDA simulations expect to read characters as input.
Single-character tokens are good enough for a basic demonstration where we don’t need
to retain the names of variables or the values of literals. In a real parser, however,
we’d want to use a proper data structure to represent tokens so they could communicate
more information than just “some unknown variable name” or “some unknown Boolean.”
By creating aLexicalAnalyzer
instance with a string of
Simple
code
and calling its#analyze
method, we
can get back an array of tokens showing how the code breaks down into
keywords, operators, punctuation, and other pieces of
syntax:
>>
LexicalAnalyzer
.
new
(
'y = x * 7'
)
.
analyze
=> ["v", "=", "v", "*", "n"]
>>
LexicalAnalyzer
.
new
(
'while (x < 5) { x = x * 3 }'
)
.
analyze
=> ["w", "(", "v", "<", "n", ")", "{", "v", "=", "v", "*", "n", "}"]
>>
LexicalAnalyzer
.
new
(
'if (x < 10) { y = true; x = 0 } else { do-nothing }'
)
.
analyze
=> ["i", "(", "v", "<", "n", ")", "{", "v", "=", "b", ";", "v", "=", "n", "}", "e",
"{", "d", "}"]
Choosing the rule with the longest match allows the lexical
analyzer to handle variables whose names would otherwise cause them to
be wrongly identified as keywords:
>>
LexicalAnalyzer
.
new
(
'x = false'
)
.
analyze
=> ["v", "=", "b"]
>>
LexicalAnalyzer
.
new
(
'x = falsehood'
)
.
analyze
=> ["v", "=", "v"]
There are other ways of dealing with this problem. One alternative would be to write
more restrictive regular expressions in the rules: if the Boolean rule used the pattern/(true|false)(?![a-z])/
, then it wouldn’t match the
string'falsehood'
in the first place.
Once we’ve done
the easy work of turning a string into tokens, the harder
problem is to decide whether those tokens represent a syntactically
valid
Simple
program. We can’t use
regular expressions or NFAs to do it—
Simple
’s syntax allows arbitrary nesting of
brackets, and we already know that finite automata aren’t powerful
enough to recognize languages like that. It
is
possible to use a pushdown automaton to recognize valid sequences of
tokens, though, so let’s see how to construct one.
First we need a syntactic grammar that describes how tokens may be
combined to form programs. Here’s part of a grammar for
Simple
, based on the structure of the Treetop
grammar in
Implementing Parsers
:
::= | ::= 'w' '(' ')' '{' '}' ::= 'v' '=' ::= ::= '<' | ::= '*' | ::= 'n' | 'v'
This is called a
context-free grammar
(CFG).
[
35
]
Each rule has a
symbol
on the lefthand side and one or
more sequences of symbols and tokens on the right. For example, the rule
means that
a
Simple
statement is either awhile
loop or an assignment, and
means that an assignment statement consists of a
'v' '='
variable name followed by an equals sign and an expression.
The CFG is a static description of
Simple
’s structure, but we can also think of
it as a set of rules for
generating
Simple
programs. Starting from the
symbol, we can apply the
grammar rules to recursively expand symbols until only tokens remain.
Here’s one of many ways to fully expand
according to the
rules:
→
→ 'v' '='
→ 'v' '='
→ 'v' '='
→ 'v' '=''*'
→ 'v' '=' 'v' '*'
→ 'v' '=' 'v' '*'
→ 'v' '=' 'v' '*' 'n'
This tells us that'v' '=' 'v' '*'
represents a syntactically valid program, but we want the
'n'
ability to go in the opposite direction: to
recognize
valid programs, not generate them. When
we get a sequence of tokens out of the lexical analyzer, we’d like to
know whether it’s possible to expand the
symbol into those tokens by
applying the grammar rules in some order. Fortunately, there’s a way to
turn a context-free grammar into a nondeterministic pushdown automaton
that can make exactly this decision.
The technique for converting a CFG into a PDA works like
this:
Pick a character to represent each symbol from the grammar. In this case, we’ll use
the uppercase initial of each symbol—S
for
,W
for
, and so on—to distinguish them from
the lowercase characters we’re using as tokens.
Use the PDA’s stack to store characters that represent grammar
symbols (S
,W
,A
,E
, …) and tokens (w
,v
,=
,*
, …). When the PDA starts, have it
immediately push a symbol onto the stack to represent the structure
it’s trying to recognize. We want to recognize
Simple
statements, so our PDA will begin
by pushingS
onto the
stack:
>>
start_rule
=
PDARule
.
new
(
1
,
nil
,
2
,
'$'
,
[
'S'
,
'$'
]
)
=> #
Translate the grammar rules into PDA rules that expand symbols
on the top of the stack without reading any input. Each grammar rule
describes how to expand a single symbol into a sequence of other
symbols and tokens, and we can turn that description into a PDA rule
that pops a particular symbol’s character off the stack and pushes
other characters on:
>>
symbol_rules
=
[
#
::= | PDARule
.
new
(
2
,
nil
,
2
,
'S'
,
[
'W'
]
),
PDARule
.
new
(
2
,
nil
,
2
,
'S'
,
[
'A'
]
),
#
::= 'w' '(' ')' '{' '}' PDARule
.
new
(
2
,
nil
,
2
,
'W'
,
[
'w'
,
'('
,
'E'
,
')'
,
'{'
,
'S'
,
'}'
]
),
#
::= 'v' '=' PDARule
.
new
(
2
,
nil
,
2
,
'A'
,
[
'v'
,
'='
,
'E'
]
),
#
::= PDARule
.
new
(
2
,
nil
,
2
,
'E'
,
[
'L'
]
),
#
::= '<' | PDARule
.
new
(
2
,
nil
,
2
,
'L'
,
[
'M'
,
'<'
,
'L'
]
),
PDARule
.
new
(
2
,
nil
,
2
,
'L'
,
[
'M'
]
),
#
::= '*' | PDARule
.
new
(
2
,
nil
,
2
,
'M'
,
[
'T'
,
'*'
,
'M'
]
),
PDARule
.
new
(
2
,
nil
,
2
,
'M'
,
[
'T'
]
),
#
::= 'n' | 'v' PDARule
.
new
(
2
,
nil
,
2
,
'T'
,
[
'n'
]
),
PDARule
.
new
(
2
,
nil
,
2
,
'T'
,
[
'v'
]
)
]
=> [#
, # , …]
For example, the rule for assignment statements says that the
symbol can be
expanded to the tokensv
and=
followed by the
symbol, so we have a
corresponding PDA rule that spontaneously pops anA
off the stack and pushes the charactersv=E
back on. The
rule says that we can
replace the
symbol with a
or
symbol; we’ve
turned that into one PDA rule that pops anS
from the stack and replaces it with aW
, and another rule that pops anS
and pushes anA
.
Give every token character a PDA rule that reads that
character from the input and pops it off the stack:
>>
token_rules
=
LexicalAnalyzer
::
GRAMMAR
.
map
do
|
rule
|
PDARule
.
new
(
2
,
rule
[
:token
]
,
2
,
rule
[
:token
]
,
[]
)
end
=> [#
, # , …]
These token rules work in opposition to the symbol rules. The
symbol rules tend to make the stack larger, sometimes pushing
several characters to replace the one that’s been popped; the token
rules always make the stack smaller, consuming input as they
go.
Finally, make a PDA rule that will allow the machine to enter
an accept state if the stack becomes empty:
>>
stop_rule
=
PDARule
.
new
(
2
,
nil
,
3
,
'$'
,
[
'$'
]
)
=> #
Now we can build a PDA with these rules and feed it a string of
tokens to see whether it recognizes them. The rules generated by the
Simple
grammar are
nondeterministic—there’s more than one applicable rule whenever the
characterS
,L
,M
, orT
is topmost on the stack—so it’ll
have to be an NPDA:
>>
rulebook
=
NPDARulebook
.
new
(
[
start_rule
,
stop_rule
]
+
symbol_rules
+
token_rules
)
=> #
>>
npda_design
=
NPDADesign
.
new
(
1
,
'$'
,
[
3
]
,
rulebook
)
=> #
>>
token_string
=
LexicalAnalyzer
.
new
(
'while (x < 5) { x = x * 3 }'
)
.
analyze
.
join
=> "w(v
>>
npda_design
.
accepts?
(
token_string
)
=> true
>>
npda_design
.
accepts?
(
LexicalAnalyzer
.
new
(
'while (x < 5 x = x * }'
)
.
analyze
.
join
)
=> false
To show exactly what’s going on, here’s one possible execution of the NPDA when it’s fed
the string'w(v
:
State | Accepting? | Stack contents | Remaining input | Action |
1 | no | $ | w(v | pushS , go to state2 |
2 | no | S$ | w(v | popS , pushW |
2 | no | W$ | w(v | popW , pushw(E){S} |
2 | no | w(E){S}$ | w(v | readw , popw |
2 | no | (E){S}$ | (v | read( , pop( |
2 | no | E){S}$ | v | popE , pushL |
2 | no | L){S}$ | v | popL , pushM |
2 | no | M | v | popM , pushT |
2 | no | T | v | popT , pushv |
2 | no | v | v | readv , popv |
2 | no | | | read< , pop< |
2 | no | L){S}$ | n){v=v*n} | popL , pushM |
2 | no | M){S}$ | n){v=v*n} | popM , pushT |
2 | no | T){S}$ | n){v=v*n} | popT , pushn |
2 | no | n){S}$ | n){v=v*n} | readn , popn |
2 | no | ){S}$ | ){v=v*n} | read) , pop) |
2 | no | {S}$ | {v=v*n} | read{ , pop{ |
2 | no | S}$ | v=v*n} | popS , pushA |
2 | no | A}$ | v=v*n} | popA , pushv=E |
2 | no | v=E}$ | v=v*n} | readv , popv |
2 | no | =E}$ | =v*n} | read= , pop= |
2 | no | E}$ | v*n} | popE , pushL |
2 | no | L}$ | v*n} | popL , pushM |
2 | no | M}$ | v*n} | popM , pushT*M |
2 | no | T*M}$ | v*n} | popT , pushv |
2 | no | v*M}$ | v*n} | readv , popv |
2 | no | *M}$ | *n} | read* , pop* |
2 | no | M}$ | n} | popM , pushT |
2 | no | T}$ | n} | popT , pushn |
2 | no | n}$ | n} | readn , popn |
2 | no | }$ | } | read} , pop} |
2 | no | $ | | go to state 3 |
3 | yes | $ | | — |