Understanding Computation (22 page)

Read Understanding Computation Online

Authors: Tom Stuart

Tags: #COMPUTERS / Programming / General

BOOK: Understanding Computation
12.49Mb size Format: txt, pdf, ePub

Looks good! Nondeterminism has apparently given us the power to
recognize languages that deterministic
machines can’t handle.

Nonequivalence

But wait: we
saw in
Equivalence
that
nondeterministic machines without a stack are exactly equivalent in
power to deterministic ones. Our Ruby NFA simulation behaved like a
DFA—moving between a finite number of “simulation states” as it read
each character of the input string—which gave us a way to turn any NFA
into a DFA that accepts the same strings. So has nondeterminism really
given us any extra power, or does our Ruby NPDA simulation just behave
like a DPDA? Is there an algorithm for converting any nondeterministic
pushdown automaton into a deterministic one?

Well, no, it turns out that there isn’t. The NFA-to-DFA trick only works because we can
use a single DFA state to represent many possible NFA states. To simulate an NFA, we only
need to keep track of what states it could currently be in, then pick a different set of
possible states each time we read an input character, and a DFA can easily do that job if we
give it the right rules.

But that trick doesn’t work for PDAs: we can’t usefully represent
multiple NPDA configurations as a single DPDA configuration. The
problem, unsurprisingly, is the stack. An NPDA simulation needs to know
all the characters that could currently be on top of the stack, and it
must be able to pop and push several of the simulated stacks
simultaneously. There’s no way to combine all the possible stacks into a
single stack so that a DPDA can still see all the topmost characters and
access every possible stack individually. We didn’t have any difficulty
writing a Ruby program to do all this, but a DPDA just isn’t powerful
enough to handle it.

So unfortunately, our NPDA simulation does
not
behave like a DPDA,
and there isn’t an NPDA-to-DPDA algorithm. The unmarked palindrome problem is an example of
a job where an NPDA can do something that a DPDA can’t, so nondeterministic pushdown
automata really
do
have more power than deterministic
ones.

Parsing with Pushdown Automata

Regular Expressions
showed
how nondeterministic finite automata can be used to
implement regular expression matching. Pushdown automata have an important
practical application too: they can be used to parse programming
languages.

We already saw in
Implementing Parsers
how to use
Treetop to build a parser for part of the
Simple
language. Treetop parsers use a single
parsing expression grammar
to describe the complete
syntax of the language being parsed, but that’s a relatively modern idea.
A more traditional approach is to break the parsing process apart into two
separate stages:

Lexical analysis

Read a raw string of characters and turn it into a sequence of
tokens
. Each token
represents an individual building block of program
syntax, like “variable name,” “opening bracket,” or “
while
keyword.” A lexical analyzer uses a
language-specific set of rules called a
lexical
grammar
to decide which sequences of characters should
produce which tokens. This stage deals with messy character-level
details like variable-naming rules, comments, and whitespace,
leaving a clean sequence of tokens for the next stage to
consume.

Syntactic analysis

Read a sequence of tokens and
decide whether they represent a valid program according to the
syntactic grammar
of the language being parsed. If the program
is valid, the syntactic analyzer may produce additional information about its structure
(e.g., a parse tree).

Lexical Analysis

The lexical analysis stage is
usually pretty straightforward. It can be done with regular expressions (and
therefore by an NFA), because it involves simply matching a flat sequence of characters
against some rules and deciding whether those characters look like a keyword, a variable
name, an operator, or whatever else. Here’s some quick-and-dirty Ruby code to chop up a
Simple
program into tokens:

class
LexicalAnalyzer
<
Struct
.
new
(
:string
)
GRAMMAR
=
[
{
token
:
'i'
,
pattern
:
/if/
},
# if keyword
{
token
:
'e'
,
pattern
:
/else/
},
# else keyword
{
token
:
'w'
,
pattern
:
/while/
},
# while keyword
{
token
:
'd'
,
pattern
:
/do-nothing/
},
# do-nothing keyword
{
token
:
'('
,
pattern
:
/\(/
},
# opening bracket
{
token
:
')'
,
pattern
:
/\)/
},
# closing bracket
{
token
:
'{'
,
pattern
:
/\{/
},
# opening curly bracket
{
token
:
'}'
,
pattern
:
/\}/
},
# closing curly bracket
{
token
:
';'
,
pattern
:
/;/
},
# semicolon
{
token
:
'='
,
pattern
:
/=/
},
# equals sign
{
token
:
'+'
,
pattern
:
/\+/
},
# addition sign
{
token
:
'*'
,
pattern
:
/\*/
},
# multiplication sign
{
token
:
'<'
,
pattern
:
/},
# less-than sign
{
token
:
'n'
,
pattern
:
/[0-9]+/
},
# number
{
token
:
'b'
,
pattern
:
/true|false/
},
# boolean
{
token
:
'v'
,
pattern
:
/[a-z]+/
}
# variable name
]
def
analyze
[].
tap
do
|
tokens
|
while
more_tokens?
tokens
.
push
(
next_token
)
end
end
end
def
more_tokens?
!
string
.
empty?
end
def
next_token
rule
,
match
=
rule_matching
(
string
)
self
.
string
=
string_after
(
match
)
rule
[
:token
]
end
def
rule_matching
(
string
)
matches
=
GRAMMAR
.
map
{
|
rule
|
match_at_beginning
(
rule
[
:pattern
]
,
string
)
}
rules_with_matches
=
GRAMMAR
.
zip
(
matches
)
.
reject
{
|
rule
,
match
|
match
.
nil?
}
rule_with_longest_match
(
rules_with_matches
)
end
def
match_at_beginning
(
pattern
,
string
)
/\A
#{
pattern
}
/
.
match
(
string
)
end
def
rule_with_longest_match
(
rules_with_matches
)
rules_with_matches
.
max_by
{
|
rule
,
match
|
match
.
to_s
.
length
}
end
def
string_after
(
match
)
match
.
post_match
.
lstrip
end
end
Note

This implementation uses single characters as tokens—
w
means “the
while
keyword,”
+
means “the addition sign,” and so
on—because soon we’re going to be feeding those tokens to a PDA, and
our Ruby PDA simulations expect to read characters as input.

Single-character tokens are good enough for a basic demonstration where we don’t need
to retain the names of variables or the values of literals. In a real parser, however,
we’d want to use a proper data structure to represent tokens so they could communicate
more information than just “some unknown variable name” or “some unknown Boolean.”

By creating a
LexicalAnalyzer
instance with a string of
Simple
code
and calling its
#analyze
method, we
can get back an array of tokens showing how the code breaks down into
keywords, operators, punctuation, and other pieces of
syntax:

>>
LexicalAnalyzer
.
new
(
'y = x * 7'
)
.
analyze
=> ["v", "=", "v", "*", "n"]
>>
LexicalAnalyzer
.
new
(
'while (x < 5) { x = x * 3 }'
)
.
analyze
=> ["w", "(", "v", "<", "n", ")", "{", "v", "=", "v", "*", "n", "}"]
>>
LexicalAnalyzer
.
new
(
'if (x < 10) { y = true; x = 0 } else { do-nothing }'
)
.
analyze
=> ["i", "(", "v", "<", "n", ")", "{", "v", "=", "b", ";", "v", "=", "n", "}", "e",
"{", "d", "}"]
Note

Choosing the rule with the longest match allows the lexical
analyzer to handle variables whose names would otherwise cause them to
be wrongly identified as keywords:

>>
LexicalAnalyzer
.
new
(
'x = false'
)
.
analyze
=> ["v", "=", "b"]
>>
LexicalAnalyzer
.
new
(
'x = falsehood'
)
.
analyze
=> ["v", "=", "v"]

There are other ways of dealing with this problem. One alternative would be to write
more restrictive regular expressions in the rules: if the Boolean rule used the pattern
/(true|false)(?![a-z])/
, then it wouldn’t match the
string
'falsehood'
in the first place.

Syntactic Analysis

Once we’ve done
the easy work of turning a string into tokens, the harder
problem is to decide whether those tokens represent a syntactically
valid
Simple
program. We can’t use
regular expressions or NFAs to do it—
Simple
’s syntax allows arbitrary nesting of
brackets, and we already know that finite automata aren’t powerful
enough to recognize languages like that. It
is
possible to use a pushdown automaton to recognize valid sequences of
tokens, though, so let’s see how to construct one.

First we need a syntactic grammar that describes how tokens may be
combined to form programs. Here’s part of a grammar for
Simple
, based on the structure of the Treetop
grammar in
Implementing Parsers
:

  ::=  | 
::= 'w' '(' ')' '{' '}'
::= 'v' '='
::=
::= '<' |
::= '*' |
::= 'n' | 'v'

This is called a
context-free grammar
(CFG).
[
35
]
Each rule has a
symbol
on the lefthand side and one or
more sequences of symbols and tokens on the right. For example, the rule
::= |
means that
a
Simple
statement is either a
while
loop or an assignment, and
::=
'v' '='
means that an assignment statement consists of a
variable name followed by an equals sign and an expression.

The CFG is a static description of
Simple
’s structure, but we can also think of
it as a set of rules for
generating
Simple
programs. Starting from the

symbol, we can apply the
grammar rules to recursively expand symbols until only tokens remain.
Here’s one of many ways to fully expand

according to the
rules:


→ 'v' '='
→ 'v' '='
→ 'v' '='
→ 'v' '=' '*'
→ 'v' '=' 'v' '*'
→ 'v' '=' 'v' '*'
→ 'v' '=' 'v' '*' 'n'

This tells us that
'v' '=' 'v' '*'
'n'
represents a syntactically valid program, but we want the
ability to go in the opposite direction: to
recognize
valid programs, not generate them. When
we get a sequence of tokens out of the lexical analyzer, we’d like to
know whether it’s possible to expand the

symbol into those tokens by
applying the grammar rules in some order. Fortunately, there’s a way to
turn a context-free grammar into a nondeterministic pushdown automaton
that can make exactly this decision.

The technique for converting a CFG into a PDA works like
this:

  1. Pick a character to represent each symbol from the grammar. In this case, we’ll use
    the uppercase initial of each symbol—
    S
    for

    ,
    W
    for

    , and so on—to distinguish them from
    the lowercase characters we’re using as tokens.

  2. Use the PDA’s stack to store characters that represent grammar
    symbols (
    S
    ,
    W
    ,
    A
    ,
    E
    , …) and tokens (
    w
    ,
    v
    ,
    =
    ,
    *
    , …). When the PDA starts, have it
    immediately push a symbol onto the stack to represent the structure
    it’s trying to recognize. We want to recognize
    Simple
    statements, so our PDA will begin
    by pushing
    S
    onto the
    stack:

    >>
    start_rule
    =
    PDARule
    .
    new
    (
    1
    ,
    nil
    ,
    2
    ,
    '$'
    ,
    [
    'S'
    ,
    '$'
    ]
    )
    => #
  3. Translate the grammar rules into PDA rules that expand symbols
    on the top of the stack without reading any input. Each grammar rule
    describes how to expand a single symbol into a sequence of other
    symbols and tokens, and we can turn that description into a PDA rule
    that pops a particular symbol’s character off the stack and pushes
    other characters on:

    >>
    symbol_rules
    =
    [
    # ::= |
    PDARule
    .
    new
    (
    2
    ,
    nil
    ,
    2
    ,
    'S'
    ,
    [
    'W'
    ]
    ),
    PDARule
    .
    new
    (
    2
    ,
    nil
    ,
    2
    ,
    'S'
    ,
    [
    'A'
    ]
    ),
    # ::= 'w' '(' ')' '{' '}'
    PDARule
    .
    new
    (
    2
    ,
    nil
    ,
    2
    ,
    'W'
    ,
    [
    'w'
    ,
    '('
    ,
    'E'
    ,
    ')'
    ,
    '{'
    ,
    'S'
    ,
    '}'
    ]
    ),
    # ::= 'v' '='
    PDARule
    .
    new
    (
    2
    ,
    nil
    ,
    2
    ,
    'A'
    ,
    [
    'v'
    ,
    '='
    ,
    'E'
    ]
    ),
    # ::=
    PDARule
    .
    new
    (
    2
    ,
    nil
    ,
    2
    ,
    'E'
    ,
    [
    'L'
    ]
    ),
    # ::= '<' |
    PDARule
    .
    new
    (
    2
    ,
    nil
    ,
    2
    ,
    'L'
    ,
    [
    'M'
    ,
    '<'
    ,
    'L'
    ]
    ),
    PDARule
    .
    new
    (
    2
    ,
    nil
    ,
    2
    ,
    'L'
    ,
    [
    'M'
    ]
    ),
    # ::= '*' |
    PDARule
    .
    new
    (
    2
    ,
    nil
    ,
    2
    ,
    'M'
    ,
    [
    'T'
    ,
    '*'
    ,
    'M'
    ]
    ),
    PDARule
    .
    new
    (
    2
    ,
    nil
    ,
    2
    ,
    'M'
    ,
    [
    'T'
    ]
    ),
    # ::= 'n' | 'v'
    PDARule
    .
    new
    (
    2
    ,
    nil
    ,
    2
    ,
    'T'
    ,
    [
    'n'
    ]
    ),
    PDARule
    .
    new
    (
    2
    ,
    nil
    ,
    2
    ,
    'T'
    ,
    [
    'v'
    ]
    )
    ]
    => [#, #, …]

    For example, the rule for assignment statements says that the

    symbol can be
    expanded to the tokens
    v
    and
    =
    followed by the

    symbol, so we have a
    corresponding PDA rule that spontaneously pops an
    A
    off the stack and pushes the characters
    v=E
    back on. The

    rule says that we can
    replace the

    symbol with a

    or

    symbol; we’ve
    turned that into one PDA rule that pops an
    S
    from the stack and replaces it with a
    W
    , and another rule that pops an
    S
    and pushes an
    A
    .

  4. Give every token character a PDA rule that reads that
    character from the input and pops it off the stack:

    >>
    token_rules
    =
    LexicalAnalyzer
    ::
    GRAMMAR
    .
    map
    do
    |
    rule
    |
    PDARule
    .
    new
    (
    2
    ,
    rule
    [
    :token
    ]
    ,
    2
    ,
    rule
    [
    :token
    ]
    ,
    []
    )
    end
    => [#, #, …]

    These token rules work in opposition to the symbol rules. The
    symbol rules tend to make the stack larger, sometimes pushing
    several characters to replace the one that’s been popped; the token
    rules always make the stack smaller, consuming input as they
    go.

  5. Finally, make a PDA rule that will allow the machine to enter
    an accept state if the stack becomes empty:

    >>
    stop_rule
    =
    PDARule
    .
    new
    (
    2
    ,
    nil
    ,
    3
    ,
    '$'
    ,
    [
    '$'
    ]
    )
    => #

Now we can build a PDA with these rules and feed it a string of
tokens to see whether it recognizes them. The rules generated by the
Simple
grammar are
nondeterministic—there’s more than one applicable rule whenever the
character
S
,
L
,
M
, or
T
is topmost on the stack—so it’ll
have to be an NPDA:

>>
rulebook
=
NPDARulebook
.
new
(
[
start_rule
,
stop_rule
]
+
symbol_rules
+
token_rules
)
=> #
>>
npda_design
=
NPDADesign
.
new
(
1
,
'$'
,
[
3
]
,
rulebook
)
=> #
>>
token_string
=
LexicalAnalyzer
.
new
(
'while (x < 5) { x = x * 3 }'
)
.
analyze
.
join
=> "w(v>>
npda_design
.
accepts?
(
token_string
)
=> true
>>
npda_design
.
accepts?
(
LexicalAnalyzer
.
new
(
'while (x < 5 x = x * }'
)
.
analyze
.
join
)
=> false

To show exactly what’s going on, here’s one possible execution of the NPDA when it’s fed
the string
'w(v:

State
Accepting?
Stack contents
Remaining input
Action
1
no
       $
w(vpush
S
, go to state
2
2
no
      S$
w(vpop
S
, push
W
2
no
      W$
w(vpop
W
, push
w(E){S}
2
no
w(E){S}$
w(vread
w
, pop
w
2
no
 (E){S}$
 (vread
(
, pop
(
2
no
  E){S}$
  vpop
E
, push
L
2
no
  L){S}$
  vpop
L
, push
M
2
no
M  vpop
M
, push
T
2
no
T  vpop
T
, push
v
2
no
v  vread
v
, pop
v
2
no
    read
<
, pop
<
2
no
  L){S}$
    n){v=v*n}
pop
L
, push
M
2
no
  M){S}$
    n){v=v*n}
pop
M
, push
T
2
no
  T){S}$
    n){v=v*n}
pop
T
, push
n
2
no
  n){S}$
    n){v=v*n}
read
n
, pop
n
2
no
   ){S}$
     ){v=v*n}
read
)
, pop
)
2
no
    {S}$
      {v=v*n}
read
{
, pop
{
2
no
     S}$
       v=v*n}
pop
S
, push
A
2
no
     A}$
       v=v*n}
pop
A
, push
v=E
2
no
   v=E}$
       v=v*n}
read
v
, pop
v
2
no
    =E}$
        =v*n}
read
=
, pop
=
2
no
     E}$
         v*n}
pop
E
, push
L
2
no
     L}$
         v*n}
pop
L
, push
M
2
no
     M}$
         v*n}
pop
M
, push
T*M
2
no
   T*M}$
         v*n}
pop
T
, push
v
2
no
   v*M}$
         v*n}
read
v
, pop
v
2
no
    *M}$
          *n}
read
*
, pop
*
2
no
     M}$
           n}
pop
M
, push
T
2
no
     T}$
           n}
pop
T
, push
n
2
no
     n}$
           n}
read
n
, pop
n
2
no
      }$
            }
read
}
, pop
}
2
no
       $
             
go to state 3
3
yes
       $
             

Other books

one-hit wonder by Lisa Jewell
The Day of the Moon by Graciela Limón
The Anvil by Ken McClure
Secrets Shared by Raven McAllan
Silence of Stone by Annamarie Beckel
Blindsided by Kate Watterson
Wild Open by Bec Linder
His Heart's Home by Sterling, Stephanie
Scavenger of Souls by Joshua David Bellin