Construction Methods of LR Parsers

University of Pennsylvania ScholarlyCommons Technical Reports (CIS) Department of Computer & Information Science 5-1-1981 Construction Methods of ...

Author: Donna Lindsey

1 downloads 0 Views 4MB Size

Report

Download PDF

Recommend Documents

Validating LR(1) Parsers

On the size of parsers and LR(k)-grammars

3.2 SLR(1) Parsers Compilers (Procesadores de Lenguaje) 3.2 SLR(1) Parsers Problems with LR(0) parsers

Parsers: Terminology. Parsing. Parsers: Terminology. Parsers: Terminology. Parsers: Terminology. Compilers & Translators

SERVICIO MANUAL DE SERVICIO LISTA DE PARTES LAVADORA REDONDA MODELOS : LR-150, LR-160 LR-400, LR410 LR-480, LR-560 LR-860, LR-980 LR-2000

Chapter 3. Methods of In-Water Construction

Evolution of production management methods in construction

Procurement of construction using alternative project delivery methods. 1. "Alternative project delivery methods for construction" means

Construction Planning, Equipment, and Methods

Pragmatic Methods for Construction Grammar

Construction Planning, Equipment, and Methods

Chapter 6 - Well Construction Methods

15 PAVEMENT CONSTRUCTION METHODS SUMMARY

Framing Methods, Exterior Wall Construction

Wersja: 1.0 LR- POE 32, LR-POE LT32, LR-POE 32M, LR-POE 46, LR-POE 68, LR-POE 100, LR- POE 120, LR-POE 150, LR-POE 170, LR-POE 220 Strona 1 z 6

Evaluation of Dependency Parsers on Unbounded Dependencies

Technische Daten Raupenkran LR 1300 LR 1300

LR-62.5

CMP831 Lean Construction Principles and Methods. Lean Construction Overview

Bottom-up parsers. Properties

Wednesday, January 19, Parsers

Generalized Parsers for Machine Translation

03-LR

University of Pennsylvania

ScholarlyCommons Technical Reports (CIS)

Department of Computer & Information Science

5-1-1981

Construction Methods of LR Parsers Karl Max Schimpf University of Pennsylvania

Follow this and additional works at: http://repository.upenn.edu/cis_reports Part of the Electrical and Computer Engineering Commons Recommended Citation Karl Max Schimpf, "Construction Methods of LR Parsers", . May 1981.

University of Pennsylvania Department of Computer and Information Science Technical Report No. MS-CIS-80-40. This paper is posted at ScholarlyCommons. http://repository.upenn.edu/cis_reports/725 For more information, please contact [email protected].

Construction Methods of LR Parsers Abstract

This paper presents five different LR parser generators and an error recovery method which is derived directly from the LR parser. The parsers presented include the original LR (1) parser defined by Knuth. The SLR(1) and LALR(1) parsers defined by DeRemer, and the weak and strong compatible LR parsers presented by Pager. All five parsers have been implemented by the author using two programs. Furthermore, the implementation of the SLR (1) parser generator includes an error recovery method and produces an SLR(1) parser with error recovery built in. Disciplines

Electrical and Computer Engineering Comments

University of Pennsylvania Department of Computer and Information Science Technical Report No. MSCIS-80-40.

This technical report is available at ScholarlyCommons: http://repository.upenn.edu/cis_reports/725

UNIVERSITY OF PENNSYLVANIA THE MOORE SCHOOL OF ELECTRICAL ENGINEERING SCHOOL OF ENGINEERING AND APPLIED SCIENCE CONSTRUCTION METHODS OF LR PARSERS Karl Max Schimpf Philadelphia, Pennsylvania May 1981

A thesis presented to the Faculty of Engineering and Applied Science of the University of Pennsylvania in partial fulfillment of the requirements for the degree of Master of Science in Engineering for graduate work in Computer and Information Science.

Aravind ~ o s h i

-

abstract

T h i s p a p e r p r e s e n t s f i v e d i f f e r e n t LR p a r s e r g e n e r a t o r s and a n e r r o r r e c o v e r y method which i s d e r i v e d d i r e c t l y from t h e LR p a r s e r .

The p a r s e r s p r e s e n t e d i n c l u d e t h e o r i g i n a l

LR(1) p a r s e r d e f i n e d by Knuth, p a r s e r s d e f i n e d by DeRemer,

T h e S L R ( 1 ) a n d LALR(1)

a n d t h e weak a n d s t r o n g

c o m p a t i b l e LR p a r s e r s p r e s e n t e d by P a g e r .

A l l five parsers

h a v e b e e n i m p l e m e n t e d by t h e a u t h o r u s i n g two p r o g r a m s . Furthermore,

t h e i m p l e m e n t a t i o n of

t h e SLR(1) p a r s e r

g e n e r a t o r i n c l u d e s a n e r r o r r e c o v e r y method and p r o d u c e s a n SLR(1) p a r s e r w i t h e r r o r r e c o v e r y b u i l t i n .

Table

of

contents

: Introduction

Chapter

: The construction of LR(1)

Chapter

parsing tables 11.1 LR(1)

grammars

11.1.1

Derivations

11.1.2

Languages generated by context free grammars

11.2

Sentential forms and their viable prefixes

11.3 LR(1)

characteristic automata

11.4 Construction of LR(1) parsers : Methods for reducing states in

Chapter

LR(1) 11.1 SLR(1)

parsers

parsers

11.2 LALR(1)

parsers

11.3 Pager's

Weak compatibility

11.4 Pager's

Strong compatibility

: An error recovery method for LR parsers

Chapter Chapter

1

: Implementation

V.l

Representation of the parsing tables

V.2

SLR(1)

implementation

V.2.1

Input grammar

V.2.2

Running the SLR(1) parser constructor

V.2.3

Interpretation of the output file

V.2.4

Conflict resolution

V.2.5

Size restrictions

iii

V . 3 LR(l), LALR(l), w e a k a n d s t r o n g c o m p a t i b l e parser generators

92

V-3.1

Input g r a m m a r

92

V.3.2

R u n n i n g t h e program

96

Appendix

A

: S a m p l e P A S C A L s k e l e t o n f o r u s e of t h e

SLR(1) p a r s i n g t a b l e s Appendix

g

: S a m p l e P A S C A L s k e l e t o n f o r u s e of t h e

LR(l), LALR(l), Weak and Strong c o m p a t i b i l i t y parser g e n e r a t o r s References

103

I

Chapter

Introduction

I t i s a w e l l known f a c t t h a t - o f a l l string

parsers,

l a r g e s t c l a s s of are

quite

the

class

first

LR

deterministic

parsers

recognize the

context f r e e languages

EKnu651.

LR p a r s e r s

powerful and a r e a b l e t o recognize v i r t u a l l y a l l

programming l a n g u a g e s i n were

of

the

introduced

existance by

Knuth

v e r s i o n known a s a n L R ( 1 ) p a r s e r .

today.

These

parsers

[Knu65] w i t h h i s o r i g i n a l Unfortunately,

h i s method

r e q u i r e s e x t e n s i v e r e s o u r c e s and h e n c e i s i m p r a c t i c a l t o u s e f o r p a r s i n g any programming l a n g u a g e . Several alternative parsing presented

which

reduce

the

methods resource

p r o d u c i n g more p r a c t i c a l LR p a r s e r s . accomplish

this

result

a c c e p t e d by t h e p a r s e r s .

have

since

been

requirements

thus

Some o f

these

by r e d u c i n g t h e c l a s s of

The r e s u l t i s a r e d u c t i o n

parsers languages in

the

number

of p a r s e s t a t e s b u i l t and h e n c e a n o v e r a l l r e d u c t i o n

i n the resource requirements. type

of

LR

parser

are

p r e s e n t e d by D e R e m e r

T h e m o s t common f o r m s o f t h i s

the

SLR(1)

and

LALR(1) p a r s e r s

[DeR69].

A n o t h e r f o r m of r e s o u r c e r e d u c t i o n u s e d by is

a

method

parser.

by

t h e s e s t a t e m i n i m i z a t i o n methods

reductions

have

been

[ P a g 7 7 a , P a g 7 7 b l c a l l e d weak a n d s t r o n g

Pager

c o m p a t i b l e LR p a r s e r s . state

parsers

of p e r f o r m i n g s t a t e m i n i m i z a t i o n on t h e LR(1)

Two o f

proposed

LR

I n these parsers,

he

t o m a i n t a i n t h e power of

resticts

the

t h e LR(1) p a r s e r

a n d h e n c e t h e r e s u l t a n t p a r s e r r e c o g n i z e s t h e same c l a s s

of

l a n g u a g e s a s t h e o r i g i n a l LR(1) p a r s e r . T h i s p a p e r p r e s e n t s f i v e d i f f e r e n t LR p a r s e r g e n e r a t o r s and

a n e r r o r r e c o v e r y method which i s d e r i v e d d i r e c t l y f r o m

t h e LR p a r s e r . LR(1)

parser

The p a r s e r s p r e s e n t e d i n c l u d e defined

by

Knuth

[Knu65],

LALR(1) p a r s e r s d e f i n e d b y DeRemer strong

compatible

LR

[DeR69]

,

the

p a r s e r s p r e s e n t e d by Pager

Furthermore,

recovery

method

recovery b u i l t in.

and

produces

SLR(1) and

implementation

[Pag77a].

author

t h e i m p l e m e n t a t i o n of

parser generator includes the

original

and t h e weak and

A l l f i v e p a r s e r s h a v e b e e n i m p l e m e n t e d by t h e

two p r o g r a m s .

the

of

using

t h e SLR(1) an

error

a n SLR(1) p a r s e r w i t h e r r o r

The method of compatible

LR

construction parsers,

of

the

presented

unfortunately only provides a

by

partial

a l g o r i t h m s which b u i l d t h e s e p a r s e r s .

weak Pager

the

basic

p r e s e n t s Pager's simplifies

nature

of a

t h e comprehension of

a more c o m p l e t e e x p l a n a t i o n of

[Pag77a] of

,

the

These algorithms a l s o which

the algorithms.

algorithms i n

strong

explanation

c o n t a i n minor i n c o n s i s t a n c i e s and o m i s s i o n s obscure

and

modified t h e code.

tend

to

This paper

notation

which

It a l s o provides

the algorithms,

and

includes

a few minor a l g o r i t h m s o m i t t e d by P a g e r . T h e p r o b l e m w i t h LR p a r s e r s , when u s e d i n

is

that

compiler,

t h e y a r e d e s i g n e d a s a s y n t a c t i c method which o n l y

decides i f the

a

the given input s t r i n g belongs t o a

c l a s s a c c e p t e d b y t h e LR p a r s e r .

i l l e g a l i n p u t symbol i s found, failure.

However,

the

Hence,

parser

language

in

once t h e f i r s t

stops

reporting i t is

when a c o m p i l e r p a r s e s a p r o g r a m ,

a d v a n t a g e o u s t o h a v e t h e c o m p i l e r r e p o r t a s many

additional

errors as possible. I n o r d e r t o i m p r o v e t h e LR use

in

a

compiler,

syntactic error errors.

this

recovery

Furthermore,

paper scheme

additional

routines

also to

capabilities presents

recognize

for

a purely additional

t h e method h a s b e e n d e s i g n e d s o t h a t

i t can be d i r e c t l y i n c o r p o r a t e d

no

parser's

are

i n t o t h e LR p a r s e r .

Hence,

necessary i n order t o perform

e r r o r r e c o v e r y and p a r s e t h e r e s t of

the input.

The method u s e d i n t h i s p a p e r t o h a n d l e e r r o r

recovery

i s b a s e d on t h e m e t h o d u s e d by P e n n e l l o a n d D e R e m e r

[P&D79],

which has a s e p a r a t e e r r o r r e c o v e r y

includes

error

correction. of

symbol,

verify

and

fragments" error

the

input, that

starting only

it

(substrings derivable

recovery

method

that

c o n t r o l s t r a t e g y used i s t o s e a r c h

The

t h e remainder

routine

the

illegal

of

"viable

consists

its

from

presented

from

in

grammar).

this

paper h a s been

implemented u s i n g t h e SLR(1) p a r s e r a s i t s b a s i s . the

is

method

general

enough

e a s i l y b e a p p l i e d t o any of

The

However,

t h a t t h e same method c o u l d

t h e o t h e r LR

parsers

presented

i n t h i s paper. C h a p t e r two s t a r t s b y s e t t i n g up for

context

preliminary

f r e e languages and d e r i v a t i o n s .

This notation

i s u s e d t o d e s c r i b e t h e b a s i c s t r a t e g y u s e d by The

last

sections

of

the

chapter

parsers.

LR

cover

c o n s t r u c t i o n methods which w i l l y i e l d t h e

notation

the

LR(1)

actual

parser

as

its result. C h a p t e r t h r e e d e s c r i b e s how implemented

parser

each

of

the

constructors are built.

other

The SLR(1) a n d

LALR(1) c o n s t r u c t i o n m e t h o d s a r e p r e s e n t e d u s i n g characteristic Pager's weak

automaton

n o t i o n of

and

strong

as

four

the

LR(0)

t h e i r basis f o r construction.

compatibility, compatibility,

the

definitions

of

both

and t h e a l g o r i t h m s used i n

c o n j u n c t i o n w i t h t h e c o n s t r u c t i o n of

t h e s e two

parsers

are

also described. C h a p t e r f o u r d i s c u s s e s t h e e r r o r r e c o v e r y method and a n algorithm

which

takes

in

parser with e r r o r recovery. parser

is

used

a n LR p a r s e r and p r o d u c e s a n LR It

also

explains

how

an

LR

t o p a r s e a n i n p u t s t r i n g and d e c i d e i f t h e

s t r i n g i s d e r i v a b l e f r o m t h e grammar u s e d t o g e n e r a t e t h e LR parser. C h a p t e r f i v e c o n c l u d e s t h e p a p e r by d i s c u s s i n g the

two

programs used f o r t h e implementation.

c o n s t r u c t s a n SLR(1) The o t h e r p r o g r a m , of c o m p a t i b i l i t y ,

parser with error

recovery

One p r o g r a m built

u s i n g o u r m o d i f i c a t i o n of P a g e r ' s c a n b u i l d e i t h e r a n LR(l),

o r s t r o n g l y compatible LR parser.

briefly

LALR(l),

in.

concept weakly

Chapter

The construction -

of

the LR(1) parsing tables

This chapter describes how created.

In

LR(1)

parsing

tables

are

order to do this, let me start out by setting

up some preliminary notation.

11.1 LR(1) -

Grammars

A Context-Free Grammar (denoted CFG) G is a quadruple G = ( N

,

T

,

P

,

S ) where

T is a finite alphabet of terminal symbols; N is a finite alphabet of nonterminal symbols; ( N U T) is the finite set of grammar symbols; S is a nonterminal symbol in N , called the

start symbol;

and

P is a finite set of pairs ( A , g ) ,

called productions,

such that A 6 N and 2 6 (N U T)

*

A production (A,g) will be denoted in the form A -> g. there is a special start production S ->

S' where S'

S does not occur in any other production also

a

in

P.

Also

6 N and

There

is

symbol $ 6 T, which denotes the end of the

special

string being parsed, and does not appear in any production. For notational convenience, upper case letters will used

to

denote

be

nonterminal symbols, lower case letters to

denote terminal symbols, underlined upper

case

letters

to

denote grammar symbols, and underlined lower case letters to denote strings of grammar symbols ( The symbol

11.1.1

= will be

strings

=> : (N U T) {

G = ( N

*

x (N U T)

I B

(gBc,&)

*

,

T

,

P

,

S

with

B

production B ->

a

6 N;

the

relation

*

a,b,c (N U T) ;

b

nonterminal

b

let

),

be defined by the set of pairs

in P)

In other words, given any string in symbol

(N U T) in

*

of

N

the

and

in P, we say that the string

transitive

and

&.

gBc

Also, let f > and

transitive

reflexive

form

given the

string & in a one step derivation using B ->

will be denoted as p B c => the

* ).

reserved to denote the empty string.

and B ->

the

(N U T)

Derivations

Given a CFG

pBc,

in

derives

b.

*=>

closures

This denote of =>

respectively.

w e can d e f i n e another r e l a t i o n

From t h e a b o v e r e l a t i o n , which

implies

o r d e r i n g of

an

: (N U T)

new r e l a t i o n = > R

{aBc => R -a b c I aBc I n o t h e r words,

=>

R

*

=>

the rewrite steps.

x (N U T)

&

*

Let t h i s

be defined as the s e t

*

and 2 6 T )

is t h e one

step

derivation,

when

the

derivation is applied t o the rightmost nonterminal occurring i n the string

aBr.

*

a n d =>

Let f>R

denotes

R

the

transitive

a n d t r a n s i t i v e r e f l e x i v e c l o s u r e s o f =>R, r e s p e c t i v e l y .

L a n g u a g e g e n e r a t e d by 2 c o n t e x t - f r e e

11.1.2

G i v e n a CFG G = ( N

,

T

g e n e r a t e d b y G i s t h e s e t of L(G)

Note: -

*

P

,

S ),

the

language

L(G)

strings

S = > & , " 6 T

*

}

T h e o r d e r i n w h i c h => i s a p p l i e d h a s n o e f f e c t o n t h e

resulting L(G),

={a 1

,

grammar

terminal

g e n e r a t e d by G ,

string

produced.

Bence

could be a l t e r n a t i v e l y be

t h e language defined

as

t h e set

Using t h e above d e f i n i t i o n s , loosely defined a s follows:

a n LR(1)

grammar

can

be

An LR(1)

-a

grammar is a CFG G, such that each string

(derived

6 L(G)

via

a rightmost derivation) can be

parsed deterministically in a single scan from left

to

right, having the ability to look ahead one symbol from the point of scanning.

11.2 Sentential -

An LR(1) to

be

forms

and

their viable prefixes

parser, when scanning the input (of

parsed),

More

string

is essentially looking for a match with one

or more strings that can be derived symbol.

a

formally,

from

LR(1)

the

CFG's

the

parser

is

start

trying to

recognize a sentential form which is an element in the set

a I

{

*

S = > g 5 and 5 6 (N U T)

*

)

In recognizing a sentential form, the LR(1) really

interested

parser

in knowing whether it has scanned enough

of the input string such that a reduction can be that

is,

when

where

a,b 6

this

information,

the

*

sentential

(N U T) ;

-c

*

6 T ;

and

a reduction of

as

finding the handle.

pair ( l & j , B

->

the

of

length

the string

b

b) such that

performed,

form is the string 2 = B -> b 6 P.

Knowing

b to B can be made to get

the rightmost derivation string that 2 came from. known

is

This

is

The handle is defined as the

*

S =>

abc. The denotes R the handle, which states the position where

can be reduced to B using B ->

b.

The

string

ab -

is

called

the

viable

prefix or characteristic string

[A&U77]. Using the above

definitions,

it

is

characterize what an LR(1) parser does. from left to right, finding

it,

the

looking string

for

from

unscanned looking

the

input, for

viable the

another

easy

to

It scans the input,

viable

prefix.

Once

is reduced with the corresponding

production of the viable prefix. derived

a

fairly

Using the

prefix

parser viable

reduced

concatenated

repeats prefix.

the

string

with

above

the

process

This continues until

either the input has been reduced to the

start

symbol,

or

failure occurs by not finding any legal viable prefixes.

11.3 LR(1) -

Characteristic Automaton

It is fundemental result that viable from

CFG's

are

regular.

prefixes

Therefore a deterministic finite

automaton, called the characteristic automaton can

derived

for

CFG,

a

be built to recognize the set of legal viable prefixes.

Furthermore, once built, the LR(1)

the

characteristic

has

been

parser can be directly derived from it.

Let a marked production where A ->

automaton

be

of

the

ab is a production in P, and

form

A ->

a

"."is assumed

a symbol not in the set of grammar symbols

(N U T).

.b to be

These

marked

p r o d u c t i o n s w i l l b e u s e d t o d e n o t e "how much" o f

production's string

A -> 2

r i g h t hand s i d e

being

.b

scanned

scanned.

has

been

Hence

represents the fact that

recognized

the

marked

the

LR(1)

in

the the

production parser

has

w h e r e 2 i s some s t r i n g t h a t o c c u r r e d

the string

before the s t r i n g 2 i n the input. Expanding t h i s t o i n c l u d e a s e t of let

an

A -> 2

[ A -> 2

item be defined a s the pair

.b is

denoting

the

a marked p r o d u c t i o n ,

set

of

look-ahead

.b

,

symbols,

LA] w h e r e

and LA is a s u b s e t

of

T

a l l t e r m i n a l symbols which can f o l l o w

t h e p r o d u c t i o n and i s c a l l e d t h e

set of

lookahead

symbols.

I t e m s , e s s e n t i a l l y , d e s c r i b e two t h i n g s :

i ) What p o r t i o n of occur

at

the

a production's

end of

r i g h t hand

side

can

t h e s e t of v i a b l e p r e f i x e s b e i n g

described

i i ) What p o s s i b l e s y m b o l s c a n production's can

follow

right the

hand viable

immediately

follow

the

s i d e (and hence what symbols prefix

with

the

given

production). E a c h s t a t e of of

all

state,

t h e c h a r a c t e r i s t i c automaton i s

i t e m s w i t h t h e same v i a b l e p r e f i x .

For example,

if

set

When b u i l d i n g a

t h e r e m u s t b e a way t o i n s u r e t h a t a l l i t e m s ,

given s t a t e , a r e included.

the

for

a

t h e r e i s an i t e m

i n the s t a t e with B ->

is

i n P,

p r o d u c t i o n B -> formed

the

with

marked

. -c f o r

that

state.

The

viable

prefix,

new m a r k e d p r o d u c t i o n , w i l l h a v e t h e s a m e

the

items

and

t h e n t h e r e must b e a n i t e m w i t h t h e marked

prefix as the original i t e m . such

. Bb

A -> 5

production

T h e p r o c e s s of

is called closing the s t a t e .

to close a state,

including

Rowever,

it is also necessary t o

i n order

describe

p r o p a g a t e lookaheads t o t h e added i t e m s .

all

how

To d o t h i s ,

to

define

the function f i r s t ( & ) a s follows: first(=)

= ( a

*

1 5 => a s , a 6 T)

Using t h e above d e f i n i t i o n ,

the closure

of

a

set

of

items I (denoted a s c l o s u r e ( 1 ) ) can be constructed using t h e rules : i) Every i t e m i n I i s a l s o i n c l o s u r e ( 1 ) ii) I f

t h e i t e m [ A -> 2

. Bb

a n d B ->

i n P,

a 6 LA

then

item

[B ->

the

,

LA] i s i n c l o s u r e ( I ) ,

. c,

first(ba)]

is

in

closure(I).

example

2.1

L e t t h e CFG G h a v e t h e s e t of p r o d u c t i o n s :

w h e r e S -> A i s t h e s t a r t p r o d u c t i o n . of

the

item

set

{ [ S ->

.A

,

Then t h e c l o s u r e

{$)I)

is

the

set

The characterisitc automaton G is built from the set of states

constructed above with the transitions being grammar

symbols.

The path to a given state will then spell a

legal

prefix for some sentential form. The algorithm initial

(shown

below)

starts

by

setting

the

to the closure of the start production, then

state

taking each state just

built,

determines

the

transitions

from the state as follows:

i) for each grammar symbol [A -> 2

.

, LA]

is

X

in (N U T) set.

the

item

in the state, there is a unique

transition, labeled X , to the state containing the item [A -> &

.b

,

LA1

the grammar symbol

ii)

if

[A -> 2

obtained by shifting the dot across

5.

. , LA]

is

in

the

state,

transition should be produced for that item.

then

no

for

Algorithm

constructing

a CFG G = ( N

input:

output:

the

characteristic automaton

,T , P ,

S )

a set C, of states, and the function

GOT0 : (set of items) x (N U T) ->(set of items), which defines the characteristic automaton.

Method:

The two

procedures

below,

initiated

by

calling

is a unique symbol in T which

denotes

ITEMS ( G );

procedure ITEMS(G1; begin ->

C := closure([S {where " $ "

. S',($)I);

the end of the string to parse) repeat for each symbol

X

do add -

set of items I

in

C,

and

such that J = GOTO(1,X)

each

grammar

is not empty and

J to C;

until no more sets of items can be added to C

f u n c t i o n GOTO(1,X); bepin let J -

b e t h e s e t of

[ A ->

aX

[ A -> 2

.b

5

items

LA] s u c h t h a t

. =,LA]

i s i n I;

return closure(J) ; end

-9

L e t t h e c o r e of of

a s t a t e b e t h e s e t of

items

in

either

a

state,

t h e two f o l l o w i n g f o r m s : i ) i i )

[ S ->

[A ->

S'

b

.

, {$)I c , LA]

where

b #

2

I t c a n b e s h o w n t h a t by c l o s i n g t h e c o r e

of

all e x a m p l e s i n

t h e origonal s t a t e can be retrieved.

Hence,

t h i s p a p e r w i l l only s h o w t h e c o r e of

each s t a t e .

example

2.2

Construction

of a

characteristic a u t o m a t o n

Let the C F G G b e defined by t h e s a m e set of productions as

in

example

2.1.

Then,

t h e LR(1)

characteristic

a u t o m a t o n of the grammar G is as follows:

w h e r e the transition ars are defined by GOT0

11.4

Construction

of

LR(1)

Parsers

Using the characteristic automaton, can

be

directly generated.

as a quintuple M = ( K K

is

a

,

LR(1)

parser

Let an LR(1) parser be defined

action

finite

the

, goto ,

set

action : K x T -> {shift j

I

,

G

of

start ) where

parser

states;

j 6 K)

U {reduce p 1 p 6 P) U {error) defines the parsing action table; soto : K x N -> K U {error) defines the parsinq goto table;

G is a CFG such that L(G) is the class of languages to recognize; and start is the initial state. The set of parser states K accept action(H,$)

which =

is

reduce S - >

contains

the S'.

H,

state Also,

a

the

this

definition,

an

LR(1)

state

such

that

action

parsing tables are enough to define an LR(1) Using

special

and

goto

parser.

parser

can

constructed using the following algorithm [A&U77,Gal791:

be

Algorithm

for

constructing LR(1) parsing tables

The characteristic automaton CG = (C,GOTO)

input:

for a CFG G;

output:

a parsing table (possibly

with

conflicts

if

the

grammar G is not LR(1))

from

the

parser

characteristic

will

corresponds state.

i)

... ,I,)

Let C = {11,12,

method:

be

be a set of sets of

automaton

labelled

2

to the set of items I

items

CG.

The states of the

n

where

state

i

State 1 is the initial

i

The parsing actions are:

If

GOTO(Ii,a)

[A->&.

=Ij;

ii) If [A -> 2 action(i,a)

=

a b , LA] 61i

where

then action(i,a)

. ,LA]

reduce

-

a 6 T

and

shift j

in Ii, then for each a 6 LA,

set

A ->

iii) All entries of action not rules are set to error.

defined

by

the

above

The pot0 transition for state i is constructed using the two rules :

i) if GOTO(IiyA) goto(i,A)

= =j 'where

A is a

nonterminal,

then

= j

ii) All other entries of goto, not defined by the first rule, are set to error

example defined

Let the LR(1)

characteristic

as in example 2.2.

automaton

be

Using the above algorithm,

the two parsing tables produced are:

action

a

b

+---------------+---------------+---------------

$

+

I

shift 3

I

error

I

reduce A->=

I

1

error

I

error

I

reduce S->A

I

3

1

shift 4

I

reduce A->=

(

error

I

4

1

shift 4

1

error

1

shift 7

1

error

1

error

1

shift 8

I

error

I

1

error

1

error

I

reduce A->aAb

+I

1

error

1

reduce A->aAb

I

error

1

1 2

5 6

7 8

+---------------+---------------+---------------+

+---------------+---------------+---------------+

error

+---------------+---------------+--------------+---------------+---------------+---------------+

+---------------+---------------+---------------

I

+

S

+---------------+---------------

A

1 I error I 2 +---------------+--------------2 1 error I error

+---------------+---------------

3 4 5

+ I + I +

1

error

I

5

1

1

error

1

6

1

error

I

error

+ I +

1

error

1

error

1

+---------------+---------------+ +---------------+---------------

I

+---------------+---------------

6

I

7 1 error I error +---------------+--------------8

1

I

error

+ 1 +

error

+---------------+---------------

From the above algorithm, one can tell directly when

CFG

G does not produce an LR(1) language.

This occurs when

action is not a function but only a relation, words,

whenever

a

or

in

other

there is more than one possible action for

some input

pair.

conflicts.

The two types of conflicts that can exist are i)

shift/reduce and

These

ii)

multiple

redueelreduce

respectively denoted as S/R and R / R .

entries

are

conflicts,

known

which

as

are

Chapter

Methods

for

III

reducing s t a t e s

LR(1) p a r s e r s

LR(1) p a r s e r s h a v e t h e n i c e p r o p e r t y t h a t t h e y u s e d f o r p a r s i n g most- p r o g r a m m i n g l a n g u a g e s . t h e p a r s e r s produced f o r t h e s e grammars, described

in

the

previous

considered useful. proposed

Hence,

which w i l l r e d u c e t h e s i z e of

the

(proposed

have

t h e s e methods.

m e r g i n g s t a t e s of

Pager[Pag77al)

Two

use

The

of

languages.

the

s t a t e s by other

two

conditions

for

a LR(1) p a r s e r w h i l e m a i n t a i n i n g t h e

p o w e r t o r e c o g n i z e LR(1)

been

t h e p a r s e r produced.

t h e language accepted. by

method

too l a r g e t o be

(SLR(1) a n d LALR(1)) r e d u c e t h e number o f

r e d u c i n g t h e s i z e of methods

using are

be

Unfortunately,

several modifications

T h i s c h a p t e r d i s c u s s e s f o u r of methods

chapter,

can

full

111.1 S L R ( 1 ) p a r s e r s

similar

The SLR(1) p a r s i n g t a b l e c o n s t r u c t i o n i s q u i t e to

that

of

the

LR(1).

main d i f f e r e n c e i s t h a t t h e

The

p a r s e r p r o d u c e d i s b a s e d on a c h a r a c t e r i s t i c a u t o m a t o n no

(i.e.

lookahead

simplification reduces,

an in

LR(0)

general,

with

automaton). the

total

This

number

of

states created. To b u i l d a n S L R ( 1 ) p a r s e r , t h e lookahead set l e a v i n g j u s t this definition,

r e d e f i n e a n i t e m by removing t h e marked p r o d u c t i o n .

the rules to close a set

of

SLR

Under

items

I

become:

i ) every i t e m i n I is also i n closure(1) ;

i i ) If

t h e i t e m A -> g

a n d B ->

b

. Bc i s

i n closure(I),

6 P

t h e n t h e i t e m B ->

.b

is also i n closure(1);

The p r o c e d u r e t o b u i l d t h e c h a r a c t e r i s t i c a u t o m a t o n a r e also simplified.

T h e s e p r o c e d u r e s a r e as f o l l o w s :

f u n c t i o n GOTO(1,X); begin l e t J b e t h e s e t of

7

A -> g

. Xb

.b

i t e m s A ->

i s i n I and

X

such t h a t

i s a grammar s y m b o l ;

return closure(1); end

-9

p r o c e d u r e ITEMS(G); begin C := c l o s u r e ( S ->

. S');

repeat f o r each -

s e t of i t e m s I i n C ,

a n d e a c h grammar s y m b o l

J

= GOTO(1,X)

do add -

end

such t h a t

i s n o t empty and J

6

C

J t o C;

u n t i l no more s e t s of -9

X

i t e m s can be added t o C

3.1

example

Let a CFG

productions

in

G

be

example

defined 2.1.

by

the

Then

set

an

of

LR(0)

characteristic automaton is:

The SLR(1) method does decide

what

not

use

a

Instead, it uses a method

lookaheads,

which

in

fact

lookaheads will be included. FOLLOW : N ->

to

guarantees

approximate that

This is done by

However, in

the

set

the

order

the of

function

to

terminal symbol $ must be included.

the definition of FOLLOW, it is assumed additional

to

2T which computes all symbols which can follow

a given nonterminal symbol. the

set

reduction to use once a viable prefix has been

recognized.

FOLLOW,

lookahead

production

of the form S"

nonterminal and does not appear FOLLOW is defined as

in

any

that

compute Hence for

there

is

-> S$ where S" production

an is a

in

P.

FOLLOW(X) = { a

I Y

* ax&,

=>

for a l l Y 6 N

where a = f i r s t ( b ) )

example

3.2

U s i n g t h e CFG G d e s c r i b e d i n

example

2.1,

t h e FOLLOW s e t s a r e :

Using t h e c h a r a c t e r i s t i c FOLLOW

the

SLR(1)

parsing

automaton

and

the

function

t a b l e can be created using t h e

following algorithm:

SLR(1) p a r s i n g t a b l e c o n s t r u c t i o n a l g o r i t h m

t h e S L R ( 1 ) c h a r a c t e r i s t i c a u t o m a t o n CG = (C,GOTO)

input:

f o r t h e CFG G.

output:

a parsing table (possibly

with

conflicts

if

not

SLR(1))

method: from

the

L e t C = {I1,

... , I n ) b e

characteristic

automaton

p a r s e r w i l l be l a b e l e d 1,2, to

the

set

of

items

i n i t i a l s t a t e b e s t a t e 1.

t h e s e t of s e t s

*..

.

=i

CG.

of

items

The s t a t e s o f

the

,n w h e r e s t a t e i c o r r e s p o n d s A s w i t h LR(1)

parsers,

let the

The parsing actions are defined as follows:

i) If A

->a . b c

-

GOTO(Ii,b)

.

ii) If A ->

set action(i,b)

6 I

i

I

where b 6 T and then action(i,a)=j

j

is in I i then for = reduce

each

b 6 FOLLOW(A)

A -> 2

iii) all entries not defined by i) or ii)

are

set

to

error

The goto transitions are defined by the following two rules:

i) If GOTO(II,A)

=

then goto(i,A)

I

j = j where

A 6 N

ii) all other entries of goto, not defined by

i),

are

set to error

example example

3.3

Using the LR(0) characteristic automaton in

3.1,

and

the FOLLOW sets in example 3.2,

SLR(Z) parser is defined by the following tables:

the

action

b

a

+---------------+---------------+---------------+

$

error I r e d u c e A->= I shift 3 I +---------------+---------------+--------------2 1 error I error I r e d u c e S->A +---------------+---------------+---------------+ 3 1 shift 3 I r e d u c e A->= 1 r e d u c e A->= +---------------+---------------+--------------error 4 1 error 1 shift 5 I +---------------+---------------+--------------5 1 error I r e d u c e A->aAb 1 r e d u c e A->aAb

1

+---------------+---------------+---------------

S

+---------------+--------------1

I

error error

I I

A

+

2

I

error

error

4 5

error I error I +---------------+---------------

I

error

I

error

+---------------+---------------

I

+ I +

I

+ 1

+I I + +I

111.2

LALR(1)

parsers

A second type of simplification similar to is

the

LALR(1)

SLR(1)

parser invented by DeRemmer [DeR69].

algorithms for computing LALR(1) presented

the

parsers

have

since

[LLH71,AEH72,A&U77,DeR72,Alp76,Pag77b].

Many been

The main

difference from SLR(1) is a concise and more accurate method for

computing

FOLLOW.

the

set

The same LR(0)

of

definition

than

the function

characteristic automaton can be used

to construct either an LALR(1) The

lookaheads

of

the

of an SLR(I) parser. LALR(1)

lookahead

function

LA : state x P -> {t C T) is defined as follows:

and t = first(c) and the string

ba

is

a prefix for the state k)

example characteristic

Using

the

automaton,

CFG

g,

from

function LA is defined as follows:

and example

the 3.3,

LR(0) the

The construction of the LALR(1) as

parser is exactly

the

same

an SLR(1) except that the action function is computed as

follows :

. a h 6 Ii where

i) If A -> 2

a 6 T and

GOT0 (Ii,a) =I then action(i,a)=j

ii) If A

->a

.

is in Ii then for each

a 6 LA(i, A-> A_)

set action(i,a)

=

reduce A -> g

iii) all entries not defined in i) and ii) are set to error -

example example

3.4,

3.5

Using the LR(0)

characteristic automaton in

3.1, and the function LA as defined in example

the LALR(1) parsing tables are:

action

1

I

shift 3

I

error

(

r e d u c e A->g

+I

error

I

error

I

r e d u c e S->A

I

+---------------+---------------+---------------

3

1

shift 3

I

r e d u c e A->=

1

error

4

1

error

I

shift 5

I

error

+---------------+---------------+---------------+ 5 1 error 1 r e d u c e A->aAb I r e d u c e A->aAb +---------------+---------------+---------------

1

I

error

I

2

I

2

1

error

I

error

I

3

1

error

I

4

I

4

1

error

1

error

1

5

1

error

1

error

I

The s e t of l a n g u a g e s d e f i n e d b y

LR(l),

SLR(l),

LALR(l),

a r e known t o form a h i e r a r c h y a s f o l l o w s :

SLR(1)

2

LALR(1)

-C

LR(1)

1

+I

and

111.3 Pager's

weak c o m p a t i b i l i t y

I n t h e p r e v i o u s two s e c t i o n s , of

r e s t r i c t i o n s on t h e c l a s s

l a n g u a g e s were i m p o s e d t o r e d u c e t h e n u m b e r of

t h e LR(1) p a r s e r . states

may

be

Pager

[Pag77al shows t h a t t h e

reduced

without

affecting

the

states in number

of

class

of

languages accepted. T h e m o d i f i c a t i o n i n t r o d u c e d b y weak c o m p a t i b i l i t y i s i n the

c o n s t r u c t i o n of

section

t h e LR(1) c h a r a c t e r i s t i c a u t o m a t o n ( s e e

In

11.3).

the

algorithm

for

constructing

the

automaton t h e r e i s t h e statement: f o r each -

s e t of i t e m s I i n C ,

i s n o t empty and J

s u c h t h a t GOTO(1,S) do add -

6

C

two s t a t e s a r e s i m i l a r

in

form,

they

b e r e p r e s e n t e d by a s i n g l e s t a t e , a n d t h e r e f o r e s i m i l a r

c o p i e s of deciding

a

state

whether

can two

be

removed.

states

can

c o m p a t i b i l i t y c r i t e r i o n and states

called

a

merge.

the For

be

action

The

criterion

combined of

contain

o t h e r f o r m s of compatibility.

the

same

t h e LR(1) c o n s t r u c t i o n ,

s e t of i t e m s .

for

is called

combining

s t a t e s a r e c o m p a t i b l e i f t h e y a r e s i m i l a r i n form, they

X

J t o C;

I n t h i s statement i f can

a n d e a c h grammar s y m b o l

that

two two

is,

P a g e r h a s f o u n d two

c o m p a t i b i l i t y w h i c h h e c a l l s weak a n d

strong

Unfortunately,

changing

the

compatibility

from the LR(1) case can cause problems. two states satisfy Pager's the

states

may

compatibility

criterion

In particular, when criteria,

merging

necessitate a propagation of lookaheads to

states already created, which in turn will modify the merged state which caused the original propagation.

However, these

problems can be resolved using the following algorithm:

Algorithm

for

constructinq

an &&

compatible

characteristic automaton

input:

output:

a CGF G and a compatibility function compatible.

a set C, of states, and the function

GOT0 : (set of items) x (N U T) ->

(set of items), which

defines the characteristic automaton.

method:

the three procedures below, initiated by calling

ITEMS' ( G ) ;

f u n c t i o n GOTO(1,s) ; benin

let J -

b e t h e s e t of

->

[A

[ A ->

aX a

.

.

items

b ,

LA]

s e t .

Xb ,

LA]

is i n I;

return closure(J) ; end

-3

p r o c e d u r e ITEMS' ( G ) ; benin

C

: 3

c l o s u r e ( [ S ->

.

S'

, A [A -> A

{[A ->g

.b .b

.b

,

,

LA1] 6 S 1

,

LA2] 6 S2

and for all items [A ->

LA^

a

u LA^] I

.b

there exists an item [A -> 3 for all items [A ->

.b

,

LA1]

.b

,

6 S1

LA2] 6 S2 and

LA2] 6 S2

there exists an item [A -> 2 Then, according to Pager's

,

.b

,

LA1] 6 Sl)

definition, two states S

1

and

S2

are weakly compatible if

i) S 1 and S2 only have common marked productions in That is, if [A -> 3

their item part.

then there exists an item [A -> 3 if item [A -> item [A -> 2

b ,

a

. 1 , LA1]

. a , LA2]

,

,

LA1] 6 S1

LA2] e S2 and

LA2] 6 S2 then there exists an 6 S1

ii) for each pair of items [A -> 2

l a -> r

.b

.b

.b

,

LA1] 6 S 1 and

6 S2, then at least one of the

following is true: a)

LA,^ LA^

=

fi

b) L A l f) LA2 # $ and there exists an item [ B -> 2

LA,^

.

LA1'

,

+$

LA1']

6 S 1 such that

r\ LA2

C ) LA1

[A -> 2

LA,^

# $ and t h e r e e x i s t s a n i t e m

.b

,

LA2']

# $

LA,'

Condition a ) states that i f the

states

which

6 S2 s u c h t h a t

t h e r e a r e no i t e m s

h a v e a common l o o k a h e a d s y m b o l , t h e n t h e

merge c a n n o t p r o d u c e any c o n f l i c t s , not

a R/R c o n f l i c t .

produce

and i n

particular

t h e y h a v e common m a r k e d p r o d u c t i o n s .

r e s u l t of m e r g i n g w o u l d o n l y p r o d u c e a S/R existed

in

one of

conditions is:

b ,

L A 2 ] , [B ->

LA^^ LA2

# d and e i t h e r

R/R

LA^^ LA2 #

conflict

if

b

$,

arising

ab

p r o d u c t i o n s A -> only

,

LA1']

6 S1

2 ,

LA2']

6 S2

LA^^

#

LA1'

-

+>g w

the only possible

from

a n d B ->

and

d

merging

&.

LA^'

# (4,

if

conflict

a

is

t h e l o o k a h e a d s on t h e

However,

this

can

occur

f > R g , p r o d u c i n g a common s u b s t a t e

where b o t h p r o d u c t i o n s w i l l be r e d u c i b l e .

LA^^

or

# $

LA2'

Since

conflict

the

In

[ A ->

n

Therefore,

t h e unmerged s t a t e s b e f o r e m e r g i n g ) .

LA1], [B -> 5

LA2

merged

it

b

5

be

if

c o n d i t i o n b ) and c ) t h e s e t of [ A ->

can

i t is a l s o impossible

(Note:

t o i n t r o d u c e S/R c o n f l i c t s s i n c e t h e s t a t e s w i l l only i f

between

i n addition

1f > R y

and

By

condition

+= > R

b)

*

w 6 T , y, -

t h e n t h e r e must a l r e a d y e x i s t a s t a t e w i t h a R / R c o n f l i c t on some Hence,

symbol if

a 6 LAl

r\

LAl'.

Similarly

t h e language is indeed LR(l),

for condition c).

t h e n i t must b e

the

case

that

therefore conditions a),b) no

*

b +->R y ; d +=> R w';

conflicts

~ , y '6 T ;

and y # y'

and c) are sufficient to

,

and

inssure

will be produced if the language generated by

the grammar is indeed LR(1). For example, let a CFG productions

in

figure

be

defined

3.1.

with

LR(1)

The

the

set

characteristic

automaton contains 38 states (shown in part in figure Under

of

3.2).

weak compatibility, states 8 and 12 can not be merged

since the items [X->a.AE,{d)l the common lookahead symbol d .

6 12 and [Y->a.B,{d)]

6 8 have

However, for example, states

30 and 33 are in fact weakly compatible.

It can be shown that the LR(1)

size

of

a

weak

cornpatable

parsing table will contain a number of states that is

somewhere between that of LALR(1)

and LR(1)

parsing tables.

figure 3 . 2

111.4 Strong compatibility Pager's

strong compatability adds one condition to weak

compatibility

guarantees the production of a LALR(1)

which

parser if the language generated by the grammar is Otherwise

it

will

produce an LR(1) parsing table with the

number of states greater than the number of states by

the

LALR(1)

LALR(1).

produced

method but less than the number produced by

the LR(1) method. Strong compatibility requires that merged

if

they

have

a

common

no

two

descendant

in

characteristic automaton wh.ich will introduce R/R

states

be

the LR(1) conflicts

when the two states are merged. For creates

example, (in

figure 3.2. because

the

the

past) States 8 items

grammar

the

LR(1)

and

12

presented

by

figure

3.1

characteristic automaton in are

[X->a.AE,{d)l

not

weakly

compatible

6 12 and [Y->a.B,{d)J

have a common lookahead symbol "d".

6 8

If these two states are

merged (and hence causing merges of states (20,28), (18,26), (17,251, (16,241, (29932)s

(31,3419 (30,33),

and

pair

(35,37)

where

each

are common descendants) the

resulting states of the automaton would have Hence these two states, according to Pager's in fact strongly compatible.

(19,27),(36,38)

no

conflicts.

definition, are

On t h e o t h e r h a n d ,

3.3

which

creates

(in

a u t o m a t o n i n f i g u r e 3.4. causing

common

r e s u l t i n two R / R

l e t t h e grammar b e part)

the

that

LR(1)

figure

characteristic

M e r g i n g s t a t e s 7 a n d 10 ( a n d hence

descendants

14

a n d 18 t o b e m e r g e d ) w o u l d

c o n f l i c t s o n t h e s y m b o l s "a"

i t s descendant s t a t e .

of

and

"b"

in

Hence t h e s e s t a t e s w i l l n o t b e merged

under s t r o n g c o m p a t i b i l i t y .

f i g u r e 3.3

f 4: [Y->ab., {a)] I B - > b * y{b) I

figure 3 . 4

T h e way i n w h i c h two i t e m s ( f r o m d i f f e r e n t s t a t e s ) produce

a

common s t a t e w i t h a R / R

c a n d e r i v e t h e same s u b s t r i n g .

S1

conflict is if

That i s , i f

the

can

two i t e m s

two

states

a n d S 2 a r e t o b e m e r g e d s u c h t h a t t h e r e e x i s t s two i t e m s

. b , LA1] LA^^ L A l ; -b

[ A -> 2

6 S1

f 6

+>g w

-

and and

[ B -> 5

.4

+ > R 11,

.

,

LA2]

6 S2

where

t h e n t h e two s t a t e s

h a v e common d e s c e n d a n t s s u c h t h a t a m e r g e w i l l i n t r o d u c e R / R con£l i c t s .

For example, the reason that states 7 and 10 (in figure

3.4)

6 7

could not be merged is that the items IX->a.B,{d)l

and [Y->a.b,{d)]

6 10 have a common lookahead symbol d,

and

the strings B and b both rewrite to the string b. The search for a common substring between

two

states,

when necessary to try all possible combinations of rewrites, involves as much work as However,

it

is

not

building necessary

combinations of rewrite rules.

all to

descendant expand

This fact

all

can

states. possible

be

seen

by

understanding how expansion of the nonterminals is performed in building the characteristic automaton.

[A -> g

item

. Xb

,

LA]

is

-b

*

=>R 2 ,

where X -> 2 in P and

closed,

d 6 LA, it will create the item [X ->

. c , first(hd)].

If

it is clear that the elements in the lookahead set

LA will be propagated to the new item. if

That is, when the

-b bB 2 ,

On the

other

hand,

the definition of the function first indicates

that any element d 6 L A is not in first(2d).

Hence, in this

case, the lookaheads defined by first(bd) are independent of

LA and does not effect states derived Stated

differently,

the

only

rewrites

performed are those which are applied which

occur

at

the

end

of

marked

restriction on the number of possible at,

is

what

Pager

calls

a

from

stronq

(denoted =>SR) and is defined as:

to

the

new

that the

item.

should

be

nonterminals

productions. derivations

to

This look

rightmost derivation

-aBc

=>

-

SR abc iff

i ) ~ = g ii) aBy = > R

abc

Pager has derived a procedure(Pag77al two

which

if

checks

items, having a common lookahead symbol, will produce a

R/R

shared descendant containing a feels

that

the

conflict.

The

algorithm presented by Pager is opaque, as

well as slightly incorrect, and that the algorithm paper

(see

author

page

49)

has

been

in

this

corrected and modified to

clarify its nature. The procedures

algorithm

is

which

presented

tries

all

using

possible

two

co-recursive

strong

rightmost

derivations to see if the two given marked productions yeild a

common

descendant

state where two different productions

will be reduced (since this is the conflict

can

be

produced).

trivial cases (i.e.

only

way

that

an

R/R

The procedure C H E C K looks for

cases where no rewrites

are

necessary

to determine the result) while the procedure nontrivialcheck checks those cases requiring rewrites in order to

determine

the wanted criteria. One possibility that procedure C H E C K handles is is

impossible

for

it

two items, with or without rewrites, to

produce a common descendant. and (2) B -> 2.-

if

That is,

let

(1)

be two marked productions where

A -> 2 . B

Assume t h a t t h e s e two m a r k e d p r o d u c t i o n s c a n d e r i v e a common substring

be the case t h a t do

w i l l produce a R/R

which

*

Xf

=>

R

conflict.

*

w a n d Yg =>

-

R

w -

.

Then i t must

S i n c e b o t h f and g

not d e r i v e 2, t h e lookaheads can not propagate through f

a n d g. the

B u t t h e n , b y t h e way LR(1)

string

scanning the derived

from

derived string

Xf

X

w i l l

be

derived

from

f.

from

must

be

X # Y,

are

Hence

x.

i t i s i m p o s s i b l e f o r any i t e m s of

which w i l l produce R / R

X

before

any

string

S i m i l a r l y , any

t h e form

p r o d u c e a common s u b s t r i n g ( a n d h e n c e a

generated,

reduced t o

of t h e form

s t r i n g d e r i v e d f r o m Yg m u s t b e o f since

parsers

common

Therefore, t h i s form t o descendant)

conflicts.

T h e s e c o n d t r i v i a l c h e c k i n t h e p r o c e d u r e CHECK, i s the

two

marked

productions

i m m e d i a t e l y i n d i c a t e a common

d e s c e n d a n t which w i l l p r o d u c e R / R c o n f l i c t s i f merged. is, i f

t h e two i t e m s a r e o f

if

t h e f o r m ( 1 ) A -> g.=Wf

That

and ( 2 )

B -> a.EZg w h e r e

ii)

& 6 (N U T) a n d

i i i ) W,Z It i s c l e a r ,

the

items

6 N a n d W,Z

?>R g

* =>R

2

under t h e above c o n d i t i o n s , (3)

[ A ->

& J

. Wf

t h a t t h e c l o s u r e of

,

LA1]

and

(4)

[B ->

abX

[W ->

.2

. Zg

,

LA2]

and

, Q1

Hence t h i s c a s e

will (6)

w i l l

[Z

produce

->

produce

.2 a

items

the

(5)

, Ql w h e r e Q = L A 1 r \ ~ ~ 2 . common

descendant

where

c o n f l i c t s w i l l be produced. In a l l other cases,

some

rewriting

necessary

is

and

procedure nontrivialcheck is called t o handle these cases. One p o s s i b i l i t y , two

that requires rewriting,

m a r k e d p r o d u c t i o n s a r e of

is

when

the

and ( 2 )

t h e f o r m ( 1 ) A->=.bXf

B->c.bYg w h e r e

*

i ) X 6 N and X =>R

iii)

Y

6 (N U T);

In t h i s case,

3

in

a

X m u s t r e w r i t e t o some s t r i n g

However,

t h i s t h e same

p r o d u c t i o n X ->

a

X->.h- a n d B - > & . a produce R/R

derivable

h

where

as

1#

testing

if

from

there

2 such t h a t t h e items

w i l l s h a r e a common d e s c e n d a n t w h i c h

can

conflicts.

A second p o s s i b i l i t y

i t e m s of

a n d 1 # X

o r d e r t o p r o d u c e a common s t r i n g ( a n d h e n c e a common

descendant). exists

e

handled

t h e f o r m ( 1 ) A->a.bXf

in

and ( 2 )

nontrivialcheck

B->c.m w h e r e

are

i ) X 6 N ii)

g

(N U T);

6

*

a

and X

# Z

iii) f =>R 5

i v ) no p r o d u c t i o n

X->.h

X->h,

and B - > & . a

where

&#=,

exists

common s t r i n g d e r i v a b l e f r o m Xf

and

must b e of

implies

X#Z,

that

any

t h e form X g w h i l e

a n y common s t r i n g d e r i v a b l e f r o m Zg m u s t b e o f this

that

w i l l h a v e a common d e s c e n d a n t

I n t h i s c a s e , b e c a u s e of c o n d i t i o n i v )

But

such

t h e form

G.

t h a t t h e y c a n n o t d e r i v e t h e same s t r i n g

and hence can n o t have a s h a r e d descendant. The l a s t p o s s i b i l i t y c h e c k e d c h e c k e d by

t h e f o r m ( 1 ) A->g.bX a n d ( 2 ) B - > g . b Y The

XfY.

only

derive a

common

However,

this

p r o d u c t i o n s of marked

procedure

i s t h e c a s e when t h e m a r k e d p r o d u c t i o n s a r e

aontrivialcheck of

the

way

X,Y

6 N

and

t h a t t h e s e two m a r k e d p r o d u c t i o n s c a n

descendant

is

where

the

is

*

X =>

if

same a s t e s t i n g i f

t h e f o r m X->g a n d Y - > t

p r o d u c t i o n s A->&.X

R

and Y - > . t ,

w

-

and

-W .

t h e r e e x i s t s two

such t h a t o r X->.s

w i l l p r o d u c e a common d e s c e n d a n t w h i c h c a n

*

Y => R

either

the

a n d B->&.Y,

contain

an

R/R

c o n f l i c t from merging. For e f f i c i e n c y , t h e p r o c e d u r e

nontrivialcheck

uses

a

special global function t r i e d : N x ( m a r k e d p r o d u c t i o n s ) ->

boolean.

B e f o r e t h e t o p c a l l t o p r o c e d u r e CHECK i s made, is

the function

set t o f a l s e f o r a l l p o s s i b l e i n p u t s , and i t w i l l r e t u r n

false the first time it is After

that,

anytime

called

with

any

will

checking

if

prevent

a

input.

the function is again called with the

same set of arguments, it will return true. function

given

the

nonterminal

Therefore, this

procedure nontrivialcheck from will

rewrite

to

match

some

particular marked item. Finally, it is assumed that on the top CHECK(A ->

a

. 2'

,

B ->

b

. b')

the

conditions hold: i) A -> ii)

s'P

. a' 2 and

f 8 ->

b

B' #

2

.

b'

level

call

following

of two

Co-recursive procedures for 2 -

procedure check(A -> q

B -> {note:

to check

shared descendant

. ala2...a

n'

. blb2...b

a i y b i 6 (N U T);

m

A,B 6 N;

)

: boolean;

a y p 6 (N

U

* )

begin s:= maximum i s.t. t:=

maximum i set.

a ia i+l

...a n

bibi+l ...b,

match:= maximal i sat.

ai

?>g 2 ;

2;

bi;

then check:=false else -if match> max(s,t)

then check: =true else if s>t -

then check:=nonrrivialcheck( B ->

1

A -> g

. b1b2...bm,t . a l a2...a n ' s'match)

else check:=nontrivialcheck( A -> 2

B -> end

-9

p

. a l a*...a n ' s . blb2 . . .

bmyt,match)

-> g

p r o c e d u r e nontrivialcheck(A

ala2...an,s,

. blb2. *.bm,t,

b

B ->

match) : {note:

boolean;

s 5 t)

begin

terminate:=false; repeat if (match then -

-(sol))

< 0)

or

(s=t)

nontrivialcheck:=false; else if (as 6 N)

7 -

terminate:=true;

or

not tried(as,B -

s-1

bbl--b

.

bs..-b m )

then f o r each -

production C ->

s o t o a PC, 5

s

#

e,

and

s-1

.

6 P

C - > . c # B -> b b l " . b

do -

if check(C B ->

->

b s . . .b

. c,

bb l . . . b

s-1

m

. bs .'bn)

then nontrivia1check:ptrue;

else -if

(sat)

and

(match-1~s) and bt 6 N

and check (B -

...b s-1 bs. ..bn' ...as-1 . as ...an

-> bb

A -> a a

)

then nontrivialcheck:=true;

terminate:=true

fi*

-9

until terminate; end

-3

Using the above,

two

states

S1

and

S2

are

strong

6 S1 then there

exists

compatible if

i) If the item [A -> 2 an

item

- > a .b9LA21

. -b,LA21

[A -> [A ->

[A

a

. b,LAll

6 S2

then

6 S2

there

and

if

exists

the

item

an

item

b,LA1] 6 S1

ii) for each quadruple of items [A ->

a

[A -> 2

. -b,LAll, [B ->

. P,LA21,[B

&

-> &

&LA;]

. &,LA;]

6 SI, 6 S2

either a) weak compatibility between the items hold or

b)

b

and & do not share a descendant.

C h a p t e r JJ

An E r r o r -

In

the

R e c o v e r y Method

previous

were

constructions

two

T h e d o w n f a l l of

designed

only

to

chapters,

discussed,

parsers.

for && P a r s e r s

a l l

five

of

which

a l l LR p a r s e r s i s

decide i f

different produce

that

they

the

unfortunate

used i n a compiler,

is

found,

be

more

the parse stops with failure. to

have

This

t h a t when s u c h a p a r s e r i s

result

once t h e f i r s t i l l e g a l

desirable

are

the given input is legal, that

i s , b e l o n g s t o t h e l a n g u a g e g e n e r a t e d by i t s g r a m m a r . causes

LR

the

terminal However,

parse

report

symbol i t would

as

many

additional errors as possible. S e v e r a l people have schemes

proposed

for

LR

[G&R75,D&R77,P&D79,O'H76,Pen77,P&D781. only

deal

various

error

recovery parsers

This

chapter

w i l l

w i t h one s u c h method, which i s a m o d i f i c a t i o n of

t h e one presented

by

DeRemmer

and

Pennello[P&D79].

The

algorithm

presented

h e r e d i f f e r s from t h i e r s i n t h a t i t i s

i n c o r p o r a t e d i n t o t h e LR p a r s e r a n d d o e s n o t

attempt

error

correction. In order t o describe e r r o r recovery, we f i r s t how

a n LR p a r s e r w o r k s .

qoq l . . o q

n

describe

L e t a p a t h b e a s e q u e n c e of s t a t e s

such t h a t f o r each s t a t e q

i'

one of

the

following

conditions hold:

i) $ o t o ( q i y X )

= qi+l

-

ii) a c t i o n ( q i , a )

f o r some X 6 N

qi+l

A path w i l l be denoted a s

where

ai

states

6 (N U T )

such

path

[qo:g].

then the path

that

either

g o t ~ ( q ~ , ~ , =a ~ q i ). t o p : p a t h ->

f o r some a 6 T.

Also,

let

T h a t i s , i f 2 = a l a 2...a

[qO:g] i s t h e s e q u e n c e of a ~ t i o n ( q ~ - ~ ~qia ~ ) o r P

t h e r e s u l t of

s t a t e be defined a s t h e

is qoqlo..q n

.

n

state

the function

qn

F i n a l l y , whenever t h e p a t h

f r o m t h e s t a r t s t a t e ( o f t h e LR p a r s e r ) i t

where

the

[ q : ~ ]b e g i n s

w i l l

simply

be

denoted a s [a]. T h e b a s i c c o n t r o l of decision

function

df

a LR p a r s e r c a n b e d e f i n e d b y

: p a t h x T ->

the

(path U{reiect,accept))

as f o l l o w s :

i) d f ( [ a l , b )

=

[abl

some s t a t e j 6 K.

if action(top([=l),b)

= shift j

for

i i ) d f ( [ ~ l , b )= d f ( [ a A l , b ) if

a c t i o n ( t o p ( [awl ) , b ) = r e d u c e

aw # -

S when b = $

i i i ) d f ([Sl , $ )

= reduce S

->

S'

n o t d e f i n e d by r u l e s i ) t h r o u g h i i i )

The a l g o r i t h m t o implement t h e a b o v e d e c i s i o n

i s simply as follows:

procedure parse(df,input); begin p a t h : = [ s t a r t ,=I ; repeat t:=next

t e r m i n a l symbol from i n p u t ;

path:-df

(path, t ) ;

u n t i l (path = accept) or p r i n t path; end

and

is d e f i n e d a s r e j e c t f o r a l l p a i r s

([a] , b )

- 9

w,

= accept

i f action(top([SI),$)

i v ) df

A ->

(path = reject);

function

Note t h a t t h e v a r i a b l e p a t h i s i m p l i c i t l y used which

as

a

stack

h o l d s t h e p r e f i x of s e n t e n t i a l f o r m s b e i n g r e c o g n i z e d

by t h e p a r s e r .

The e r r o r r e c o v e r y s t r a t e g y d e s c r i b e s what t o do i f t h e parse

of

an

A s can be seen from

input results i n reject.

t h e p r e v i o u s a l g o r i t h m , LR p a r s e r s have that

they

stop

reading

error

would

nice

property

i n p u t immediately a f t e r t h e input

s t r i n g is found t o be i l l e g a l . an

the

The b e s t r e c o v e r y f r o m

such

b e i f t h e p a r s e c o u l d somehow b e r e s t a r t e d

s u c h t h a t a l l o t h e r e r r o r s made i n t h e i n p u t c o u l d b e p i c k e d up.

Unfortunately, t h i s strategy is r e a l l y unfeasible since

it carries the

implicit

assumption

of

knowing

what

the

w r i t e r m e a n t when h e w r o t e t h e s t r i n g t o b e p a r s e d . A much m o r e c o n s e r v a t i v e a p p r o a c h i s t o o n l y s t a t e w h a t

remaining

substrings

t o t h e g i v e n grammar. the

rightmost

-a

is

error

6 ( N U T)

a

of

the input a r e impossible according

That is, i f

string

w

derivation

such

*

*,

and

2 6 T

reported as an error.

6 T

t h e remaining i n p u t a f t e r

*

and t h e r e doesn't

that then

the

*

S =>

R-

awc

substring

for

w

exist a some

should be

For example,

c o n s i d e r t h e two p s e u d o PASCAL p r o d u c t i o n s

< s t m t > -> FOR < v a r > : = < e x p > TO < e x p > DO < s t m t > < s t m t > -> WHILE < e x p > D O < s t m t > with the erroneous input FOR X : = l

5 DO B E G I N J:==X;

w h e r e t h e t e r m i n a l s y m b o l "TO" Using

an

s y m b o l "5". that

parse,

LR

parsing

L:=X END; has a c c i d e n t l y been l e f t out. would s t o p a f t e r r e a d i n g t h e

A s one Looks f o r s u b s e q u e n t e r r o r s ,

it is clear

"5" i s a v a l i d s u b s t r i n g d e r i v a b l e f r o m S.

It is a l s o

c l e a r t h a t 5 can occur a t t h e following p o i n t s i n t h e

given

product ions < s t m t > ->

FOR < v a r > : = " < e x p > " TO < e x p > DO < s t m t >

< s t m t > -> FOR < v a r > := < e x p > TO " < e x p > " DO < s t m t >

< s t m t > -> WHILE " < e x p > " DO < s t m t > By e x p a n d i n g t h e s u b s t r i n g t o i n c l u d e t h e n e x t i n p u t s y m b o l , the

n e x t p o s s i b l e s u b s t r i n g t o t e s t w o u l d b e "5 DO1'.

t h e number of p o s s i b l e p o s i t i o n s of

this

string

Here,

has

been

reduced t o < s t m t > -> FOR < v a r > :=

< s t m t > ->

WHILE " < e x p > DO"

Continuing t h i s process, DO

BEGIN

< e x p > TO " < e x p > D O "

J:=X;

it is clear t h a t the

substring

L:=X END" c a n c o r r e s p o n d t o t h e f o l l o w i n g

positions i n the productions: < s t m t > ->

"5

FOR < v a r > := < e x p > TO " < e x p > D O < s t m t > "

< s t m t > -> WHILE l l < e x p > D O < s t m t > "

A t t h i s point,

string of

implies

t h e semicolon a t t h e end

of

the

parse

t h a t a r e d u c t i o n s h o u l d b e p e r f o r m e d by one One p o s s i b i l i t y i s

t h e above p r o d u c t i o n s .

to

take

the

s t r i n g recognized before t h e r e j e c t point,

a n d t o e i t h e r add

o r d e l e t e symbols t o produce a match

therefore

which in

reduction the

error

However,

the

one

BEGIN

s u b s t r i n g "5 DO deterministic

correction chosen

by

that

remove i t from f u r t h e r

method the

could

used

by

[P&D79].

a u t h o r assumes t h a t t h e

L:=X

J:=X;

string

decide

T h i s t y p e of e r r o r r e c o v e r y i s

t o choose.

fact

and

END"

is

the

be recognized,

consideration.

That

maximal and hence

is,

w i l l

it

r e s t a r t t h e p a r s e s t a r t i n g with t h e semicolon. The a b o v e recovery

example

method

described

method more e x p l i c i t l y ,

fact in

characterizes t h i s chapter.

l e t me s t a r t b y

the

error

To s t a t e t h e

defining

an

error

a s e t of LR p a r s e r s t a t e s , where e a c h e r r o r s t a t e

as

state

in

c o n t a i n s t h e s e t of L R p a r s e r s t a t e s t h a t t h e p a r s e m i g h t b e in.

restart s t a t e as a s p e c i a l e r r o r s t a t e c o n t a i n i n g

The

a l l t h e LR p a r s e r s t a t e s . The f i r s t s h i f t , through

the

illegal

i n error recovery, i s a terminal

symbol

that

forced

shift

produced

the

rejection.

T h i s s h i f t c a n b e v i e w e d a s a p a r a l l e l s h i f t , on

the

symbol

error

a,

from

all

LR p a r s e r s t a t e s I i n t h e

r e s t a r t s t a t e t o a l l s t a t e s J such t h a t a c t i o n ( 1 , a )

= J.

It

w i l l then t r y t o p a r s e t h e i n p u t where t h e p a r s e w i l l s t a r t ,

simultaneously,

from e a c h of

t h e LR p a r s e s t a t e s J

existing

a f t e r t h e f o r c e d s h i f t t h r o u g h t h e i l l e g a l symbol.

If along

t h e way,

will

a n y of

t h e s e p a r s e s produce an e r r o r ,

it

be

dropped from f u r t h e r c o n s i d e r a t i o n f o r s i m u l t a n e o u s p a r s i n g . One p o s s i b l e r e s u l t o f

w i l l b e d r o p p e d f r o m t h e s e t of

parses Under

this

derivation

condition, such

*

it is q u i t e l e g a l

Hence, symbol

input

can

not

R -awc

to

occur,

that

for

assume

is

there

the

no

parsed input

that

and

all

that

simultaneous parses.

clear

is

it

S =>

that

is

t h e above process

the

next

input

report i t a s an error.

S i n c e t h i s i s an e r r o r , t h e a l g o r i t h m w i l l t h e n r e s t a r t recovery

method

on

the

n e x t i n p u t symbol.

f i r s t a c t i o n on any e r r o r i s a f o r c e d s h i f t . to

guarantee

that

the

e r r o r recovery should not continue i f the

the

Note t h a t t h e

is

This

input is parsed.

remaining

x.

done Also,

illegal

terminal

t h e above e r r o r

recovery

s y m b o l was t h e e n d o f s t r i n g m a r k e r $ . The s e c o n d p r o b l e m i s t h a t i f process

is

to

be

m e r g e d i n t o t h e LR p a r s e r ,

p a r s e s h a v e t o b e made d e t e r m i n i s t i c . with

t h e a c t i o n f u n c t i o n f o r a s e t of

There is

s t a t e s c a n b e l u m p e d i n t o a new

function.

problem

the result

In this

case

it

that the action is deterministic, since resulting

clear

creating

no

states, if

for a l l possible inputs is a s h i f t entry.

is

the parallel

a

new e r r o r s t a t e .

set

of

states

and

hence

T h e same i s t r u e f o r t h e g o t o

Therefore, nondeterminism can only occur

if

the

action,

for

a

set

of states to be simultaneously parsed,

contain either i) shifts and reductions for the same input symbol ii) reductions for different productions for

the

same

input symbol (as shown in the previous example) Unfortunately, neither of these cases seem to be

resolvable

deterministically.

parse

was

performed,

the

If,

in

either

case,

the

allowed to continue and the next action was result

would

produce

two

different

above two conditions would prefixes.

Such

paths.

That is, the

result

in

disjoint

will

be

called

conditions

overdefined.

However, some decision still has to

be

remaining

Again, the conservative

input

can

approach was taken. becomes

be

parsed.

made

sentential

so

that

Whenever the input string being

overdefined,

the

parser

assumes

that

the

parsed

it is the

maximal substring it can recognize, and restarts

the

whole

error recovery process on the next input symbol. By merging the error-recovery into the LR parser, a new LR

parser

with

error

parsing

table

M

, pot0 ,

=

(

K ,actian

parser

M'

= (

where

with

K

,

error

K'

,

recovery

c

n

be built.

is

G

,

recovery action

,

the

start), be

poto

If an LR

then

defined

,

G

,

tuple let

as

start

the the

,

same tuple

init-error)

K,G, and start are defined as in M,

K' is a set of new states called error recovery states init-error is a state in

K'

denoted

as

the

restart

state of the error recovery method goto : (K U K')

x

action : (K U K')

N -> K U K' U {error) x T ->

{shift k I k 6 K ) U

{error,overdefined) U

{reduce p I p 6 P) Furthermore, the init-error state will be for

each

b 6 T,

action(init-error,b)

so

defined

= shift j

that

for

some

state j.

Each recovery state is a set of parsing states

K,

that

such

it

is

the

set

of

states

in

that can occur

simultaneously for the input string being parsed. Using the

above

LR

definition,

parsers

with

error

recovery can be built by the following algorithm:

Construction

parser with error recovery

LR parsing table M = (K,action,poto,G,start)

input:

output: start

of J&

LR parsing table M'

, init-error)

=

(K

,

K'

,action

,-

,G

,

method: begin {initialize state init-error) set K' -

to the single state containing the set {s 6 K)

and label it as init-error; for each let s -

a 6 T be the set

{j 6 K ( action(i,a)

=

shift j

for all i 6 init-error); if s -

is a singleton

then set else -if

s' to the element of s

s 6 K'

then set else add -

s' to that state in K' s to K'

and label the new state as s';

fi set action(init-error,a) od -

= se

for each let s -

X 6 N be the set

1 goto(t,X)

{j 6 K

= j

for all t 6 init-error); if s is empty then set poto(init-error,X) else if s is a singleton then set else -if

= error

s' to that element of s

s 6 K'

then set -

s'

to the state in K'

containing

S

else add -

s to K',

and set s' to its label

fi set $oto(init-error,X) -

=

s';

fi -

od {build each general error state) repeat for each -

state i 6 K' such that the parsing table

for that state is still undefined for each -

a 6 T

i f there -

e x i s t s two s t a t e s S 1 s S 2 6 i s e t .

[ A -> 2

[B -> 2

.

,

LA1] 6 S 1 w h e r e a 6 L A l

A ,

LA2] 6 S 2

where f i r s t ( 4 ) = a then s e t action(i,a) = overdefined -e l se i f t h e r e e x i s t s two s t a t e s -

w h e r e a 6 LA1

LA2 a n d A - > a

#

B->b

then set a c t i o n ( i , a ) = overdefined -e l se i f there exists a state s 6 i s.to [ A -> 2

then s e t --

. ,LA]

6 s w h e r e a 6 LA

a c t i o n ( i , a ) = r e d u c e A ->

else let s {j 6 K

be the set

I

action(t,a)

f o r a l l t 6 i);

= shift j

w

if s is empty then set action(i,a) -else if -

=

error

s is a singleton then set s'

else if --

to the element in s

s 6 K'

then set --

s'

to the state in K'

containing s else add -

s to R',

setting s' as the

label of the added state; fi set action(i,a) -

= shift s'

fi fi -

od for each let s -

X 6 N& be the set {j 6 K

I

goto(t,X)

for all t 6 i); if s is empty then set goto(i,X) -

= error

= j

else if -

s is a singleton

then set s' else -if s 6 K' then set s' -

to the element in s

to the state in K'

containing s else add -

s to K',

and set s' to its

label fig

-9

set goto(i,X) -

= s'

fi od od until no more states can be added to K' end -

Using the resulting LR,parser with error recovery, basic df'

control

: path x

can

be

handled using the decision function

T -> path as follows:

i) df0([q:a1 ,b) = [q:=bl when action(top([q:gl),b) j 6 (K U K')

the

= shift j for some

ii) df'([q:=l

,b)

=

df'([q:aAJ

,b) = reduce A ->

when action(top([q:aw]) aw = -

,b)

w,

and if

S then b # $

iii) df'([init-error:w],b)

=

df'([init-error:A],b)

when action(top([init-error:wl),b) = reduce

where

iv) df'(

v) df'(

a

#

[Sl , $ )

e

and b # $

=

accept

[init-error:S] , $ )

A -> E ,

= Reject

if action(top( [init-error:S] ) , $ ) = accept or I

overdef ined

vi) df8([q:&]

,$)

= reject

when action(top([q:=l),$)

vii) df' ( [init-error :a],b) where b P $,

=

= error

[init-error,b]

and

action(top([init-error,&]),b)

7

viii) df'

(

[ q : ~ ],b)

=

= overdefined

[init-error :bl

where b # $ and action(top([q:g])

, b ) = error

Note t h a t c a s e s v i ) o r v i i i ) r e p r e s e n t been

found

in

the

that

s t r i n g being parsed.

an

error

has

Hence, any e r r o r

m e s s a g e s produced a r e produced a t t h e s e p o i n t s . F i n a l l y , an

LR

parser

with

error

recovery

can

be

implemented s i m p l y by c a l l i n g t h e p r o c e d u r e p a r s e , u s i n g d f ' a s the d e c i s i o n function.

Chapter

1

Implementation

T h i s c h a p t e r d i s c u s s e s two p r o g r a m s . creates

an

SLR(1) p a r s e r , w i t h e r r o r r e c o v e r y .

p r o g r a m c r e a t e s e i t h e r a n LR(l), or

a

strongly

compatible

LR

d i s c u s s e s t h e r e p r e s e n t a t i o n of both

The f i r s t program

programs.

The

second

LALR(l), weakly parser.

The s e c o n d compatible

The f i r s t s e c t i o n

the parsing tables built section

by

describes

the

i m p l e m e n t a t i o n o f t h e S L R ( 1 ) p a r s e r c o n s t u c t o r a n d how

that

s y s t e m i s u s e d w h i l e t h e t h i r d s e c t i o n d o e s t h e same f o r t h e second p a r s e r c o n s t r u c t o r .

V.l Representation of the parsing -

The representation suggest

using

of

arrays.

the

For

tables

parsing

tables

naturally

uniformity of both access and

values held in the arrays, all terminal symbols, nonterminal symbols,

and productions are provided with an internal code

of integers by both programs.

For

terminal

symbols,

the

codes are defined by the set {i I Olicn where n is the number of distinct terminal symbols occurring in the productions) where 0 is reserved

for

the

special

terminal

symbol

$.

Nonterminal symbols are encoded using the set {i

I -msil-1 where m is the number of distinct nonterminals occurring in the productions)

where the start symbol S will always be given the

code

-1.

Productions are coded using the set {i

I 1si

S'

is always given the code 1.

In representing the action

and

t

o

functions,

only

non-error values are kept internally since the vast majority of the function values are in values

are

saved

fact

error.

The

remaining

in groups, one for for each state, where

states having the same

set

of

non-error

represented by a single copy of the groups.

values

will

be

For e x a m p l e ,

the grammar

would p r o d u c e the f o l l o w i n g SLR(1) p a r s i n g t a b l e s :

Action table

-

I

15

I

S->E

I

S 9

1

I

I

+-----+--------+--------+--------+--------+--------+--------

1 16 I 0 I 0 I S 8 I I +-----+--------+--------+--------+--------+--------+-------1 17 1 0 I 0 I 0 I I +-----+--------+--------+--------+--------+--------+--------+ w h e r e shift j i s represented by S j, r e d u c e p is represented b y p , o v e r d e f i n e d i s represented by 0 , and error is omitted.

-

I

I

S l O I

I

S l

1

+ O I + I

goto table

1

14

1

I 1 5

I 1 6

1 1 7

+j

I

15

I

1

I

I I

+

+-----+--------+--------+--------+--------

+-----+--------+--------+--------+--------

I

1 16 1 I I +-----+--------+--------+--------+--------+ 1 17 1 I I I

I

I

+

+-----+--------+--------+--------+--------

where g o t o ( i , X ) = e r r o r h a s b e e n o m i t t e d

By e l i m i n a t i o n o f tables

the error values,

d o e s n o t n e e d t o be s a v e d .

58.8%

of the

above

A l s o , s t a t e s 1,2,8 and 9

i n t h e p r e v i o u s a c t i o n t a b l e a l l h a v e t h e same same

set

of

values

and

t h e r e f o r e w i l l b e r e p r e s e n t e d by o n l y o n e group

of v a l u e s . Each n o n - e r r o r

value

of

the

action

table

w i l l

be

of

the

represented a s follows:

i ) action(i,a)

= s h i f t j w i l l be represented

by t h e p a i r ( x , j )

is

x

where

the

code

t e r m i n a l symbol a .

= r e d u c e A ->

i i ) action(i,a) by t h e p a i r terminal

(x,-p)

symbol

w

w i l l be represented is

where x a

and

p

is

the the

code

of

code

the

of

the

p r o d u c t i o n A -> 2.

i i i ) a c t i o n ( i , a ) = overdefined w i l l be represented by t h e p a i r ( x , O ) w h e r e x i s t h e o f

the

terminal

symbol a.

The n o n - e r r o r s t a t e i,

will

be

values

of

represented

the as

~ o t o ( i , A )= j a n d x i s t h e c o d e of

goto the

table, pair

for

(x,j)

some where

t h e n o n t e r m i n a l A.

For e f f i c i e n c y i n r e t r i e v i n g t h e v a l u e s from t h e a c t i o n and g o t o t a b l e s t h e i n t e g e r p a i r s c o r r e s p o n i n g t o each s t a t e a r e s o r t e d u s i n g t h e r e l a t i o n 5' w h e r e

( a , b ) 5'

(c,d)

i f f e i t h e r aT

20

represent

V . 2 SLR(1) -

*

: reduce E->T

+

: shift 8

)

: reduce E->T

implementation

This section describes how to constructor

with

error

the restriction that

A -> g.

no

use

recovery.

the

parser

SLR(1)

This implementation has

production

can

be

of

the

form

Included in this section is a brief description of

the input grammar,

how

to

run

the

system,

and

how

to

interpret the output produced.

V.2.1

Input Grammar

The input for the program is defining

the

CFG

constructed from.

which

the

set

of

productions

the SLR(1) parsing table is to be

The input will be parsed in a free

style

format, that is, no formatting by columns or line boundaries will be used. a

The end of line character will be treated

as

blank character and each symbol on the input file must be

separated by one or more blanks.

In general, a

terminal

symbol

is

by

represented

a

nonblank string, of 15 characters or less not beginning with "n,n$tc,and "."). In the event that the user may use one of the metasymbols used by the program, or a nonblank string beginning

with

""

.

symbols

are

represented

as

character

15 characters or less, enclosed by the symbols The first symbol of the

string,

if

not

the

empty string, must begin with a nonblank character but blank characters can appear anywhere program

also

accepts

the

else

in

the

string.

The

"" which represents a

string

nonterminal symbol whose name is the empty string. Productions are represented by writing them in the form

A ->

w

where A is a nonterminal,

w

is a sequence of grammar

symbols, and It->" is a metasymbol recognized by the program. Each

production

metasymbol

"."and

It$" must

appear.

is

separated

from

the

next

using

the

after the last production, the metasymbol The

productions

can be entered in any

order except that the first production, on the

input

file,

must be the start production. For example, the grammar

presented

V.1

in

could

be

represented by the following piece of input:

A shorthand notation also exists for productions having the

same

A ->

w

In

hand

side

(i.e.

where A remains constant

these

A -> -1 w

left

productions of the form

between

the

productions).

cases, the productions can be entered in the form

! w -2

...

1

'

! w

-Z1

where there exists the productions

For example, the grammar

in

section

V.l

could

have

the

input

alternatively been written as:

The order in which productions are found in file

corresponds

internally.

In

to a

the order in which they will be coded similar

manner,

the

terminal

and

nonterminal symbols will be coded in the order corresponding to their first appearance in the set of productions.

V.2.2

R u n n i n g t h e SLR(1) p a r s e r c o n s t r u c t o r

The s y s t e m c a n b e School,

by

entering

run

on

the

Vax-11

in

the

Moore

t h e following monitor level procedure

call: $ @[ k a r l ] s l r b n f

After invocation,

the procedure w i l l ask the

user

for

the

f i l e s u s e d by t h e program, and r u n t h e program. The f i r s t f i l e t o b e r e q u e s t e d i s t h e t h e s e t of p r o d u c t i o n s ,

file

containing

and i s r e q u e s t e d w i t h t h e prompt:

input : The s e c o n d f i l e r e q u e s t i s f o r t h e

output

file

which

w i l l c o n t a i n a l l d i a g n o s t i c and i n f o r m a t o r y messages, and i s r e q u e s t e d w i t h t h e prompt: output : The t h i r d f i l e r e q u e s t i s f o r t h e f i l e t h a t t h e c r e a t e d SLR(1)

parsing

t a b l e s s h o u l d b e saved on,

and i s r e q u e s t e d

w i t h t h e prompt: internal representation: T h e l a s t two f i l e r e q u e s t s a r e f o r t e m p o r a r y f i l e s t h a t can

b e u s e d by t h e p r o g r a m , a n d a r e b o t h r e q u e s t e d w i t h t h e

prompt: temporary s t o r a g e u n i t :

Upon c o m p l e t i o n o f

the f i l e requests,

the

The p r o g r a m w i l l n o t p r o d u c e a n y o u t p u t ,

run.

program

on t h e u s e r ' s

s c r e e n , nor w i l l i t a s k t h e u s e r f o r any f u t h e r unless

the

SLR(1)

parsing

c o n f l i c t s ( s e e s e c t i o n V.2.4

w i l l

This paper containing

the

not

SLR(1)

is

information

t a b l e was c r e a t e d and c o n t a i n s f o r handling t h i s case). mention

parsing

how

to

use

the

file

t a b l e s e x c e p t f o r a PASCAL

program s k e l e t o n i n appendix a.

V.2.3

Interpretation

of the

output f i l e

T h e o u t p u t c a n b e b r o k e n i n t o two m a j o r s e c t i o n s

where

t h e f i r s t s e c t i o n d e s c r i b e s how t h e p r o g r a m p a r s e d t h e i n p u t grammar a n d t h e parsing

second

tables.

produced only i f

section

However,

the

prints

the

second

SLR(1)

built

section

w i l l

t h e r e were n o e r r o r s d e t e c t e d i n t h e

be

first

section. T h e f i r s t p a g e of being

parsed,

i l l e g a l syntax. input

grammar,

the input f i l e . be written,

along If

the output is a with

any

error

copy

of

the

input

messages i n d i c a t i n g

t h e r e were n o s y n t a c t i c m i s t a k e s i n

the

t h e n t h i s p a g e w i l l b e a n e x a c t d u p l i c a t e of Otherwise,

p o r t i o n s of

the input f i l e

w i l l

and w i l l be i n t e r s p e r s e d w i t h s y n t a c t i c a l e r r o r s

r e c o g n i z e d by t h e program.

For example,

t h e erroneous input:

would produce t h e f o l l o w i n g o u t p u t :

< S > ->

In

this

.

< A > ->

example,

production

has

the

a

a b

program

.

A ***illegal

is

reporting

LHS

that

the

t e r m i n a l symbol on t h e l e f t hand s i d e of

the production. The n e x t t h r e e s u b s e c t i o n s of coding

scheme

of

terminals,

u s e d by t h e program. For example,

the input :

the

output

reports

the

nonterminals, and productions

w o u l d p r o d u c e t h e f o l l o w i n g output:

T E R M I N A L NODES:

-------- ------

N O N T E R M I N A L NODES:

----------- ------

PRODUCTIONS :

------------

T h e program provides additional coding

information

schemes, t h a t is, if t h e s t r i n g "*undef*"

n o n t e r m i n a l , t h e n that n o n t e r m i n a l d o e s

not

with

the

procedes a

occur

on

the

left

hand

side

of a n y p r o d u c t i o n r e c o g n i z e d w h i l e p a r s i n g

the input f ilea B e l o w t h e c o d i n g s c h e m e i s a d i a g n o s t i c summmary o f how

well

the

program

did

i n parsing t h e given i n p u t bnf.

e v e r y t h i n g i s a c c e p t a b l e t o t h e program, message

"successful

parse"

SLR(1) p a r s i n g t a b l e s .

and

it w i l l

attempt

Otherwise,

it

print

If the

t o construct the

w i l l

give

an

error

summary o f why i t t h o u g h t t h e i n p u t w a s w r o n g , a n d a b o r t a n y further calculations. S h o u l d t h e i n p u t grammar b e program

then

To b e g i n w i t h , each

attempts

t o b u i l d t h e SLR(I)

i t computes t h e f i r s t

nonterminal,

and

p r i n t s o u t t h e s e t s of each s t a t e .

successfully

prints

and

parsed,

parsing tables.

follow

o u t t h e s e sets.

SLR(1) i t e m s

defining

the

the

sets Second, core

for it

of

For example,

t h e p r e v i o u s i n p u t grammar

would

produce

output f o r the f i r s t f i v e s t a t e s a s follows:

............................................. 1)

< S > ->

.

STATE : 1

'

............................................. 7)

< F > ->

.

(

STATE : 2

) STATE : 3

-------------------------------------------a-

6)

< F > ->

id

............................................. 5)

STATE : 4

->

............................................. 3)

< E > ->

4)

< T > ->

The last section

readable

form of

s i z e of

the array

parsing

tables,

of

STATE : 5

. . + the

output,

for

a

is

run,

a

t h e produced p a r s i n g t a b l e followed by t h e parsetable. for

Non-error

values,

of

the

each s t a t e are l i s t e d s e p a r a t e l y w i t h

the action values preceeding t h e s o t o values.

For example, the output produced by the program for the parsing values for the first state would be as follows:

.............................................. STATE 1 id SHIFT TO 3 SHIFT TO 2

(

V.2.4

Conflict Resolution

Sometimes, when a CFG G is provided SLR(1)

input

to

the

parser constructor it can not produce a SLR(1) parser

for G since L(G) In

as

such

is not in the class of languages of SLR(1).

cases,

the

construction

method

has

produced

conflicts in the action table. For example, the grammar in figure 5.1 a

+

is an example of

natural grammar for arithmetic expressions with operators and

*.

grammar,

The and

LR(0) the

characteristic

automaton,

for

follow sets are shown in figure 5.2.

states 9 and 10, there

will

exist

S/R

conflicts

on

this In the

symbols

+

and

*

if

the

characteristic automaton.

SLR(1) parser is built from the This can

also

be

seen

in

the

output produced by the program for such an input (see figure

5.3).

finure 5.1

figure 5.2

...............................................

STATE : 9

***REDUCE/SHIFT CONFLICT O N SYMBOL + OLD ENTRY: -2 CONFLICTING ENTRY: ***REDUCE/SHIFT CONFLICT ON SYMBOL * OLD ENTRY: -2 CONFLICTING ENTRY:

...............................................

STATE : 10

***REDUCE/SHIFT C O N F L I C T O N S Y M B O L + OLD ENTRY: -3 CONFLICTING ENTRY: ***REDUCE/SHIFT CONFLICT O N S Y M B O L * OLD ENTRY: -3 CONFLICTING ENTRY:

f i g u r e 5.3

It t u r n s o u t t h a t t h e s e c o n f l i c t s c a n favor

of

either

looking

resolved

in

a s h i f t o r a r e d u c e a c t i o n by knowing t h e

p r e c e d e n c e and a s s o c i a t i v i t y o f example,

be

these

two

operators.

a t s t a t e 9 and t h e o p e r a t o r

*,

For

the parser

is a t t e m p t i n g t o recognize t h e s e n t e n t i a l form:

E + E * E Assuming t h a t

*

has precedence over

+,

it is c l e a r

that

we

want

to

s h i f t on t h e i n p u t symbol

*

string E

*

t o f i r s t recognize the

E a n d r e d u c e i t t o t h e s t r i n g E p r o d u c i n g t h e new

s e n t e n t i a l form E + E

S h o u l d t h e grammar i n t h e i n p u t f i l e p r o d u c e c o n f l i c t s , the

w i l l

program

arbitrarily

pick

one

of

the

action

d e f i n i t i o n s f o r t h e symbol c a u s i n g t h e c o n f l i c t i n t h e s t a t e and

discard

This choice is

a l l other conflicting entries.

r e p o r t e d t o t h e u s e r a s shown i n f i g u r e 5 . 3 . the

"OLD

program

ENTRY: while

discarded

xx"

the

entry-

f o r t h e symbol

*,

represents

"CONFLICTING Hence,

I n each

t h e e n t r y chosen by t h e ENTRY:

i n s t a t e 9,

yy"

states

a

the a r b i t r a r y choice,

was t o r e d u c e on t h e p r o d u c t i o n l a b e l l e d 2

E->E+E).

(i.e.

To a l l o w t h e u s e r t o c h a n g e t h e a r b i t r a r y by

case,

choice

made

t h e p r o g r a m , t h e p r o g r a m w i l l a l s o become i n t e r a c t i v e i f

a n y c o n f l i c t s a r i s e i n b u i l d i n g t h e SLR(1) p a r s e r .

That is,

t h e program w i l l prompt t h e u s e r w i t h t h e prompt: ENTER STATE TO RESOLVE: To t h i s r e s p o n s e , If will

two c h o i c e s a r e a v a i l a b l e .

the user responds with t h e

stop

so

parser.

0,

the

program

t h a t t h e u s e r can look a t t h e output f i l e i n

order to identify a l l existing SLR(1)

number

If

conflicts

in

building

the

t h e u s e r f e e l s t h a t t h e s e c o n f l i c t s can -

not be resolved, user

should

then he is out

rerun

the

of

luck.

Otherwise,

the

p r o g r a m a n d when g e t t i n g t h e a b o v e

prompt, h e s h o u l d r e s o l v e t h e c o n f l i c t s by u s i n g t h e

second

option. The s e c o n d o p t i o n i n r e s p o n d i n g t o t h e a b o v e p r o m p t s t o type i n t h e s t a t e t h a t t h e u s e r wants t o resolve. u s e r completes h i s answer, t h e program w i l l core if

of

After the

print

out

the

and w i l l a s k t h e u s e r

the state, for verification,

it i s t h e s t a t e h e wanted.

The n e x t r e q u e s t by t h e program provide

t h e i n t e g e r c o d e of

is

for

the

user

to

t h e t e r m i n a l symbol c a u s i n g t h e

con£ l i c t u s i n g t h e prompt: ENTER SYMBOL NUMBER T O RESOLVE: As above,

t h e program w i l l v e r i f y

printing

out

the terminal's

the

user's

response

by

name a n d a s k i n g t h e u s e r i f i t

i s t h e c o r r e c t t e r m i n a l symbol.

Again,

a "N"

response

w i l l

c a u s e t h e program t o reprompt f o r a s t a t e t o r e s o l v e w h i l e a I1

y It r e s p o n s e w i l l h a v e t o p r o g r a m

continue

processing

the

resolution. The n e x t r e q u e s t , action

function's

a f t e r t h e symbol r e q u e s t ,

value

is for

the

f o r t h e s t a t e and symbol w i t h t h e

prompt: ENTER NEW ACTION TO TAKE: If

t h e v a l u e p r o v i d e d by t h e u s e r i s a p o s i t i v e i n t e g e r (and

hence of

a

shift action),

t h e program w i l l p r i n t o u t t h e c o r e

the s t a t e the s h i f t is to.

is

user

negative

(and

w i l l p r i n t out the provided user i f user's

production

by t h e u s e r .

associated

given

with

by

the

t h e program the

label

it w i l l then ask the

I n e i t h e r case,

t h i s was w h a t t h e u s e r w a n t e d a n d a g a i n

verify

the

input.

the

conflict

the

resolution

user has

disregard the conflict resolution. user

value

hence a reduce e n t r y ) ,

The p r o g r a m w i l l p r o v i d e after

If the

will

one been

A "Ytt

last

chance,

specified,

response

by

to the

c a u s e t h e r e s o l u t i o n t o b e p r o c e s s e d w h i l e a "N"

r e s p o n s e w i l l d i s r e g a r d t h e r e s o l u t i o n p r o v i d e d by t h e u s e r . In

either

case,

t h e program w i l l t h e n r e q u e s t f o r a n o t h e r

c o n f l i c t r e s o l u t i o n w i t h t h e prompt: ENTER S T A T E TO RESOLVE:

At t h i s point,

user

responds

with

t h e whole a

0.

process

repeats

unless

I f a 0 i s t y p e d i n by t h e u s e r ,

t h e n no more c o n f l i c t r e s o l u t i o n s w i l l b e p r o c e s s e d a n d program w i l l b u i l d t h e SLR(1)

parser.

w i l l n o t produce

parser

an

SLR(1)

c o n f l i c t has been resolved.

the

the

Note t h a t t h e program unless

at

least

one

V.2.5

Size Restrictions

This program contains several size

restrictions

which

are as follows:

i) No more than 100 terminal symbols may be used.

ii) No more than 200 nonterminal symbols may be used.

iii) No more than 300 productions input

may

appear

in

the

.

iv) No terminal

or

nonterminal

name

may

exceed

15

characters.

v ) For each production A ->

of

terminal

w, w

can not be

a

string,

and nonterminal names, exceeding a length

of ten names.

vi) The number of parse states, created by the program, must not exceed 600.

vii) The number of SLR(1) items, excluding the items of the form A ->

. w, must

not exceed 9,999.

v i i i ) The s i z e of

the array parsetable can

t h e d i m e n s i o n s of

1 0 , 0 0 0 x 2.

V.3 L R ( 1 I L L A L R ( l ) , -

not

exceed

Weak a n d S t r o n g C o m p a t i b i l i t y

parser g e n e r a t o r s

T h i s s e c t i o n d e s c r i b e s how t o u s e t h e p r o g r a m w h i c h c a n build

either

LR(l),

compitable parsing tables. brief

description

program,

V.3.1

weak

LALR(l),

of

compatible,

Included i n t h i s

the

input

grammar,

or strong

section

-how

is

a

t o run the

a n d how t o i n t e r p r e t t h e o u t p u t .

I n p u t Grammar

The i n p u t f o r t h e program i s defining produced.

the

from

CGF

set

of

productions

which t h e p a r s i n g t a b l e s a r e t o b e

These productions

w i t h a l i s t of

the

can

be

optionally

t e r m i n a l s and n o n t e r m i n a l s ,

preceeded

allowing the user

t o s p e c i f y t h e i n t e g e r codes given t o t h e s e symbols. The i n p u t w i l l b e p a r s e d i n a f r e e s t y l e

is, used.

no

formatting The e n d of

by

columns

format,

that

o r l i n e boundaries w i l l be

l i n e character w i l l be treated a s a

blank

character

and each symbol on the input must be separated by

at least one blank. In general, a terminal symbol is any nonempty string of nonblank I1 (11

It

I

.

characters which does not begin with the character

However, it can not be any of 11

9

.

11 9

I1

It#",

user wants to

use

one

.,

or "e").

of

the

I1

the

metasymbols

(i.e.

In the event that the

metasymbols

or

string

a

beginning with a "", and includes

the name composed by the empty string (""). Productions are represented by writing them in the form

A ->

w

where A is the name of a nonterminal,

w

is a sequence

of terminal and nonterminal names, and "->I1 is a recognized by the program.

metasymbol

The symbol "e" has been reserved

to represent the empty string so

that

productions

of

the

using

the

e can be written. form A -> Productions are separated from metasymbol production. i.e.

of

,

and

no

symbols

Productions having the the

form

each should same

A -> -1 w ' A -> -w2 ,

other

follow left

.. , A

the last

hand

side,

-> -nS w can be

written in the form A -> -1 w the metasymbol " 1 "

I A ->

x2

1

... I

A ->

%

where

is treated as an "or" symbol.

For example, the grammar S -> A

A -> aAb

=

A ->

could be entered with the input:

Productions, when

parsed,

will

be

coded

internally

using the order in which they appear on the input. restriction on

the

order

in

which

the

The only

productions

are

written is that the start production must appear first. Unlike the optionally

SLR(1)

parser

constructor,

production

the

That is,

metasymbol " # I " .

may be empty.

followed

either

for

the

second

nonterminal, -2

by

of

the

Elements in these lists will be labeled

in the order that they are found (I for the first 2

the

It is not necessary that all terminals

and nonterminals appear in these lists, and list

before

user is allowed to provide a list of

terminals, followed by a list of nonterminals, the

program

the user to specify the coding scheme of

allows

the nonterminal and terminal symbols. start

this

terminal

for

the

etc*

second

and

-1

nonterminal

terminal,

for the first etc.).

Any

remaining terminals, or nonterminals, not specified by these lists will be labelled

according

to

the

order

of

first

a p p e a r a n c e i n t h e set of p r o d u c t i o n s . For example,

a s s u m e u s i n g t h e p r e v i o u s grammar t h a t t h e

u s e r wants t h e t e r m i n a l b t o b e l a b e l l e d 1 and t e r m i n a l a t o b e l a b e l l e d 2.

T h i s c o u l d b e d o n e by u s i n g t h e i n p u t :

The p r o g r a m d e s c r i b e d b y t h i s s e c t i o n i n f a c t h a s the

SLR(1)

parsing

t a b l e s ( p r o d u c e d b y r u n n i n g t h e SLR(1)

p r o g r a m d e s c r i b e d i n s e c t i o n V.2) t h i s program.

Hence,

used

to

parse

t h e d e s c r i p t i o n of

the

input

for

t h e input r u l e s can

b e f o r m a l l y d e s c r i b e d by t h e s e t o f r u l e s u s e d t h e SLR(1) p a r s i n g t a b l e s w h i c h a r e a s f o l l o w s :

in

creating

.

-> -> ! tother prods> -> nonterminal '-> nonterminal -> ! -> nonterminal '-> -> e-rule ! ! I e-rule ! I -> terminal ! nonterminal ! terminal ! nonterminal -> # ! # ! # -> terminal ! terminal -> nonterminal ! nonterminal $

'.

'.

.

'.

.

.

.

.

. . .

V.3.2

Runing

the

program

The program can be run School

by

entering

the

on

the

Vax-11

in

the

Moore

following monitor level procedure

call: @ [karll runnewbnf

After invocation, the procedure willask

the

user

for

the

files used by the program, and then run the program. The first file requested by the procedure is

the

file

containing the set of productions, and is requested with the prompt: B N F FILE:

The s e c o n d f i l e i s r e q u e s t i s f o r t h e o u t p u t f i l e w h i c h w i l l contain

all

diagnostic

and

informatory

and i s

messages,

requested with t h e prompt: OUTPUT F I L E : The l a s t r e q u e s t i s f o r t h e f i l e t o save t h e p a r s i n g

tables

c r e a t e d and is r e q u e s t e d w i t h t h e prompt: TABLE : Upon c o m p l e t i o n o f run.

the f i l e requests,

the

program

A f t e r t h e program f i n i s h e s r e a d i n g t h e i n p u t bnf

t h e program w i l l r e q u e s t t h e u s e r t o s p e c i f y

what

is

file,

type

of

p a r s e r s h o u l d be c r e a t e d w i t h t h e prompt: ENTER OPTION 0 COMPUTE FIRSTS ONLY 1 BUILD L R ( 1 ) PARSE TABLE 2 BUILD LALR(1) PARSE TABLE 3 BUILD WEAK COMPATIBLE LR PARSE TABLE 4 BUILD STRONG COMPATIBLE LR PARSE TABLE

-

-

-

Once t h e u s e r r e s p o n d s , corresponding as

it

tries

parse to

table,

build

the

program

w i l l

build

the

p r i n t i n g o u t "BUILDING STATE X"

state

X.

This

completes

a l l

i n t e r a c t i o n t h e program h a s w i t h t h e u s e r . The f i r s t p a g e of

the output f i l e

is

a

copy

of

the

i n p u t being p a r s e d , along w i t h any e r r o r messages d e s c r i b i n g i l l e g a l syntax.

For example, < S > ->

the erroneous input:

.

->

a b

.

A ->

e

would p r o d u c e t h e f o l l o w i n g o u t p u t : INPUT PARSE OF PRODUCTIONS: --------- -- -----------

***

3 2 ) PRODUCTION DEFINITION EXPECTED

The a b o v e e r r o r i s s t a t i n g t h a t a t t h e b e g i n n i n g 32

of

the previous input l i n e ,

find a

production

but

found

on

column

t h e program w a s e x p e c t i n g t o something

else

(i.e.

the

file,

after

t e r m i n a l symbol A ) . The n e x t t h r e e s u b s e c t i o n s of the

parse

terminals,

of

the

input,

nonterminals,

and

the input:

a b l < s t a r t s y m b o l > -> < A > < A > -> a < A > b I e

output

r e p o r t s t h e c o d i n g scheme of

program. For example,

the

.

productions

used

by

the the

would produce t h e f o l l o w i n g o u t p u t : TERMINALS: ----------

NON-TERMINALS :

--------------1.

< s t a r t symbol>

-2.

*START SYMBOL* * U N I Q U E * *NOT USED O N RHS*

PRODUCTIONS :

------------

As

can

be

informational provided,

seen

by

messages

the about

above

example,

nonterminal

additional symbols

are

and a r e a s f o l l o w s :

*START SYMBOL*

-

S t a t e s t h a t t h e n o n t e r m i n a l symbol h a s

b e e n r e c o g n i z e d a s t h e s t a r t symbol.

*UNIQUE*

-

S t a t e s t h a t t h e s t a r t symbol d o e s n o t o c c u r

anywhere e l s e i n t h e p r o d u c t i o n s and v a l i d s t a r t symbol.

hence

is

a

*NOT UNIQUE*

-

S t a t e s t h a t t h e s t a r t symbol o c c u r s i n

another production besides

the

start

production

and h e n c e i s a n i n v a l i d s t a r t symbol.

-

*NOT USED O N RHS*

s t a t e s t h a t t h e nonterminal never

a p p e a r s o n t h e r i g h t hand s i d e of any p r o d u c t i o n .

*NT NOT REACHABLE*

-

S t a t e s t h a t t h e nonterminal can

n o t a p p e a r i n any

of

the

h e n c e need n o t b e p a r t of

sentential

and

t h e i n p u t grammar.

-

*NT REPRESENTS N O TERMINAL STRINGS*

i s n o t any t e r m i n a l

forms

strings

States that there

derivable

from

the

nonterminal.

*NT NOT DEFINED*

-

S t a t e s t h a t t h e nonterminal does not

a p p e a r on t h e l e f t hand

side

of

any

production

recognized from t h e i n p u t f i l e .

A f t e r t h e coding schemes, f i r s t s e t of Finally, constructed,

w i l l

print

the

each nonterminal. if the

the

user

program

appropriate parsing tables. be

t h e program

selects w i l l

to

have

a

parser

c o n s t r u c t i t and p r i n t t h e

The o u t p u t o f

the

parser

w i l l

p r i n t e d by s t a t e s where e a c h s t a t e w i l l c o n t a i n i t s c o r e

(items) and non-error

a c t i o n and g o t o v a l u e s .

F o r e x a m p l e , u s i n g t h e i n p u t grammar u s e d a b o v e , the

user

table,

chose

to

build

a

and i f

s t r o n g c o m p a t i b l e LR p a r s i n g

t h e p a r s e t a b l e s p r i n t e d would b e as f o l l o w s : STRONG COMPATIBLE L R ( 1 ) CHARACTERISTIC TABLE

l ) < s t a r t s y m b o l > -> LOOKAHEADS : SEOFS

.

TABLE ENTRIES: $EOF$ REDUCE B Y 3 a SHIFT TO 3 GO TO 2

......................

STATE : 2

l ) < s t a r t s y m b o l > -> LOOKAHEADS : SEOFS

.......................

.

TABLE ENTRIES: $EOF$ REDUCE B Y 1 ...................... 2 ) < A > -> a LOOKAHEADS : SEOFS b TABLE ENTRIES: a SHIFT TO 3 b REDUCE B Y 3 GO TO 4

.

STATE : 3 b

.......................

...................... 2 ) < A > -> a < A > LOOKAHEADS : $EOF$ b

STATE : 4

.

.......................

b

TABLE ENTRIES:

b S H I F T TO 5 ...................... 2 ) < A > -> a < A > b LOOKAREADS : $EOF$ b TABLE ENTRIES : $EOF$ REDUCE B Y 2 b REDUCE b y 2

STATE : a

f

.......................

Appendix

A

Sample PASCAL skeleton for use of SLR(1) Program doparse(table,

parsing tables

{any other files used by program) ) ;

Const numbers tates parsetablesize

= x; {x? of actual parse states) = y; {yz actual size of

numberproductions

=

errorvalue

array parsetable) z; {zl actual number of productions) = n; {n value not in set of labels)

type

{the path will be represented as a stack using a linear list) parsestack = ^stacknode; stacknode = record topstate : integer; next : parsestack end 9 var table : file

of

integer; {file containing parsing tables)

function push(stack : parsestack; newstate : integer) : parsestack; {returns stack with new state added in front) var temporary -

: parsestack;

begin new(temporary) ; with temporaryn do begin topstate:=newstate; next:=stack end 9 push:=temporary end 9

function pop(stack

: parsestack)

: parsestack;

{removes the top element of the stack) begin pop:=stackA.next; dispose(stack) end - 9 function top(stack

: parsestack)

: integer;

{returns state on top of stack) begin top:=stackn,topstate -3end function empty(stack

: parsestack) : parsestack;

{returns an empty stack) begin while stacknil & stack:=pop(stack); empty:=nil end -3 function gettoken

integer;

{This routine returns the label of the next terminal occuring in the input file) end

-3

procedure semantics(stack : parsestack; production : integer); {does any semantic routines associated with reducing the given production) end

- 9

procedure errormessages(state

,

symbol : integer);

{prints out message corresponding to error value for state and symbol) end

-9

function parse : boolean; {parses input. returns true if no parsing errors are found in parsing the input) const

eof token = 0;

type

{representation of an entry in parsetable) tableentry = record symbol , value : integer end -9 {representation of a reference to a group of entries in parsetable) stateeotry = record startposition , size : integer end -, {representation of a production in productionlist) productionentry = record lhssymbol , rhslength : integer end -* var parsetable : array [ 1 actionlist

,

.. parsetablesize 1 of tableentry; array 1 .. numberstates I of stateentry;

gotolist :

[

-

..

productionlist : array [ 1 numberproductions] of productionentry; {other parameters passed with parsing tables) topstate, parsestart, errorstart, errorcontinue, topoftable, productioncount : integer;

{actual number of parse states) {start state) {forced shift state on error recovery) {init-error state) {actual size of parsetable) {actual number of productions)

{local variables) token : integer ; {next terminal from input) value : integer; {next action to take in parsing input) stop : boolean; {true when have parsed whole input) parseerror : boolean; {true if any parsing errors) stack : parsestack; {holds path) procedure getparsetable; (reads in parsing tables) var index -

: integer;

procedure getin(=

invalue : integer);

{reads in next integer from file table) begin invalue:=tableA; get (table) end

- 9

begin reset (table); getin(topstate1; getin(parsestart1; getin(errorstart1; getin(errorc0ntinue); getin(topof table); getin(productioncount); for index:=l Q topstate & begin with actionlist [index] & begin getin(startposition); getin(size) end - 9 with gotolist [index] do begin getin(startpositi0n); getin(size) end end - 9

-

topoftable & with parsetable [index] do begin getin(symbo1) ; getin (value) end -9 for index:=l to productioncount with productionlist [index] & begin getin(rhs1ength) ; getin (lhssymbol) end

for index:=l to -

-

-

end

-9

function clear(stack : parsestack; newbottom : integer

)

: parsestack;

{empties stack and put value on bottom of stack) begin clear:=push(empty(stack),newbottom) end 9 function popelements(stack : parsestack; amount : integer ) : parsestack; {takes the requested amount of states off the stack) begin if (amount = 0) or (stack = nil) then popelements:=stack else. popelements:=popelements(pop(stack), pred (count)') end 9

-

-

function popoffproduction(stack : parsestack; count : integer ) : parsestack; {takes the requested amount of states off the stack, but if stack underflow occurs, it resets the bottom state) begin stack:=popelements(stack,count);

if stack = nil then popoffproduction:=push(stack9errorcontinue) else popoffproduction:=stack end

-9