Syntax Analysis. Parser. Grammars CS2210

Syntax Analysis CS2210 Lecture 4 CS2210 Compiler Design 2004/05 Parser source lexical analyzer token parser get next token parse tree rest of fr...
Author: Reynold Elliott
2 downloads 0 Views 241KB Size
Syntax Analysis CS2210 Lecture 4

CS2210 Compiler Design 2004/05

Parser source

lexical analyzer

token parser get next token

parse tree

rest of frontend

IR

symbol table

Parsing = determining whether a string of tokens can be generated by a grammar CS2210 Compiler Design 2004/05

Grammars



Precise, easy-to understand description of syntax Context-free grammars -> efficient parsers (automatically!) Help in translation and error detection



Easier language evolution









Eg. Attribute grammars Can add new constructs systematically CS2210 Compiler Design 2004/05

1

Syntax Errors ■

Many errors are syntactic or exposed by parsing ■



eg. Unbalanced ()

Error handling goals: ■ ■



Report errors quickly & accurately Recover quickly (continue parsing after error) Little overhead on parse time CS2210 Compiler Design 2004/05

Error Recovery ■

Panic mode



Phrase level



Error productions



Global correction









Discard tokens until synchronization token found (often ‘;’)

Local correction: replace a token by another and continue Encode commonly expected errors in grammar Find closest input string that is in L(G) ■

Too costly in practice

CS2210 Compiler Design 2004/05

Context-free Grammars ■

■ ■

Precise and easy way to specify the syntactical structure of a programming language Efficient recognition methods exist Natural specification of many “recursive” constructs: ■

expr -> expr + expr | term CS2210 Compiler Design 2004/05

2

Context-free Grammar Definition ■

Terminals T ■



Symbols which form strings of L(G), G a CFG (= tokens in the scanner), e.g. if, else, id

Nonterminals N ■ ■

Syntactic variables denoting sets of strings of L(G) Impose hierarchical structure (e.g., precedence rules)



Start symbol S (∈ N)



Productions P



■ ■

Denotes the set of strings of L(G) Rules that determine how strings are formed N -> (N|T) * CS2210 Compiler Design 2004/05

Example: Expression Grammar expr -> expr op expr expr -> (expr) expr -> - expr expr -> id



Terminals:



Nonterminals



Start symbol



op -> + op -> -



op -> * op -> / op -> ^



{id, +, -, *, /, ^} {expr, op,} Expr

CS2210 Compiler Design 2004/05

Notational Conventions ■

Terminals ■ ■ ■ ■ ■



a,b,c.. +,-,.. ‘,’.’;’ etc 0..9 expr or

Nonterminals ■ ■



Terminal strings ■



u,v,..

Grammar symbol strings ■



A, B, C .. S start symbol (if present) or first nonterminal in production list

α,β

Productions ■

A -> α

CS2210 Compiler Design 2004/05

3

Shorthands & Derivations E -> E + E | E * E | (E) | - E |



■ ■

E => - E “E derives -E” => derives in 1 step =>* derive in n (0..) steps

CS2210 Compiler Design 2004/05

More Definitions ■



■ ■ ■ ■

L(G) language generated by G = set of strings derived from S S =>+ w : w sentence of G (w string of terminals) S =>+ α : α sentential form of G (string can contain nonterminals) G and G’ are equivalent :⇔ L(G) = L(G’) A language generated by a grammar (of the form shown) is called a context-free language CS2210 Compiler Design 2004/05

Example G = ({-,*,(,),}, {E}, E, {E -> E + E, E-> E * E , E -> (E) , E-> - E, E -> })

Sentence: -( + ) Derivation: E => -E => -(E) => -(E+E)=>-(+E) => -( + ) •





Leftmost derivation i.e. always replace leftmost nonterminal Rightmost derivation analogously Left /right sentential form

CS2210 Compiler Design 2004/05

4

Parse Trees Parse tree = graphical representation of a derivation ignoring replacement order E

E E => -E => -(E) => -(E+E)=> -(+E) => -( + )

( E

E

)

+

E

CS2210 Compiler Design 2004/05

Ambiguous Grammars ■



>=2 different parse trees for some sentence ⇔ >= 2 leftmost/rightmost derivations Usually want to have unambiguous grammars ■



E.g. want to just one evaluation order: + * to be parsed as + ( * ) not (+)* To keep grammars simple accept ambiguity and resolve separately (outside of grammar)

CS2210 Compiler Design 2004/05

Expressive Power ■

CFGs are more powerful than REs ■ ■



Can express matching () with CFGs Can express most properties desired for programming languages

CFGs cannot express: ■



Identifiers declared before used L = {wcw|w is in (a|b) *} Parameter checking (#formals = #actuals) L ={a nbmcndm|n ≥ 1, m ≥ 1}

CS2210 Compiler Design 2004/05

5

Eliminating Ambiguity (1) Grammar stmt -> if expr then stmt | if expr then stmt else stmt | other is ambiguous: Sentence: if E1 then if E2 then S1 else S2

stmt => if expr then stmt => if E1 then stmt => if E1 then if expr then stmt else stmt => if E1 then if E2 then stmt else stmt => if E1 then if E2 then S1 else stmt => if E1 then if E2 then S1 else S 2

stmt => if expr then stmt else stmt => if E1 then stmt else stmt => if E1 then if expr then stmt else stmt => if E1 then if E2 then stmt else stmt => if E1 then if E2 then S1 else stmt => if E1 then if E2 then S1 else S 2

Which one do we prefer?

CS2210 Compiler Design 2004/05

Eliminating Ambiguity (2) Grammar stmt -> if expr then stmt | if expr then stmt else stmt | other is ambiguous: Sentence: if E1 then if E2 then S1 else S2

stmt -> matchted_stmt | unmatched_stmt matched_stmt -> if expr then matched_stmt else matched_stmt | other unmatched_stmt -> if expr then stmt | if expr then matched_stmt else unmatched_stmt

CS2210 Compiler Design 2004/05

Left Recursion If for grammar G there is a derivation A =>+ Aα, for some string α then G is left recursive Example: S -> Aa | b A -> Ac | Sd | ε ■

CS2210 Compiler Design 2004/05

6

Parsing ■



= determining whether a string of tokens can be generated by a grammar Two classes based on order in which parse tree is constructed: ■

Top-down parsing



Bottom-up parsing





Start construction at root of parse tree Start at leaves and proceed to root CS2210 Compiler Design 2004/05

Recursive Descent Parsing ■

A top-down method based on recursive procedures (one for each nonterminal typically) ■



May have to backtrack when wrong production was picked

Predictive parsing = a recursive descent parsing approach that avoids backtracking ■ ■

More efficient Uses (limited) lookahead to decide what productions to use CS2210 Compiler Design 2004/05

Predictive Parser ■

Program with a (parsing) procedure for each nonterminal which ■



Decides what production to use (based on lookahead in the input) Uses a production by mimicking the right side

CS2210 Compiler Design 2004/05

7

Predictive Parser Example type -> simple | ^id | array [simple] of type simple -> integer | char | num dotdot num

procedure match(t:token); begin if lookahead = t then lookahead = nexttoken; else error; end; procedure type; begin if lookahead is in {integer,char,num) then simple else if lookakead = ‘^’ then begin match(‘^’);match(id) end else if lookahead = array then begin match(array);match(‘[‘); simple; match(‘]’);match(of); type end else error; end

CS2210 Compiler Design 2004/05

Predictive Parsing Obstacles ■

expr -> expr + term ■ ■



expr; match(‘+’); term; Infinite recursion (left recursion)

stmt -> if expr then stmt else stmt | if expr then stmt ■

Common prefix ■



Can’t predict production

Solution ■ ■

Eliminate left recursion Left factoring CS2210 Compiler Design 2004/05

Eliminating Left Recursion (1) ■

Simple case: immediate left recursion: Replace A -> A α | β with A -> β A’ A’ -> αA’ | ε

CS2210 Compiler Design 2004/05

8

Eliminating Left Recursion (2) Order the nonterminals A 1 .. A n for i := 1 to n do begin for j := 1 to i-1 do begin replace each production of the form Ai -> Ajγ by the productions Ai -> δ1γ | δ 2γ |…| δkγ where A i -> δ1 | δ2 | … | δk are all current A j productions end eliminate immediate left recursion among the A i productions end CS2210 Compiler Design 2004/05

Example Eliminating Left Recursion S -> Aa | b A -> Ac | Sd | ε Order: S,A for i := 1 to n do begin for j := 1 to i-1 do begin replace each production of the form Ai -> A jγ by the productions Ai -> δ1γ | δ2γ |…| δkγ where Ai -> δ1 | δ2 | … | δk are all current A j productions end eliminate immediate left recursion among the A i productions end

i=2,j=1: Eliminate A->S γ Replace A->Sd with A->Ac|Aad|bd|ε Eliminate immediate left recursion: S->Aa|b A -> bdA’|A’ A’ ->cA’ | adA’ |

ε

CS2210 Compiler Design 2004/05

Left Factoring ■

Find longest common prefix and turn into new nonterminal ■ ■

stmt -> if expr then stmt stmt’ stmt’ -> else stmt | ε

CS2210 Compiler Design 2004/05

9

Transition Diagrams ■ ■

Create initial and final state For each production A -> X1X2…Xn create a path from the initial to the final state, with edges labeled X1, X2, … Xn

E:

0

T

+

3

ε

6

CS2210 Compiler Design 2004/05

Non-recursive Predictive Parsers ■ ■

Avoid recursion for efficiency reasons Typically built automatically by tools Input

Stack

X Y Z $

a + b $ Predictive Parsing Program

Parsing Table M

output M[A,a]gives production A symbol on stack a input symbol (and $)

CS2210 Compiler Design 2004/05

Parsing Algorithm X symbol on top of stack, a current input symbol





1. 2. 3.

Stack contents and remaining input called parser configuration (initially $S on stack and complete input string) If X=a=$ halt and announce success If X=a ≠ $ pop X off stack advance input to next symbol If X is a nonterminal use M[X,a] which contains production X->rhs or error replace X on stack with rhs or call error routine, respectively, e.g. X->UVW replace X with WVU (U on top) output the production (or augment parse tree) CS2210 Compiler Design 2004/05

10

Construction of Parsing Table Helpers (1) ■

First(α) : =set of terminals that begin strings derived from α ■ ■ ■

First(X) = {X} for terminal X If X-> ε a production add ε to First(X) For X->Y1…Yk place a in First(X) if a in First(Y i) and ε ∈First(Yj) for j=1…i-1, if ε ∈First(Yj) j=1…k add ε to First(X)

CS2210 Compiler Design 2004/05

Construction of Parsing Table Helpers (2) ■

Follow(A) := set of terminals a that can appear immediately to the right of A in some sentential form i.e., S =>* α Aaβ for some α,β (a can include $) ■ ■



Place $ in Follow(S), S start symbol, $ right end marker If there is a production A-> αBβ put everything in First(β) except ε in Follow(B) If there is a production A-> αB or A->αBβ where ε is in First(β) then everything in Follow(A) is in Follow(B)

CS2210 Compiler Design 2004/05

Construction Algorithm Input: Grammar G Output: Parsing table M For each production A -> α do For each terminal a in FIRST(α) add A-> α to M[A, a] If ε is in FIRST(α) add A-> α to M[A,b] for each terminal b in FOLLOW(A). ($ counts as a terminal in this step) Make each undefined entry in M to error CS2210 Compiler Design 2004/05

11

Example E -> TE’ E’ -> +TE’ | ε T ->FT’ T’ -> *FT’ | ε F -> (E) | id FIRST(E) = FIRST(T) = FIRST(F) ={(,id} FIRST(E’) = {+, ε} FIRST(T’) = {*, ε} FOLLOW(E)=FOLLOW(E’)={),$} FOLLOW(T)=FOLLOW(T’)={+.),$} FOLLOW(F) ={+.*,),$}

I + d

* (

)

$

E E’ T T’ F

CS2210 Compiler Design 2004/05

LL(1) ■

A grammar whose parsing table has no multiply defined entries is said to be LL(1) ■ ■ ■



First L = left to right input scanning Second L = leftmost derivation (1) = 1 token lookahead

Not all grammars can be brought to LL(1) form, i.e., there are languages that do not fall into the LL(1) class

CS2210 Compiler Design 2004/05

12