Syntax Analysis CS2210 Lecture 4
CS2210 Compiler Design 2004/05
Parser source
lexical analyzer
token parser get next token
parse tree
rest of frontend
IR
symbol table
Parsing = determining whether a string of tokens can be generated by a grammar CS2210 Compiler Design 2004/05
Grammars
■
Precise, easy-to understand description of syntax Context-free grammars -> efficient parsers (automatically!) Help in translation and error detection
■
Easier language evolution
■
■
■
■
Eg. Attribute grammars Can add new constructs systematically CS2210 Compiler Design 2004/05
1
Syntax Errors ■
Many errors are syntactic or exposed by parsing ■
■
eg. Unbalanced ()
Error handling goals: ■ ■
■
Report errors quickly & accurately Recover quickly (continue parsing after error) Little overhead on parse time CS2210 Compiler Design 2004/05
Error Recovery ■
Panic mode
■
Phrase level
■
Error productions
■
Global correction
■
■
■
■
Discard tokens until synchronization token found (often ‘;’)
Local correction: replace a token by another and continue Encode commonly expected errors in grammar Find closest input string that is in L(G) ■
Too costly in practice
CS2210 Compiler Design 2004/05
Context-free Grammars ■
■ ■
Precise and easy way to specify the syntactical structure of a programming language Efficient recognition methods exist Natural specification of many “recursive” constructs: ■
expr -> expr + expr | term CS2210 Compiler Design 2004/05
2
Context-free Grammar Definition ■
Terminals T ■
■
Symbols which form strings of L(G), G a CFG (= tokens in the scanner), e.g. if, else, id
Nonterminals N ■ ■
Syntactic variables denoting sets of strings of L(G) Impose hierarchical structure (e.g., precedence rules)
■
Start symbol S (∈ N)
■
Productions P
■
■ ■
Denotes the set of strings of L(G) Rules that determine how strings are formed N -> (N|T) * CS2210 Compiler Design 2004/05
Example: Expression Grammar expr -> expr op expr expr -> (expr) expr -> - expr expr -> id
■
Terminals:
■
Nonterminals
■
Start symbol
■
op -> + op -> -
■
op -> * op -> / op -> ^
■
{id, +, -, *, /, ^} {expr, op,} Expr
CS2210 Compiler Design 2004/05
Notational Conventions ■
Terminals ■ ■ ■ ■ ■
■
a,b,c.. +,-,.. ‘,’.’;’ etc 0..9 expr or
Nonterminals ■ ■
■
Terminal strings ■
■
u,v,..
Grammar symbol strings ■
■
A, B, C .. S start symbol (if present) or first nonterminal in production list
α,β
Productions ■
A -> α
CS2210 Compiler Design 2004/05
3
Shorthands & Derivations E -> E + E | E * E | (E) | - E |
■
■ ■
E => - E “E derives -E” => derives in 1 step =>* derive in n (0..) steps
CS2210 Compiler Design 2004/05
More Definitions ■
■
■ ■ ■ ■
L(G) language generated by G = set of strings derived from S S =>+ w : w sentence of G (w string of terminals) S =>+ α : α sentential form of G (string can contain nonterminals) G and G’ are equivalent :⇔ L(G) = L(G’) A language generated by a grammar (of the form shown) is called a context-free language CS2210 Compiler Design 2004/05
Example G = ({-,*,(,),}, {E}, E, {E -> E + E, E-> E * E , E -> (E) , E-> - E, E -> })
Sentence: -( + ) Derivation: E => -E => -(E) => -(E+E)=>-(+E) => -( + ) •
•
•
Leftmost derivation i.e. always replace leftmost nonterminal Rightmost derivation analogously Left /right sentential form
CS2210 Compiler Design 2004/05
4
Parse Trees Parse tree = graphical representation of a derivation ignoring replacement order E
E E => -E => -(E) => -(E+E)=> -(+E) => -( + )
( E
E
)
+
E
CS2210 Compiler Design 2004/05
Ambiguous Grammars ■
■
>=2 different parse trees for some sentence ⇔ >= 2 leftmost/rightmost derivations Usually want to have unambiguous grammars ■
■
E.g. want to just one evaluation order: + * to be parsed as + ( * ) not (+)* To keep grammars simple accept ambiguity and resolve separately (outside of grammar)
CS2210 Compiler Design 2004/05
Expressive Power ■
CFGs are more powerful than REs ■ ■
■
Can express matching () with CFGs Can express most properties desired for programming languages
CFGs cannot express: ■
■
Identifiers declared before used L = {wcw|w is in (a|b) *} Parameter checking (#formals = #actuals) L ={a nbmcndm|n ≥ 1, m ≥ 1}
CS2210 Compiler Design 2004/05
5
Eliminating Ambiguity (1) Grammar stmt -> if expr then stmt | if expr then stmt else stmt | other is ambiguous: Sentence: if E1 then if E2 then S1 else S2
stmt => if expr then stmt => if E1 then stmt => if E1 then if expr then stmt else stmt => if E1 then if E2 then stmt else stmt => if E1 then if E2 then S1 else stmt => if E1 then if E2 then S1 else S 2
stmt => if expr then stmt else stmt => if E1 then stmt else stmt => if E1 then if expr then stmt else stmt => if E1 then if E2 then stmt else stmt => if E1 then if E2 then S1 else stmt => if E1 then if E2 then S1 else S 2
Which one do we prefer?
CS2210 Compiler Design 2004/05
Eliminating Ambiguity (2) Grammar stmt -> if expr then stmt | if expr then stmt else stmt | other is ambiguous: Sentence: if E1 then if E2 then S1 else S2
stmt -> matchted_stmt | unmatched_stmt matched_stmt -> if expr then matched_stmt else matched_stmt | other unmatched_stmt -> if expr then stmt | if expr then matched_stmt else unmatched_stmt
CS2210 Compiler Design 2004/05
Left Recursion If for grammar G there is a derivation A =>+ Aα, for some string α then G is left recursive Example: S -> Aa | b A -> Ac | Sd | ε ■
CS2210 Compiler Design 2004/05
6
Parsing ■
■
= determining whether a string of tokens can be generated by a grammar Two classes based on order in which parse tree is constructed: ■
Top-down parsing
■
Bottom-up parsing
■
■
Start construction at root of parse tree Start at leaves and proceed to root CS2210 Compiler Design 2004/05
Recursive Descent Parsing ■
A top-down method based on recursive procedures (one for each nonterminal typically) ■
■
May have to backtrack when wrong production was picked
Predictive parsing = a recursive descent parsing approach that avoids backtracking ■ ■
More efficient Uses (limited) lookahead to decide what productions to use CS2210 Compiler Design 2004/05
Predictive Parser ■
Program with a (parsing) procedure for each nonterminal which ■
■
Decides what production to use (based on lookahead in the input) Uses a production by mimicking the right side
CS2210 Compiler Design 2004/05
7
Predictive Parser Example type -> simple | ^id | array [simple] of type simple -> integer | char | num dotdot num
procedure match(t:token); begin if lookahead = t then lookahead = nexttoken; else error; end; procedure type; begin if lookahead is in {integer,char,num) then simple else if lookakead = ‘^’ then begin match(‘^’);match(id) end else if lookahead = array then begin match(array);match(‘[‘); simple; match(‘]’);match(of); type end else error; end
CS2210 Compiler Design 2004/05
Predictive Parsing Obstacles ■
expr -> expr + term ■ ■
■
expr; match(‘+’); term; Infinite recursion (left recursion)
stmt -> if expr then stmt else stmt | if expr then stmt ■
Common prefix ■
■
Can’t predict production
Solution ■ ■
Eliminate left recursion Left factoring CS2210 Compiler Design 2004/05
Eliminating Left Recursion (1) ■
Simple case: immediate left recursion: Replace A -> A α | β with A -> β A’ A’ -> αA’ | ε
CS2210 Compiler Design 2004/05
8
Eliminating Left Recursion (2) Order the nonterminals A 1 .. A n for i := 1 to n do begin for j := 1 to i-1 do begin replace each production of the form Ai -> Ajγ by the productions Ai -> δ1γ | δ 2γ |…| δkγ where A i -> δ1 | δ2 | … | δk are all current A j productions end eliminate immediate left recursion among the A i productions end CS2210 Compiler Design 2004/05
Example Eliminating Left Recursion S -> Aa | b A -> Ac | Sd | ε Order: S,A for i := 1 to n do begin for j := 1 to i-1 do begin replace each production of the form Ai -> A jγ by the productions Ai -> δ1γ | δ2γ |…| δkγ where Ai -> δ1 | δ2 | … | δk are all current A j productions end eliminate immediate left recursion among the A i productions end
i=2,j=1: Eliminate A->S γ Replace A->Sd with A->Ac|Aad|bd|ε Eliminate immediate left recursion: S->Aa|b A -> bdA’|A’ A’ ->cA’ | adA’ |
ε
CS2210 Compiler Design 2004/05
Left Factoring ■
Find longest common prefix and turn into new nonterminal ■ ■
stmt -> if expr then stmt stmt’ stmt’ -> else stmt | ε
CS2210 Compiler Design 2004/05
9
Transition Diagrams ■ ■
Create initial and final state For each production A -> X1X2…Xn create a path from the initial to the final state, with edges labeled X1, X2, … Xn
E:
0
T
+
3
ε
6
CS2210 Compiler Design 2004/05
Non-recursive Predictive Parsers ■ ■
Avoid recursion for efficiency reasons Typically built automatically by tools Input
Stack
X Y Z $
a + b $ Predictive Parsing Program
Parsing Table M
output M[A,a]gives production A symbol on stack a input symbol (and $)
CS2210 Compiler Design 2004/05
Parsing Algorithm X symbol on top of stack, a current input symbol
■
■
1. 2. 3.
Stack contents and remaining input called parser configuration (initially $S on stack and complete input string) If X=a=$ halt and announce success If X=a ≠ $ pop X off stack advance input to next symbol If X is a nonterminal use M[X,a] which contains production X->rhs or error replace X on stack with rhs or call error routine, respectively, e.g. X->UVW replace X with WVU (U on top) output the production (or augment parse tree) CS2210 Compiler Design 2004/05
10
Construction of Parsing Table Helpers (1) ■
First(α) : =set of terminals that begin strings derived from α ■ ■ ■
First(X) = {X} for terminal X If X-> ε a production add ε to First(X) For X->Y1…Yk place a in First(X) if a in First(Y i) and ε ∈First(Yj) for j=1…i-1, if ε ∈First(Yj) j=1…k add ε to First(X)
CS2210 Compiler Design 2004/05
Construction of Parsing Table Helpers (2) ■
Follow(A) := set of terminals a that can appear immediately to the right of A in some sentential form i.e., S =>* α Aaβ for some α,β (a can include $) ■ ■
■
Place $ in Follow(S), S start symbol, $ right end marker If there is a production A-> αBβ put everything in First(β) except ε in Follow(B) If there is a production A-> αB or A->αBβ where ε is in First(β) then everything in Follow(A) is in Follow(B)
CS2210 Compiler Design 2004/05
Construction Algorithm Input: Grammar G Output: Parsing table M For each production A -> α do For each terminal a in FIRST(α) add A-> α to M[A, a] If ε is in FIRST(α) add A-> α to M[A,b] for each terminal b in FOLLOW(A). ($ counts as a terminal in this step) Make each undefined entry in M to error CS2210 Compiler Design 2004/05
11
Example E -> TE’ E’ -> +TE’ | ε T ->FT’ T’ -> *FT’ | ε F -> (E) | id FIRST(E) = FIRST(T) = FIRST(F) ={(,id} FIRST(E’) = {+, ε} FIRST(T’) = {*, ε} FOLLOW(E)=FOLLOW(E’)={),$} FOLLOW(T)=FOLLOW(T’)={+.),$} FOLLOW(F) ={+.*,),$}
I + d
* (
)
$
E E’ T T’ F
CS2210 Compiler Design 2004/05
LL(1) ■
A grammar whose parsing table has no multiply defined entries is said to be LL(1) ■ ■ ■
■
First L = left to right input scanning Second L = leftmost derivation (1) = 1 token lookahead
Not all grammars can be brought to LL(1) form, i.e., there are languages that do not fall into the LL(1) class
CS2210 Compiler Design 2004/05
12