The Parsing Problem (cont.)

Goals of the parser, given an input program: – Find all syntax errors; for each, produce an appropriate diagnostic message, and recover quickly – Pro...

Author: Alban Bailey

44 downloads 1 Views 164KB Size

Report

Download PDF

Recommend Documents

A Sound Abstraction of the Parsing Problem

Parsing

Parsing: Top-Down vs. Bottom-Up Parsing Algorithms Treebanks Statistical Parsing Partial Parsing Chunking Dependency Parsing

The Electric Potential (Cont.)

Parsing V Operator-Precedence Parsing

Terms cont d. Terms cont d. Terms cont d

Fully Parsing the Penn Treebank

LP PARSING

Application of Neural Networks in the Semantic Parsing Re-Ranking Problem

The Network Layer. The Network Layer (cont(

Suspensiones (cont.)

Parsing Formal Languages using Natural Language Parsing Techniques

Lecture 12: The CKY parsing algorithm

The Front End: Scanning and Parsing

The Tangent Line Problem. The Tangent Line Problem. The Tangent Line Problem. The Tangent Line Problem. The Tangent Line Problem

The Problem. Managing Crop Load on Size Controlling Rootstocks. The Problem. The Problem. The Problem. The Problem

CANCILLERES COLOMBIANOS (Cont.)

Description Cont. Alc.%

Dependency Parsing of Turkish

Top-Down Parsing. Intro to Top-Down Parsing

Our Place in the Universe (Cont d)

Bottom-Up Parsing (Example) Bottom-Up Parsing (Example)

1. Building the operator precedence parsing table:

Planning (cont.) Uncertainty

Goals of the parser, given an input program:

– Find all syntax errors; for each, produce an appropriate diagnostic message, and recover quickly – Produce the parse tree, or at least a trace of the parse tree, for the program

The Parsing Problem (cont.) • Two categories of parsers – Top down - produce the parse tree, beginning at the root • Order is that of a leftmost derivation • Traces or builds the parse tree in preorder

– Bottom up - produce the parse tree, beginning at the leaves • Order is that of the reverse of a rightmost derivation

• Parsers look only one token ahead in the input

The Parsing Problem (cont.) • Top-down Parsers – Given a sentential form, xAα , the parser must choose the correct A-rule to get the next sentential form in the leftmost derivation, using only the first token produced by A

• The most common top-down parsing algorithms: – Recursive descent - a coded implementation – LL parsers - table driven implementation

1

The Parsing Problem (cont.) • Bottom-up parsers – Given a right sentential form, α, determine what substring of α is the right-hand side of the rule in the grammar that must be reduced to produce the previous sentential form in the right derivation – The most common bottom-up parsing algorithms are in the LR family – YACC is in the LR family…

YACC Introduction • What is YACC ? – Tool which will produce a parser for a given grammar. – YACC (Yet Another Compiler Compiler) is a program designed to compile a LALR(1) grammar and to produce the source code of the syntactic analyzer of a language produced by this grammar.

History • Yacc original written by Stephen C. Johnson, 1975. • Variants: – – – – –

lex, yacc (AT&T) bison: a yacc replacement (GNU) flex: fast lexical analyzer (GNU) BSD yacc PCLEX, PCYACC (Abraxas Software)

2

How YACC Works YACC source (foo.y)

y.tab.h y.tab.c y.output

yacc (1) Parse

y.tab.c

a.out

cc / gcc (2) Compile

Token stream

a.out

Abstract Syntax Tree

(3) Run

A YACC File Example %{ #include %} %token NAME NUMBER %% statement: NAME '=' expression | expression ;

{ printf("= %d\n", $1); }

expression: expression '+' NUMBER { $$ = $1 + $3; } | expression '-' NUMBER { $$ = $1 - $3; } | NUMBER { $$ = $1; } ; %% int yyerror(char *s) { fprintf(stderr, "%s\n", s); return 0; } int main(void) { yyparse(); return 0; }

YACC File Format %{ C declarations %} yacc declarations %% Grammar rules %% Additional C code –

Comments in /* ... */ may appear in any of the sections.

3

Definitions Section %{ #include #include %} It is a terminal %token ID NUM %start expr start from expr

Start Symbol • The first non-terminal specified in the grammar specification section. • To overwrite it use %start declaration. %start non-terminal

Rules Section • Is a grammar • Example expr : expr '+' term | term; term : term '*' factor | factor; factor : '(' expr ')' | ID | NUM;

4

Rules Section • Normally written like this • Example: expr : expr '+' term | term ; term : term '*' factor | factor ; factor : '(' expr ')' | ID | NUM ;

The Position of Rules expr : | ; term : | ; factor

expr '+' term term

{ $$ = $1 + $3; } { $$ = $1; }

term '*' factor factor

{ $$ = $1 * $3; } { $$ = $1; }

: '(' expr ')' | ID | NUM ;

{ $$ = $2; }

Works with LEX

[0-9]+

call yylex()

next token is NUM

12 + 26

NUM ‘+’ NUM LEX and YACC need a way to identify tokens

5

Communication between LEX and YACC

• Use enumeration / define • YACC creates y.tab.h • LEX includes y.tab.h

yacc -d gram.y Will produce: y.tab.h

Communication between LEX and YACC %{ scanner.l #include #include "y.tab.h" %} id [_a-zA-Z][_a-zA-Z0-9]* %% int { return INT; } char { return CHAR; } float { return FLOAT; } {id} { return ID;}

%{ #include #include %} %token CHAR, FLOAT, ID, INT %%

yacc -d xxx.y produces y.tab.h # # # #

define define define define

CHAR 258 FLOAT 259 ID 260 INT 261

parser.y

YACC • Rules may be recursive • Rules may be ambiguous* • Uses bottom up Shift/Reduce parsing – Get a token – Push onto stack – Can it reduced ?

• yes: Reduce using a rule • no: Get another token • Yacc cannot look ahead more than one token

6

Passing value of token • Every terminal-token (symbol) may represent a value or data type – May be a numeric quantity in case of a number (42) – May be a pointer to a string ("Hello, World!") • When using lex, we put the value into yylval – In complex situations yylval is a union • Typical lex code: [0-9]+ NUM}

{yylval = atoi(yytext); return

Passing value of token • Yacc allows symbols to have multiple types of value symbols %union { double dval; int vblno; char* strval; }

Passing value of token %union { double dval; int vblno; char* strval; }

[0-9]+ [A-z]+

yacc -d

y.tab.h … extern YYSTYPE yylval;

{ yylval.vblno = atoi(yytext); return NUM;} { yylval.strval = strdup(yytext); return STRING;}

Lex file include “y.tab.h”

7

Yacc Example • Taken from Lex & Yacc • Example: Simple calculator a = 4 + 6 a a=10 b = 7 c = a + b c c = 17 $

Grammar expression ::= expression '+' term | expression '-' term | term term

::= term '*' factor | term '/' factor | factor

factor

::= '(' expression ')' | '-' factor | NUMBER | NAME

Symbol Table 0

#define NSYMS 20

/* maximum number of symbols */

struct symtab { char *name; double value; } symtab[NSYMS]; struct symtab *symlook();

name

value

1

name

value

2

name

value

3

name

value

4

name

value

5

name

value

6

name

value

7

name

value

8

name

value

9

name

value

10

name

value

parser.h

8

Parser %{ #include "parser.h" #include %}

Terminal NAME and have the same data type.

%union { double dval; struct symtab *symp; } %token NAME %token NUMBER

Nonterminal expression and have the same data type.

%type expression %type term %type factor %%

Parser statement_list: | ; statement: | ;

parser.y

(cont’d)

statement '\n' statement_list statement '\n'

NAME '=' expression { $1->value = $3; } expression { printf("= %g\n", $1); }

expression: expression '+' term { $$ = $1 + $3; } | expression '-' term { $$ = $1 - $3; } | term ;

parser.y

Parser term: |

(cont’d)

term '*' factor { $$ = $1 * $3; } term '/' factor { if ($3 == 0.0) yyerror("divide by zero"); else $$ = $1 / $3; }

| factor ; factor: | | | ; %%

'(' expression ')' { $$ = '-' factor { $$ = NUMBER { $$ = NAME { $$ =

$2; } -$2; } $1; } $1->value; }

parser.y

9

Scanner %{ #include "y.tab.h" #include "parser.h" #include %} %% ([0-9]+|([0-9]*\.[0-9]+)([eE][-+]?[0-9]+)?) { yylval.dval = atof(yytext); return NUMBER; } [ \t] ;

/* ignore white space */

scanner.l

Scanner [A-Za-z][A-Za-z0-9]*

"$"

(cont’d)

{ /* return symbol pointer */ yylval.symp = symlook(yytext); return NAME; }

{ return 0; /* end of input */ }

\n |”=“|”+”|”-”|”*”|”/” %%

return yytext[0];

scanner.l

Precedence / Association (1) 1 - 2 - 3 (2) 1 - 2 * 3 1. 1-2-3 = (1-2)-3? or 1-(2-3)? 2. 1-2*3 = 1-(2*3) or (1-2)*3? Yacc: Shift/Reduce conflicts. Default is to shift.

10

Precedence / Association %right %left %left %left

‘=‘ '' NE LE GE '+' '-‘ '*' '/' highest precedence

Precedence / Association %left '+' '-' %left '*' '/' %noassoc UMINUS expr

: | | |

expr expr expr expr

‘+’ ‘-’ ‘*’ ‘/’

expr expr expr expr

{ $$ = $1 + $3; } { $$ = $1 - $3; } { $$ = $1 * $3; } { if($3==0) yyerror(“divide 0”); else $$ = $1 / $3;

} | ‘-’ expr %prec UMINUS {$$ = -$2; }

IF-ELSE Ambiguity Consider the following rule: stmt

: IF expr stmt | IF expr stmt ELSE stmt ……

How about the following statement ? IF expr IF expr stmt ELSE stmt

11

IF-ELSE Ambiguity • It is a shift/reduce conflict. • Yacc will always choose to shift. • A solution: stmt

: matched | unmatched ; matched: other_stmt | IF expr THEN matched ELSE matched ; unmatched: IF expr THEN stmt | IF expr THEN matched ELSE unmatched ;

Shift/Reduce Conflicts • shift/reduce conflict – occurs when a grammar is written in such a way that a decision between shifting and reducing can not be made. – ex: IF-ELSE ambiguous.

• To resolve this conflict, yacc will choose to shift.

Reduce/Reduce Conflicts • Reduce/Reduce Conflicts: start : expr | stmt ; expr : CONSTANT; stmt : CONSTANT;

• Yacc resolves the conflict by reducing using the rule that occurs earlier in the grammar. NOT GOOD!! • So, modify grammar to eliminate them.

12

Error Messages • Bad error message: – Syntax error. – Compiler needs to give programmer a good advice.

• It is better to track the line number in lex: void yyerror(char *s) { fprintf(stderr, "line %d: %s\n:", yylineno, s); }

Debug Your Parser 1. Use –t option or define YYDEBUG to 1. 2. Set variable yydebug to 1 when you want to trace parsing status. 3. If you want to trace the semantic values z

Define your YYPRINT function

Shift and Reducing: Example

stmt: stmt ‘;’ stmt | NAME ‘=‘ exp

stack:

exp: exp ‘+’ exp | exp ‘-’ exp

input: a = 7; b = 3 + a + 2

| NAME | NUMBER

13

Recursive Grammar • Left recursion list: item | list ',' item ;

• Right recursion list: item | item ',' list ;

• LR parser (e.g. yacc) prefers left recursion. • LL parser prefers right recursion.

YACC Declaration Summary `%start' Specify the grammar's start symbol `%union' Declare the collection of data types that semantic values may have `%token' Declare a terminal symbol (token type name) with no precedence or associativity specified `%type' Declare the type of semantic values for a nonterminal symbol

YACC Declaration Summary `%right' Declare a terminal symbol (token type name) that is right-associative `%left' Declare a terminal symbol (token type name) that is left-associative `%nonassoc' Declare a terminal symbol (token type name) that is nonassociative (using it in a way that would be associative is a syntax error, ex: x op. y op. z is syntax error)

14