An example lexer and recursive descent parser

An example lexer and recursive descent parser Brett G. Giles January 28, 2002 Contents 1 Introduction 1.1 Calculator Grammer . . . . . . . . . . . . ...
Author: Britney Reeves
1 downloads 0 Views 241KB Size
An example lexer and recursive descent parser Brett G. Giles January 28, 2002

Contents 1 Introduction 1.1 Calculator Grammer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1 1

2 Lexical Scan and tokens

2

3 Parsing and code generation 3.1 Parsing via recursive descent . 3.2 Code Generation . . . . . . . . 3.3 Support Routines . . . . . . . . 3.4 Includes and global definitions

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

4 4 9 10 14

4 Test data

16

5 Appendices 5.1 Compilation and running instructions. . . . . . . . . . . . . . . . . . . . . . . . . . .

17 17

1

Introduction

This PDF document is intended to give you a small example of how to write a lexer and recursive descent parser for a language. We do this for a language based on the standard unix dc. It primarily allows one to create a series of print statements that perform calculations. No identifiers are allowed, no looping, no branching and so forth.

1.1

Calculator Grammer

The grammer is defined by the following rules: prog -> stmtlist. stmtlist -> stmt SEMICOLON stmtlist | stmt stmt -> PRINT expr expr -> term moreterms term -> factor morefactors | SUB term factor -> LPAR expr RPAR | NUM

1

January 28, 2002

calculator.nw

2

morefactors -> MUL factor morefactors | DIV factor morefactors | . moreterms -> ADD term moreterms | SUB term moreterms | . As discussed in the previous section, we only allow programs such as “PRINT 5+4”. While the assignment and minisculus are much more complex, this will serve as an example of how to get started.

2

Lexical Scan and tokens

As in the assignment, we will use lex to scan our tokens. Our tokens are defined as the following: PRINT => "print" LPAR => "(" RPAR => ")" SEMICOLON=> ";" NUM => [0-9]+ ADD => "+" SUB => "-" MULT => "*" DIV => "/"

2

Our first task is create the lex file, calculator.l which is composed of three parts. We prefix the definitions with the %pointer declaration to make sure that yytext is a char * rather than an array. hcalculator.l 2i≡ %pointer hlexdefinitions 3bi %% hlexrules 3ai %% hlexsubroutines 3ci Root chunk (not used in this document).

January 28, 2002

calculator.nw

3

As the meat of the lexing is in the rules section, we will define that first. Just to show how it is done, we use two defined names. Note that we have quoted characters via the double quote as opposed to the backslash. Only those characters quoted are needed to be quoted. Forward slash, although not documented on the web site is a special character to lex. 3a

hlexrules 3ai≡ ({BLANK}|{TAB}|{NEWLINE})+ print return PRINT; {DIGITS} return NUM; "(" return LPAR; ")" return RPAR; "+" return ADD; return SUB; "*" return MULT; "/" return DIV; ; return SEMICOLON; . return OTHER;

/* ignore whitespace */

;

This code is used in chunk 2.

3b

In the definitions section, we must define all the returned values via a series of defines. Normally this would be put in a common include file for the project. In this case, we define it directly as it is so small. Our two names are defined at the bottom. hlexdefinitions 3bi≡ %{ #define #define #define #define #define #define #define #define #define #define #define

OTHER 1 PRINT 257 NUM 258 LPAR 259 RPAR 260 ADD 261 SUB 262 MULT 263 DIV 264 SEMICOLON 265 ERROR 0

%} BLANK " " TAB "\t" NEWLINE "\n" DIGITS [0-9]+ This code is used in chunk 2. Defines: BLANK,TAB,NEWLINE,DIGITS, never used.

3c

Finally, in this very simple case, we need nothing in the subroutines section. hlexsubroutines 3ci≡ This code is used in chunk 2.

January 28, 2002

3

calculator.nw

4

Parsing and code generation

Our task is to now create a recursive descent parser for the language as defined in section 1. All of our source code will be contained in one file. Typically, if not using literate programming, there would be a number of seperate C source files that contain the various routines. Using literate programming it is both more convenient and practical to use a single file. The program consists of four main sections as shown below. 4a

hcalculator.c 4ai≡ hincludesanddefines 14bi hparsingroutines 4ci hcodegeneration 9i hsupportroutines 10ai hmainroutine 4bi Root chunk (not used in this document).

We show the main routine first, as it is the driver for the others. It creates the syntax tree and then writes the code. If we pass any arguments to the program, we set up the debugMe global variable to true. This will cause various printf statements to be executed. 4b

hmainroutine 4bi≡ int main(int argc, char ** argv) { if (argc > 1) debugMe=1; /* true */ else debugMe = 0; /* false */ theSyntaxTree = prog(); writeCode(); } This code is used in chunk 4a. Defines: main, never used. Uses debugMe 16a and prog 5a.

3.1

Parsing via recursive descent

In this section we include the code for recognizing the language and creating trees to reflect the structure of our input code. See your text page 144 for further examples. Each routine creates a node with various children based upon it’s contribution to the syntax of the program. We have a parsing routine for each of the language constructs as given in it’s definition. 4c

hparsingroutines 4ci≡ hprog 5ai hstmtlist 5bi hstmt 5ci hexpr 6ai hterm 6bi hfactor 7ai hmorefactors 7bi hmoreterms 8i This code is used in chunk 4a.

January 28, 2002

calculator.nw

5

Our top level node is prog. This will actually be the top of the final syntax tree, with a stmtlist as it’s dependant and no siblings. This is created by the makeProgNode routine. 5a

hprog 5ai≡ syntree * prog() { debugMe && printf("prog\n"); currtoken = getnexttoken(); return makeProgNode(stmtlist()); } This code is used in chunk 4c. Defines: prog, used in chunks 4b and 15c. Uses debugMe 16a, getnexttoken 10b, makeProgNode 14a, and stmtlist 5b.

Our list of statements comes next, having been called either from prog or recursively by stmtlist. We only recurse if, after getting a stmt, the current token is a semicolon. 5b

hstmtlist 5bi≡ syntree * stmtlist() { syntree * stmtNode; syntree * slistNode; slistNode=NULL; debugMe && printf("stmtlist\n"); stmtNode = stmt(); if (currtoken == SEMICOLON) { matchtoken(SEMICOLON); slistNode = stmtlist(); }; return makeSlistNode(stmtNode,slistNode); } This code is used in chunk 4c. Defines: stmtlist, used in chunks 5a and 15c. Uses debugMe 16a, makeSlistNode 13, matchtoken 11a, and stmt 5c.

Each statment must start with “print”, which is followed by an expression. The tree created here is a print node with the expression as a dependant. No siblings are created here, but of course may be via the statement list routine. 5c

hstmt 5ci≡ syntree * stmt() { debugMe && printf("stmt\n"); matchtoken(PRINT); return makePrintNode(expr()); } This code is used in chunk 4c. Defines: stmt, used in chunks 5b and 15c. Uses debugMe 16a, expr 6a, makePrintNode 12c, and matchtoken 11a.

January 28, 2002

calculator.nw

6

In expression, the set of terms form our dependants. That is, an expression will have a term as it’s depenant, which will have further terms as siblings. A term can be either a number node or an operation node. 6a

hexpr 6ai≡ syntree * expr() { syntree * theTerm, * additionalTerms; debugMe && printf("expr\n"); theTerm=term(); additionalTerms=moreterms(); return makeExprNode(theTerm,additionalTerms); } This code is used in chunk 4c. Defines: expr, used in chunks 5c, 7a, and 15c. Uses debugMe 16a, makeExprNode 12b, moreterms 8, and term 6b.

6b

Note in term below that SUB is unary negation and not subtraction. This is not handled atomically in dc and therefore will need to be handled in code generation. See section 3.2. For a term, the tree returned is either a negate op and the term in the right tree or it is similar to expression above if we replace factors with terms. hterm 6bi≡ syntree * term() { syntree * theFactor, * additionalFactors; debugMe &&printf("term\n"); if (currtoken == SUB) { matchtoken(SUB); return makeOpNode(’n’,term(),NULL); } else { theFactor=factor(); additionalFactors=morefactors(); return makeExprNode(theFactor,additionalFactors); } } This code is used in chunk 4c. Defines: term, used in chunks 6a, 8, and 15c. Uses debugMe 16a, factor 7a, makeExprNode 12b, makeOpNode 12a, matchtoken 11a, and morefactors 7b.

January 28, 2002

7a

calculator.nw

7

factor has two new items, it either starts with a left parenthesis or is a number. The parenthesis are discarded and we get the internal expression, or we return a number node. hfactor 7ai≡ syntree * factor() { debugMe && printf("factor\n"); if ( currtoken == LPAR) { syntree * ret; matchtoken(LPAR); ret = expr(); matchtoken(RPAR); return ret; } else { int val = atoi(yytext); matchtoken(NUM); return makeNumNode(val); } } This code is used in chunk 4c. Defines: factor, used in chunks 6b, 7b, and 15c. Uses debugMe 16a, expr 6a, makeNumNode 11c, and matchtoken 11a.

morefactors is where we discover with operation we are actually doing, multiply or divide. The rest is similar to the above routines. Note though that left and right go back and forth here a bit. 7b

hmorefactors 7bi≡ syntree * morefactors() { syntree * theFactor, * additionalFactors; debugMe && printf("morefactors"); if (currtoken == MULT) { matchtoken(MULT); theFactor=factor(); additionalFactors=morefactors(); return makeOpNode(’*’,theFactor,additionalFactors); } if (currtoken == DIV) { matchtoken(DIV); theFactor=factor(); additionalFactors=morefactors(); return makeOpNode(’/’,theFactor,additionalFactors); } else debugMe &&printf("(end mf)\n"); return NULL; } This code is used in chunk 4c. Defines: morefactors, used in chunks 6b and 15c. Uses debugMe 16a, factor 7a, makeOpNode 12a, and matchtoken 11a.

January 28, 2002

moreterms follows the same algorithm as morefactors. 8

hmoreterms 8i≡ syntree * moreterms() { syntree * theTerm, * additionalTerms; debugMe && printf("moreterms"); if (currtoken == ADD) { matchtoken(ADD); theTerm=term(); additionalTerms=moreterms(); return makeOpNode(’+’,theTerm,additionalTerms); } else if (currtoken == SUB) { matchtoken(SUB); theTerm=term(); additionalTerms=moreterms(); return makeOpNode(’-’,theTerm,additionalTerms); } else debugMe && printf("(end mt)\n"); return NULL; } This code is used in chunk 4c. Defines: moreterms, used in chunks 6a and 15c. Uses debugMe 16a, makeOpNode 12a, matchtoken 11a, and term 6b.

calculator.nw

8

January 28, 2002

3.2

calculator.nw

9

Code Generation

In this section we write out a series of stack codes for dc. Our only excitement in this comes in handling negation. For negation, we write an additional zero for the null child. We traverse the tree, writing the chosen subtree, then the other subtree, then this node. 9

hcodegeneration 9i≡ void printcode(syntree * tr) { if (tr == NULL) return; switch (tr->nodetype) { case PROGNODE: printcode(tr->dependants); printf("echo program end\n"); break; case SLISTNODE: printcode(tr->dependants); printcode(tr->siblings); break; case PRINTNODE: printcode(tr->dependants); printf("PRINT\n"); printcode(tr->siblings); break; case EXPRNODE: printcode(tr->dependants); printcode(tr->siblings); break; case OPNODE: switch(tr->op) { case ’+’:case ’*’: printcode(tr->dependants); printf("OP2 %c\n",tr->op); printcode(tr->siblings); break; case ’-’: case ’/’: printcode(tr->dependants); printf("OP2 %c\n",tr->op); printcode(tr->siblings); break; case ’n’: printf("cPUSH 0\n"); printcode(tr->dependants); printf("OP2 -\n"); break; default: printf("Error - invalid op code:’%c’\n",tr->op); } break; case NUMNODE: printf("cPUSH %d\n",tr->value); printcode(tr->siblings); break; } }

January 28, 2002

calculator.nw

10

void writeCode() { debugMe && printf("Code Gen called\n"); printcode(theSyntaxTree); }

This code is used in chunk 4a. Defines: printcode,writeCode, never used. Uses debugMe 16a.

3.3 10a

Support Routines

We gather the support routines here. hsupportroutines 10ai≡ hgetnexttoken 10bi hsyntaxerror 10ci hmatchtoken 11ai hsyntaxtree 11bi This code is used in chunk 4a.

10b

Note that getnexttoken is not strictly necessary. It simply returns yylex and could be eliminated. hgetnexttoken 10bi≡ int getnexttoken( void) { return yylex(); } This code is used in chunk 10a. Defines: getnexttoken, used in chunks 5a and 11a.

10c

We must be ready to handle syntax errors (tokens in the wrong order for example.) For this example and the first assignment, this is sufficient. hsyntaxerror 10ci≡ void syntaxerror(int tok) { printf("abandoning at token %d, %s\n", tok, yytext); exit (4); } This code is used in chunk 10a. Defines: syntaxerror, used in chunk 11a.

January 28, 2002

11a

calculator.nw

11

Whenever we match a particular token, we advance to the next token. If we are trying to match a token and fail, we use syntax error to exit. hmatchtoken 11ai≡ void matchtoken(int token) { if (currtoken == token) { currtoken = getnexttoken(); debugMe && printf("Matching %d, new token is %d, ’%s’\n",token, currtoken,yytext); } else { syntaxerror(currtoken); } } This code is used in chunk 10a. Defines: matchtoken, used in chunks 5–8 and 15c. Uses debugMe 16a, getnexttoken 10b, and syntaxerror 10c.

In keeping with the example in the lab, we define nodes for our various kinds of grammer. We combine terms, factors etc. into operations. This is so we store the semantics rather than just the syntax of the expression. Each of the routines below typically allocates a new node, assigns the proper type to it and fills in values as appropriate. 11b

hsyntaxtree 11bi≡ hsyntaxnumnode 11ci hsyntaxopnode 12ai hsyntaxexprnode 12bi hsyntaxprintnode 12ci hsyntaxslistnode 13i hsyntaxprognode 14ai This code is used in chunk 10a.

11c

A number node has a ’\0’ for the operation and holds the number in the value field. Both siblings and dependants are null. hsyntaxnumnode 11ci≡ syntree * makeNumNode(int value) { syntree* rval; rval = malloc(sizeof(syntree)); rval->nodetype = NUMNODE; rval->value= value; rval->op = ’\0’; rval->dependants = rval->siblings = NULL; return rval; } This code is used in chunk 11b. Defines: makeNumNode, used in chunks 7a and 15c.

January 28, 2002

calculator.nw

12

An operation node will have an arithmetic operation code for the operation. Both siblings and dependants are passed in and may be null. 12a

hsyntaxopnode 12ai≡ syntree * makeOpNode(char op, syntree* deps, syntree* sibs) { syntree* rval; rval = malloc(sizeof(syntree)); rval->nodetype = OPNODE; rval->value= 0; rval->op = op; rval->dependants = deps; rval->siblings = sibs; return rval; } This code is used in chunk 11b. Defines: makeOpNode, used in chunks 6–8 and 15c.

12b

The expression node has a list of terms as it’s dependant. hsyntaxexprnode 12bi≡ syntree * makeExprNode(syntree* theTerm, syntree* otherterms) { syntree* rval; rval = malloc(sizeof(syntree)); rval->nodetype = EXPRNODE; rval->value= 0; rval->op = ’\0’; rval->dependants = theTerm; rval->siblings = otherterms; return rval; } This code is used in chunk 11b. Defines: makeExprNode, used in chunks 6 and 15c.

Our only valid statement is a print, hence we have only one routine for making statements. 12c

hsyntaxprintnode 12ci≡ syntree * makePrintNode(syntree* expression) { syntree* rval; rval = malloc(sizeof(syntree)); rval->nodetype = PRINTNODE; rval->value= 0; rval->op = ’\0’; rval->dependants = expression; rval->siblings = NULL; return rval; } This code is used in chunk 11b. Defines: makePrintNode, used in chunks 5c and 15c.

January 28, 2002

calculator.nw

13

Making a statement list is the only slightly complicated routine in this set. If we get an slist parameter of NULL we create the slist node. This will mean that the calling routine (stmtlist) has recursed on itself to the end of the statment list. If both parameters are NULL, we have an empty statement list and the dependants will simply be NULL. In the case that we are passed an existing slist, we simply insert the new statement at the head of the list of statements dependant on the statement list. As an error check, if we are passed a NULL statement, we just pass back the slist. 13

hsyntaxslistnode 13i≡ syntree * makeSlistNode(syntree* statement,syntree* slist) { syntree* rval; if (slist == NULL) { rval = malloc(sizeof(syntree)); rval->nodetype = SLISTNODE; rval->value= 0; rval->op = ’\0’; rval->dependants = statement; rval->siblings = NULL; } else { if (statement != NULL) { statement->siblings = slist->dependants; slist->dependants = statement; rval = slist; } else rval = slist; } return rval; } This code is used in chunk 11b. Defines: makeSlistNode, used in chunks 5b and 15c.

January 28, 2002

14a

calculator.nw

14

Makeing a statement list is the only slightly complicated routine in this set. If we are making the first statement list, the slist parameter will be NULL and we therefore create the node. Note that in this case, the statement parameter could be NULL without creating problems. In the case that we already have a statement list, we simply insert the new statement at the head of the list of statements dependant on the statement list. If we are passed a NULL statement, we just pass back the slist. hsyntaxprognode 14ai≡ syntree * makeProgNode(syntree* slist) { syntree* rval; rval = malloc(sizeof(syntree)); rval->nodetype = PROGNODE; rval->value= 0; rval->op = ’\0’; rval->dependants = slist; rval->siblings = NULL; return rval; } This code is used in chunk 11b. Defines: makeProgNode, used in chunks 5a and 15c.

3.4

14b

Includes and global definitions

We include a couple of standard libraries and then duplicate the defines from the lex file. (See section 2.) hincludesanddefines 14bi≡ #include #include #define OTHER 1 #define PRINT 257 #define NUM 258 #define LPAR 259 #define RPAR 260 #define ADD 261 #define SUB 262 #define MULT 263 #define DIV 264 #define SEMICOLON 265 #define ERROR 0 This definition is continued in chunks 14–16. This code is used in chunk 4a. Defines: OTHER,PRINT,NUM,LPAR,RPAR,ADD,SUB,MULT,DIV,SEMICOLON,ERROR, never used.

We follow this with definitions of the interface elements to the lexer. 14c

hincludesanddefines 14bi+≡ extern char *yytext; extern int yylex(void); This code is used in chunk 4a. Defines: yytext,yylex, never used.

January 28, 2002

15a

calculator.nw

15

Then comes our data definition. The idea is to have a binary tree that reflects the structure of our program. See figure for an example. In our case, the structure is very simple and we only have a few node types. They are: hincludesanddefines 14bi+≡ #define #define #define #define #define #define

PROGNODE 1001 SLISTNODE 1002 PRINTNODE 1003 EXPRNODE 1004 OPNODE 1005 NUMNODE 1006

This code is used in chunk 4a. Defines: PROGNODE,SLISTNODE,PRINTNODE,EXPRNODE,OPNODE,NUMNODE, never used.

The structure itself is as discussed in class. 15b

hincludesanddefines 14bi+≡ typedef struct stree { int nodetype; char op; int value; struct stree * dependants; struct stree * siblings; } syntree; syntree * theSyntaxTree; This code is used in chunk 4a. Defines: syntree,theSyntaxTree,nodetype,dependants,siblings,op,value, never used.

15c

Now, we can create the forward definitions for all of the subroutines. hincludesanddefines 14bi+≡ syntree * prog(); syntree * stmtlist(); syntree * stmt(); syntree * expr(); syntree * term(); syntree * factor(); syntree * morefactors(); syntree * moreterms(); void matchtoken(); int currtoken; syntree syntree syntree syntree syntree syntree

* * * * * *

makeNumNode(int value); makeOpNode(char op, syntree* deps, syntree* sibs); makeExprNode(syntree* terms, syntree* otherterms); makePrintNode(syntree* expression); makeSlistNode(syntree* statement, syntree* slist); makeProgNode(syntree* slist);

This code is used in chunk 4a. Uses expr 6a, factor 7a, makeExprNode 12b, makeNumNode 11c, makeOpNode 12a, makePrintNode 12c, makeProgNode 14a, makeSlistNode 13, matchtoken 11a, morefactors 7b, moreterms 8, prog 5a, stmt 5c, stmtlist 5b, and term 6b.

January 28, 2002

16a

calculator.nw

16

Last, and probably least, we define the global boolean variable (int in c) to determine whether we are printing debug output. hincludesanddefines 14bi+≡ int debugMe; This code is used in chunk 4a. Defines: debugMe, used in chunks 4–9 and 11a.

4

16b

Test data

A variety of data to ensure this thing works correctly. dc should print 34, 714, −3, 5, 1, −3, 145. (Each number on a seperate line.) hcalculator.txt 16bi≡ print print print print print print print

(5+4)*3+7; (12+345)*2; -3; 10/2; 5-4; 4+3-10; ((((7-5*2)*(3-1)-5)*(4+3-10)-5)*(42-37)--5)

Root chunk (not used in this document).

January 28, 2002

5

calculator.nw

Appendices

5.1

Compilation and running instructions.

To compile, run the following commands. • noweb calculator.nw • lex calculator.l • gcc lex.yy.c calculator.c -lfl -o calculator Note that in the last command you might use -ll rather than -lfl. To run the program without debugging and sending the output directly to dc: • ./calculator