Principles of Programming Languages

Principles of Programming Languages h"p://www.di.unipi.it/~andrea/Dida2ca/PLP-‐15/ Prof. Andrea Corradini Department of Computer Scienc...

Author: Darrell Reynold Terry

10 downloads 2 Views 2MB Size

Report

Download PDF

Recommend Documents

Principles of Programming Languages

IA010: Principles of Programming Languages

15 312: Principles of Programming Languages

Principles of Programming Languages Version 1.0.1

Principles of Programming Languages Topic: Formal Languages I

Python. CSE 307 Principles of Programming Languages Stony Brook University

Principles of Programming Languages COMP3031: Lex (Flex) and Yacc (Bison)

CSc 520. Principles of Programming Languages 32: Procedures Inlining

CS 314 Principles of Programming Languages. Lecture 4

Fundamentals of Programming Languages

Semantics of Programming Languages

Fundamentals of Programming Languages

Organization of Programming Languages

Concepts of Programming Languages

Programming Languages

Programming Languages!

Principles of Programming Languages h"p://www.di.unipi.it/~andrea/Dida2ca/PLP-‐15/ Prof. Andrea Corradini Department of Computer Science, Pisa

Lesson 5! •  Lexical analysis: implemen=ng a scanner

The Reason Why Lexical Analysis is a Separate Phase •  Simpliﬁes the design of the compiler –  LL(1) or LR(1) parsing with 1 token lookahead would not be possible (mul=ple characters/tokens to match)

•  Provides eﬃcient implementa=on –  Systema=c techniques to implement lexical analyzers by hand or automa=cally from speciﬁca=ons –  Stream buﬀering methods to scan input

•  Improves portability –  Non-‐standard symbols and alternate character encodings can be normalized (e.g. UTF8, trigraphs) 2

Main goal of lexical analysis: tokeniza=on source code

y := 31 + 28*x

Lexical analyzer or Scanner

token (lookahead)

tokenval (token attribute)

Parser 3

Addi=onal tasks of the Lexical Analyzer •  Remove comments and useless white spaces / tabs from the source code •  Correlate error messages of the parser with source code (e.g. keeping track of line numbers) •  Expansion of macros

4

Interac=on of the Lexical Analyzer with the Parser Source Program

Lexical Analyzer

Token, tokenval

Parser

Get next token

error

error

Symbol Table

5

Tokens, PaWerns, and Lexemes •  A token is a pair >=

Token

-‐ if then else id num relop relop relop relop relop relop

Attribute-Value

-‐ -‐ -‐ -‐ pointer to table entry pointer to table entry LT LE EQ NE GT GE 21

Regular Deﬁni=ons for tokens •  The speciﬁca=on of the paWerns for the tokens is provided with regular deﬁni=ons letter → [A-Za-z]

digit → [0-9]

digits → digit+

if → if then → then else → else relop → < ⏐ ⏐ >= ⏐ = id → letter (letter | digit )* num → digits (. digits )? ( E (+⏐-)? digits )?

22

From Regular Deﬁni=ons to code •  From the regular deﬁni=ons we ﬁrst extract a transiCon diagram, and next the code of the scanner. •  In the example the lexemes are recognized either when they are completed, or at the next character. In real situa=ons a longer lookahead might be necessary. •  The diagrams guarantee that the longest lexeme is iden=ﬁed. 23

Coding Regular Deﬁni=ons in TransiCon Diagrams relop → =⏐= start

3 return(relop, NE)

other

4 *

return(relop, LT)

=

5 return(relop, EQ)

>

6

id → letter ( letter⏐digit )*

start

1

letter

=

7 return(relop, GE)

other

8 *

return(relop, GT)

letter or digit

10

other

11 *

return(gettoken(), install_id())

24

ᑌ਺ᗮ༌Ꮀበ጖ᗮ ᅻ਺጖ᅻࡐह጖ᗮ ጖ఀ਺ᗮ್ཱུႼᎰ጖ᗮဤཱུ਺ᗮႼဤበ್጖್ဤཱུăᗮ ࡐ়ཱུᗮࡐበᗮ ়್በहᎰበበ਺়ᗮ್ཱུᗮ ܺ਺ह጖್ဤཱུᗮͯȮ·ʰ͐Āᗮᑌ਺ᗮ ਺ཱུ጖਺ᅻᗮᑌఀࡐ጖ᗮᑌ਺ᗮఀࡐᐕ਺ᗮ૤ဤᎰ়ཱུᗮ್ཱུᗮ጖ఀ਺ᗮበᓆ༌࣓ဤเᗮ ጖ࡐ࣓เ਺ᗮࡐ়ཱུᗮ়਺጖਺ᅻ༌್ཱུ਺ᗮᑌఀ਺጖ఀ਺ᅻᗮᑌ਺ᗮఀࡐᐕ਺ᗮ ࡐᗮฆ਺ᓆᑌဤᅻ়ᗮဤᅻᗮࡐᗮ጖ᅻᎰ਺ᗮ়್਺ཱུ጖್୪਺ᅻȪᗮ ‫ݬ‬ఀ਺ᗮ጖ᅻࡐཱུበ್጖್ဤཱུᗮ়್ࡐஸᅻࡐ༌ᗮ૤ဤᅻᗮ጖ဤฆ਺ཱུᗮ 1ť ್በᗮበఀဤᑌཱུᗮ್ཱུᗮ֍್ஸȪᗮͯȪ̇νĀᗮࡐ়ཱུᗮ್በᗮበဤᗮ ૤ࡐᅻᗮ጖ఀ਺ᗮ༌ဤበ጖ᗮ हဤ༌Ⴜเ਺ᒏᗮ়್ࡐஸᅻࡐ༌ᗮᑌ਺ᗮఀࡐᐕ਺ᗮበ਺਺ཱུȪᗮ ԉ਺ஸ್್ཱཱཱུུུஸᗮ್ཱུᗮበ጖ࡐ጖਺ᗮ ̇͐Āᗮ್૤ᗮᑌ਺ᗮበ਺਺ᗮࡐᗮ ়್ஸ್጖Āᗮᑌ਺ᗮஸဤᗮ጖ဤᗮበ጖ࡐ጖਺ᗮ ̇ͯȪᗮ ‫ཱུצ‬ᗮ጖ఀࡐ጖ᗮበ጖ࡐ጖਺Āᗮ ᑌ਺ᗮहࡐཱུᗮᅻ਺ࡐ়ᗮࡐཱུᓆᗮཱུᎰ༌࣓਺ᅻᗮဤ૤ᗮࡐ়়್጖್ဤཱུࡐเᗮ ়್ஸ್጖በȪᗮ ‫ז‬ဤᑌ਺ᐕ਺ᅻĀᗮ್૤ᗮᑌ਺ᗮበ਺਺ᗮࡐཱུᓆ጖ఀ್ཱུஸᗮ࣓Ꮀ጖ᗮࡐᗮ়್ஸ್጖ᗮဤᅻᗮࡐᗮ়ဤ጖Āᗮᑌ਺ᗮఀࡐᐕ਺ᗮበ਺਺ཱུᗮࡐᗮཱུᎰ༌࣓਺ᅻᗮ ್ཱུᗮ጖ఀ਺ᗮ૤ဤᅻ༌ᗮဤ૤ᗮࡐཱུᗮ್ཱུ጖਺ஸ਺ᅻҲᗮ ,Ϯ್በᗮࡐཱུᗮ਺ᒏࡐ༌Ⴜเ਺Ȫᗮ ‫ݬ‬ఀࡐ጖ᗮहࡐበ਺ᗮ್በᗮఀࡐ়ཱུเ਺়ᗮ࣓ᓆᗮ਺ཱུ጖਺ᅻ್ཱུஸᗮ Transi=on diagram for unsigned numbers በ጖ࡐ጖਺ᗮ͐˨Āᗮᑌఀ਺ᅻ਺ᗮᑌ਺ᗮᅻ਺጖Ꮀᅻཱུᗮ጖ဤฆ਺ཱུᗮ 1ťࡐ়ཱུᗮࡐᗮႼဤ್ཱུ጖਺ᅻᗮ጖ဤᗮࡐᗮ጖ࡐ࣓เ਺ᗮဤ૤ᗮहဤཱུበ጖ࡐཱུ጖በᗮ + (. digit +)? ( E ‫ݬ‬ఀ਺በ਺ᗮ ᑌఀ਺ᅻ਺ᗮ ್በᗮ ਺ཱུ጖਺ᅻ਺়Ȫᗮ ༌਺हఀࡐ್ཱུहበᗮ num ጖ఀ਺ᗮ →૤ဤᎰ়ཱུᗮ digitเ਺ᒏ਺༌਺ᗮ (+ -‐)? digit+ ࡐᅻ਺ᗮ )? ཱུဤ጖ᗮ በఀဤᑌཱུᗮ ဤཱུᗮ ጖ఀ਺ᗮ ়್ࡐஸᅻࡐ༌ᗮ࣓Ꮀ጖ᗮࡐᅻ਺ᗮࡐཱུࡐเဤஸဤᎰበᗮ጖ဤᗮ጖ఀ਺ᗮᑌࡐᓆᗮᑌ਺ᗮఀࡐ়ཱུเ਺়ᗮ়್਺ཱུ጖್୪਺ᅻበȪᗮ

Coding Regular Deﬁni=ons in TransiCon Diagrams (cont.) ⏐

֍್ஸᎰᅻ਺ᗮͯȪ ̇νиᗮӕᗮ጖ᅻࡐཱུበ್጖್ဤཱུᗮ়್ࡐஸᅻࡐ༌ᗮ૤ဤᅻᗮᎰཱུበ್ஸཱུ਺়ᗮཱུᎰ༌࣓਺ᅻበᗮ ‫צ‬૤ᗮᑌ਺ᗮ ್ཱུበ጖਺ࡐ়ᗮ በ਺਺ᗮ ࡐᗮ ়ဤ጖ᗮ ್ཱུᗮ በ጖ࡐ጖਺ᗮ ̇ͮĀᗮ ጖ఀ਺ཱུᗮ ᑌ਺ᗮ ఀࡐᐕ਺ᗮࡐཱུᗮ ဤႼ጖್ဤཱུࡐเᗮ ૤ᅻࡐह጖್ဤཱུȶ ᗮ 25

ܺ጖ࡐ጖਺ᗮ ̇·ᗮ್በᗮ਺ཱུ጖਺ᅻ਺়Āᗮࡐ়ཱུᗮᑌ਺ᗮเဤဤฆᗮ૤ဤᅻᗮဤཱུ਺ᗮဤᅻᗮ༌ဤᅻ਺ᗮࡐ়়್጖್ဤཱུࡐเᗮ়್ஸ್጖በ҃ᗮበ጖ࡐ጖਺ᗮ̇Υᗮ್በᗮ

From Individual Transi=on Diagrams to Code •  Easy to convert each Transi=on Diagram into code •  Loop with mul=way branch (switch/case) based on the current state to reach the instruc=ons for that state •  Each state is a mul=way branch based on the next input channel

26

Coding the Transi=on Diagrams for Rela=onal Operators start

0

3 return(relop, NE)

other

4 *

return(relop, LT)

=

5 return(relop, EQ)

>

6

=

7 return(relop, GE)

other

TOKEN getRelop() " 8 { "TOKEN retToken = new(RELOP);  "while(1) { /* repeat character processing " " "until a return or failure occurs */" " "switch(state) { " " " "case 0: "c = nextChar();  " " " " "if(c == '’) state = 6;  " " " " "else fail() ; /* lexeme is not a relop */ " " " " " "break; " " " "case 1: ... " " " "..." " " "case 8: "retract();" " " " " "retToken.attribute = GT; " " " " " "return(retToken);" } } }

*

return(relop, GT)

27

Puyng the code together token nexttoken() { while (1) { switch (state) { case 0: c = nextchar(); if (c==blank || c==tab || c==newline) { state = 0; lexeme_beginning++; } else if (c==‘’) state = 6; else state = fail(); break; case 1: … case 9: c = nextchar(); if (isletter(c)) state = 10; else state = fail(); break; case 10: c = nextchar(); if (isletter(c)) state = 10; else if (isdigit(c)) state = 10; else state = 11; break; …

The transi=on diagrams for the various tokens can be tried sequen=ally: on failure, we re-‐scan the input trying another diagram. int fail() { forward = token_beginning; switch (state) { case 0: start = 9; break; case 9: start = 12; break; case 12: start = 20; break; case 20: start = 25; break; case 25: recover(); break; default: /* error */ } return start; 28 }

Puyng the code together: Alterna=ve solu=ons •  The diagrams can be checked in parallel •  The diagrams can be merged into a single one, typically non-‐determinisCc: this is the approach we will study in depth.

29

Lexical errors •  Some errors are out of power of lexical analyzer to recognize: fi (a == f(x)) …" •  However, it may be able to recognize errors like: d = 2r" •  Such errors are recognized when no paWern for tokens matches a character sequence 30

Error recovery •  Panic mode: successive characters are ignored un=l we reach to a well formed token •  Delete one character from the remaining input •  Insert a missing character into the remaining input •  Replace a character by another character •  Transpose two adjacent characters •  Minimal Distance 31