Principles of Programming Languages

Principles  of  Programming  Languages   h"p://www.di.unipi.it/~andrea/Dida2ca/PLP-­‐15/   Prof.  Andrea  Corradini   Department  of  Computer  Scienc...
10 downloads 2 Views 2MB Size
Principles  of  Programming  Languages   h"p://www.di.unipi.it/~andrea/Dida2ca/PLP-­‐15/   Prof.  Andrea  Corradini   Department  of  Computer  Science,  Pisa  

Lesson 5! •  Lexical  analysis:  implemen=ng  a  scanner  

The  Reason  Why  Lexical  Analysis     is  a  Separate  Phase   •  Simplifies  the  design  of  the  compiler   –  LL(1)  or  LR(1)  parsing  with  1  token  lookahead  would   not  be  possible  (mul=ple  characters/tokens  to  match)  

•  Provides  efficient  implementa=on   –  Systema=c  techniques  to  implement  lexical  analyzers   by  hand  or  automa=cally  from  specifica=ons   –  Stream  buffering  methods  to  scan  input  

•  Improves  portability   –  Non-­‐standard  symbols  and  alternate  character   encodings  can  be  normalized  (e.g.  UTF8,  trigraphs)   2

Main  goal  of  lexical  analysis:   tokeniza=on   source code

y := 31 + 28*x

Lexical  analyzer   or   Scanner  



token (lookahead)

tokenval (token attribute)

Parser   3

Addi=onal  tasks  of  the  Lexical  Analyzer   •  Remove  comments  and  useless  white  spaces  /   tabs  from  the  source  code   •  Correlate  error  messages  of  the  parser  with   source  code  (e.g.  keeping  track  of  line   numbers)   •  Expansion  of  macros  

4

Interac=on  of  the  Lexical  Analyzer     with  the  Parser   Source Program

Lexical   Analyzer  

Token, tokenval

Parser  

Get next token

error

error

Symbol  Table  

5

Tokens,  PaWerns,  and  Lexemes   •  A  token  is  a  pair     >=  

Token

-­‐   if   then   else   id   num   relop   relop     relop   relop   relop   relop  

Attribute-Value

-­‐   -­‐   -­‐   -­‐   pointer  to  table  entry   pointer  to  table  entry   LT   LE   EQ   NE   GT   GE   21

Regular  Defini=ons  for  tokens   •  The  specifica=on  of  the  paWerns  for  the   tokens  is  provided  with  regular  defini=ons   letter → [A-Za-z]

digit → [0-9]

digits → digit+



if → if then → then else → else relop → < ⏐ ⏐ >= ⏐ = id → letter (letter | digit )* num → digits (. digits )? ( E (+⏐-)? digits )?

22

From  Regular  Defini=ons  to  code   •  From  the  regular  defini=ons  we  first  extract  a   transiCon  diagram,  and  next  the  code  of  the   scanner.   •  In  the  example  the  lexemes  are  recognized   either  when  they  are  completed,  or  at  the   next  character.  In  real  situa=ons  a  longer   lookahead  might  be  necessary.   •  The  diagrams  guarantee  that  the  longest   lexeme  is  iden=fied.     23

Coding  Regular  Defini=ons  in   TransiCon  Diagrams   relop → =⏐= start




3   return(relop, NE)

other

4   *

return(relop, LT)

=

5   return(relop, EQ)

>

6  

id → letter ( letter⏐digit )*

start

1  

letter

=

7   return(relop, GE)

other

8   *

return(relop, GT)

letter or digit

10  

other

11  *

return(gettoken(), install_id())

24

ᑌ਺ᗮ༌Ꮀበ጖ᗮ ᅻ਺጖ᅻࡐह጖ᗮ ጖ఀ਺ᗮ್ཱུႼᎰ጖ᗮဤཱུ਺ᗮႼဤበ್጖್ဤཱུăᗮ ࡐ়ཱུᗮࡐበᗮ ়್በहᎰበበ਺়ᗮ್ཱུᗮ ܺ਺ह጖್ဤཱུᗮͯȮ·ʰ͐Āᗮᑌ਺ᗮ ਺ཱུ጖਺ᅻᗮᑌఀࡐ጖ᗮᑌ਺ᗮఀࡐᐕ਺ᗮ૤ဤᎰ়ཱུᗮ್ཱུᗮ጖ఀ਺ᗮበᓆ༌࣓ဤเᗮ ጖ࡐ࣓เ਺ᗮࡐ়ཱུᗮ়਺጖਺ᅻ༌್ཱུ਺ᗮᑌఀ਺጖ఀ਺ᅻᗮᑌ਺ᗮఀࡐᐕ਺ᗮ ࡐᗮฆ਺ᓆᑌဤᅻ়ᗮဤᅻᗮࡐᗮ጖ᅻᎰ਺ᗮ়್਺ཱུ጖್୪਺ᅻȪᗮ ‫ݬ‬ఀ਺ᗮ጖ᅻࡐཱུበ್጖್ဤཱུᗮ়್ࡐஸᅻࡐ༌ᗮ૤ဤᅻᗮ጖ဤฆ਺ཱུᗮ 1 ť ್በᗮበఀဤᑌཱུᗮ್ཱུᗮ֍್ஸȪᗮͯȪ̇νĀᗮࡐ়ཱུᗮ್በᗮበဤᗮ ૤ࡐᅻᗮ጖ఀ਺ᗮ༌ဤበ጖ᗮ हဤ༌Ⴜเ਺ᒏᗮ়್ࡐஸᅻࡐ༌ᗮᑌ਺ᗮఀࡐᐕ਺ᗮበ਺਺ཱུȪᗮ ԉ਺ஸ್್ཱཱཱུུུஸᗮ್ཱུᗮበ጖ࡐ጖਺ᗮ ̇͐Āᗮ್૤ᗮᑌ਺ᗮበ਺਺ᗮࡐᗮ ়್ஸ್጖Āᗮᑌ਺ᗮஸဤᗮ጖ဤᗮበ጖ࡐ጖਺ᗮ ̇ͯȪᗮ ‫ཱུצ‬ᗮ጖ఀࡐ጖ᗮበ጖ࡐ጖਺Āᗮ ᑌ਺ᗮहࡐཱུᗮᅻ਺ࡐ়ᗮࡐཱུᓆᗮཱུᎰ༌࣓਺ᅻᗮဤ૤ᗮࡐ়়್጖್ဤཱུࡐเᗮ ়್ஸ್጖በȪᗮ ‫ז‬ဤᑌ਺ᐕ਺ᅻĀᗮ್૤ᗮᑌ਺ᗮበ਺਺ᗮࡐཱུᓆ጖ఀ್ཱུஸᗮ࣓Ꮀ጖ᗮࡐᗮ়್ஸ್጖ᗮဤᅻᗮࡐᗮ়ဤ጖Āᗮᑌ਺ᗮఀࡐᐕ਺ᗮበ਺਺ཱུᗮࡐᗮཱུᎰ༌࣓਺ᅻᗮ ್ཱུᗮ጖ఀ਺ᗮ૤ဤᅻ༌ᗮဤ૤ᗮࡐཱུᗮ್ཱུ጖਺ஸ਺ᅻҲᗮ ,Ϯ್በᗮࡐཱུᗮ਺ᒏࡐ༌Ⴜเ਺Ȫᗮ ‫ݬ‬ఀࡐ጖ᗮहࡐበ਺ᗮ್በᗮఀࡐ়ཱུเ਺়ᗮ࣓ᓆᗮ਺ཱུ጖਺ᅻ್ཱུஸᗮ Transi=on  diagram   for  unsigned   numbers   በ጖ࡐ጖਺ᗮ͐˨Āᗮᑌఀ਺ᅻ਺ᗮᑌ਺ᗮᅻ਺጖Ꮀᅻཱུᗮ጖ဤฆ਺ཱུᗮ 1 ťࡐ়ཱུᗮࡐᗮႼဤ್ཱུ጖਺ᅻᗮ጖ဤᗮࡐᗮ጖ࡐ࣓เ਺ᗮဤ૤ᗮहဤཱུበ጖ࡐཱུ጖በᗮ +  (.  digit +)?  (  E  ‫ݬ‬ఀ਺በ਺ᗮ ᑌఀ਺ᅻ਺ᗮ ್በᗮ ਺ཱུ጖਺ᅻ਺়Ȫᗮ ༌਺हఀࡐ್ཱུहበᗮ num  ጖ఀ਺ᗮ →૤ဤᎰ়ཱུᗮ  digitเ਺ᒏ਺༌਺ᗮ (+ -­‐)?   digit+  ࡐᅻ਺ᗮ )?  ཱུဤ጖ᗮ በఀဤᑌཱུᗮ ဤཱུᗮ ጖ఀ਺ᗮ ়್ࡐஸᅻࡐ༌ᗮ࣓Ꮀ጖ᗮࡐᅻ਺ᗮࡐཱུࡐเဤஸဤᎰበᗮ጖ဤᗮ጖ఀ਺ᗮᑌࡐᓆᗮᑌ਺ᗮఀࡐ়ཱུเ਺়ᗮ়್਺ཱུ጖್୪਺ᅻበȪᗮ

Coding  Regular  Defini=ons  in   TransiCon  Diagrams  (cont.)   ⏐

֍್ஸᎰᅻ਺ᗮͯȪ ̇νиᗮӕᗮ጖ᅻࡐཱུበ್጖್ဤཱུᗮ়್ࡐஸᅻࡐ༌ᗮ૤ဤᅻᗮᎰཱུበ್ஸཱུ਺়ᗮཱུᎰ༌࣓਺ᅻበᗮ ‫צ‬૤ᗮᑌ਺ᗮ ್ཱུበ጖਺ࡐ়ᗮ በ਺਺ᗮ ࡐᗮ ়ဤ጖ᗮ ್ཱུᗮ በ጖ࡐ጖਺ᗮ ̇ͮĀᗮ ጖ఀ਺ཱུᗮ ᑌ਺ᗮ ఀࡐᐕ਺ᗮࡐཱུᗮ ဤႼ጖್ဤཱུࡐเᗮ ૤ᅻࡐह጖್ဤཱུȶ ᗮ 25

ܺ጖ࡐ጖਺ᗮ ̇·ᗮ್በᗮ਺ཱུ጖਺ᅻ਺়Āᗮࡐ়ཱུᗮᑌ਺ᗮเဤဤฆᗮ૤ဤᅻᗮဤཱུ਺ᗮဤᅻᗮ༌ဤᅻ਺ᗮࡐ়়್጖್ဤཱུࡐเᗮ়್ஸ್጖በ҃ᗮበ጖ࡐ጖਺ᗮ̇Υᗮ್በᗮ

From  Individual  Transi=on     Diagrams  to  Code   •  Easy  to  convert  each  Transi=on  Diagram  into  code   •  Loop  with  mul=way  branch  (switch/case)  based  on   the  current  state  to  reach  the  instruc=ons  for  that   state   •  Each  state  is  a  mul=way  branch  based  on  the  next   input  channel  

26

Coding  the  Transi=on  Diagrams  for  Rela=onal  Operators   start

0  




3   return(relop, NE)

other

4   *

return(relop, LT)

=

5   return(relop, EQ)

>

6  

=

7   return(relop, GE)

other

TOKEN getRelop() " 8   { "TOKEN retToken = new(RELOP);
 "while(1) { /* repeat character processing " " "until a return or failure occurs */" " "switch(state) { " " " "case 0: "c = nextChar();
 " " " " "if(c == '’) state = 6;
 " " " " "else fail() ; /* lexeme is not a relop */ " " " " " "break; " " " "case 1: ... " " " "..." " " "case 8: "retract();" " " " " "retToken.attribute = GT; " " " " " "return(retToken);" } } }

*

return(relop, GT)

27

Puyng  the  code  together   token nexttoken() { while (1) { switch (state) { case 0: c = nextchar(); if (c==blank || c==tab || c==newline) { state = 0; lexeme_beginning++; } else if (c==‘’) state = 6; else state = fail(); break; case 1: … case 9: c = nextchar(); if (isletter(c)) state = 10; else state = fail(); break; case 10: c = nextchar(); if (isletter(c)) state = 10; else if (isdigit(c)) state = 10; else state = 11; break; …

The  transi=on  diagrams   for  the  various  tokens   can  be  tried  sequen=ally:   on  failure,  we  re-­‐scan   the  input  trying  another   diagram.     int fail() { forward = token_beginning; switch (state) { case 0: start = 9; break; case 9: start = 12; break; case 12: start = 20; break; case 20: start = 25; break; case 25: recover(); break; default: /* error */ } return start; 28   }

Puyng  the  code  together:     Alterna=ve  solu=ons     •  The  diagrams  can  be  checked  in  parallel   •  The  diagrams  can  be  merged  into  a  single   one,  typically    non-­‐determinisCc:  this  is  the   approach  we  will  study  in  depth.  

29  

Lexical  errors   •  Some  errors  are  out  of  power  of  lexical   analyzer  to  recognize:      fi (a == f(x)) …" •  However,  it  may  be  able  to  recognize  errors   like:      d = 2r" •  Such  errors  are  recognized  when  no  paWern   for  tokens  matches  a  character  sequence   30

Error  recovery   •  Panic  mode:  successive  characters  are  ignored   un=l  we  reach  to  a  well  formed  token   •  Delete  one  character  from  the  remaining  input   •  Insert  a  missing  character  into  the  remaining   input   •  Replace  a  character  by  another  character   •  Transpose  two  adjacent  characters   •  Minimal  Distance   31