Principles of Programming Languages Topic: Formal Languages I

CS 314, LS,LTM: L2, Formal Languages I

1

Review • This class will teach you – some common principles underlying most languages – some new ways of thinking about programs (new paradigms)

• A programming language must be – as easy as possible for people to • learn • read • write

– as easy as possible for a compiler to translate into efficient machine code or an interpreter to execute CS 314, LS,LTM: L2, Formal Languages I

2

Defining a Language • To define a computer language we need to say – How do we tell if a file of characters is a legal (grammatical) program in this language? ( syntax ) – How do we define what a program in this language means? (semantics )

• Semantics: several approaches but in practice we just use English to explain the meaning • Syntax: defined by a formal grammar

CS 314, LS,LTM: L2, Formal Languages I

3

Grammars S R NP VP NP R Name | Det Noun VP R Verb | Verb NP Name R john | mary Det R a | the Det R some | every Noun R boy | girl Verb R runs | likes

CS 314, LS,LTM: L2, Formal Languages I

S NP Det

VP

Noun

Verb

NP Name

the

boy

likes

Mary

4

Grammars A grammar, G, is a quadruple , where • T is a set of terminal symbols (e.g., john, mary, a, the, some, every, boy, girl, runs, likes); • N is a set of nonterminal symbols (e.g., S, NP, VP, Name, Det, Noun, Verb); • P is a set of productions, or rewrite rules (e.g., Det R a the); • S is a special start symbol. The language of G, L(G), is the set of all terminal sequences that can be produced by applying the rewrite rules, repeatedly, starting with S .

CS 314, LS,LTM: L2, Formal Languages I

5

Grammars • For Programming Languages: Stmt R Identifier := Digit Identifier R Letter | Identifier Letter | Identifier Digit Letter R a | b | c | ... | x | y | z Digit R 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9

• Backus-Naur Form (BNF): ::= := ::= | | ::= a | b | c | ... | x | y | z ::= 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 CS 314, LS,LTM: L2, Formal Languages I

6

Derivation in a Grammar Q: Generate x2:=0 in this grammar (call it G)? R R R R R R

:= := := x := x 2 := x 2 := 0

Yes! This is a leftmost or canonical derivation in G.

CS 314, LS,LTM: L2, Formal Languages I

7

Parsing in a Grammar Q: Recognize x2:=0 as a terminal sequence in L(G)? x 2 := 0 R R R R R R

2 := 0 2 := 0 := 0 := 0 :=

Yes! This is a parse of the sentence x2:=0 in G.

CS 314, LS,LTM: L2, Formal Languages I

8

Parse Trees

:=







2

0

x Each internal node is a nonterminal; its children are drawn from the right-hand side of one of the productions for that nonterminal. CS 314, LS,LTM: L2, Formal Languages I

9

Grammars are not Unique • Consider a grammar G’: ::= := ::= | ::= | ::= a | b | c | ... | x | y | z ::= 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9

• The grammar G’ generates the same language as G, but it has different parse trees.

CS 314, LS,LTM: L2, Formal Languages I

10

Grammars are not Unique

:=







2

x Parse Tree for G

CS 314, LS,LTM: L2, Formal Languages I

0

:=

0

x

2

Parse Tree for G’

11

Types of Grammars • Context Free Grammars: – Every production has a single nonterminal on the lefthand side: A R … – Disallowed: X A R X a

• Regular Grammars: – Productions take the form: A R c , or are all either left-linear: A R B a , or right-linear: A R a B – Disallowed: S R a S b – Cannot generate the language { an bn | n = 1,2,3, ... }

CS 314, LS,LTM: L2, Formal Languages I

12

Types of Grammars • Context Free Grammars (CFGs) are used to specify the overall structure of a programming language: – if/then/else, ... – brackets: ( ), { }, begin/end, ...

• Regular Grammars (RGs) are used to specify the structure of tokens: – identifiers, numbers, keywords, ...

• Note: The recognition problem for CFGs and RGs requires a different computational model (more on this later). CS 314, LS,LTM: L2, Formal Languages I

13

Ambiguity S R NP VP NP R Name | Det Noun | NP PP PP R Prep NP VP R Verb | Verb NP NP Name R john | mary Name Det R a | the | some | every Prep R on | with | under | ... john Noun R man | hill | telescope | ... Verb R saw | runs | likes | ... CS 314, LS,LTM: L2, Formal Languages I

S VP Verb

NP

saw

....

14

Ambiguity NP NP NP Det

...

a

Noun

man

CS 314, LS,LTM: L2, Formal Languages I

PP PP

Prep

on

Prep NP

NP Det

Det

Noun

a

hill

with

Noun

a telescope 15

Ambiguity NP NP Det

Noun

PP Prep

NP NP Det

... a

man

CS 314, LS,LTM: L2, Formal Languages I

on

a

PP Noun

hill

Prep

with

NP Det

Noun

a

telescope 16

Dangling Else Here is a simplified grammar for Pascal: ::= | | ... ::= if then | if then else ::= := ::= = 0 ::= a | b | c | ... | x | y | z ::= 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9

How are compound “if” statements parsed using this grammar? CS 314, LS,LTM: L2, Formal Languages I

17

if x = 0 then if y = 0 then z := 1 else w := 2 Parse Tree 1

if then

= 0 x

if then else

= 0 y

:= z

CS 314, LS,LTM: L2, Formal Languages I

1

:= w

2

18

if x = 0 then if y = 0 then z := 1 else w := 2 Parse Tree 2

if then





= 0 x

else

if then

= 0 y

:= z

CS 314, LS,LTM: L2, Formal Languages I

:=

1

w

2

Q: which tree is correct?

19

How to Fix the Dangling Else? • Algol60: use block structure if x = 0 then begin if y = 0 then z := 1 end else w := 2

• Algol68: use statement begin/end markers if x = 0 then if y = 0 then z := 1 fi else w := 2 fi

• Pascal: change the grammar of “if” statement to disallow the second parse tree, i.e., always associate an “else” with the closest “if”.

CS 314, LS,LTM: L2, Formal Languages I

20

How to Fix the Dangling Else? Here is a revised grammar for Pascal: ::= | ::= if then else | | ... ::= if then | if then else ::= := ::= = 0 ::= a | b | c | ... | x | y | z ::= 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 CS 314, LS,LTM: L2, Formal Languages I

21

if x = 0 then if y = 0 then z := 1 else w := 2 In the new grammar there is only one parse tree!

if then

= 0 x

if then else

= 0 y

:= z

CS 314, LS,LTM: L2, Formal Languages I

1

:= w

2

22

Arithmetic Expressions Here is a grammar for arithmetic expressions: ::= + | - | * | / | | ::= a | b | c | ... | x | y | z ::= 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9

Using this grammar, how would we parse: x + 3 * y ?

CS 314, LS,LTM: L2, Formal Languages I

23

Two Parse Trees



+

*





x

y *

+









3

y

x

3

CS 314, LS,LTM: L2, Formal Languages I

24

Precedence Modify the grammar to add precedence: ::= + | - | ::= * | / | ::= | | ( ) ::= a | b | c | ... | x | y | z ::= 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9

Using this grammar, how would we parse: x + 3 * y ? Using this grammar, how would we parse: 7 - 4 - 2 ? CS 314, LS,LTM: L2, Formal Languages I

25

Only One Parse Tree + x * 3

CS 314, LS,LTM: L2, Formal Languages I

y

But there are two parse trees for the second example: - 2 7 - 4 7 - 4 - 2

26

Associativity Modify the grammar to add associativity: ::= + | - | ::= * | / | ::= | | ( ) ::= a | b | c | ... | x | y | z ::= 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9

Using this grammar, how would we parse: 7 - 4 - 2 ? CS 314, LS,LTM: L2, Formal Languages I

27

Only One Parse Tree - 2 -

4

7 CS 314, LS,LTM: L2, Formal Languages I

28

Concrete vs. Abstract Syntax

+

+ x * 3

CS 314, LS,LTM: L2, Formal Languages I

x

*

3

y

Abstract Syntax

y

29

Extended BNF (EBNF) Write nonterminals as in BNF. (Variant: Write them with initial capital letters, or using a different font.) Use additional metasymbols, as shortcuts: – {…} means repeat the enclosed text zero or more times – […] means the enclosed text is optional – (…) is used for grouping, usually with the alternation symbol, e.g., (… | ...).

If { }, [ ], or ( ) are used as terminal symbols in the language being defined, then they must be quoted. (Variant: They must be underlined.) CS 314, LS,LTM: L2, Formal Languages I

30

Extended BNF (EBNF) Examples: ::= { ( + | - ) } ::= { ( * | / ) } ::= | | ‘(’ ‘)’ ::= if then [ else ] ::= { ( | ) }

CS 314, LS,LTM: L2, Formal Languages I

31