Regular Expressions. The Picture So Far

CS 172: Computability and Complexity Regular Expressions Sanjit A. Seshia EECS, UC Berkeley Acknowledgments: L.von Ahn, L. Blum, M. Blum The Pictu...
Author: Sherman Robbins
3 downloads 0 Views 84KB Size
CS 172: Computability and Complexity

Regular Expressions

Sanjit A. Seshia EECS, UC Berkeley

Acknowledgments: L.von Ahn, L. Blum, M. Blum

The Picture So Far DFA

NFA

Regular language

S. A. Seshia

2

1

Today’s Lecture DFA

Regular language

NFA

Regular expression

S. A. Seshia

3

Regular Expressions • What is a regular expression?

S. A. Seshia

4

2

Regular Expressions • Q. What is a regular expression? • A. It’s a “textual”/ “algebraic” representation of a regular language – A DFA can be viewed as a “pictorial” / “explicit” representation • We will prove that a regular expressions (regexps) indeed represent regular languages S. A. Seshia

5

Regular Expressions: Definition σ is a regular expression representing {σ σ} (σ∈Σ) ε is a regular expression representing {ε} ∅ is a regular expression representing ∅ If R1 and R2 are regular expressions representing L1 and L2 then: (R1R2) represents L1⋅L2 (R1 ∪ R2) represents L1 ∪ L2 (R1)* represents L1* S. A. Seshia

6

3

Operator Precedence 1.

2.

3.

* ⋅ ∪

( often left out; a · b  ab )

S. A. Seshia

7

Example of Precedence

R1*R2 ∪ R3 = ( ( R1* ) R2 ) ∪ R3

S. A. Seshia

8

4

What’s the regexp? { w | w has exactly a single 1 }

0*10*

S. A. Seshia

9

What language does ∅* represent?

{ε}

S. A. Seshia

10

5

What’s the regexp? { w | w has length ≥ 3 and its 3rd symbol is 0 }

Σ2 0 Σ * Σ = (0 ∪ 1)

S. A. Seshia

11

Some Identities Let R, S, T be regular expressions • R∪∅ = ? • R·∅ =? • Prove: R ( S ∪ T ) = R S ∪ R T (what’s the proof idea?)

S. A. Seshia

12

6

Some Applications of Regular Expressions • String matching & searching – Utilities like grep, awk, … – Search in editors: emacs, …

• Programming Languages – Perl – Compiler design: lex/yacc

• Computer Security – Virus signatures S. A. Seshia

13

Virus Signature as String … pop ecx jecxz SFModMark mov esi, ecx mov eax, 0d601h pop edx pop ecx …

Sequence of words, one for each instruction: i0 i1 i0 i2 i1 i3 i2 i4 i0 i3 i4

Chernobyl virus code fragment

i0

virus! S. A. Seshia

14

7

Virus Signature as Regexp … nop pop ecx nop jecxz SFModMark mov esi, ecx nop nop mov eax, 0d601h pop edx pop ecx … Simple obfuscated Chernobyl virus code fragment

Sequence of words doesn’t work! nop i0 i1 i0 nop i2 nop i1 i3 i2 i4 nop i0 nop

i3

nop i4 i0

virus!

S. A. Seshia

15

Equivalence Theorem ⇒

S. A. Seshia



A language is regular if and only if some regular expression describes it

16

8

Part I (“if part”) Some regular expression R describes a language ⇒ That language is regular There exists NFA N such that R describes L(N)

S. A. Seshia

17

Given regular expression R, we show there exists NFA N such that R represents L(N) Proof idea?

S. A. Seshia

18

9

Given regular expression R, we show there exists NFA N such that R represents L(N) Proof Idea: Induction on the length of R: Base Cases (R has length 1): σ R=σ

R=ε

R=∅ S. A. Seshia

19

Inductive Step: Assume R has length k > 1 and that any regular expression of length < k represents a language that can be recognized by an NFA What might R look like? R = R1 ∪ R2 R = R1R2 R = (R1)* (remember: we have NFAs for R1 and R2) S. A. Seshia

20

10

Part I (“if part”) Some regular expression R describes a language ⇒ That language is regular There exists NFA N such that R describes L(N)

DONE !

S. A. Seshia

21

An Example Transform (1(0 ∪ 1))* to an NFA

ε

1

1,0

ε

S. A. Seshia

22

11

Part II (“only if part”) A language is regular ⇒ Some regular expression R describes it Turn DFA into equivalent regular expression

S. A. Seshia

23

Proof Sketch 1. DFA  Generalized NFA •

NFA with edges labeled by regexps, 1 start state, and 1 accept state

2. GNFA with k states  GNFA with 2 states •

k > 2; delete states but maintain equivalence

3. 2-state GNFA  regular expression R R

S. A. Seshia

24

12

GNFA Example & Definition 01*0

A GNFA is a tuple (Q, Σ, δ, qstart, qaccept) • Q – set of states • Σ – finite alphabet (not regexps) • qstart – initial state (unique, no incoming edges) • ε transitions to old start state • qaccept – accepting state (unique, no outgoing edges) • ε transitions from old accept states • δ : (Q \ qaccept) x (Q \ qstart)  R R – set of all regexps over Σ. Example: Any string matching 01*0 can cause the transition. S. A. Seshia

25

Step 1: DFA to GNFA a a, b b

What’s the corresponding GNFA?

S. A. Seshia

26

13

Step 1: DFA to GNFA ε ε

ε

DFA

qstart

qaccept

ε Add unique and distinct start and accept states Edges with multiple labels  regexp labels If internal states (q1, q2) don’t have an edge between them, add one labeled with ∅ S. A. Seshia

27

Step 2: Eliminate states from GNFA While machine has more than 2 states: Pick an internal state, rip it out and relabel the arrows with regular expressions to account for the missing state 0

0 1

S. A. Seshia

28

14

Step 2: Eliminate states from GNFA While machine has more than 2 states: Pick an internal state, rip it out and relabel the arrows with regular expressions to account for the missing state 01*0

S. A. Seshia

29

a∪b

a

q0

S. A. Seshia

ε

q1

b

q2

ε

q3

30

15

a∪b a*b q0

q2

ε

q3

S. A. Seshia

31

q0

(a*b)(a∪ ∪b)*

q3

δ(q0,q3) = (a*b)(a∪ ∪b)*

S. A. Seshia

32

16

Formally: Add qstart and qaccept and create GNFA G Run CONVERT(G) to eliminate states & get regexp: If #states = 2 return the expression on the arrow going from qstart to qaccept If #states > 2 ?

S. A. Seshia

33

Formally: Add qstart and qaccept to create G Run CONVERT(G): If #states > 2 select qrip∈Q different from qstart and qaccept define Q′′ = Q – {qrip} define δ′ as: δ′(q δ′ i,qj) = δ(qi,qrip)δ δ(qrip,qrip)*δ δ(qrip,qj) ∪ δ(qi,qj) return CONVERT(G′′)

/* recursion */

(what does this look like, pictorially?) S. A. Seshia

34

17

Prove: CONVERT(G) is equivalent to G Proof by induction on k (number of states in G) Base Case:  k=2 Inductive Step: Assume claim is true for k-1 states Prove that G and G′′ are equivalent By the induction hypothesis, G′′ is equivalent to CONVERT(G′′)

S. A. Seshia

35

The Complete Picture DFA

Regular language

S. A. Seshia

NFA

Regular expression

36

18

Which language is regular? C = { w | w has equal number of 1s and 0s} NOT REGULAR D = { w | w has equal number of occurrences of 01 and 10} REGULAR!

S. A. Seshia

37

Next Steps • Read Sipser 1.4 in preparation for next lecture

S. A. Seshia

38

19