Introduction to Programming Competitions String processing

Dr Vincent Gramoli | Lecturer School of Information Technologies

Outline

›  String basics ›  String matching ›  String processing with dynamic programming ›  Suffix trie/tree ›  Suffix array

2

String problems

›  String processing problems -  Not as common as graph problems or mathematics problems -  Common in bioinformatics research field (e.g., DNA strings) -  Strings are usually long, requiring efficient data structures to parse them

3

String Basics

4

String basics ›  Given an input of (≤10) lines of (≤30) characters that contains alphabetic characters [A-Za-z], digits [0-9], space, and period (‘.’) and without trailing space at the end of each line, write a program to: -  read this text file, line by line, until a line starts with seven periods (“…….”) -  concatenate each line into one long string T where they are separated by a space -  output the result

5

String basics public static void main (String[] args) throws Exception { String str = ""; Scanner sc = new Scanner(System.in); boolean first = true; while (sc.hasNext()) { String l = sc.nextLine(); if (l.startsWith(".......")) break; if (!first) str += " "; str += l; } System.out.println(str); }

›  Reading the input in Java

6

String basics public static void main (String[] args) throws Exception { String str = ""; Scanner sc = new Scanner(System.in); boolean first = true; while (sc.hasNext()) { String l = sc.nextLine(); if (l.startsWith(".......")) break; if (!first) str += " "; str += l; } System.out.println(str); }

›  Reading the input in Java ›  Reading line by line

7

String basics public static void main (String[] args) throws Exception { String str = ""; Scanner sc = new Scanner(System.in); boolean first = true; while (sc.hasNext()) { String l = sc.nextLine(); if (l.startsWith(".......")) break; if (!first) str += " "; str += l; } System.out.println(str); }

›  Reading the input in Java ›  Reading line by line ›  Identifying a starting token to stop reading 8

String basics public static void main (String[] args) throws Exception { String str = ""; Scanner sc = new Scanner(System.in); boolean first = true; while (sc.hasNext()) { String l = sc.nextLine(); if (l.startsWith(".......")) break; if (!first) str += " "; str += l; } System.out.println(str); }

›  Reading the input in Java ›  Reading line by line ›  Identifying a starting token to stop reading ›  Concatenating strings into another one 9

String basics Can you determine the number of vowels and consonants that are in T in O(| T|) time?

10

String basics Can you determine the number of vowels and consonants that are in T in O(| T|) time? ›  Convert to char

char[] temp = str.toCharArray(); for (i = dgts = alph = vwls = csnts = 0; i < str.length(); i++) { temp[i] = Character.toLowerCase(temp[i]); dgts += Character.isDigit(temp[i]) ? 1 : 0; alph += Character.isLetter(temp[i]) ? 1 : 0; vwls += ("aeiou”.indexOf(temp[i])>0) ? 1 : 0; } csnts = alph – vwls; System.out.printf("%d %d %d\n", dgts, vwls, csnts);

11

String basics Can you determine the number of vowels and consonants that are in T in O(| T|) time? ›  Convert to char ›  Count occurrences of digits, alpha, vowels

char[] temp = str.toCharArray(); for (i = dgts = alph = vwls = csnts = 0; i < str.length(); i++) { temp[i] = Character.toLowerCase(temp[i]); dgts += Character.isDigit(temp[i]) ? 1 : 0; alph += Character.isLetter(temp[i]) ? 1 : 0; vwls += ("aeiou”.indexOf(temp[i])>0) ? 1 : 0; } csnts = alph – vwls; System.out.printf("%d %d %d\n", dgts, vwls, csnts);

12

String basics Can you determine the number of vowels and consonants that are in T in O(| T|) time? ›  Convert to char ›  Count occurrences of digits, alpha, vowels ›  Count consonants char[] temp = str.toCharArray(); for (i = dgts = alph = vwls = csnts = 0; i < str.length(); i++) { temp[i] = Character.toLowerCase(temp[i]); dgts += Character.isDigit(temp[i]) ? 1 : 0; alph += Character.isLetter(temp[i]) ? 1 : 0; vwls += ("aeiou”.indexOf(temp[i])>0) ? 1 : 0; } csnts = alph – vwls; System.out.printf("%d %d %d\n", dgts, vwls, csnts);

13

String basics How to find the first occurrence of a substring in a string? ›  Given the string str and “INFO5010” the substring to look for

14

String basics How to find the first occurrence of a substring in a string? ›  Given the string str and “INFO5010” the substring to look for

int hasINFO5010 = (str.indexOf(“INFO5010”) != -1) ? 1 : 0;

›  No need for a complex algorithm, especially if the string to search for is short

15

String basics How to identify the words that appear the most in a text file?

16

String basics How to identify the words that appear the most in a text file? ›  Tokenize by splitting with space (‘ ’) and period (‘.’)

Vector tokens = new Vector(); TreeMap freq = new TreeMap(); StringTokenizer st = new StringTokenizer(str, " ."); while (st.hasMoreTokens()) { String p = st.nextToken(); tokens.add(p); if (!freq.containsKey(p)) freq.put(p, 1); else freq.put(p, freq.get(p) + 1); }

17

String basics How to identify the words that appear the most in a text file? ›  Tokenize by splitting with space (‘ ’) and period (‘.’) ›  Use a TreeMap to store the frequency

Vector tokens = new Vector(); TreeMap freq = new TreeMap(); StringTokenizer st = new StringTokenizer(str, " ."); while (st.hasMoreTokens()) { String p = st.nextToken(); tokens.add(p); if (!freq.containsKey(p)) freq.put(p, 1); else freq.put(p, freq.get(p) + 1); }

18

String basics How to identify the words that appear the most in a text file? ›  Tokenize by splitting with space (‘ ’) and period (‘.’) ›  Use a TreeMap to store the frequency ›  Count the frequency Vector tokens = new Vector(); TreeMap freq = new TreeMap(); StringTokenizer st = new StringTokenizer(str, " ."); while (st.hasMoreTokens()) { String p = st.nextToken(); tokens.add(p); if (!freq.containsKey(p)) freq.put(p, 1); else freq.put(p, freq.get(p) + 1); }

19

Tips

1.  Use StringBuilder rather than StringBuffer as the Java StringBuffer is thread-safe but this is not needed in programming contests as we only write sequential programs. 2.  Using C-style character arrays will yield faster execution than when using the C++ STL string. 3.  Java Strings are immutable, operations on these Strings can be very slow. Use StringBuilder instead.

20

String Matching

21

String matching ›  String matching is the problem of finding the starting index of a (sub)string P in a longer string T ›  Example: -  T = ‘Today is Monday’ -  P = ‘day’ ⇒ indices 2 and 12 -  P = ‘ayd’ ⇒ -1/ NULL

›  For matching problems where searching for relatively small strings: -  strstr in -  find in

C

C++

-  indexOf in

Java java.lang.String 22

String matching ›  Find all occurrences of a substring P of length m in a (long) string T of length n, if any

23

String matching ›  Find all occurrences of a substring P of length m in a (long) string T of length n, if any ›  Naïve approach: void naiveMatching() { for (int i = 0; i < n; i++) { // try all potential starting indices bool found = true; for (int j = 0; j < m && found; j++) // use boolean flag found if (i+j >= n || P[j] != T[i+j]) // if mismatch found found = false; // abort this, shft the starting index i by +1 if (found) // if P[0..m-1] == T[i..i+m-1] printf(“P is found at index %d in T\n”, i); } }

›  This may be linear if applied to natural text ›  But it actually runs in O(nm) in the general case: -  T = ‘AAAAAAAAAAB’ and P = ‘AAAAB’. 24

String matching Knuth-Morris-Pratt’s (KMP) algorithm ›  KMP never re-compares a character in T that has matched a character in P ›  KMP works similar to the naïve algorithm if P first character does not match 01234567890123456789! T = “THIRTY THIRTY THREE.”! P = “THIRTY THREE”!

›  There are 9 matches from i=j=0..8 and one mismatch at index i=j=9 ›  The naïve approach would restart from i=1 ›  KMP can resume from i=2 because ‘TH’ appears at both the suffix and prefix of ‘THIRTY TH’ (i.e., ‘TH’ is the border of ‘THIRTY TH’) ›  We can safely skip i=0..6 but not 7 and 8 as P starts with ‘TH’. ›  KMP resets j to 2, skipping 9-2 = 7 characters (‘THIRTY ’) while i remains at index 9. 25

String matching KMP algorithm (con’t) 01234567890123456789! T = “THIRTY THIRTY THREE.”! P = “THIRTY THREE”!

›  This is a match so P = ‘THIRTY THREE’ is found at index 7 ›  KMP resets j back to 0

26

String matching KMP algorithm (con’t) ›  KMP requires to preprocess P and get ‘back’ table b to achieve this speedup 0

1

2

3

4

5

P

T! H! I! R! T ! Y!

b

-1

0

0

0

0

0

6

7

8

9

0

1

2

T! H! R! E! E! 0

0

1

2

0

0

0

›  This means that if a mismatch occurs in j=9 after finding matching for ‘THIRTY TH’ then, we know that we have to retry matching P from index j = b[9] = 2 -  Now KMP assumes that it has matched only the two first characters of T

27

String matching KMP algorithm (con’t) void kmpPreprocess() { // call this before calling kmpSearch() int i = 0, j = -1; b[0] = -1; // starting values while (i < m) { // pre-process the pattern string P while (j >= 0 && P[i] != P[j]) j = b[j]; // if dif. reset j w/ b i++; j++; // if same, advance both pointers b[i] = j; // observe i = 8, 9, 10, 11, 12 with j = 0, 1, 2, 3, 4 } } // in the example of P = ”THIRTY THREE" above void kmpSearch() { // similar to kmpPreprocess(), but on string T int i = 0, j = 0; // starting values while (i < n) { // search through string T while (j >= 0 && T[i] != P[j]) j = b[j]; // if dif. reset j w/ b i++; j++; // if same, advance both pointers if (j == m) { // a match found when j == m printf("P is found at index %d in T\n", i - j); j = b[j]; // prepare j for the next possible match } } }

Complexity is O(n+m) 28

String matching in a 2D grid ›  UVa 10010 – Where is Waldorf? abcdefghigg ! hebkWaldork // WALDORF is highlighted
 ftyawAldorm ! ftsimrLqsrc 
 byoarbeDeyv // can you find BAMBI and BETTY?! klcbqwikOmk! strebgadhRb // can you find DAGBERT in this row?
 yuiqlxcnbjF!

›  The string matching problem can also be posed in 2D. Given a 2D grid/array of characters (instead of the well-known 1D array of characters), find the occurrence(s) of pattern P in the grid. ›  Depending on the problem requirement the search direction can be of 4 or 8 cardinal directions, and either the pattern must be found in a straight line of it can be bend. 29

String matching in a 2D grid ›  UVa 10010 – Where is Waldorf? abcdefghigg ! hebkWaldork // WALDORF is highlighted
 ftyawAldorm ! ftsimrLqsrc 
 byoarbeDeyv // can you find BAMBI and BETTY?! klcbqwikOmk! strebgadhRb // can you find DAGBERT in this row?
 yuiqlxcnbjF!

›  The solution is usually a recursive backtracking. This is because unlike the 1D counterpart where we always go to the right, at every coordinate (row, col) of the 2D grid, we have more than one choice to explore. ›  To speedup the backtracking process, usually we employ pruning strategy: once the recursion depth exceeds the pattern P length, we can immediately prune that recursive branch. This is called the depth-limited-search. 30

String Processing with DP

31

String processing with DP String alignment (or edit distance) ›  Align two strings A with B with the maximum alignment score (or minimum number of edit operations): ›  After aligning A with B, there are a few possibilities between character A[i] and B[i]: 1.  Character A[i] and B[i] match and we do nothing (assume this worth ‘+2’ score), 2.  Character A[i] and B[i] mismatch and we replace A[i] with B[i] (assume ‘-1’ score) 3.  We insert a space in A[i] (also ‘-1’ score) 4.  We delete a letter from A[i] (also ‘-1’ score).

32

String processing with DP String alignment (or edit distance) ›  Example: (note that we use a special symbol ‘_’ to denote a space) A=‘ACAATCC’ -> ‘A_CAATCC’! B=‘AGCATGC’ -> ‘AGCATGC_’! 2-22--2-!

33

String processing with DP String alignment (or edit distance) ›  A brute force solution that tries all possible alignments will get TLE even for medium-length strings A and/or B

34

String processing with DP String alignment (or edit distance) ›  The solution for this problem is the Needleman-Wunsch’s DP algorithm -  Consider two strings A[1..n] and B[1..m]. -  We define V(i,j) to be the score of the optimal alignment between prefix A[1..i] and B[1..j] and score(C1,C2) is a function that returns the score if character C1 is aligned with character C2. -  Base cases: -  V(0,0) = 0 (no score for matching two empty strings) -  V(i,0) = i x score(A[i], _) (delete substring A[1..i] to make the alignment, i>0 -  V(0,j) = j x score(_, B[j]) (insert spaces in B[a..j] to make the alignment, j>0 -  Recurrences: For i>0 and j>0, -  V(i,j) = max(option1, option2, option3), where -  option1 = V(i-1, j-1) + score(A[i], B[i]) (score of match or mismatch) -  option2 = V(i-1, j) + score(A[i], _) (delete Ai) -  option3 = V(i, j-1) + score(_, B[j]) (insert Bj) 35

String processing with DP String alignment (or edit distance) -  Example: A=‘ACAATCC’ and B=‘AGCATGC’ -  With only scoring function of +2 for match and -1 for mismatch/insert/delete _

A

G

C

A

T

G

C

_

0

-1

-2

-3

-4

-5

-6

-7

A

_

A

G

C

A

T

G

C

_

0

-1

-2

-3

-4

-5

-6

-7

A

-1

2

1

0

-1

-2

-3

-4

C

-2

1

1

3

2

1

0

-1

A

-3

0

0

2

5

4

3

2

A

-4

-1

-1

1

4

4

3

2

-5

T

-5

-2

-2

0

3

6

5

4

C

-6

C

-6

-3

-3

0

2

5

5

7

C

-7

C

-7

-4

-4

-1

1

4

4

7

_

A

G

C

A

T

G

C

_

0

-1

-2

-3

-4

-5

-6

-7

-1

A

-1

2

1

0

-1

-2

-3

-4

C

-2

C

-2

1

1

3

A

-3

A

-3

A

-4

A

-4

T

-5

T

C

-6

C

-7

-  Initially, only the base cases are known -  We fill the values row by row left to right: V(i,j) relies on V(i-1,j-1), V(i-1,j), V(i,j-1) -  The maximum alignment score (7) is at the bottom right cell ⇒ O(nm) time 36

String processing with DP Longest common subsequence (LCS) ›  Given two strings A and B, what is the longest common subsequence between them? ›  Example: A=‘ACAATCC’ ! B=‘AGCATGC’ ! LCS(A,B) = ‘ACATC’! ›  LCS can be reduced to the string alignment problem seen before so we can use the same DP algorithm -  We set the cost for mismatch as negative infinity (e.g., -1B) -  Cost for insertion and deletion as 0 and cost for match as 1 -  This makes the Needleman-Wunsch’s algorithm to never consider mismatch 37

String processing with DP UVa 11151 – Longest palindrome ›  A palindrome is a string that can be read the same way in either direction ›  Some variants are solvable with DP technique ›  Example: ‘ADAM’ -> ‘ADA’!

‘MADAM’ -> ‘MADAM’! ‘NEVERODDOREVENING’ -> ‘NEVERODDOREVEN’! ›  The DP solution: let len(l,r) be the length of the longest palindrome from string A[l..r]. Complexity is O(n2) ›  Base cases: -  if (l=r) then len(l,r) = 1 (odd-length palindrome) -  If (l+1 = r) then len(l,r) = 2 if (A[l] = A[r]) or 1 otherwise (even-length palindrome)

›  Recurrences: -  If A[l] = A[r] then len(l,r) = 2 + len(l+1, r-1) (both corner characters are the same) -  Else len(l,r) = max(len(l,r-1), len(l+1,r)) (increase left side or decrease right side) 38

Suffix Trie/Tree

39

Suffix trie and applications ›  The suffix i (or the ith suffix) of a string is a special case of substring that goes from the ith character of the string up to the last character of the string ›  A suffix trie of a set of string S is a tree of all possible suffixes of string S. -  Each edge label represents a character -  Each vertex represents a suffix indicated by its path label A

›  Example: -  S = {CAR, CAT, RAT} -  Suffixes = {CAR, AR, R, CAT, AT, T, RAT, AT, T} -  After sorting and removing duplicates, we have: -  Suffixes = {AR, AT, CAR, CAT, R, RAT, T}

R

C

T

A R

T

R

T

A T

40

From suffix trie to suffix tree A

$

C

T

G

8 C

G

7

T A

$ 5

C

G

A

C

C A $ 4

T A

A

G

C

A

A

A

C

$ 1

A $ 0

7

G

6

5

3

A

$ 3

A

$ 6

2

CA TAGA CA

A

A

C TAGA GACA CA

A

A

8

A

CA

$

4

1

$ 2

0 0

1

2

3

4

5

6

7

8

G

A

T

A

G

A

C

A

$ 41

Sub-linear string matching ›  Assuming that the suffix tree of a string T is already built we can find all occurrences of a pattern string P in T in O(m + q) where -  m = |P| -  q is the number of occurrences

›  Matching all occurrences means finding all suffixes of which P is a prefix

Sub-linear string matching ›  Find the vertex x of the suffix tree whose path label is P -  Traverse the tree choosing edges whose labels are consecutive substrings of P

-  idx = “012345678”

5

3

A

-  T = “GATAGACA”

6 CA TAGA CA

›  Example:

7

C TAGA GACA CA

-  The suffix indices stored in the subtree rooted in x indicate all occurrence indices

CA

-  Stop at vertex x whose path label equals P

8

4

1 0

2

Sub-linear string matching ›  Find the vertex x of the suffix tree whose path label is P -  Traverse the tree choosing edges whose labels are consecutive substrings of P

-  idx = “012345678” -  P = ‘A’ ⇒ occurrences: 7, 5, 3, 1

5

3

A

-  T = “GATAGACA”

6 CA TAGA CA

›  Example:

7

C TAGA GACA CA

-  The suffix indices stored in the subtree rooted in x indicate all occurrence indices

x

CA

-  Stop at vertex x whose path label equals P

8

4

1 0

2

Sub-linear string matching ›  Find the vertex x of the suffix tree whose path label is P -  Traverse the tree choosing edges whose labels are consecutive substrings of P

-  idx = “012345678” -  P = ‘A’ ⇒ occurrences: 7, 5, 3, 1 -  P = ‘GA’ ⇒ occurrences: 4, 0

6

5

3

A

-  T = “GATAGACA”

x

CA TAGA CA

›  Example:

7

C TAGA GACA CA

-  The suffix indices stored in the subtree rooted in x indicate all occurrence indices

CA

-  Stop at vertex x whose path label equals P

8

4

1 0

2

Sub-linear string matching ›  Find the vertex x of the suffix tree whose path label is P -  Traverse the tree choosing edges whose labels are consecutive substrings of P

-  idx = “012345678” -  P = ‘A’ ⇒ occurrences: 7, 5, 3, 1 -  P = ‘GA’ ⇒ occurrences: 4, 0 -  P = ‘Z’ ⇒ not found

5

3

A

-  T = “GATAGACA”

6 CA TAGA CA

›  Example:

7

C TAGA GACA CA

-  The suffix indices stored in the subtree rooted in x indicate all occurrence indices

CA

-  Stop at vertex x whose path label equals P

8

4

1 0

2

Finding the longest repeated substring Longest Repeated Substring (LRS) ›  Finding the longest substring of a string that occurs at least twice ›  Given the suffix tree of T, we can find LRS in T in O(n) where n = |T| ›  The substring corresponds to the path label of the of the deepest internal vertex x in the suffix tree of T -  As x is an internal vertex of the tree, its path label is a repeated substring -  As x is the deepest internal vertex implies, its path label is the longest substring

Finding the longest repeated substring LRS (con’t) ›  Example: -  T = “GATAGACA”

CA

-  The longest repeated substring is ‘GA’ with path label length = 2

8 7

x

C TAGA GACA CA

6

A

3

CA TAGA CA

5

4

1 0

2

Finding the longest repeated substring LRS (con’t) ›  Example: -  T = “GATAGACA”

7

C TAGA GACA CA

-  The other repeated substring is ‘A’ but its path label length = 1

x

CA

-  The longest repeated substring is ‘GA’ with path label length = 2

8

6

A

3

CA TAGA CA

5

4

1 0

2

Finding the longest common substring Longest Common Substring (LCS) Finding the LCS of two or more strings can be solved in linear time with suffix tree ›  Consider the case with two strings T1 and T2 ›  We can build a generalize suffix tree that combines the suffix trees of T1 and T2 ›  To differentiate the source of each suffix, we use two different terminating vertex symbols, one for each string ›  Then, we mark internal vertices that have vertices in their subtrees with different terminating symbols ›  The suffixes represented by these marked internal vertices share a common prefix and come from both T1 and T2 ›  These marked internal nodes represent the common substrings between T1 and T2 ›  As we are interested in the longest one, we report the deepest marked vertex

Finding the longest common substring LCS (con’t) ›  Example: -  T1 = “GATAGACA”, T2 = “CATA”

CA

8 TA$

-  We combine the suffix trees by using different terminating symbols $ (for T1) and # (for T2)

4 3

3

1

6

0

CA$

1

$ $ GACA

5

GAC A$

7

2

2

4 0

Finding the longest common substring LCS (con’t) ›  Example: -  T1 = “GATAGACA”, T2 = “CATA”

CA

3 7

3

$

1

6

0

CA$

1

2

2

$ GACA

5

GAC A$

-  Vertex with path label “GA” is not marked as GACA$ and GATAGACA$ are both from T1

8 TA$

-  We combine the suffix trees by using different terminating symbols $ (for T1) and # (for T2)

4

4 0

Finding the longest common substring LCS (con’t) ›  Example: -  T1 = “GATAGACA”, T2 = “CATA”

CA

3 7

1 3

$

1

6

0

CA$

-  Vertices with path labels “A”, “ATA”, “CA”, “TA” have 2 distinct terminating symbols, they are the common substrings

2

2

$ GACA

5

GAC A$

-  Vertex with path label “GA” is not marked as GACA$ and GATAGACA$ are both from T1

8 TA$

-  We combine the suffix trees by using different terminating symbols $ (for T1) and # (for T2)

4

4 0

Finding the longest common substring LCS (con’t) ›  Example: -  T1 = “GATAGACA”, T2 = “CATA”

7

3

$

1

6

0

CA$

1

2

2

$ GACA

5

-  Vertices with path labels “A”, “ATA”, “CA”, “TA” have 2 distinct terminating symbols, they are the common substrings -  The deepest marked vertex has path label “ATA”, the LCS

CA

3

GAC A$

-  Vertex with path label “GA” is not marked as GACA$ and GATAGACA$ are both from T1

8 TA$

-  We combine the suffix trees by using different terminating symbols $ (for T1) and # (for T2)

4

4 0

Finding the longest common substring LCS (con’t) ›  Example: -  T1 = “GATAGACA”, T2 = “CATA”

7

3

$

1

6

0

CA$

-  LCS(T1,T2) = “ATA”

1

2

2

$ GACA

5

-  Vertices with path labels “A”, “ATA”, “CA”, “TA” have 2 distinct terminating symbols, they are the common substrings -  The deepest marked vertex has path label “ATA”, the LCS

CA

3

GAC A$

-  Vertex with path label “GA” is not marked as GACA$ and GATAGACA$ are both from T1

8 TA$

-  We combine the suffix trees by using different terminating symbols $ (for T1) and # (for T2)

4

4 0

Suffix Trie/Tree

56

Suffix Array ›  The efficient implementation of suffix tree in linear time is complex ›  Suffix array are simpler to build ›  We skip the O(n) construction of the suffix tree and describe the construction of the suffix array in O(n log n)

Suffix Array ›  A suffix array is an integer array that stores permutation of n indices of sorted suffixes. idx

suffix

idx

SA[i] suffix

0

GATAGACA$

0

8

$

1

ATAGACA$

1

7

A$

2

TAGACA$

2

5

ACA$

3

AGACA$

3

3

AGACA$

4

GACA$

4

1

ATAGACA$

5

ACA$

5

6

CA$

6

CA$

6

4

GACA$

7

A$

7

0

GATAGACA$

8

$

8

2

TAGACA$

›  Example: the suffix array is the permutation of sorted suffixes

Suffix Array Suffix tree and suffix array are closely related ›  An internal vertex of the suffix tree is a range of the suffix array $

1

7

A$

2

5

ACA$

3

3

AGACA$

4

1

ATAGACA$

5

6

CA$

6

4

GACA$

7

0

GATAGACA$

8

2

TAGACA$

8 7

6

5

3

›  The suffix tree traversal visits leaves in the suffix array order

CA TAGA CA

8

A

0

CA

SA[i] suffix

C TAGA GACA CA

idx

4

1 0

2

Suffix Array ›  Construction of a suffix array #include #include #include using namespace std; #define MAX_N 1010 // O(n2 log n) char T[MAX_N]; // naïve approach cannot go beyond 1000 characters int SA[MAX_N], i, n; // O(n) bool cmp(int a, int b) { return strcmp(T+a, T+b) < 0; } int main() { n = (int)strlen(gets(T)); // read line and compute its length for (int i = 0; i < n; i++) SA[i] = i; // initial SA: {0,1,…,n-1} sort(SA, SA+n, cmp); // sort: O(n log n) * cmp: O(n) = O(n2 log n) for (i = 0; i < n; i++) printf(“%2d\t%s\n”, SA[i]); }

Conclusion

61

Conclusion ›  We presented string processing and matching techniques ›  Java String.indexOf and C++ string.find are good enough for small strings, KMP has a higher constant overhead but becomes faster as string size increases, C strstr has similar complexity ›  For even faster string matching algorithm, one should use suffix arrays ›  There is a faster but more complex implementation of suffix arrays in O(n log n) ›  Various string searching algorithms were not covered: Boyer-Moore’s, Rabin-Karp’s, Aho-Corasick’s…

Backup

63

Suffix Array ›  Construction of a suffix array #include #include #include using namespace std; #define MAX_N 1010 // O(n2 log n) char T[MAX_N]; // naïve approach cannot go beyond 1000 characters int SA[MAX_N], i, n; // O(n) bool cmp(int a, int b) { return strcmp(T+a, T+b) < 0; } int main() { n = (int)strlen(gets(T)); // read line and compute its length for (int i = 0; i < n; i++) SA[i] = i; // initial SA: {0,1,…,n-1} sort(SA, SA+n, cmp); // sort: O(n log n) * cmp: O(n) = O(n2 log n) for (i = 0; i < n; i++) printf(“%2d\t%s\n”, SA[i]); }

String

Tip: Use StringBuilder rather than StringBuffer as the Java StringBuffer is thread safe but this is not needed in programming contests as we only write sequential programs.

65

String

Tip: Using C-style character arrays will yield faster execution than when using the C++ STL string. Java Strings are immutable, operations on these Strings can be very slow. Use StringBuilder instead.

66