Regular Expressions Exercises Part 1

Regular Expressions Exercises – Part 1 A common task performed during data preparation or data analysis is the manipulation of strings. Regular expre...
4 downloads 2 Views 85KB Size
Regular Expressions Exercises – Part 1 A common task performed during data preparation or data analysis is the manipulation of strings.

Regular expressions are meant to assist in such and similar tasks. A regular expression is a pattern that describes a set of strings. Regular expressions can range from simple patterns (such as finding a single number) thru complex ones (such as identifing UK postcodes). R implements a set of “regular expression rules” that are basically shared by other programming languages as well, and even allow the implementation of some nuances, such as Perllike regular expressions. Also, sometimes specific patterns may or may not be found, according to the system locales. The implementation of those patterns can be performed thru several base-r functions, such as: grep grepl regexpr gregexpr sub

gsub strsplit Since this topic includes both learning a set of rules and several different r functions, I’ll split this subject in a 3sets series. Answers to the exercises are available here. Although with regex, you can get correct results in more than one way, if you have different solutions, feel free to post them.

Character class A character class is a list of characters enclosed between square brackets (e.g. [ and ]), which matches any *single* character in that list. For example [0359abC] means “find a pattern with one of the digits/characters 0,3,5,9,”a”,”b” or “C”. There are some “shortcuts” that allow us finding specific ranges of digits or characters: [0-9] means any digit [A-Z] means any upper case character [a-z] means any lower case character Let’s create a variable called text1 and populate it with the value “The current year is 2016” Exercise 1 Create a variable called my_pattern and implement the required pattern for finding any digit in the variable text1. Use function grepl to verify if there is a digit in the string variable Exercise 2 Use function gregexpr to find all the positions in text1 where

there is a digit. Place the results in a variable called string_position

Predefined classes of characters In many cases, we will look for specific types of characters (for example, any digit, any letter, any whitespace, etc). For this purpose, there are several predefined classes of characters that save us a lot of typing. Note: The interpretation of some predefined classes depends on the locale. The “standard” interpretation is that of the POSIX locale. Below are some “popular” predefined classes and their meaning: 1. [:alnum:] Alphanumeric characters: [:alpha:] and [:digit:]. 2. [:alpha:] Alphabetic characters: [:lower:] and [:upper:] can also be used. 3. [:digit:] Digits: 0 1 2 3 4 5 6 7 8 9. 4. [:blank:] Blank characters: space and tab, and possibly other localedependent characters such as non-breaking space. Exercise 3 Create a variable called my_pattern and implement the required pattern for finding one digit and one uppercase alphanumeric character, in variable text1. This time, combine predefined classes in the regex pattern. Use function grepl to verify if the searched pattern exists on the string.

Exercise 4 Use function regexpr to find the position of the first space in text1. Place the results in a variable called first_space and

Special single character The period (“.”) matches any single character. Exercise 5 Create a pattern that checks in text1 if there is a lowercase character, followed by any character and then by a digit. Exercise 6 Find the starting position of the above string. Place the results in a variable called string_pos2

Special symbols There are several “special symbols” that assist in the definition of specific patterns. Pay attention that in R, you should append an extra backslash when using those special symbols: The symbol \w matches a ‘word’ character and \W is its negation. Symbols \d, \s, \D and \S denote the digit and space classes and their negations. As you may have noticed, some special symbols have their parallel “predefined classes”. (For example, \d equals [0-9] and equals [:digit:]) Exercise 7 Find the following pattern: one space followed by two lowercase letters and one more space. Use a function that returns the starting point of the found string and place its result in string_pos3.

Metacharacters There are several metacharacters in the “regex syntax”. Here I’ll introduce two popular ones: The caret ("^") – means: find a pattern starting from the beginning of the string The dollar sign ("$") – means: find a pattern starting from the end of the string. Exercise 8 Using the sub function, replace the pattern found on the previous exercice by the string ” is not ” Place the resulting string in text2 variable.

Repetition Characters There are several ways of dealing with the repetition of characters in the “regex syntax”. Here I’ll introduce the “Curly brackets” syntax: {n} The preceding item is matched exactly n times. {n,} The preceding item is matched n or more times. {n,m} The preceding item is matched at least n times, but not more than m times. By default repetition is greedy, so the maximal possible number of repeats is used. Exercise 9 Find in text2 the following pattern: Four digits starting at the end of the string. Use a function that returns the starting point of the found string and place its result in string_pos4. Exercise 10 Using the substr function, and according to the position of

the string found in the previous excercise, extract the first two digits found at the end of text2.