Data Wrangling – Regular Expression

Red Pattern

Data Wrangling – Regular Expression

In continuation on our mini series in exploring different data wrangling techniques, today we look into regular expression. For example, suppose we have a phone number stored in the form of +1 (647) 123-4567 and we are interested in extracting the area code “647”, how would you do that? The method we will explore today helps in identifying patterns within a text string. With Regular Expressions, given a string of text, we can extract certain words or patterns.

Basic operations in Regular Expression

Below we list some of the basic operations when working with Regular Expression. Think of these as a description of a pattern we want to match or look for later on. Even though these expressions may look daunting at first, we will shortly review examples so you can better understand what is going on and how to be successful in data wrangling.

  • abc – For instance, this matches the first occurence of an exact string match
  • [abc] – Equally important, this expression matches the first occurence of any letters within the set
  • [a-c] – Besides listing each letter or number, we can provide a range and match its’ first occurence
  • [^abc] – Conversely by using the ^ symbol at the beginning, this will match any letters not within the set
  • [\\] – Especially important are special characters. These are defined as: . ^ $ * + ? { } [ ] \ | ( )
  • [ab*c] – Instead of explicitly writing repeat patterns, we can use the * symbol to match zero to any number of a substring. In this case the letter b
  • [ab?c] – Furthermore, sometimes we want to match zero or one occurence of substring

Additionally, there are also pre-defined abbreviations that may come in handy

  • \d – Until now we needed to enter all numbers in range, this shortcut matches any decimal digit and is equivalent to [0-9].
  • \D – In the same way this shortcut matches any non-digit character andis equivalent to the class [^0-9].
  • \s – At this point this matches any whitespace character; this is equivalent to the class [ \t\n\r\f\v].
  • \S – Conversely, this matches any non-whitespace character; this is equivalent to the class [^ \t\n\r\f\v].
  • \w – Matches any alphanumeric character; this is equivalent to the class [a-zA-Z0-9_].
  • \W – Finally this matches any non-alphanumeric character; this is equivalent to the class [^a-zA-Z0-9_].
  • \n – Line feed
  • \r – Carriage Return
  • \t – Tab
  • \v – Tab
  • \f – Page Break / Form Feed

Data Wrangling examples of basic Regular Expression operations

To begin with, let’s take an arbitrary phone number from an address book and try to extract meaningful details from it. We know in a typical North American phone number we can get the country code, area code, and finally the phone number from this string.

Exact String Match

import re
mytext = "+1 (647) 123-4567"
# Matching exact text
matchtext = re.search(r'647', mytext)  # re.search(r'<Regular Expression>', <String to search>)
print(matchtext)
<re.Match object; span=(4, 7), match='647'>

In this case above, the results tells us there is a match of “647” and this is found in the substring position (4,7).

Any character in a list

In comparison instead of matching an exact string, we can try to match the first occurrence of any character within a list. If we provide the command [647], we would return the position of the first 6, or 4, or 7.

# Matching the first occurrence of any item in a list
matchtext = re.search(r'[647]', mytext)
print(matchtext)
<re.Match object; span=(4, 5), match='6'>

Any character in a range, no match found

In continuation, if the regular expression we are trying to match is not found, the expression would return “None”.

# Matching the first occurrence within a range
matchtext = re.search(r'[a-z]', mytext)
print(matchtext)
None

Substring match in a range, match found

Compared to returning nothing, we now look at an example where we are looking for any numbers. The result shows us the first number identified as 1, and this is found on position 1.

# Matching the first occurence within a range - Example 2
matchtext = re.search(r'[0-9]', mytext)
print(matchtext)
<re.Match object; span=(1, 2), match='1'>

Matching special characters

At this point we look at how we can find specific special characters. Recall in our example phone number we have the area code enclosed within brackets. Correspondingly here we can extract these special characters by entering a back slash (‘\’) followed by the character. Notice how the returned result is “(647)” with brackets and was found from location 3 to 8.

# Matching all non-white space characters within brackets (special characters)
#Special characters are: . ^ $ * + ? { } [ ] \ | ( )
matchtext = re.search(r'\(.*\)', mytext)
print(matchtext)
<re.Match object; span=(3, 8), match='(647)'>

Matching entire string

Even though this may not be a common use case, at times you may want to be able to extract an entire string, Or in other words,

# Matching any number of non-white space characters
matchtext = re.search(r'.*', mytext)
print(matchtext)
<re.Match object; span=(0, 17), match='+1 (647) 123-4567'>

Tokenising with Regular Expressions’

Suppose in our string we have several pieces of data of interest. For example we want to extract the country code, area code, and phone number from our text. Following our example below, we would be able to extract these elements.

# Matching tokens in a string
matchtext = re.search(r'\+(.*) \((.*)\) (.*)-(.*)', mytext)
print(matchtext)

# Reproducing tokens matched
for i in range(0,5,1):
    print("At match number {} we have:".format(i),matchtext.group(i))
At match number 0 we have: +1 (647) 123-4567
At match number 1 we have: 1
At match number 2 we have: 647
At match number 3 we have: 123
At match number 4 we have: 4567

Parsing Analytical instrument serial string with Regular Expression

Another common use case for Regular Expressions are to parse serial instrument responses from analytical instruments in a laboratory setting. Consequently we provide a quick example how usingf Regular Expressions can be easily parse serial strings as a practical example.

print(rawdata)
S	23.45 g
2020-12-25 09:34
# Create our regular expression to map our string
match_rawdata = re.search(r'S\t(.*) (.*)\n(.*)-(.*)-(.*) (.*):(.*)',rawdata)

# Save the matched values into separate variables
weight = match_rawdata.group(1)
unit = match_rawdata.group(2)
date = match_rawdata.group(3) + "-" + match_rawdata.group(4) + "-" + match_rawdata.group(5) \
    + " " + match_rawdata.group(6) + ":" + match_rawdata.group(7)

# Print results
print("Our instrument returned {} with units {} on {}".format(weight, unit, date))
Our instrument returned 23.45 with units g on 2020-12-25 09:34

Summary

In summary, today we went through some of the ways in which we can use Regular Expressions to perform data wrangling. First we explored some basic operations, subsequently we looked at a practical example by parsing a serial string. Equipped with these examples, it should be sufficient for you to be able to parse any text and be able to extract meaningful details to further fuel your data analysis.

References

Regular Expression HowTos by Python Docs

Logo 100x100 About Alan Wong…
Alan is a part time Digital enthusiast and full time innovator who believes in freedom for all via Digital Transformation. 
兼職人工智能愛好者,全職企業家利用數碼科技釋放潛能與自由。

Leave a Reply