TOC
Introduction
As a machine learning engineer, sometime I need to retrieve information from string or text file. Besides the basic functions of Python build-in string, regular expression is an important tool.
I will summarize some basic knowledge of regular expression and Python re module in this post. We can get more informations from the references listed in the last section.
Regular Expression Basics
Metacharacters
Basics:
Metacharacters | Matching |
---|---|
. | All charactor except a newline |
^ | Start of the string |
$ | End of the string |
\w | Unicode word characters including English, Chinese numbers and underscore etc. |
\s | Unicode whitespace characters including “ \t\n\t\f\v” |
\d | Numbers |
\b | Word boundary |
\W | Any character which is not a word character |
\S | Any characters which is not whitespace, equivalent of [^ \t\n\t\f\v] |
\D | not numbers |
\B | not word bounday |
[^x] | any word which is not ‘x’ |
Repeating things:
Metacharacters | Matching |
---|---|
* | Repeat zero or more times |
+ | Repeat once or more times |
? | Appear zero or once |
{n} | Appear n times |
{n, } | Appear n or more times |
{m,n} | Appear m to n times |
Group and Backreference
Group:
p = re.compile('(a(b)c)d')
m = p.match('abcd')
m.group(0)
# "abcd"
m.group(1)
# "abc"
m.group(2)
# "b"
Backreference:
p = re.compile(r'\b(\w+)\s+\1\b')
p.search('Paris in the the spring').group()
# "the the"
Lookahead Assertion
Refer to Ref[2].
Python re Module
Basic Methods
Matching String
Method | Description |
---|---|
match() | Determine if RE matches at the begining of the string |
search() | Scan through a string, looking for any location where the RE match |
findall() | Find all substrings that matches, return as list |
finditer() | Find all substrings that matches, return as iterator |
Modifying String
Method | Description |
---|---|
split() | Split the string into a list, splitting it whereever the Re matches |
sub() | Find all substring where Re matches, and replace them with a different string |
subn() | Does the same thing as sub(), but returns the new string and the number of replacement |
Common Problems
Raw String Notation
For example, you want to match the string "\section"
, since \
is a metacharactor in regular expression, you should put backslash before \
which resulting in "\\section"
.
However, to express this in Python string literal, both backslash need to be escaped which will resulting in "\\\\section"
. This will definitly leads to duplications of backslash and makes the result string hard to understand
To solve the problem, regular string notation is introduced in Python. Below is the mapping from Python regular string to raw string notation string:
"ab*"
$\rightarrow$r"ab*"
"\\\\section"
$\rightarrow$r"\\section"
"\\w+\\s+\\1"
$\rightarrow$r"\w+\s+\1"
Greedy VS Non-greedy Fasion
Non-greedy qualifiers such as *?
, +?
, ??
, {m, n}?
will match text as little as possible.
Code examples are as follows:
begin = "<"
end = ">"
string = "<hah><ffd>"
pat = re.compile(begin+'(.*?)'+end,re.S)
non_greedy_result = pat.findall(string)
pat = re.compile(begin+'(.*)'+end,re.S)
greedy_result = pat.findall(string)
print(f"Non-greedy result: {non_greedy_result}")
print(f"Greedy result: {greedy_result}")
## Output:
# Non-greedy result: ['hah', 'ffd']
# Greedy result: ['hah><ffd']