Regular expressions¶

What are regular expressions?¶

Regular expressions are patterns that describe strings of symbols. For example, we can create an expression that will match any email address, any date, phone number, credit card number, etc.

In Python, we need a module called re to work with regular expressions. This module will help us to search for strings matching a pattern in a text or to check whether a given text exactly matches a given pattern.

Special characters¶

When using regular expressions, we must remember that some characters have a special meaning. Those are:

dot: . - any character except a newline character. The pattern .la matches the strings: "Ola", "ala" and "Ela",
question mark: ? - matches zero or one occurrence of the preceding character. The pattern Olk?a matches "Ola" and "Olka",
plus: + - means one or more occurrences. The pattern a+le matches the strings: "ale", "aaale", "aaaaale", etc.,
asterisk: __ - stands for any number of occurrences of a character (including zero). The pattern __ala matches the strings: "la", "ala", "aaaaala", etc.,
square brackets match any of the characters they contain. The pattern [OA]la matches "Ola" i "Ala". We can also specify a range of characters by using a hyphen. To the pattern [a-z]al matches the word starting with any lowercase letter followed by a string "al", i.e. "mal", "pal" or "eal",
parentheses allow you to group characters in an expression so that you can collectively apply different modifiers to them,
the braces say the number of repeats. The pattern a(la){1,3} matches all strings starting with "a" followed by one to three strings "la", tj. "ala", "alala" and "alalala",
caret: ^ - negation of the characters given in square brackets. That is, to the pattern [^OA]la matches "Ela" i "Bla", but "Ola" and "Ala" not,
pipeline: | stands for an alternative, e.g. to a pattern Alice has (dog|cat) both the string "Ala has a cat" and "Ala has a dog" will match,
caret ^ at the beginning of a pattern means to match the beginning of a line. The pattern ^ [a-z]* will not match e.g. "Barbara has a hedgehog", because the string does not start with a lowercase letter,
Likewise, the dollar sign $ matches the end of a line.

Additionally:

\d stands for a digit and is an alias for [0-9],
\s matches any white space,
\w stands for a word and is an alias for [A-Za-z0-9_].
\D, \S, \W are the negations of the above, i.e. \S stands for any character, excluding whitespace (spaces, tabs, etc.).

If we want to use a special character, but have it treated as ordinary, we must precede it with a backslash. For example, for the string "john.smith", an example pattern to match would be [a-z].[a-z],

regex101¶

Regex101

Several on-line visual tools are available to validate the patterns we have written. One of them is https://regex101.com/. All we need to do is paste the text and provide a pattern, and the tool will visualize which parts of the text fit correctly.

The functionalities of the module re¶

The basic functions of the module are:

search

This function takes two arguments, the first being a regular expression and the second being the text that we are looking for a string matching the expression. Returns an object of type Match, which contains information about which string matched and where it is, or a None object if no matching string was found.

    print(re.search(r"[A-Z]la", "ala Ola Ela"))

    <_sre.SRE_Match object; span=(4, 7), match='Ola'>

match

The function takes exactly the same parameters as search. The difference is that match tells you whether the beginning of the text matches the expression, not just a part of it.

    print(re.match(r"[A-Z]la", "ala Ola Ela"))

    None

fullmatch

The function takes the same parameters, checks if all the text matches the expression.

    print(re.fullmatch(r"[A-Z]la", "Ela"))

    <_sre.SRE_Match object; span=(0, 3), match='Ela'>

findall

The function returns all matches against a pattern from text.

    print(re.findall(r".la", "Ola ala Ela"))

    ['Ola', 'ala', 'Ela']

finditer

This function works like findall, but returns an iterator that allows you to access successive items as you step over them.

    iter = re.finditer(r".la", "Ola ala Ela")
    for elem in iter:
        print(elem)

    <_sre.SRE_Match object; span=(0, 3), match='Ola'>
    <_sre.SRE_Match object; span=(4, 7), match='ala'>
    <_sre.SRE_Match object; span=(8, 11), match='Ela'>

split

The split function from the re module works similar to the split function from the os module, except that here we can specify a regular expression against which we split the string.

    print(re.split(r",|\.", "apple,pear,grapes,carrot.cabbage,veggies.fruit,yard"))

    ['apple', 'pear', 'grapes', 'carrot', 'cabbage', 'veggies', 'fruit', 'yard']

sub

The function converts all strings described by the regular expression to the given string.

    print(re.sub(r"[a-z]{4}", "dog", "Alice has a cat"))

    Alice has a dog

subn

The function works like sub, but additionally returns how many substitutions have been made.

    print(re.subn(r"[a-z]{4}", "dog", "Alice has a cat"))

    ('Alice has a dog', 1)

Grouping¶

A useful skill is the use of grouping, thanks to marking groups in the pattern, we have the ability to extract part of the match.

Let's look at some examples:

1.

    import re

    text = "Thomas S. (33), last seen in Krakow"
    pattern = r"([A-Z]{1}[a-z]+ [A-Z]{1}\.) \((\d+) l.\)"
    match = re.search(pattern, text)
    print(match)
    print(match.groups())
    print(match.group(0))
    print(match.group(1))
    print(match.group(2))

We have defined two groups in our pattern, the first([A-Z]{1}[a-z]+ [A-Z]{1}\.), which says we're looking for a string that begins with an uppercase letter, followed by at least one lowercase letter, then a space, followed by one uppercase letter, followed by a period. Second group, (\d+) states that we are looking for at least one digit. Let's look at the result of executing the above code:

    <re.Match object; span=(0, 17), match='Thomas S. (33 l.)'>
    ('Thomas S.', '33')
    Thomas S. (33 l.)
    Thomas S.
    33

As we can see, the first print returns us an object of the Match type with information about the entire pattern match. The second one, in turn, returns a record whose elements match the specific groups. Subsequent calls will show specific groups - but we must remember that referring to the group with index 0 returns the entire match .:

2.

    text = "Thomas (33) and Eva (24) agreed to go shopping together tomorrow"
    pattern = r"([A-Z]{1}[a-z]+) \((\d+) l.\)"
    print(re.findall(pattern, text))

In this case, we've defined two groups, the first matching a string of letters starting with a capital letter, the second a string of numbers. The result of calling the code will be:

    [('Thomas', '33'), ('Eva', '24')]

As we can see, the findall method will return a list of records, where in the first place in each of them there will be what matched the first group, and in the second, similarly - to the second one.