Python regular expressions for beginners: what it is, why and what for

image



Over the past few years, machine learning, data science, and related industries have made great strides forward. More and more companies and developers are using Python and JavaScript to work with data.



And this is where we need regular expressions. Whether parsing all or portions of text from web pages, analyzing Twitter data, or preparing data for text analysis - regular expressions come to the rescue.



By the way, Alexey Nekrasov , the leader of the Python department at MTS, and the program director of the Python department at Skillbox, added his advice on some functions . To make it clear where the translation is, and where the comments are, we will highlight the latter with a quote.



Why are regular expressions needed?



They help to quickly solve a variety of tasks when working with data:

  • Determine the required data format, including phone number or e-mail address.
  • Split strings into substrings.
  • Search, extract and replace characters.
  • Perform non-trivial operations quickly.


The good news is that the syntax of most of these expressions is standardized, so you need to understand them once, after which you can use them anytime, anywhere. And not only in Python, but also in any other programming languages.



When are regular expressions unnecessary? When there is a similar built-in function in Python, and there are quite a few of them.



What about regular expressions in Python?



There is a special re module here, which is exclusively for working with regular expressions. This module needs to be imported, after which you can start using regulars.



import re



As for the most popular methods provided by the module, here they are:



  • re.match ()
  • re.search ()
  • re.findall ()
  • re.split ()
  • re.sub ()
  • re.compile ()


Let's take a look at each of them.



re.match (pattern, string)



The method is designed to search for a given pattern at the beginning of a string. So, if you call the match () method on the line "AV Analytics AV" with the template "AV", then it will be completed successfully.



import re
result = re.match(r'AV', 'AV Analytics Vidhya AV')
print(result) 
:
<_sre.SRE_Match object at 0x0000000009BE4370>
      
      





Here we found the required substring. The group () method is used to display its contents. This uses "r" in front of the template string to indicate that it is a raw string in Python.



result = re.match(r'AV', 'AV Analytics Vidhya AV')
print(result.group(0))
 
:
AV
      
      





Okay, now let's try to find "Analythics" on the same line. We won't succeed, since the line begins with "AV", the method returns none:



result = re.match(r'Analytics', 'AV Analytics Vidhya AV')
print(result)
 
:
None
      
      





The start () and end () methods are used to find out the start and end position of the found string.



result = re.match(r'AV', 'AV Analytics Vidhya AV')
print(result.start())
print(result.end())
 
:
0
2
      
      





All of these methods are extremely useful when working with strings.



re.search (pattern, string)



This method is similar to match (), but the difference is that it searches not only at the beginning of a string. So, search () returns an object if we try to find "Analythics".



result = re.search(r'Analytics', 'AV Analytics Vidhya AV')
print(result.group(0))
 
:
Analytics
      
      





As for the search () method, it searches the entire string, returning, however, only the first match it finds.



re.findall (pattern, string)



Here we return all found matches. For example, the findall () method has no restrictions on searching at the beginning or end of a line. For example, if you search for "AV" in a string, then we get all occurrences of "AV" returned. It is recommended to use this method for searching, since it knows how to work both re.search () and re.match ().



result = re.findall(r'AV', 'AV Analytics Vidhya AV')
print(result)
 
:
['AV', 'AV']
      
      





re.split (pattern, string, [maxsplit = 0])



This method splits a string based on a given pattern.



result = re.split(r'y', 'Analytics')
print(result)
 
:
['Anal', 'tics']
      
      





In this example, the word "Analythics" is separated by the letter "y". The split () method here also accepts a maxsplit argument with a default value of 0. Thus, it splits the string as many times as possible. However, if you specify this argument, then the division cannot be performed more than the specified number of times. Here are some examples:



result = re.split(r'i', 'Analytics Vidhya')
print(result)
 
:
['Analyt', 'cs V', 'dhya'] #   .
 
result = re.split(r'i', 'Analytics Vidhya', maxsplit=1)
print(result)
 
:
['Analyt', 'cs Vidhya']
      
      





Here the maxsplit parameter is set to 1, which results in the line being split into two instead of three.



re.sub (pattern, repl, string)



Helps to find a pattern in a string, replacing with the specified substring. If the desired item is not found, then the string remains unchanged.



result = re.sub(r'India', 'the World', 'AV is largest Analytics community of India')
print(result)
 
:
'AV is largest Analytics community of the World'
      
      





re.compile (pattern, repl, string)



Here we can assemble the regular expression into an object, which in turn can be used for searching. This option avoids rewriting the same expression.



pattern = re.compile('AV')
result = pattern.findall('AV Analytics Vidhya AV')
print(result)
result2 = pattern.findall('AV is largest analytics community of India')
print(result2)
 
:
['AV', 'AV']
['AV']
      
      





Up to this point, we have considered the option with the search for a specific sequence of characters? In this case, there is no pattern, the set of characters must be returned in the order corresponding to certain rules. This is a common task when dealing with retrieving information from strings. And this is easy to do, you just need to write an expression using a special. characters. The most common ones are:



  • ... Any single character except newline \ n.
  • ? 0 or 1 occurrence of the pattern to the left
  • + 1 or more occurrences of the pattern on the left
  • * 0 or more occurrences of the pattern on the left
  • \ w Any number or letter (\ W - everything except letter or number)
  • \ d Any digit [0-9] (\ D - everything except a digit)
  • \ s Any whitespace character (\ S is any non-whitespace character)
  • \ b Word boundary
  • [..] One of the characters in brackets ([^ ..] - any character except those in brackets)
  • \ Escaping special characters (\. Stands for period or \ + for plus sign)
  • ^ and $ Beginning and end of line respectively
  • {n, m} n to m occurrences ({, m} - 0 to m)
  • a | b Matches a or b
  • () Groups the expression and returns the found text
  • \ t, \ n, \ r Tab, newline, and carriage return, respectively


It is clear that there may be more symbols. Information on these can be found in the Python 3 Regular Expression Documentation .



Some examples of using regular expressions



Example 1. Returning the first word from a string



Let's first try to get each character using (.)



result = re.findall(r'.', 'AV is largest Analytics community of India')
print(result)
 
:
['A', 'V', ' ', 'i', 's', ' ', 'l', 'a', 'r', 'g', 'e', 's', 't', ' ', 'A', 'n', 'a', 'l', 'y', 't', 'i', 'c', 's', ' ', 'c', 'o', 'm', 'm', 'u', 'n', 'i', 't', 'y', ' ', 'o', 'f', ' ', 'I', 'n', 'd', 'i', 'a']
      
      







Now we will do the same, but so that the final result does not include a space, we use \ w instead of (.)



result = re.findall(r'\w', 'AV is largest Analytics community of India')
print(result)
 
:
['A', 'V', 'i', 's', 'l', 'a', 'r', 'g', 'e', 's', 't', 'A', 'n', 'a', 'l', 'y', 't', 'i', 'c', 's', 'c', 'o', 'm', 'm', 'u', 'n', 'i', 't', 'y', 'o', 'f', 'I', 'n', 'd', 'i', 'a']
      
      





Now let's do a similar operation with each word. We use in this case * or +.



result = re.findall(r'\w*', 'AV is largest Analytics community of India')
print(result)
 
:
['AV', '', 'is', '', 'largest', '', 'Analytics', '', 'community', '', 'of', '', 'India', '']
      
      





But even here, as a result, there were gaps. Reason - * means "zero or more characters". The "+" will help us remove them.



result = re.findall(r'\w+', 'AV is largest Analytics community of India')
print(result)
:
['AV', 'is', 'largest', 'Analytics', 'community', 'of', 'India']
      
      





Now, let's extract the first word using

^:



result = re.findall(r'^\w+', 'AV is largest Analytics community of India')
print(result)
 
:
['AV']
      
      





But if you use $ instead of ^, then we get the last word, not the first:



result = re.findall(r'\w+$', 'AV is largest Analytics community of India')
print(result)
 
:
[โ€˜Indiaโ€™]
 
      
      





Example 2. Returning two characters of each word



Here, as above, there are several options. In the first case, using \ w, we extract two consecutive characters, except for those with spaces, from each word:



result = re.findall(r'\w\w', 'AV is largest Analytics community of India')
print(result)
 
:
['AV', 'is', 'la', 'rg', 'es', 'An', 'al', 'yt', 'ic', 'co', 'mm', 'un', 'it', 'of', 'In', 'di']
      
      







Now we try to extract two consecutive characters using the word boundary character (\ b):



result = re.findall(r'\b\w.', 'AV is largest Analytics community of India')
print(result)
 
:
['AV', 'is', 'la', 'An', 'co', 'of', 'In']
      
      





Example 3. Returning domains from a list of email addresses.



In the first step, we return all characters after the @:



result = re.findall(r'@\w+', 'abc.test@gmail.com, xyz@test.in, test.first@analyticsvidhya.com, first.test@rest.biz')
print(result)
 
:
['@gmail', '@test', '@analyticsvidhya', '@rest']
      
      





As a result, the parts ".com", ".in", etc. do not end up in the result. To fix this, you need to change the code:



result = re.findall(r'@\w+.\w+', 'abc.test@gmail.com, xyz@test.in, test.first@analyticsvidhya.com, first.test@rest.biz')
print(result)
 
:
['@gmail.com', '@test.in', '@analyticsvidhya.com', '@rest.biz']
      
      





The second solution to the same problem is to extract only the top-level domain using "()":



result = re.findall(r'@\w+.(\w+)', 'abc.test@gmail.com, xyz@test.in, test.first@analyticsvidhya.com, first.test@rest.biz')
print(result)
 
:
['com', 'in', 'com', 'biz']
      
      





Example 4. Getting a date from a string



To do this, you must use \ d



result = re.findall(r'\d{2}-\d{2}-\d{4}', 'Amit 34-3456 12-05-2007, XYZ 56-4532 11-11-2011, ABC 67-8945 12-01-2009')
print(result)
 
:
['12-05-2007', '11-11-2011', '12-01-2009']
      
      





To extract only the year, the parentheses help:



result = re.findall(r'\d{2}-\d{2}-(\d{4})', 'Amit 34-3456 12-05-2007, XYZ 56-4532 11-11-2011, ABC 67-8945 12-01-2009')
print(result)
 
:
['2007', '2011', '2009']
      
      





Example 5. Extracting words beginning with a vowel



At the first stage, you need to return all words:



result = re.findall(r'\w+', 'AV is largest Analytics community of India')
print(result)
 
:
['AV', 'is', 'largest', 'Analytics', 'community', 'of', 'India']
      
      





After that, only those that start with certain letters, using "[]":

result = re.findall(r'[aeiouAEIOU]\w+', 'AV is largest Analytics community of India')
print(result)
 
:
['AV', 'is', 'argest', 'Analytics', 'ommunity', 'of', 'India']
      
      





In the resulting example, there are two shortened words, "argest" and "ommunity". In order to remove them, you need to use \ b, which is necessary to denote a word boundary:

result = re.findall(r'\b[aeiouAEIOU]\w+', 'AV is largest Analytics community of India')
print(result)
 
:
['AV', 'is', 'Analytics', 'of', 'India']
      
      







Alternatively, you can use and ^ inside square brackets to help invert groups:



result = re.findall(r'\b[^aeiouAEIOU]\w+', 'AV is largest Analytics community of India')
print(result)
 
:
[' is', ' largest', ' Analytics', ' community', ' of', ' India']
      
      





Now we need to remove words with a space, for which we include the space in the range in square brackets:



result = re.findall(r'\b[^aeiouAEIOU ]\w+', 'AV is largest Analytics community of India')
print(result)
 
:
['largest', 'community']
      
      





Example 6. Checking the format of a phone number



In our example, the number is 10 characters long, it starts with 8 or 9. To check the list of phone numbers, use:



li = ['9999999999', '999999-999', '99999x9999']
 
for val in li:
    if re.match(r'[8-9]{1}[0-9]{9}', val) and len(val) == 10:
            print('yes')
    else:
            print('no')
 
:
yes
no
no
      
      





Example 7. Splitting a string into several delimiters



Here we have several solutions. Here's the first one:



line = 'asdf fjdk;afed,fjek,asdf,foo' # String has multiple delimiters (";",","," ").
result = re.split(r'[;,\s]', line)
print(result)
 
:
['asdf', 'fjdk', 'afed', 'fjek', 'asdf', 'foo']
      
      





Alternatively, the re.sub () method can be used to replace all delimiters with spaces:



line = 'asdf fjdk;afed,fjek,asdf,foo'
result = re.sub(r'[;,\s]', ' ', line)
print(result)
 
:
asdf fjdk afed fjek asdf foo
      
      





Example 8. Extracting data from an html file



In this example, we extract data from an html file, which are enclosed between and, except for the first column with a number. We also assume that the html code is contained in the string.



Sample file



1 Noah Emma

2 Liam Olivia

3 Mason Sophia

4 Jacob Isabella

5 William Ava

6 Ethan Mia

7 Michael Emily



In order to solve this problem, perform the following operation:



result=re.findall(r'<td>\w+</td>\s<td>(\w+)</td>\s<td>(\w+)</td>',str)
print(result)
Output:
[('Noah', 'Emma'), ('Liam', 'Olivia'), ('Mason', 'Sophia'), ('Jacob', 'Isabella'), ('William', 'Ava'), ('Ethan', 'Mia'), ('Michael', 'Emily')]
      
      







Alexey's comment



When writing any regex in the code, adhere to the following rules:



  • re.compile . re.compile regex.
  • re.VERBOSE. re.compile re.VERBOSE regex . .


:





pattern = '^M{0,3}(CM|CD|D?C{0,3})(XC|XL|L?X{0,3})(IX|IV|V?I{0,3})$'
re.search(pattern, 'MDLV')
      
      











pattern = """
    ^                   # beginning of string
    M{0,3}              # thousands - 0 to 3 Ms
    (CM|CD|D?C{0,3})    # hundreds - 900 (CM), 400 (CD), 0-300 (0 to 3 Cs),
                        #            or 500-800 (D, followed by 0 to 3 Cs)
    (XC|XL|L?X{0,3})    # tens - 90 (XC), 40 (XL), 0-30 (0 to 3 Xs),
                        #        or 50-80 (L, followed by 0 to 3 Xs)
    (IX|IV|V?I{0,3})    # ones - 9 (IX), 4 (IV), 0-3 (0 to 3 Is),
                        #        or 5-8 (V, followed by 0 to 3 Is)
    $                   # end of string
    """
re.search(pattern, 'Mโ€™, re.VERBOSE)
      
      





named capture group capture group, (?P...). ( capture, ).

regex101.com regex



, , Cloudflare.



All Articles