regular-expressions-regex-python

Introduction

A regular expression(RegEx) is a sequence of characters that forms a search pattern. Python RegEx is used for identifying a search pattern in a text string. It allows for finding, replacing, and formatting the data. Python has re module for the regular expression.

 

Useful fields of Regular Expressions:

1. Data analytics

2. Web scraping 

3. User input validation

4. Email verification

5. Text editors and many more

 

Implementation

Pattern matching and Searching was very difficult task before introducing the regular expression. Regular expression has simplified the task which was very difficult in the past time.

To have a look, at how regular expression has simplified the task, let’s take an example that verifies the phone number is valid or not.

If this task is to be done manually without regular expression can be done in this way:

def isValidNumber(text):
    if len(text) != 15:
        return False
        
    if text[0] != '+':
        return False
        
    for i in range(1, 4):
        if text[i].isdecimal() == False:
            return False
    
    if text[4] != ' ':
        return False
        
    for i in range(5, 15):
        if text[i].isdecimal() == False:
            return False
            
    return True

text1 = '+977 9800000000'

if isValidNumber(text1):
    print('Valid number')

Output

Valid number

We’ve taken the mobile number of Nepal for verification. As we know country code for Nepal is +977. And then the phone number has 10 digits in it. We assume that the country code and 10-digit phone number are separated by space[ ].

 

Explanation of steps of verification

First of all, the code checks whether the length of supplied text(a mobile number for verification in the string) is of length 15 or not, if not it returns False. Then it checks whether the text begins with “+” or not, if not it returns False. Then, it checks first three characters after “+” are decimal or not, if not it returns False. Then it checks whether there is space or not, if not it returns False. Then finally it checks whether all the remaining characters are decimal or not, if not it returns False.

Now, let’s see how phone number verification can be done with a regular expression:

import re

text = '+977 9800000000'
phReg = re.compile(r'\+\d\d\d \d\d\d\d\d\d\d\d\d\d')

if phReg.search(text) == None:
    print('Invalid number')
else:
    print('Valid number')

Just a few lines of code and the result are the same. We’ve created a search pattern using compile function of the re module. Here, d represents the decimal number(0-9). The pattern is so created that it will look for a pattern starting with “+” followed by three decimal numbers followed by space and ten decimal numbers.

 

RegEx function

Some of the functions that assists us to search a string in the text are as follows:

findall()

It returns a list containing all the matches

import re

text = 'my mobile number is 977-980000000.My telephone number is 076-678567'

phReg = re.compile(r'\d\d\d-\d+')
contact_no = phReg.findall(text)

print(contact_no)

Output

['977-980000000', '076-678567']

 

search()

Returns a match object if there is any match in the string

import re

text = 'My mobile number is 977-9800000000 and telephone number is 076-000000'

reg = re.compile(r'\d\d\d-\d+')
contact_no = reg.search(text)

print(contact_no.group())

Output

977-9800000000

Here we got the only a mobile number but not a telephone number. It is because the search() function returns only a first match. If no match is found it returns None.

 

split()

This split() function returns a list where the string has been split at each match

import re 

text = 'My mobile number is 977-9800000000'

reg = re.split('\s', text) 

print(reg)

Output

['My', 'mobile', 'number', 'is', '977-9800000000']

 

sub()

This sub() function substitutes a certain word with another word that is required to be substituted.

import re 

text = 'My mobile number is 977-9800000000'

reg = re.sub('\s','-', text) 

print(reg)

Output

My-mobile-number-is-977-9800000000

From the above output, we can see spaces were replaced by “-” in text.

Also, we can specify the control parameter as count to control the number of replacements.

import re 

text = 'My mobile number is 977-9800000000'

reg = re.sub('\s','-',text, 1)

print(reg)

Output

My-mobile number is 977-9800000000

Only one space is replaced by “-” while other spaces remain the same.

 

Metacharacters

Metacharacters are characters with special meanings. Some of the metacharacters in the regular expression are:

Dot(.)

Matches any characters except newline

import re

text = 'I am from Nepal'

match = re.findall('Ne..l', text)

print(match)

Output

['Nepal']

 

Caret(^)

Matches the start of the string

import re

text = 'I am from Nepal'

match = re.search('^I', text)

print(match)
print(match.group())

Output

<re.Match object; span=(0, 1), match='I'>
I

Here, we got output as “I” because the string starts with “I”.

Let’s see another example

import re 

text = 'I am from Nepal' 

match = re.search('^am', text) 

print(match)

Output

None

Here, we got None as output because string doesn’t start with “am”.

 

Dollar($)

Matches the end of string

import re 

text = 'I am from Nepal' 

match = re.search('Nepal$', text) 

print(match)

Output

<re.Match object; span=(10, 15), match='Nepal'>

Here, the output we got show that the string ends with Nepal.

Another example

import re 

text = 'I am from Nepal' 

match = re.search('Nep$', text) 

print(match)

Output

None

Here, None output show that the string doesn’t end with “Nep”.

 

Question mark(?)

It causes the resulting regular expression to match 0 or 1 repetition of preceding regular expression

import re

text = "A superwoman is there"

reg = re.compile(r'super(wo)?man')
match = reg.search(text)

print(match)
print(match.group())

Output

<re.Match object; span=(2, 12), match='superwoman'>
superwoman

The (wo)? Tells that the (wo) group can appear one or zero time in string. Here, the (wo) group appear once in ‘text’ so we got match. Also,

import re

text = "a superman is there"

reg = re.compile(r'super(wo)?man')
match = reg.search(text)

print(match)
print(match.group())

Output

<re.Match object; span=(2, 10), match='superman'>
superman

Here, the (wo) group is completely absent. ‘?’ allows zero one one time repetition, which is the reason we got the match.

import re

text = "there is a superwowowoman"

reg = re.compile(r'super(wo)?man')
mo = reg.search(text)

print(mo)

Output

None

Since the (wo) group appear twice, we couldn’t get the match. So we got None as output.

 

Asterisk(*)

Causes the resulting regular expression to match 0 or more repetition of preceding regular expression

import re 

reg = re.compile('super(wo)*man') 
mo = reg.search('There is a superman') 

print(mo) 
print(mo.group())

Output

<re.Match object; span=(11, 19), match='superman'>
superman

The (wo)* tells that the group (wo) can appear 0 or more time. There is no presence of (wo) group and still we got the match.

import re

reg = re.compile(r'super(wo)*man')
mo = reg.search('There is a superwowoman')

print(mo)
print(mo.group())

Output

<re.Match object; span=(11, 23), match='superwowoman'>
superwowoman

Since, (wo) is repeated twice we got the match.

 

Plus(+)

Causes the resulting regular expression to match 1 or more repetition of preceding regular expression

import re 

reg = re.compile(r'super(wo)+man') 
mo = reg.search('There is a superman') 

print(mo)

Output

None

The (wo)+ tells that the (wo) group can appear 1 or more time in string. Since, the string has no (wo) group we didn’t got match.

import re 

reg = re.compile(r'super(wo)+man') 
mo = reg.search('There is a superwowowoman') 

print(mo)

Output

<re.Match object; span=(11, 25), match='superwowowoman'>

(wo) group is repeated more than one time so we got match.

 

Curly braces ({m})

Specifies that exactly m copies of the previous regular expression should be matched

import re 

reg = re.compile(r'super(wo){3}man') 
mo = reg.search('There is a superwowowoman') 

print(mo)

Output

<re.Match object; span=(11, 25), match='superwowowoman'>

The (wo) group is repeated exactly 3 times in string. So we got the match in string.

import re 

reg = re.compile(r'super(wo){2}man') 
mo = reg.search('There is a superwowowoman') 

print(mo)

Output

None

The (wo){2} tells that the repetition of (wo) group should be exactly 2 but string got 3 repetition so match couldn’t be found.

 

{m, n}

Causes the regular expression to match from m to n repetition of preceding regular expression

import re 

reg = re.compile(r'(wo){2,4}') 
mo = reg.search('wowowowowowo') 

print(mo)

Output

<re.Match object; span=(0, 8), match='wowowowo'>

Here, we got the match that has four (wo) groups. This could have matched the two or three or four (wo) groups but still gone for four (wo) group. This is called Greedy Matching as the regular expression in python look for most possible match.

 

For non-greedy matching

import re 

reg = re.compile(r'(wo){2,4}?') 
mo = reg.search('wowowowowo') 

print(mo)

Output

<re.Match object; span=(0, 4), match='wowo'>

Here, we got match having two (wo) groups only.

 

Pipe(|)

Matches any one from two

import re

text = 'Nepal is a beautiful country and people in this country speaks Nepali'

reg = re.compile(r'Nep(al|ali)')
mo = reg.search(text)

print(mo.group())

Output

Nepal

Regular expression matches (al) group to give “Nepal” as output.

 

Square bracket ([ ])

It is used to define the set of characters

import re

text = 'my name is Arnold and i\'m 108'

match = re.findall(r'[0-9]', text)

print(match)

Output

['1', '0', '8']

Here, we defined a raw string that grab the numbers between 0-9. Also

import re

text = 'my name is Arnold and i\'m 108'

match = re.findall(r'[^0-9]', text) 

print(match)

Output

['m', 'y', ' ', 'n', 'a', 'm', 'e', ' ', 'i', 's', ' ', 'A', 'r', 'n', 'o', 'l', 'd', ' ', 'a', 'n', 'd', ' ', 'i', "'", 'm', ' ']

Using ^ symbol in above, we can search for string other than that of set of characters defined in square brackets.

 

Character classes

Character classes are used to shortening the regular expression. For example:

[0-9]- matches the digits between 0 to 9 while \d also matches the digits between 0 and 9 

\d – any digits between 0 to 9 

\D – any characters except digits 

\w – any letter, digits and underscore 

\W – any characters that is not letter, not digits nor underscore 

\s – any space, tab or newline characters 

\S – any characters other than space, tab and newline character

Now, using these character classes. Let’s extract the phone numbers from a text in below example

import re

text = 'My mobile number is 977-9867000009. My another mobile number is 977-9800000000. Also you can try my telephone: 076-000000'

reg = re.compile(r'\d\d\d-\d+')
mo = reg.findall(text)

print(mo)

Output

['977-9867000009', '977-9800000000', '076-000000']

First of all, it is noted that the mobile number has pattern of 3 digits followed by dash(-) followed by 10 digits. Also, telephone number has pattern of 3 digits followed by dash(-) followed by 6 digits.

Then a regex object created using compile() function that search for pattern first three digits denoted by \d and then dash(-) and then \d+. Here, \d+ looks for digits that is or more times as explained in previous section.

Let’s say there is text “Birth date of Ramesh 2056-08-07 and birth date of Suresh 2054-06-09”. If we wished to take out name and birthdate we can do it easily using regular expression.

import re

text = 'Bith date of Ramesh 2056-08-07 and birth date of Suresh 2054-06-09'

reg = re.compile(r'\w+\s\d{4}-\d{2}-\d{2}')
mo = reg.findall(text)

print(mo)

Output

['Ramesh 2056-08-07', 'Suresh 2054-06-09']

Here, we created a pattern that looks for character of any length (\w+) followed by space (\s) followed by four digits that gives year (\d{4}) followed by dash(-) followed by two digits that gives month (\d{2}) followed by dash (-) and two digits that gives day (\d{2}).

(\d{4}) and (\d{2}) looks for digits repetiting exactly 4 times and exactly 2 times respectively.

 

Strong password detection with regular expression

Let’s assume that the strong password is one that has minimum 8 characters length, at least one uppercase letter, at least one lowercase letter and at least one digit.

import re 

def isStrongPassword(text): 
    if len(text) < 8: 
        return False 
    
    if re.search(r'[A-Z]', text) == None: 
        return False 
        
    if re.search(r'[a-z]', text) == None:
        return False 
        
    if re.search(r'[0-9]', text) == None: 
        return False 

    return True 

pw = input('Enter password : ')

if isStrongPassword(pw): 
    print('Password is strong') 
else: 
    print('Password is not strong')

Output

Enter password : AbGc6 

Password is not strong 

 
Enter password : abcdefgh1234 

Password is not strong 


Enter password : ABCdejfkdmdksisjsmmzmzmz 

Password is not strong 


Enter password : Abcedfghxyz1290 

Password is strong

 

Conclusion

Regular expression has simplified the searching and matching task that we need in our daily life. Without regular expression, pattern matching, searching string through text files was very difficult.

With the aid of regular expression, these task has been simple with little effort and is time saving. Regular expression provides functions, metacharacters, character classes that make task of matching and searching really simple. So, learning regular expression is worthy.

If you have any queries regarding the tutorial, please leave a comment below. I will hear you asap. If you master the Regular Expressions, then you are way forward than other programmers.

Reference

automatetheboringstuff.com/2e/chapter7/

Happy Learning 🙂

Leave a Reply

Your email address will not be published.