Introduction
A regular expression(RegEx) is a sequence of characters that forms a search pattern. Python RegEx is used for identifying a search pattern in a text string. It allows for finding, replacing, and formatting the data. Python has re module for the regular expression.
Useful fields of Regular Expressions:
1. Data analytics
2. Web scraping
3. User input validation
5. Text editors and many more
Implementation
Pattern matching and Searching was very difficult task before introducing the regular expression. Regular expression has simplified the task which was very difficult in the past time.
To have a look, at how regular expression has simplified the task, let’s take an example that verifies the phone number is valid or not.
If this task is to be done manually without regular expression can be done in this way:
def isValidNumber(text): if len(text) != 15: return False if text[0] != '+': return False for i in range(1, 4): if text[i].isdecimal() == False: return False if text[4] != ' ': return False for i in range(5, 15): if text[i].isdecimal() == False: return False return True text1 = '+977 9800000000' if isValidNumber(text1): print('Valid number')
Output
Valid number
We’ve taken the mobile number of Nepal for verification. As we know country code for Nepal is +977. And then the phone number has 10 digits in it. We assume that the country code and 10-digit phone number are separated by space[ ].
Explanation of steps of verification
First of all, the code checks whether the length of supplied text(a mobile number for verification in the string) is of length 15 or not, if not it returns False. Then it checks whether the text begins with “+” or not, if not it returns False. Then, it checks first three characters after “+” are decimal or not, if not it returns False. Then it checks whether there is space or not, if not it returns False. Then finally it checks whether all the remaining characters are decimal or not, if not it returns False.
Now, let’s see how phone number verification can be done with a regular expression:
import re text = '+977 9800000000' phReg = re.compile(r'\+\d\d\d \d\d\d\d\d\d\d\d\d\d') if phReg.search(text) == None: print('Invalid number') else: print('Valid number')
Just a few lines of code and the result are the same. We’ve created a search pattern using compile function of the re module. Here, d represents the decimal number(0-9). The pattern is so created that it will look for a pattern starting with “+” followed by three decimal numbers followed by space and ten decimal numbers.
RegEx function
Some of the functions that assists us to search a string in the text are as follows:
findall()
It returns a list containing all the matches
import re text = 'my mobile number is 977-980000000.My telephone number is 076-678567' phReg = re.compile(r'\d\d\d-\d+') contact_no = phReg.findall(text) print(contact_no)
Output
['977-980000000', '076-678567']
search()
Returns a match object if there is any match in the string
import re text = 'My mobile number is 977-9800000000 and telephone number is 076-000000' reg = re.compile(r'\d\d\d-\d+') contact_no = reg.search(text) print(contact_no.group())
Output
977-9800000000
Here we got the only a mobile number but not a telephone number. It is because the search() function returns only a first match. If no match is found it returns None.
split()
This split() function returns a list where the string has been split at each match
import re text = 'My mobile number is 977-9800000000' reg = re.split('\s', text) print(reg)
Output
['My', 'mobile', 'number', 'is', '977-9800000000']
sub()
This sub() function substitutes a certain word with another word that is required to be substituted.
import re text = 'My mobile number is 977-9800000000' reg = re.sub('\s','-', text) print(reg)
Output
My-mobile-number-is-977-9800000000
From the above output, we can see spaces were replaced by “-” in text.
Also, we can specify the control parameter as count to control the number of replacements.
import re text = 'My mobile number is 977-9800000000' reg = re.sub('\s','-',text, 1) print(reg)
Output
My-mobile number is 977-9800000000
Only one space is replaced by “-” while other spaces remain the same.
Metacharacters
Metacharacters are characters with special meanings. Some of the metacharacters in the regular expression are:
Dot(.)
Matches any characters except newline
import re text = 'I am from Nepal' match = re.findall('Ne..l', text) print(match)
Output
['Nepal']
Caret(^)
Matches the start of the string
import re text = 'I am from Nepal' match = re.search('^I', text) print(match) print(match.group())
Output
<re.Match object; span=(0, 1), match='I'> I
Here, we got output as “I” because the string starts with “I”.
Let’s see another example
import re text = 'I am from Nepal' match = re.search('^am', text) print(match)
Output
None
Here, we got None as output because string doesn’t start with “am”.
Dollar($)
Matches the end of string
import re text = 'I am from Nepal' match = re.search('Nepal$', text) print(match)
Output
<re.Match object; span=(10, 15), match='Nepal'>
Here, the output we got show that the string ends with Nepal.
Another example
import re text = 'I am from Nepal' match = re.search('Nep$', text) print(match)
Output
None
Here, None output show that the string doesn’t end with “Nep”.
Question mark(?)
It causes the resulting regular expression to match 0 or 1 repetition of preceding regular expression
import re text = "A superwoman is there" reg = re.compile(r'super(wo)?man') match = reg.search(text) print(match) print(match.group())
Output
<re.Match object; span=(2, 12), match='superwoman'> superwoman
The (wo)? Tells that the (wo) group can appear one or zero time in string. Here, the (wo) group appear once in ‘text’ so we got match. Also,
import re text = "a superman is there" reg = re.compile(r'super(wo)?man') match = reg.search(text) print(match) print(match.group())
Output
<re.Match object; span=(2, 10), match='superman'> superman
Here, the (wo) group is completely absent. ‘?’ allows zero one one time repetition, which is the reason we got the match.
import re text = "there is a superwowowoman" reg = re.compile(r'super(wo)?man') mo = reg.search(text) print(mo)
Output
None
Since the (wo) group appear twice, we couldn’t get the match. So we got None as output.
Asterisk(*)
Causes the resulting regular expression to match 0 or more repetition of preceding regular expression
import re reg = re.compile('super(wo)*man') mo = reg.search('There is a superman') print(mo) print(mo.group())
Output
<re.Match object; span=(11, 19), match='superman'> superman
The (wo)* tells that the group (wo) can appear 0 or more time. There is no presence of (wo) group and still we got the match.
import re reg = re.compile(r'super(wo)*man') mo = reg.search('There is a superwowoman') print(mo) print(mo.group())
Output
<re.Match object; span=(11, 23), match='superwowoman'> superwowoman
Since, (wo) is repeated twice we got the match.
Plus(+)
Causes the resulting regular expression to match 1 or more repetition of preceding regular expression
import re reg = re.compile(r'super(wo)+man') mo = reg.search('There is a superman') print(mo)
Output
None
The (wo)+ tells that the (wo) group can appear 1 or more time in string. Since, the string has no (wo) group we didn’t got match.
import re reg = re.compile(r'super(wo)+man') mo = reg.search('There is a superwowowoman') print(mo)
Output
<re.Match object; span=(11, 25), match='superwowowoman'>
(wo) group is repeated more than one time so we got match.
Curly braces ({m})
Specifies that exactly m copies of the previous regular expression should be matched
import re reg = re.compile(r'super(wo){3}man') mo = reg.search('There is a superwowowoman') print(mo)
Output
<re.Match object; span=(11, 25), match='superwowowoman'>
The (wo) group is repeated exactly 3 times in string. So we got the match in string.
import re reg = re.compile(r'super(wo){2}man') mo = reg.search('There is a superwowowoman') print(mo)
Output
None
The (wo){2} tells that the repetition of (wo) group should be exactly 2 but string got 3 repetition so match couldn’t be found.
{m, n}
Causes the regular expression to match from m to n repetition of preceding regular expression
import re reg = re.compile(r'(wo){2,4}') mo = reg.search('wowowowowowo') print(mo)
Output
<re.Match object; span=(0, 8), match='wowowowo'>
Here, we got the match that has four (wo) groups. This could have matched the two or three or four (wo) groups but still gone for four (wo) group. This is called Greedy Matching as the regular expression in python look for most possible match.
For non-greedy matching
import re reg = re.compile(r'(wo){2,4}?') mo = reg.search('wowowowowo') print(mo)
Output
<re.Match object; span=(0, 4), match='wowo'>
Here, we got match having two (wo) groups only.
Pipe(|)
Matches any one from two
import re text = 'Nepal is a beautiful country and people in this country speaks Nepali' reg = re.compile(r'Nep(al|ali)') mo = reg.search(text) print(mo.group())
Output
Nepal
Regular expression matches (al) group to give “Nepal” as output.
Square bracket ([ ])
It is used to define the set of characters
import re text = 'my name is Arnold and i\'m 108' match = re.findall(r'[0-9]', text) print(match)
Output
['1', '0', '8']
Here, we defined a raw string that grab the numbers between 0-9. Also
import re text = 'my name is Arnold and i\'m 108' match = re.findall(r'[^0-9]', text) print(match)
Output
['m', 'y', ' ', 'n', 'a', 'm', 'e', ' ', 'i', 's', ' ', 'A', 'r', 'n', 'o', 'l', 'd', ' ', 'a', 'n', 'd', ' ', 'i', "'", 'm', ' ']
Using ^ symbol in above, we can search for string other than that of set of characters defined in square brackets.
Character classes
Character classes are used to shortening the regular expression. For example:
[0-9]- matches the digits between 0 to 9 while \d also matches the digits between 0 and 9 \d – any digits between 0 to 9 \D – any characters except digits \w – any letter, digits and underscore \W – any characters that is not letter, not digits nor underscore \s – any space, tab or newline characters \S – any characters other than space, tab and newline character
Now, using these character classes. Let’s extract the phone numbers from a text in below example
import re text = 'My mobile number is 977-9867000009. My another mobile number is 977-9800000000. Also you can try my telephone: 076-000000' reg = re.compile(r'\d\d\d-\d+') mo = reg.findall(text) print(mo)
Output
['977-9867000009', '977-9800000000', '076-000000']
First of all, it is noted that the mobile number has pattern of 3 digits followed by dash(-) followed by 10 digits. Also, telephone number has pattern of 3 digits followed by dash(-) followed by 6 digits.
Then a regex object created using compile() function that search for pattern first three digits denoted by \d and then dash(-) and then \d+. Here, \d+ looks for digits that is or more times as explained in previous section.
Let’s say there is text “Birth date of Ramesh 2056-08-07 and birth date of Suresh 2054-06-09”. If we wished to take out name and birthdate we can do it easily using regular expression.
import re text = 'Bith date of Ramesh 2056-08-07 and birth date of Suresh 2054-06-09' reg = re.compile(r'\w+\s\d{4}-\d{2}-\d{2}') mo = reg.findall(text) print(mo)
Output
['Ramesh 2056-08-07', 'Suresh 2054-06-09']
Here, we created a pattern that looks for character of any length (\w+) followed by space (\s) followed by four digits that gives year (\d{4}) followed by dash(-) followed by two digits that gives month (\d{2}) followed by dash (-) and two digits that gives day (\d{2}).
(\d{4}) and (\d{2}) looks for digits repetiting exactly 4 times and exactly 2 times respectively.
Strong password detection with regular expression
Let’s assume that the strong password is one that has minimum 8 characters length, at least one uppercase letter, at least one lowercase letter and at least one digit.
import re def isStrongPassword(text): if len(text) < 8: return False if re.search(r'[A-Z]', text) == None: return False if re.search(r'[a-z]', text) == None: return False if re.search(r'[0-9]', text) == None: return False return True pw = input('Enter password : ') if isStrongPassword(pw): print('Password is strong') else: print('Password is not strong')
Output
Enter password : AbGc6 Password is not strong Enter password : abcdefgh1234 Password is not strong Enter password : ABCdejfkdmdksisjsmmzmzmz Password is not strong Enter password : Abcedfghxyz1290 Password is strong
Conclusion
Regular expression has simplified the searching and matching task that we need in our daily life. Without regular expression, pattern matching, searching string through text files was very difficult.
With the aid of regular expression, these task has been simple with little effort and is time saving. Regular expression provides functions, metacharacters, character classes that make task of matching and searching really simple. So, learning regular expression is worthy.
If you have any queries regarding the tutorial, please leave a comment below. I will hear you asap. If you master the Regular Expressions, then you are way forward than other programmers.
Reference
automatetheboringstuff.com/2e/chapter7/
Happy Learning 🙂