Understanding Regex in NLP Using Python

Understanding Regex in NLP Using Python

Introduction

Natural Language Processing, abbreviated as NLP, is a branch of artificial intelligence, that helps make computers understand how human beings interact through communication. That is through analysis of speech or spoken words and also in written form.

In this article, we will know what is regex and how it is implemented in python programming language.

What is RegEx

Regex is an abbreviation of the word Regular Expression. It is a tool for finding patterns within a text. It offers a powerful rule based approach where you extract patterns from your text to extract useful information for a given NLP task. Common use cases involve, validation of user generated input such as passwords and searching through a large body of texts such as in documents.

NLP problems are solved either using heuristics/rule based approach or using machine learning. So understanding Regex is the basis of getting a stronger foundation in learning NLP.

Installation of necessary tools

Python installation is a prerequisite, since we will be using python programming language. Here is a link to python configuration. You can use any editor of your choice that is PyCharm, VS Code, Sublime text or even Jupyter notebook. In our case, we will be using VS Code.

In your working terminal, we will install a library using a package management system known as pip.

pip install regex

This module provides regular expressions matching operations. It involves all the basic functions required to perform regex operations, as you will be seeing in a few.

Regex in action

We use '\d' when the string we want to find are integers. The re.findall() function scans the string from left to right to get all the occurrence matches of the pattern in the given string. Pattern is the regular expression that you want to match.

1) Using regex findall() function to find numbers in the string

We will start by importing our earlier installed library, then implement how the function operates. For better understanding, below is a code snippet.

import re



text='''

To contact me, this is my phone number

0711456987. Feel free to do so.

'''

pattern='\\d{10}'



matches=re.findall(pattern,text)



print(matches)

The code above tries to find the phone number, which are a series of integers. The text variable is a string that contains a message mentioning a phone number. In this case, the pattern \d{10} matches any sequence of 10 consecutive digits (0-9). Since phone numbers are always given in 10 digits. The re.findall() function is used to find all occurrences of the pattern in the text. It returns a list of all the matches as strings.

Output:

['0711456987']

Finally, the code displays the output of the re.findall function, which is a list containing a single string that matches the pattern.

2) Advanced phone number extractor

The above regex can only work for plain ten digits’ phone number. Also considering, the digits must be consecutive. We will work on an example that is more complex to illustrate more on regular expressions.

import re



text='''

To contact me, this is my phone number (132)-798-1134 you can also

reach me through my alternative number 0711456987. Feel free to do so.

'''



pattern='\(\d{3}\)-\d{3}-\d{4}|\d{10}'



matches=re.findall(pattern,text)



print(matches)

The regex \(\\d{3}\)-\d{3}-\d{4}|\d{10} can be used to either match a US phone number format or a 10-digit plain phone number format. We can break down the regex as follows.

  • \(\\d{3}\) matches a sequence of 3 digits surrounded by parentheses.

  • - matches a literal dash character.

  • \d{3} matches another sequence of 3 digits.

  • - matches another literal dash character. \d{4} matches a final sequence of 4 digits.

  • | is the pipe symbol and represents a logical OR operation. It allows matching either of the two patterns on either side of the pipe.

  • \d{10} matches a sequence of 10 digits without any dashes or parentheses.

On running the second code it displays two set of strings.

Output:

['(132)-798-1134', '0711456987']

3) Note subtitle extractor

We will learn how to extract subtitles from note subtitles.

import re



text = '''

Note 1 - Introduction

It offers a powerful rule based approach where you extract patterns from

your text to extract useful information for a given NLP task. Common use cases involve, validation of user generated input such as passwords and searching through a large body of texts such as in documents.



Note 2 - Getting started with regex

You only need to get ready and the rest will be fun.

'''

pattern = 'Note \d - ([^\n]*)'

matches = re.findall(pattern, text)

print(matches)

In this case, the regex Note \d - ([^\n]*) when broken down performs the following:

  • Note matches a literal string "Note".

  • \d matches any digit found within the string.

  • - matches a literal string " - ".

  • ([^\n]*) is a capture group that matches any character that is not a newline (\n) more than ones.

  • The parentheses show that the matched text should be captured and returned as a separate item in the results list.

Finally, the result that is printed out is as follows.

Output:

['Introduction', 'Getting started with regex']

4) Date extractor

A typical illustration of a regex that extracts different date formats from a text using a code snippet. In this section we will implement the use of a different function re.finditer()

import re



text = '''

Below are different dates formats in text:

- The date is 2023-02-12.

- The date is 02/12/2023.

- The date is 02.12.2023.

- The date is 02-12-2023.

- The date is 12-02-2023.

- The date is 12/02/2023.

'''



patterns = [

    '\d{4}-\d{2}-\d{2}',

    '\d{2}/\d{2}/\d{4}',

    '\d{2}\.\d{2}\.\d{4}',

    '\d{2}-\d{2}-\d{4}',

    '\d{2}/\d{2}/\d{4}',

]



for pattern in patterns:

    for match in re.finditer(pattern, text):

        print(match.group(0))

The regex above list of patterns, each representing a different date format, as we have seen above how the regex are written, is defined in the patterns list. These patterns are then used in a loop to find all matches in the text using the re.finditer method.

For each iteration of the loop, the current pattern is passed as an argument to the re.finditer method, which returns an iterator containing all matches of the pattern in the text. The loop then iterates over the matches, and for each match, the group method is called to get the matched text, which is then printed as follows.

Output:

2023-02-12

02/12/2023

12/02/2023

02.12.2023

02-12-2023

12-02-2023

02/12/2023

12/02/2023

5) Finding positive sentiments in a text

We will see how the re.search() method operates. Below is the example code with further illustrations.

import re



def positive_sentiment(text):

    pattern = r"(?i)\b(lovely|dope|legit|good|great|excellent|pleasant|awesome|amazing|fantastic|love|like)\b"

    match = re.search(pattern, text)

    if match:

        return match.group()

    else:

        return "Positive sentiment is not found"



text = "Its just dope learning AI."

result = positive_sentiment(text)

print(result)

In this example, the regular expression is used to search for positive sentiments in the given text. The expression is case-insensitive. This is achieved by using (?i), and matches a word boundary \b followed by one of the positive sentiments. The user defined function poitive_sentiment takes a single argument text, searches for the pattern using re.search(), and returns the match if found. If there's no match, it returns the message "Positive sentiment not found".

Together, this regular expression matches a complete word that is one of the listed positive sentiments, regardless of case. The word boundaries ensure that only complete words are matched, avoiding partial matches within words.

Output:

dope

Conclusion

Regex is a key area in understanding NLP. Common fields that apply regex are:

  • Data Cleaning where it's used to clean up messy data

  • Validation where a given string matches a certain pattern, such as a valid email address or phone number.

  • Extraction of texts to extract specific pieces of information from a larger body of text, URLs, such as dates and numbers.

  • Web crawling and data scraping to extract information from websites and other text-based sources for data mining and data analysis.

We have learned different operations in regex that you can now diversely apply. From the above training, you are now well-equipped to tackle various problems.