Introduction
NLTK(Natural Language Toolkit) is the most popular and widely used Python library for doing Natural Language Processing(NLP) or Text Mining. NLP is one of the important parts of Artificial Intelligence(AI) that focuses on teaching computers how to extract meaning from data.
Due to the rapid growth in usage of the Internet, huge amounts of data(in the form of text, audio, image, and video) are generated on a daily basis. So to derive insights from those data first we have to preprocess the data before transferring them to the machine learning model.
Apart from NLTK, there are packages in python that can be used for NLP like spaCy, Gensim, Polyglot, Textblob, and Pattern.
Installation of NLTK
To install the NLTK package, you have to run the following command in your terminal:
$ pip install nltk
Steps for Text Preprocessing are:
- Convert text into lowercase
- Tokenizing
- Removing Noise
- Stemming
Here is the sample text for preprocessing:
Charles Babbage, who was born in 1791, is regarded as the father of computing because of his research into machines that could calculate.
1. Convert text into lowercase
This is one of the important steps in Natural Language Processing. In order to treat two different words like “nltk” and “NLTK” the same, we have to convert the text in whatever format into lowercase first.
We can simply use the inbuilt function lower() provided by python to convert text into lowercase.
text = "Charles Babbage, who was born in 1791, is regarded as the father of computing because of his research into machines that could calculate." text = text.lower() print(text)
Output:
'charles babbage, who was born in 1791, is regarded as the father of computing because of his research into machines that could calculate.'
2. Tokenizing
Tokenization is the process of splitting text into chunks of words or sentences that helps in analyzing the sequence of words in the text.
For the sentence-level tokenizing, we can use the sent_tokenize function provided by NLTK as:
from nltk.tokenize import sent_tokenize text = "Python is an interpreted high-level general-purpose programming language. Python's design philosophy emphasizes code readability with its notable use of significant indentation." sentence_tokenize = sent_tokenize(text) print(sentence_tokenize)
Output:
['Python is an interpreted high-level general-purpose programming language.', "Python's design philosophy emphasizes code readability with its notable use of significant indentation."]
Note: If any error occurs during the execution of this program, then you should first install the following model from nltk:
>>> import nltk >>> nltk.download('punkt') [nltk_data] Downloading package punkt to /home/shiv/nltk_data... [nltk_data] Unzipping tokenizers/punkt.zip.
As mentioned above the sample text which only contains one sentence, sentence tokenization is not relevant. We can use word-level tokenization in that text as:
from nltk.tokenize import word_tokenize text = 'charles babbage, who was born in 1791, is regarded as the father of computing because of his research into machines that could calculate.' tokens = word_tokenize(text) print(tokens)
Output:
['charles', 'babbage', ',', 'who', 'was', 'born', 'in', '1791', ',', 'is', 'regarded', 'as', 'the', 'father', 'of', 'computing', 'because', 'of', 'his', 'research', 'into', 'machines', 'that', 'could', 'calculate', '.']
As we can notice that every non-space character in the sample text is divided into tokens and the result is obtained in the python list format.
3. Removing Noise
It is the process of removing irrelevant characters from the text which is called noise in the area of NLP that does not provide any meaning while analyzing text. Most common noises are numbers, punctuation, stop words, white space, etc.
Removing Numbers
tokens = ['charles', 'babbage', ',', 'who', 'was', 'born', 'in', '1791', ',', 'is', 'regarded', 'as', 'the', 'father', 'of', 'computing', 'because', 'of', 'his', 'research', 'into', 'machines', 'that', 'could', 'calculate', '.'] # Removing numbers remove_numbers = [token for token in tokens if not token.isdigit()] print(remove_numbers)
Output:
['charles', 'babbage', ',', 'who', 'was', 'born', 'in', ',', 'is', 'regarded', 'as', 'the', 'father', 'of', 'computing', 'because', 'of', 'his', 'research', 'into', 'machines', 'that', 'could', 'calculate', '.']
Here in this list, we successfully removed the ‘1791’ number from the tokens list.
Removing Punctuation
import string tokens = ['charles', 'babbage', ',', 'who', 'was', 'born', 'in', ',', 'is', 'regarded', 'as', 'the', 'father', 'of', 'computing', 'because', 'of', 'his', 'research', 'into', 'machines', 'that', 'could', 'calculate', '.'] remove_punctuations = [token for token in tokens if not token in string.punctuation] print(remove_punctuations)
Output:
['charles', 'babbage', 'who', 'was', 'born', 'in', 'is', 'regarded', 'as', 'the', 'father', 'of', 'computing', 'because', 'of', 'his', 'research', 'into', 'machines', 'that', 'could', 'calculate']
The punctuation like ‘,’ and ‘.’ are removed from the list using the string.punctuation
function to check if the token is either punctuation or not.
Removing Stop words
The words like “the”, “and”, “in”, “is”, “or”, etc. do not provide any information during text analysis so we have to remove such words to decrease the size and space needed for processing particular text.
from nltk.corpus import stopwords tokens = ['charles', 'babbage', 'who', 'was', 'born', 'in', 'is', 'regarded', 'as', 'the', 'father', 'of', 'computing', 'because', 'of', 'his', 'research', 'into', 'machines', 'that', 'could', 'calculate'] lang_stopwords = stopwords.words("english") remove_stopwords = [token for token in tokens if token not in lang_stopwords] print(remove_stopwords)
Output:
['charles', 'babbage', 'born', 'regarded', 'father', 'computing', 'research', 'machines', 'could', 'calculate']
Note: If you first time run this program using “stopwords” in nltk, you have to download “stopwords” in your nltk package as:
>>> import nltk >>> nltk.download('stopwords') [nltk_data] Downloading package stopwords to /home/shiv/nltk_data... [nltk_data] Package stopwords is already up-to-date!
4. Stemming
It is the process of converting any particular word into its root word. During the text analysis process, the NLP algorithm should consider three different words like “caring”, “cares”, and “careful” as identical words.
For that, we have to convert those words into their root word i.e care. We can easily perform the stemming using the NLTK library as:
from nltk import SnowballStemmer lang="english" stemmer = SnowballStemmer(lang) tokens = ['charles', 'babbage', 'born', 'regarded', 'father', 'computing', 'research', 'machines', 'could', 'calculate'] stemming_tokens = [stemmer.stem(token) for token in tokens] print("Original tokens", tokens, sep='\n') print('---------------------------') print("Stemming tokens", stemming_tokens, sep='\n')
Output:
Original tokens ['charles', 'babbage', 'born', 'regarded', 'father', 'computing', 'research', 'machines', 'could', 'calculate'] --------------------------- Stemming tokens ['charl', 'babbag', 'born', 'regard', 'father', 'comput', 'research', 'machin', 'could', 'calcul']
Here the words “regarded” is stemmed to “regard”, “computing” to “comput”, “machines” to “machin”, etc.
Summarized code from the above steps:
from nltk.tokenize import word_tokenize from nltk.corpus import stopwords from nltk import SnowballStemmer import string """" Python programm to preprocess text using NLTK library """ # Method: text_preprocessing # Input: text # Output: preprocessed text def text_preprocessing(text): # convert text to lowercase text = text.lower() # word tokenizing tokens = word_tokenize(text) # removing noise: numbers, stopwords, and punctuation lang_stopwords = stopwords.words("english") tokens = [token for token in tokens if not token.isdigit() and \ not token in string.punctuation and \ token not in lang_stopwords] # stemming tokens stemmer = SnowballStemmer('english') tokens = [stemmer.stem(token) for token in tokens] # join tokens and form string preprocessed_text = " ".join(tokens) return preprocessed_text # sample text text = "Charles Babbage, who was born in 1791, is regarded as the father of computing because of his research into machines that could calculate." print("The preprocessed text of sample text is:", text_preprocessing(text), sep='\n')
Output:
The preprocessed text of sample text is: charl babbag born regard father comput research machin could calcul
Conclusion
Hence in this blog post, we successfully preprocessed the sample text using the python NLTK library.
We came to know how raw text can be converted into meaningful text so that it will be easy for algorithms to bring insights from text quickly. Text preprocessing is one of the important steps that should be implemented in every NLP project.
If you have any problems feel free to drop comments down below.
Happy Coding:-)