How to Extract Text From PDF in Python

PDF(Portable Document Format) is the file format developed by Adobe in the 1990s. At the present time, we all are familiar with its huge popularity in read-only documents.

In Python, there are lots of packages available in PyPI for extracting text from pdf like pdfplumber, pdfminer, pypdf2, slate, pdfquery, xpdf, tectract, and so on.

Here, in this article we will be going to use the PyPDF2 module for the following things:

1) Extracting text

2) Copying pages

3) Rotating pages

4) Encrypting pdf

Installation

pip install PyPDF2

 

1) Extracting text

We can extract text from specific pages or whole pages.

Note: PyPDF2 does not extract images, charts, and media files. It only extracts text and returns it as a Python string.

Extracting specific page

# import module PyPDF2
import PyPDF2

# put 'example.pdf' in working directory
# and open it in read binary mode
pdfFileObj = open('example.pdf', 'rb')

# call and store PdfFileReader
# object in pdfReader
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)

# to print the total number of pages in pdf
# print(pdfReader.numPages)

# get specific page of pdf by passing
# number since it stores pages in list
# to access first page pass 0
pageObj = pdfReader.getPage(0)

# extract the page object
# by extractText() function
texts = pageObj.extractText()

# print the extracted texts
print(texts)

 

Extracting all pages

import PyPDF2

pdffile = open('example.pdf', 'rb')
pdfReader = PyPDF2.PdfFileReader(pdffile)
num_pages = pdfReader.numPages
count = 0

# while loop will read each page.
while count < num_pages:
texts = " "
pageObj = pdfReader.getPage(count)
count += 1
texts = pageObj.extractText()
print('Page number:', count)
print(texts)

 

2) Copying pages

Here, we copy pages of two PDF files named ‘example1.pdf’ and ‘example2.pdf’ and merged them into the newly created file named  ‘example3.pdf’.

import PyPDF2

# open two pdfs
pdf1File = open('example.pdf', 'rb')
pdf2File = open('example2.pdf', 'rb')

# read first pdf
pdf1Reader = PyPDF2.PdfFileReader(pdf1File)
# read second pdf
pdf2Reader = PyPDF2.PdfFileReader(pdf2File)
# for writing in new pdf file
pdfWriter = PyPDF2.PdfFileWriter()

for pageNum in range(pdf1Reader.numPages):
    pageObj = pdf1Reader.getPage(pageNum)
    pdfWriter.addPage(pageObj)

for pageNum in range(pdf2Reader.numPages):
    pageObj = pdf2Reader.getPage(pageNum)
    pdfWriter.addPage(pageObj)

# create new pdf 'example3.pdf' 
pdfOutputFile = open('example3.pdf', 'wb')

pdfWriter.write(pdfOutputFile)
pdfOutputFile.close()
pdf1File.close()
pdf2File.close()

Now we can see the new pdf ‘example3.pdf’ in the working directory.

Note: In PyPDF2, we cannot insert pages in the middle of the PdfFileWriter object.

 

3) Rotating pages

PyPDF2 comes with two methods for rotating pdf pages.

rotateCounterClockwise(): Rotates a page counter-clockwise by increments of 90 degrees.

rotateClockwise(): Rotates a page clockwise by increments of 90 degrees.

import PyPDF2

pdfFile = open('example.pdf', 'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFile)

# rotating first page
# of 'example.pdf' only
page = pdfReader.getPage(0)

# rotating clockwise by 90
page.rotateClockwise(90)

# rotating counter-clockwise by 270
# page.rotateCounterClockwise(270)

# creating object 'pdfWriter'
# to add rotated page
pdfWriter = PyPDF2.PdfFileWriter()
pdfWriter.addPage(page)

# create new pdf
pdfOutputFile = open('rotated-example.pdf', 'wb')
pdfWriter.write(pdfOutputFile)
pdfOutputFile.close()
pdfFile.close()

 

4) Encrypting pdf

To protect pdf files from being accessed by anyone, PyPDF2 provides us with the facility of encrypting the pdf with a password.

import PyPDF2

pdfFileObj = open('example.pdf', 'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
pdfWriter = PyPDF2.PdfFileWriter()

for pageNum in range(pdfReader.numPages):
pdfWriter.addPage(pdfReader.getPage(pageNum))

pdfWriter.encrypt('abc')
resultPdf = open('encrypted-example.pdf', 'wb')
pdfWriter.write(resultPdf)
resultPdf.close()

Now we can see that in the working directory new pdf file named ‘encrypted-example.pdf’ is created. As we set the password of the newly created pdf file as “abc”. Whenever we try to open that pdf we have to enter the password as:

encrypted-pdf-pypdf2

 

References

https://automatetheboringstuff.com/chapter13/

Happy Learning 🙂

Check out:

Leave a Comment