PDF(Portable Document Format) is the file format developed by Adobe in the 1990s. At the present time, we all are familiar with its huge popularity in read-only documents.
In Python, there are lots of packages available in PyPI for extracting text from pdf like pdfplumber, pdfminer, pypdf2, slate, pdfquery, xpdf, tectract, and so on.
Here, in this article we will be going to use the PyPDF2 module for the following things:
1) Extracting text
2) Copying pages
3) Rotating pages
4) Encrypting pdf
Installation
pip install PyPDF2
1) Extracting text
We can extract text from specific pages or whole pages.
Note: PyPDF2 does not extract images, charts, and media files. It only extracts text and returns it as a Python string.
Extracting specific page
# import module PyPDF2 import PyPDF2 # put 'example.pdf' in working directory # and open it in read binary mode pdfFileObj = open('example.pdf', 'rb') # call and store PdfFileReader # object in pdfReader pdfReader = PyPDF2.PdfFileReader(pdfFileObj) # to print the total number of pages in pdf # print(pdfReader.numPages) # get specific page of pdf by passing # number since it stores pages in list # to access first page pass 0 pageObj = pdfReader.getPage(0) # extract the page object # by extractText() function texts = pageObj.extractText() # print the extracted texts print(texts)
Extracting all pages
import PyPDF2 pdffile = open('example.pdf', 'rb') pdfReader = PyPDF2.PdfFileReader(pdffile) num_pages = pdfReader.numPages count = 0 # while loop will read each page. while count < num_pages: texts = " " pageObj = pdfReader.getPage(count) count += 1 texts = pageObj.extractText() print('Page number:', count) print(texts)
2) Copying pages
Here, we copy pages of two PDF files named ‘example1.pdf’ and ‘example2.pdf’ and merged them into the newly created file named ‘example3.pdf’.
import PyPDF2 # open two pdfs pdf1File = open('example.pdf', 'rb') pdf2File = open('example2.pdf', 'rb') # read first pdf pdf1Reader = PyPDF2.PdfFileReader(pdf1File) # read second pdf pdf2Reader = PyPDF2.PdfFileReader(pdf2File) # for writing in new pdf file pdfWriter = PyPDF2.PdfFileWriter() for pageNum in range(pdf1Reader.numPages): pageObj = pdf1Reader.getPage(pageNum) pdfWriter.addPage(pageObj) for pageNum in range(pdf2Reader.numPages): pageObj = pdf2Reader.getPage(pageNum) pdfWriter.addPage(pageObj) # create new pdf 'example3.pdf' pdfOutputFile = open('example3.pdf', 'wb') pdfWriter.write(pdfOutputFile) pdfOutputFile.close() pdf1File.close() pdf2File.close()
Now we can see the new pdf ‘example3.pdf’ in the working directory.
Note: In PyPDF2, we cannot insert pages in the middle of the PdfFileWriter object.
3) Rotating pages
PyPDF2 comes with two methods for rotating pdf pages.
rotateCounterClockwise(): Rotates a page counter-clockwise by increments of 90 degrees.
rotateClockwise(): Rotates a page clockwise by increments of 90 degrees.
import PyPDF2 pdfFile = open('example.pdf', 'rb') pdfReader = PyPDF2.PdfFileReader(pdfFile) # rotating first page # of 'example.pdf' only page = pdfReader.getPage(0) # rotating clockwise by 90 page.rotateClockwise(90) # rotating counter-clockwise by 270 # page.rotateCounterClockwise(270) # creating object 'pdfWriter' # to add rotated page pdfWriter = PyPDF2.PdfFileWriter() pdfWriter.addPage(page) # create new pdf pdfOutputFile = open('rotated-example.pdf', 'wb') pdfWriter.write(pdfOutputFile) pdfOutputFile.close() pdfFile.close()
4) Encrypting pdf
To protect pdf files from being accessed by anyone, PyPDF2 provides us with the facility of encrypting the pdf with a password.
import PyPDF2 pdfFileObj = open('example.pdf', 'rb') pdfReader = PyPDF2.PdfFileReader(pdfFileObj) pdfWriter = PyPDF2.PdfFileWriter() for pageNum in range(pdfReader.numPages): pdfWriter.addPage(pdfReader.getPage(pageNum)) pdfWriter.encrypt('abc') resultPdf = open('encrypted-example.pdf', 'wb') pdfWriter.write(resultPdf) resultPdf.close()
Now we can see that in the working directory new pdf file named ‘encrypted-example.pdf’ is created. As we set the password of the newly created pdf file as “abc”. Whenever we try to open that pdf we have to enter the password as:
References
https://automatetheboringstuff.com/chapter13/
Happy Learning 🙂
Check out: