Data is present in all areas of the modern digital world, and it takes many different forms.
One of the most common formats for data is PDF. Invoices, reports, and other forms are frequently stored in Portable Document Format (PDF) files by businesses and institutions.
It can be laborious and time-consuming to extract data from PDF files. Fortunately, for easy data extraction from PDF files, Python provides a variety of libraries.
This tutorial will explain how to extract data from PDF files using Python. You'll learn how to install the necessary libraries and I'll provide examples of how to do so.
There are several Python libraries you can use to read and extract data from PDF files. These include PDFMiner, PyPDF2, PDFQuery and PyMuPDF. Here, we will use PDFQuery to read and extract data from multiple PDF files.
How to Use PDFQuery
PDFQuery is a Python library that provides an easy way to extract data from PDF files by using CSS-like selectors to locate elements in the document.
It reads a PDF file as an object, converts the PDF object to an XML file, and accesses the desired information by its specific location inside of the PDF document.
Let's consider a short example to see how it works.
from pdfquery import PDFQuery pdf = PDFQuery('example.pdf') pdf.load() # Use CSS-like selectors to locate the elements text_elements = pdf.pq('LTTextLineHorizontal') # Extract the text from the elements text = [t.text for t in text_elements] print(text)
In this code, we first create a PDFQuery object by passing the filename of the PDF file we want to extract data from. We then load the document into the object by calling the
Next, we use CSS-like selectors to locate the text elements in the PDF document. The
pq() method is used to locate the elements, which returns a PyQuery object that represents the selected elements.
Finally, we extract the text from the elements by accessing the
text attribute of each element and we store the extracted text in a list called
Let's consider another method we can use to read PDF files, extract some data elements, and create a structured dataset using PDFQuery. We will follow the following steps:
- Package installation.
- Import the libraries.
- Read and convert the PDF files.
- Access and extract the Data.
First, we need to install PDFQuery and also install Pandas for some analysis and data presentation.
pip install pdfquery pip install pandas
Import the libraries
import pandas as pd import pdfquery
We import the two libraries to be be able to use them in our project.
Read and convert the PDF files
#read the PDF pdf = pdfquery.PDFQuery('customers.pdf') pdf.load() #convert the pdf to XML pdf.tree.write('customers.xml', pretty_print = True) pdf
We will read the pdf file into our project as an element object and load it. Convert the pdf object into an Extensible Markup Language (XML) file. This file contains the data and the metadata of a given PDF page.
The XML defines a set of rules for encoding PDF in a format that is readable by humans and machines. Looking at the XML file using a text editor, we can see where the data we want to extract is.
Access and extract the Data
We can get the information we are trying to extract inside the
LTTextBoxHorizontal tag, and we can see the metadata associated with it.
The values inside the text box, [68.0, 231.57, 101.990, 234.893] in the XML fragment refers to
Left, Bottom, Right, Top coordinates of the text box. You can think of this as the boundaries around the data we want to extract.
Let’s access and extract the customer name using the coordinates of the text box.
# access the data using coordinates customer_name = pdf.pq('LTTextLineHorizontal:in_bbox("68.0, 231.57, 101.990, 234.893")').text() print(customer_name) #output: Brandon James
And that's it, we are done!
Note: Sometimes the data we want to extract is not in the exact same location in every file which can cause issues. Fortunately, PDFQuery can also query tags that contain a given string.
Data extraction from PDF files is a crucial task because these files are frequently used for document storage and sharing.
Python's PDFQuery is a potent tool for extracting data from PDF files. Anyone looking to extract data from PDF files will find PDFQuery to be a great option thanks to its simple syntax and comprehensive documentation. It is also open-source and can be modified to suit specific use cases.
Let's connect on Twitter and on LinkedIn. You can also subscribe to my YouTube channel.