PDF Text Extraction in Python

PyPDF2 and Tabula Examples

I've been trying various solutions for forecasting for a while, trying to understand the process and methods. One of these was the Prophet procedure developed by Facebook, which is very easy to use with R and Python programming languages.

AA

Prophet is a procedure for forecasting time series data based on an additive model where non-linear trends are fit with yearly, weekly, and daily seasonality, plus holiday effects1. For this purpose, I wanted to include public holidays in the process while working on the Prophet. Since the data set I used covers the years 2017-2021, while trying to find the holidays between these years, I found a PDF file containing the public holidays between the years 2014-20222.

I've been wanting to try several packages for scraping text from PDF for a while. It is a good chance to try.

The relevant PDF file has 2 pages and 9 tables. The image below shows that each table consists of a header, 2 columns, and 16 rows. Date information is on the left of the tables.

Python - PDF Metin Kazıma
2014 - 2022 Yıllara Göre Resmi Tatil Günleri

Text Scraping Process with Python

It is effortless to read, split, merge, crop, and transform the pages of PDF files with Python. However, some packages may not provide very successful results depending on how complex the content of the PDF file is.

Python programming language has many popular packages for reading and editing the PDF files, such as PDFMiner, PyPDF2, Tabula-py, Slate, PDFQuery, and xpdf3. One of these packages, PDFMiner, is now being developed over the pdfminer.six fork. Among the options, PyPDF24 and tabular5 stand out for their popularity, so I decided to try both packages.

PyPDF2

PyPDF2, a utility for reading and writing PDF files with Python, is a fork of pyPdf developed since 2011.

It initializes a PdfFileReader object. It supports a file object or an object similar to a file object given to the stream parameter, and path value representing the path of a PDF file6.

Let's move on to our sample process right away.

#!pip install PyPDF2

import urllib.request, io, PyPDF2
import pandas as pd

url = "https://fenbil.aku.edu.tr/FENBILENS/takvim/2014-2022-1RESMI.pdf"
remoteFile = urllib.request.urlopen(url)
pdfReader = PyPDF2.PdfFileReader(io.BytesIO(remoteFile.read()))

Now we can use methods and properties of pdfReader. pdfReader.numPages will give us the number of pages of the PDF file.

We can get the texts separately from each of the pages. For this, we can use the number of pages for the loop. For example, we can get the texts on the first page with pdfReader.getPage(0).extractText().

Let's import the texts into a text file to use in the next operations.

for i in range(pdfReader.numPages):
  with open('pdf.txt', 'a+') as file:
    file.write(pdfReader.getPage(i).extractText())

As seen in the text file, there are many spaces and shifts in the table areas.

df = pd.read_csv('/content/pdf.txt', delimiter = "\n", names=['date'])
df.head()

Honestly, it didn't turn out the way I wanted. Due to the complications between columns and some data not being scraped, I decided to try another package.

date
0   01 Ocak 2016 Cuma
1   23 Nisan 2016 Cumartesi
2   27 Temmuz 2014 Pazar
3   Arefe
4   04 Temmuz 2016 Pazartesi
5   28 Temmuz 2014 Pazartesi
6   17 Temmuz 2015 Cuma
7   18 Temmuz 2015 Cumartesi
8   19 Temmuz 2015 Pazar
9   03 Ekim 2014 Cuma
10  Arefe
...

You can find the relevant code snippet as a single piece on Python-PyPDF2.py.

Tabula.py

Requires Java 8+ and Python 3.6+. Tabula, which enables data to be exported to formats such as CSV, TSV, JSON, and transforms tables as pandas DataFrame, can also take URL as file path7.

Let's jump right into our example.

I don't think I could handle the loop much more appropriately, but if I can find a more appropriate way to handle a nested list with 3 tables has (16, 3) shapes I will update the code.

You can find the relevant code snippet as a single piece on Python-Tabula.py.

0     2014-01-01
1     2014-04-23
2     2014-05-01
3     2014-05-19
4     2014-07-27
         ...    
139   2022-07-11
140   2022-07-12
141   2022-08-30
142   2022-10-28
143   2022-10-29
Name: Date, Length: 144, dtype: datetime64[ns]

As a result of this process, we have correctly created a list of the holidays with 144 records.

m = Prophet(holidays=holidays)
forecast = m.fit(df).predict(future)

In the next article, I will try to share my notes about Prophet.