Read PDFs

data cleaning, strings | 18 March 2024

A lot of times we are given PDFs to read. Wouldn’t it be nice if we could read the tables and analyze in Python? This post is specifically about reading tables in PDFs.

Note: pip install tabula-py

Read all document

coderead all.py

import tabula

# Specify the path to your PDF file
pdf_path = r"C:/Users/sache/Downloads/Citi-2023-Annual-Report.pdf"
# Read tables from the PDF
tables = tabula.read_pdf(pdf_path, pages='all')

# Iterate through extracted tables
for table in tables:
    print(table)

But often we just need a specific table on a page. For example, the document we are reading is a nearly 400 page Citigroup 2023 annual report.

Read one table to pandas

coderead all.py

import tabula

# Specify the path to your PDF file
pdf_path = r"C:/Users/sache/Downloads/Citi-2023-Annual-Report.pdf"
# Read tables from the PDF
tables = tabula.read_pdf(pdf_path, pages='17')

# Iterate through extracted tables
df = pd.DataFrame(tables[0])

Once the data is in Python, we probably will need to modify it as the PDF table may not have been read correctly.