Extracting Data from PDFs documents using tabulapdf in R

Part 1

R
PDF
Data Extraction
Author
Published

June 9, 2024

Introduction

In this blog post, we will explore how to extract data from PDF files using the tabulapdf package in R. PDF is a common file format for sharing data, but extracting structured data from PDFs can be challenging. The tabulapdf package provides an easy way to extract tables from PDFs and convert them into a usable format.

Importance of Extracting Data from PDFs

Extracting tables from PDF files is crucial for several reasons:

  1. Data Accessibility: Many important documents and reports are shared in PDF format. Extracting data from these documents makes it accessible for analysis and further use.
  2. Automated Data Processing: Manual data entry from PDFs is time-consuming and prone to errors. Automated extraction ensures accuracy and efficiency.
  3. Data Integration: Extracting data from PDFs allows for the integration of information from various sources, enhancing the comprehensiveness of data analysis.
  4. Historical Data Analysis: Many historical documents are available only in PDF format. Extracting data from these PDFs allows for the analysis of trends over time.
  5. Enhanced Decision Making: Having access to data in an analyzable format helps in making informed decisions based on comprehensive data analysis.

So extracting tables from PDFs is a critical skill that enables access to important data, ensures accuracy, and saves time in data processing.We can automate the process and make the data extraction workflow more efficient. This not only improves productivity but also ensures that the data extracted is accurate and reliable, providing a solid foundation for any subsequent analysis or decision-making processes.

Given these reasons, it’s evident why efficient tools for extracting data from PDFs are invaluable. So lets learn how to extract data from PDFs using the tabulapdf package in R.

Required Libraries

I have tried tabulizer to install but not successful. I have come across tabulapdf package and find it very handy. To begin, we’ll need to install and load the necessary libraries. Here is the list of packages we’ll be using:

  • rJava
  • tabulapdf
  • pdftools
  • tidyverse

Let’s load these libraries in R.

Code
Sys.setenv(JAVA_HOME = "C:/Program Files/Java/jdk-22") ## One check is for one's system if required for running `rJava`
library(rJava)
library(tabulapdf)
library(pdftools)
library(tidyverse)

Extracting Tables from a PDF

The extract_tables function extracts tables from the specified page of the PDF file. In this case, we extracted the table from the first page of the sample PDF file.

Code
f2 <- "https://raw.githubusercontent.com/ropensci/tabulapdf/main/inst/examples/mtcars.pdf"

extract_tables(f2, pages = 1) 
Table 1: Extracted Table from the PDF
[[1]]
# A tibble: 5 × 12
  model          mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
  <chr>        <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Mazda RX4     21       6   160   110  3.9   2.62  16.5     0     1     4     4
2 Mazda RX4 W…  21       6   160   110  3.9   2.88  17.0     0     1     4     4
3 Datsun 710    22.8     4   108    93  3.85  2.32  18.6     1     1     4     1
4 Hornet 4 Dr…  21.4     6   258   110  3.08  3.21  19.4     1     0     3     1
5 Hornet Spor…  18.7     8   360   175  3.15  3.44  17.0     0     0     3     2

Table 1 indicates that the table has been successfully extracted from the PDF file. We can now proceed to convert this table into a data frame for further analysis.

Now we will extract a table from a specific page of another PDF file. This PDF is a monthly report on Consumer Price Index from the Pakistan Bureau of Statistics.

Code
# URL of the monthly review PDF
f1 <- "https://www.pbs.gov.pk/sites/default/files/price_statistics/cpi/may/Monthly%20Review%20May%2C%202024.pdf"
Figure 1: General Inflation (%)

One can see from Figure 1 that the table we want to extract is on the third page of the PDF. Let’s extract the table from the third page of the PDF file.

Code
# Extract and view the table from the third page of the PDF
extract_tables(f1, pages = 3)[[1]] |> head()
Table 2: Extracted Table from the PDF
# A tibble: 6 × 7
  `Table 1.a` ...2      ...3      General Inflation (%)(Base…¹ ...5  ...6  ...7 
  <chr>       <chr>     <chr>     <chr>                        <chr> <chr> <chr>
1 <NA>        <NA>      <NA>      CPI                          <NA>  <NA>  <NA> 
2 <NA>        <NA>      General   Food                         Non-… SPI** WPI  
3 PERIOD      National  Urban     Rural Urban Rural            Urba… <NA>  <NA> 
4 <NA>        YoY MoM   YoY MoM   YoY MoM YoY MoM YoY MoM      YoY … YoY … YoY …
5 Aug-22      27.3 2.4  26.2 2.6  28.8 2.2 28.8 1.6 30.2 1.2   24.7… 34.0… 41.2…
6 Sep-22      23.2 -1.2 21.2 -2.1 26.1 0.2 30.8 5.2 32.7 5.7   15.2… 28.6… 38.9…
# ℹ abbreviated name: ¹​`General Inflation (%)(Base 2015-16)`

Table 2 indicates that the table has been successfully extracted from the PDF file. We can now proceed to convert this table into a data frame for further analysis.

Converting Extracted Table to Data Frame

After extracting the table, we need to convert it into a data frame and clean the data for further analysis.

Code
table <- extract_tables(f1, pages = 3)[[1]]

# Convert the extracted table to a data frame
table_df <- as.data.frame(table)

# Skip the first four rows
table_filtered <- table_df %>% slice(-(1:4))

# Combine the rows into a single text string
table_text <- apply(table_filtered, 1, paste, collapse = " ")

# Split the text into columns
table_separated <- str_split_fixed(table_text, "\\s+", n = 19)  # Adjust 'n' based on the number of columns

# Convert the separated text into a data frame
table_separated_df <- as.data.frame(table_separated)

Renaming Columns

We will now rename the columns to make the data more readable and meaningful.

Code
# Rename the columns of the data frame
colnames(table_separated_df) <- c(
  "Period", 
  "CPI_general_yoy", "CPI_general_mom", 
  "CPI_general_yoy_urban", "CPI_general_mom_urban", 
  "CPI_general_yoy_rural", "CPI_general_mom_rural", 
  "CPI_food_yoy_urban", "CPI_food_mom_urban", 
  "CPI_food_yoy_rural", "CPI_food_mom_rural", 
  "CPI_non_food_yoy_urban", "CPI_non_food_mom_urban", 
  "CPI_non_food_yoy_rural", "CPI_non_food_mom_rural", 
  "SPI_yoy", "SPI_mom", 
  "WPI_yoy", "WPI_mom"
)

Cleaned Data Frame

Code
table_separated_df |> kableExtra::kable()
Period CPI_general_yoy CPI_general_mom CPI_general_yoy_urban CPI_general_mom_urban CPI_general_yoy_rural CPI_general_mom_rural CPI_food_yoy_urban CPI_food_mom_urban CPI_food_yoy_rural CPI_food_mom_rural CPI_non_food_yoy_urban CPI_non_food_mom_urban CPI_non_food_yoy_rural CPI_non_food_mom_rural SPI_yoy SPI_mom WPI_yoy WPI_mom
Aug-22 27.3 2.4 26.2 2.6 28.8 2.2 28.8 1.6 30.2 1.2 24.7 3.3 27.5 3.1 34.0 5.2 41.2 3.1
Sep-22 23.2 -1.2 21.2 -2.1 26.1 0.2 30.8 5.2 32.7 5.7 15.2 -6.6 20.1 -4.9 28.6 -1.4 38.9 1.4
Oct-22 26.6 4.7 24.6 4.5 29.5 5.0 34.7 4.5 37.2 6.1 18.2 4.5 22.4 3.9 24.0 -1.5 32.6 -0.5
Nov-22 23.8 0.8 21.6 0.4 27.2 1.3 29.7 0.0 33.5 0.5 16.4 0.6 21.4 2.1 27.1 6.1 27.7 0.0
Dec-22 24.5 0.5 21.6 0.3 28.8 0.7 32.7 0.0 37.9 0.1 14.8 0.6 20.7 1.4 27.8 0.2 27.1 -0.7
Jan-23 27.6 2.9 24.4 2.4 32.3 3.6 39.0 3.9 45.2 5.7 15.6 1.2 20.9 1.5 30.5 1.3 28.5 1.8
Feb-23 31.5 4.3 28.8 4.5 35.6 4.0 41.9 4.3 47.0 3.9 20.8 4.7 25.3 4.1 33.6 3.7 36.4 8.2
Mar-23 35.4 3.7 33.0 3.9 38.9 3.5 47.1 5.6 50.2 4.5 24.1 2.7 28.5 2.4 40.4 5.8 37.5 4.7
May-23 36.4 2.4 33.5 2.0 40.7 3.0 46.8 3.4 52.2 4.2 24.9 1.0 29.9 1.6 42.1 2.7 33.4 0.1
May-23 38.0 1.6 35.1 1.5 42.2 1.7 48.1 1.9 52.4 1.4 26.6 1.2 32.5 2.0 43.0 1.3 32.8 1.0
Jun-23 29.4 -0.3 27.3 0.1 32.4 -0.8 40.8 0.1 41.5 -1.6 18.7 0.1 23.8 0.1 34.9 0.2 22.4 -0.3
Jul-23 28.3 3.5 26.3 3.6 31.3 3.3 40.2 3.7 41.3 3.5 17.3 3.5 22.0 3.0 29.3 2.8 23.1 2.5
Aug-23 27.4 1.7 25.0 1.6 30.9 1.9 38.8 0.6 40.6 0.8 16.3 2.4 22.0 3.1 27.9 4.1 24.3 4.2
Sep-23 31.4 2.0 29.7 1.7 33.9 2.5 33.9 1.5 35.4 1.8 26.8 1.8 32.3 3.2 32.0 1.7 26.4 3.1
Oct-23 26.8 1.0 25.5 1.1 28.7 0.9 28.9 0.6 28.6 0.7 23.1 1.4 28.8 1.1 34.2 0.2 24.6 -1.9
Nov-23 29.2 2.7 30.4 4.3 27.5 0.4 29.8 0.8 29.2 1.0 30.9 7.0 25.9 -0.2 30.6 3.3 26.4 1.4
Dec-23 29.7 0.8 30.9 0.7 27.9 1.0 28.8 -0.7 29.3 0.1 32.4 1.8 26.4 1.8 35.3 3.8 27.3 0.0
Jan-24 28.3 1.8 30.2 1.8 25.7 1.9 27.4 2.8 25.1 2.3 32.3 1.1 26.3 1.4 36.2 2.0 27.0 1.5
Feb-24 23.1 0.0 24.9 0.2 20.5 -0.3 20.2 -1.5 19.0 -1.1 28.2 1.5 22.1 0.6 30.4 -0.8 18.7 1.1
Mar-24 20.7 1.7 21.9 1.4 19.0 2.1 16.6 2.4 17.1 2.8 25.8 0.7 21.0 1.4 25.9 2.1 14.8 1.3
Apr-24 17.3 -0.4 19.4 -0.1 14.5 -0.9 11.3 -1.3 9.5 -2.6 25.6 0.8 20.0 0.8 21.6 -0.7 13.9 -0.7
May-24 11.8 -3.2 14.3 -2.8 8.2 -3.9 2.2 -6.3 -0.1 -7.4 23.6 -0.4 17.2 -0.4 15.3 -4.0 9.9 -2.5

Saving the Data to a CSV File and Excel File

Finally, we can save the cleaned data to a CSV file for future use.

Code
# Save the table to a CSV file
write.csv(table_separated_df, "table_separated_df.csv", row.names = FALSE)
# Save the table to an Excel file
writexl::write_xlsx(table_separated_df, "table_separated_df.xlsx")

Conclusion

In this post, we demonstrated how to extract tables from PDF files using the tabulapdf package in R. We walked through the steps of loading necessary libraries, extracting tables from PDFs, converting them to data frames, cleaning the data, and finally saving it to a CSV file. This process can be very useful for extracting and analyzing data from PDF reports and documents.