Rajendra Gupta
View examples

Import data from PDF files using R Scripts SQL Server

December 8, 2020 by

In this article, we will read and import data from a PDF file using the R scripts SQL Server.

Introduction

In today’s digital world, data is available in many formats such as Excel, CSV, PDF, HTML, JSON, XML, TXT. We use SQL Server Integration Services for data imports and exports in SQL Server. It can read data from various data sources, databases, transform them in the required format.

You cannot import or export data in PDF format from the SSIS package directly. The PDF is a popular format for all useful documents such as agreements, contracts, policy documents, research. Generally, we use Microsoft word for preparing the documents and later converts into PDF formats.

In the article, Importing data from a PDF file in Power BI Desktop, we explored the Power BI Desktop feature for the data import from a PDF file.

SQL Server Machine Learning Language provides various functionality in the SQL Server. You can directly run your external R script from the SQL Server console. In this 4th article on SQL Server R script, we will use the R functionality to read and import data from a portal file format (PDF) file.

Environment details

You need the SQL Server environment as specified below for this article.

  • Version: SQL Server 2019
  • Installed Feature: R Machine learning Language
  • SQL Server Launchpad and Database engine service should be in running state
  • SQL Server Management Studio

Read data from PDF files using R Scripts SQL Server

We can use the external libraries to use the required functions in the R Scripts SQL Server. For this article, we use pdftools external library.

Launch the administrative R console from the C:\Program Files\Microsoft SQL Server\MSSQL15.INST1\R_SERVICES\bin and install the PDFTools library using the below SQL Server R Script.

install.packages(“pdftools”)

It downloads the PDFTools and installs for your R scripts SQL Server.

Install Packages

Once installed, you can use sp_execute_external_script stored procedure and print the pdftools package version.

You get the package version, it shows that you can use PDFTools for your scripts.

sp_execute_external_script stored procedure

For this article, I download the sample PDF file from the URL. The File content is as below.

Sample PDF

Import the library in your R script session. If we want to check the PDF files available in our current R directory, you can use the list.files() function in R Scripts SQL Server and filter the results for PDF files.

>  library(pdftools)
> files <- list.files(pattern = “pdf$”)

As shown below, It returns the PDF file name. In my case, the file name is MyData.pdf

Read PDF Files

Now, to read the PDF file content, we use the function pdf.text() in the SQL Server R script.

data <- pdf_text(“MyData.pdf”)
data

In the output, you get file contents. You can compare the output with your original PDF document to verify the contents.

If you compare this with the original PDF document, you can easily see that all of the information is available even if it is not ready to be analyzed. What do you think is the next step needed to make our data more useful?

Read data from PDF file

In the above image, the R Scripts SQL Server displays PDF text without any formatting. It does not consider any space or line break between the sentences. We might want to display all characters in a single line. We can use the strsplit() function, and it breaks down the extracted text in multiple lines using the new line character(\n).

> data<- pdf_text(“MyData.pdf”)
> Splitdata<-strsplit(data,”\n”)
> Splitdata

Look at the following screenshot. Here, we get the extracted characters in separate lines similar to the original PDF document.

Split data

Let’s convert the script into the SQL Server R script format using the sp_execute_external_scripts stored procedure.

In the below script, we use the following arguments:

  • @language: Specify the machine learning language R
  • @Script: In the script section, we specify the R script that we want to execute from SQL Server
  • In the Grid format, the PDF text appears in a single block.

    Split data in SQL Server R Script

    In the above screenshot, we do not get any column heading. To display the heading, we can specify it using the WITH RESULT SET in R Scripts SQL Server and define the column header name with its data type.

    Use column heading

    If you display result in the text ( Result to text – CTRL+T) in SSMS, it gives you formatted data because it reads whole data in a text format ( 1 row affected).

    result in the text

    Similarly, you can modify the R Scripts SQL Server with the strsplit() function and split the extracted text into multiple lines.

    Now, we have 17 lines in the SSMS grid result as per our original document.

    SSMS grid result

    Read data from a PDF file and Insert data into SQL Server table

    Till now, we have read data directly from the PDF file using the SQL Server R script. Most of the time, we want to import into SQL tables as well.

    For this purpose, create a SQL table and define the data type as Varchar(). You should use the appropriate data length in the varchar(). For the demonstration purpose, I specified it as varchar(1000).

    Once you run the below T-SQL with the SQL Server R Script, it does not display the extracted text into the SSMS output. It inserts data into newly created [ExtractedPDFData].

    Insert data into SQL table

    You can query the SQL table, and it shows you the extracted data from the pDF file using SQL Server R Script.

    Query SQL table

    Text mining using Machine learning language R Scripts SQL Server for the PDF data

    We can do text mining using the imported data from a PDF file using the SQL Server R script. Text mining means doing data analysis on input data.

    You need to install the tm external package for this demonstration.

    install.packages(“tm”)
    library(“tm”)

    Here, we do not cover text mining in detail, but this article gives you an overview of implementing it for your dataset. The Corpus is a collection of text documents so that we can apply text mining.

    data <- pdf_text(“MyData.pdf”)
    extracteddata <- Corpus(VectorSource(text))

    Now, on the extracted data, we apply the following transformations:

    • We convert the texts into lower case. It makes the words look similar. For example, the PDF and pdf will be the same words for analysis purposes

      extracteddata  <- tm_map(extracteddata , content_transformer(tolower))

    • The below command removes the common words for the English language from the extracted PDF data

      extracteddata  <- tm_map(extracteddata , removeWords, stopwords(“english”))

    • Remove any punctuation marks from the extracted text

      extracteddata  <- tm_map(extracteddata , removePunctuation)

    • Remove the numbers from the extracted text

      extracteddata <- tm_map(extracteddata, removeNumbers)

    Now, we want to prepare the term-document matrix for the top 3 used words. It displays the words and their frequency in the extracted data.

    In the below script, we use the TermDocumentMatrix to construct a term-document matrix. We further sort the data and sort them in descending order of their frequency. Further, it displays the top 2 records using the head() function.

    extracteddata_matrix <- TermDocumentMatrix(extracteddata)
    m <- as.matrix(extracteddata_matrix)
    v <- sort(rowSums(m),decreasing=TRUE)
    d <- data.frame(word = names(v),freq=v)
    head(d, 3)

    Let’s convert it to SQL Server R script and then execute the t-SQL. In the below R script, we embed the code into the @script argument.

    As per the below script output, we have the following word frequency.

    Text mining using the Machine learning language R Scripts

    • PDF – 7 times
    • Files – 6 times

    To verify the same, open the PDF file and search for the keywords PDF and Files. As shown below, it matches the frequency of the words received in the SQL Server R script output.

    View examples

    Conclusion

    In this article, we explored reading data from a PDF document using the SQL Server R script. Further, we saw an example of text mining on the extracted data. You can further explore more useful functions for text mining using the R service documentation.

    Rajendra Gupta
Machine learning, R

About Rajendra Gupta

As an MCSA certified and Microsoft Certified Trainer in Gurgaon, India, with 13 years of experience, Rajendra works for a variety of large companies focusing on performance optimization, monitoring, high availability, and disaster recovery strategies and implementation. He is the author of hundreds of authoritative articles on SQL Server, Azure, MySQL, Linux, Power BI, Performance tuning, AWS/Amazon RDS, Git, and related technologies that have been viewed by over 10m readers to date. He is the creator of one of the biggest free online collections of articles on a single topic, with his 50-part series on SQL Server Always On Availability Groups. Based on his contribution to the SQL Server community, he has been recognized with various awards including the prestigious “Best author of the year" continuously in 2020 and 2021 at SQLShack. Raj is always interested in new challenges so if you need consulting help on any subject covered in his writings, he can be reached at rajendra.gupta16@gmail.com View all posts by Rajendra Gupta

2,626 Views