Rajendra Gupta

Importing data from a PDF file in Power BI Desktop

March 1, 2019 by

Power BI Desktop is a useful reporting and analytical tool to represent data in various formats. These presentations help us to quickly understand information and circulate it to stakeholders in a visual fashion.

We can connect Power BI Desktop to disparate data sources including multiple relational and non-relation databases, Cloud, SharePoint, and Microsoft Exchange etc. In the real world, we use various file formats to represent our documents.

In Power BI, we can get data from below file formats

  • Excel
  • Text\CSV
  • XML
  • JSON
  • SharePoint Folder

PowerBI document file types

Although popular as an export format, you cannot easily modify a PDF document, and it can contain the text, images, tables, and charts in PDF document.

In Power BI Desktop, we cannot get data from PDF documents directly. Also, we do not have any option to use PDF as a data source. But suppose we want to extract a table from the PDF file and prepare visualizations on it. We can import data from PDF with following methods.

  • Convert the PDF file to a Microsoft Word or HTML document and copy the table in an appropriate data source such as Microsoft Excel. It is a lengthy process and if we have to do the same process repeatedly, it quickly becomes cumbersome
  • We can also use programming languages such as R to extract the required data from a PDF file. It requires extensive technical knowledge of a machine language such as R to work with these PDF files. If the format of the PDF files changes or we need to import a table from another PDF file, we require a change in the programming code. So this again creates organizational friction to contact developers and ask for the code

In both approaches, we require extra overhead to prepare PDF reports in Power BI. In this article, we will show how import data from a PDF file easily and without any coding.

Power BI PDF Connector

The Power BI Desktop contains many preview features. These features are not enabled by default. We can use the PDF Connector preview feature in Power BI to use a PDF file as a data source.

Launch the Power BI and go to the File menu. In the File menu, go to Options and settings and click on Options.

OPening PowerBI options and settings

It opens various Power BI configuration options. We can configure various options such as Data load, Power BI Query, R scripting, security, Report settings. We need to click on the Preview features menu option to use the PDF connector in Power BI Desktop data source.

Configuring PowerBI options and settings

It opens the various preview features in the Power BI Desktop.

Note: This preview list is updated regularly. Some of the preview features might move to general availability. We will also get new features with the new release of Power BI Desktop.

In the preview feature list, enable Get Data from PDF files. We can also read the documentation from the Learn more. It opens a web page for Microsoft Docs. You should have an active internet connection to go through these documentations.

Click on the checkbox in front of Get data from PDF files and click Ok. We need to restart Power BI to enable the preview feature. Close Power BI Desktop and re-open it.

Configuring PowerBI to improt PDF files

Once we relaunch Power BI Desktop, go to Get Data->More again from the menu bar. In the following screenshot, you can get a new option under ‘File’. We get the option ‘PDF (Beta) to use as a data source.

Specifying PowerBI PDF datasource

Click on PDF (Beta) connector and connect. We again get a warning message that this PDF connector is still under development. Since it is still a preview version, it might have a few bugs that will be resolved before the general availability of this feature.

PowerBI preview connector

Click on Continue. If we do not want this message, then you can put a check on Don’t warn me again for this connector.

In this article, we are going to import a sample PDF file that contains the Monthly sales analysis. We can see the text, images along with data table in this PDF file. We need to import table from the following PDF file. We can prepare Power BI Reports using this data.

PDF to import with PowerBI

Once you click on Connect and provide path of this PDF file from a local directory.

Opening the PDF to import with PowerBI

It connects to PDF file and opens a Navigator. It lists out the table from PDF and a list of all pages. If we select the page, it shows the complete page content in the Navigator preview window.

PowerBI PDF import navigator

We need to get data from this table, therefore, put a check in front of the table, and it shows the table on the right-hand side page.

PoweBI PDF import preview

You can see a table from our PDF file. In this preview, we need to make a few changes to display the table contents in a proper format. We can make the changes as well in data imported from the PDF file. Click on Edit.

It opens a Power Query Editor. Power Query Editor helps to make the changes in table format, add or remove any column, row, split column etc.

Making changes to the imported table when importing a PDF in PowerBI

In this Power Query Editor, you can see that the fourth row contains the column names. We do not want to have the top three rows in our table. Therefore, we need to remove these three rows from the top.

Removing rows from a PDF import with PowerBI

We need to open the list of options for Remove Rows. In the options list, click on Remove Top Rows. It opens a pop-up box to specify the number of rows from the top we want to remove. We want to remove the top three rows from the top, therefore, enter the value and click OK.

Removing top rows from a PDF import with PowerBI

It removes the number of rows specified from the top (in our case three rows) as shown in below image.

Preview after rows of PDF import have been removed in PowerBI

Now the top row in this table contains the column list. Currently, we do not have any column name defined for the table. We want the top row to represent the column names, therefore, click on the Use First Row as Headers as shown in the image above. It uses the top row as a header of the table. Once we clicked on the Use First Rows as Headers, we can see the column name in our data table.

Use first row as header when importing a PDF with PowerBI

We might do further filtering in our data. We do not want any NULL values to be there in a table for the location column. Click on the location column, remove the checkbox from the NULL value, and click OK.

Filtering PDF import data with PowerBI

It removes Null values from the table, and you can see data in the tabular format now. You can match this data with the PDF file to have validation.

Validating PDF data import with PowerBI

We need to click on the Close & Apply from the menu bar to save the changes. It makes all required changes and we can see table fields in the FIELDS section. It identifies the data type accurately, and we can use this data for the reporting purpose easily.

Saving PDF import project changes in PowerBI

We can choose the required fields from this data set and create a Power BI Visualization.

Creating a PowerBI visualization from the PDF import

We can view data as well from the ‘Data’ tab on the left-hand side.

Viewing the imported PDF data in PowerBI

Conclusion

In this article, we explored the quick and easy way to get the data from a PDF data source. It is an exciting enhancement to Power BI Desktop functionality.

Table of contents

Power BI Desktop Interactive chord diagrams
Power BI Desktop Pulse Charts
How to create a Word Cloud generator in Power BI Desktop
Power BI desktop Mekko Charts
Hexbin Scatterplot in Power BI Desktop
Candlestick chart for stock data analysis in Power BI Desktop
Enlighten World Flag Slicer in Power BI Desktop
Flow Map Chart in Power BI Desktop
Ask a Question feature in Power BI Desktop
Power BI Desktop Q&A data interaction examples
Power BI Desktop and Python; like Peanut Butter and Chocolate
Power BI Desktop Pie Chart Tree
Importing data from a PDF file in Power BI Desktop

Rajendra Gupta

Rajendra Gupta

Rajendra has 8+ years of experience in database administration having a passion for database performance optimization, monitoring, and high availability and disaster recovery technologies, learning new things, new features.

While working as a Senior consultant DBA for big customers and having certified with MCSA SQL 2012, he likes to share knowledge on various blogs.
He can be reached at rajendra.gupta16@gmail.com

View all posts by Rajendra Gupta
Rajendra Gupta
2,055 Views