One of the most common issues we have is getting data from a PDF into Excel. When you look at the PDF it looks like it already was in Excel or should be Excel friendly. So note in the picture a PDF document that looks exactly like a table that would be useful in Excel.
But if you try use the PDF tools and copy the data, you get a mess, which you have to spend hours cleaning up. Note below that all the nicely column based data of the PDF, becomes a heap of data in a single column in Excel.
Of course you can build formula to convert this into something useful (see our Data Cleanup Course for ways to do this). But there is a much easier way to import data from a PDF into Excel.
Below we show you the benefits of using a special product to get all information from PDF to Excel. Although we only focus on the Excel portion, the tool can extract PDF information into any one of the common Microsoft Office packages.
So if you spend more than 5 minutes getting data out of a PDF and into Excel, have a look at all the features below. And if your PDF conversation have any of the following issues, you definitely must keep reading:
- Merged Cells (columns get confused)
- Wrapped Text (rows need to be grouped)
- Scanned in documents with numbers (you just want something in Excel so you don’t have to type from scratch)
- Multiple PDF files that need to be batch processed.
The standard conversion is great, but we think it is the custom conversion options that will make all the difference.
Standard Conversion- exporting PDF tables to Excel
Let’s say that you have a PDF that looks like this. Note that it looks like a table in Excel with some complexities (some cells seem to have been merged).
In order to use Able2Extract, you simply follow these steps:
You open the PDF in Able2Extract
Click the All button to select the entire PDF
Choose the Excel option (you can see that there are many other options but we are slightly biased towards Excel) and click the Automatic Convert button.
Once you tell it where to save it, it will generate an Excel file with the PDF data imported into Excel looking something like this. As you can see, it has put the PDF columns into columns in Excel so that you can actually work with it. It even brings the colours through!
So if you have a nice simple PDF then you can do this. But what we are really excited about is all the custom options.
The main reason for this is that, when Excel files are saved to PDF, it often contains things like merged cells and wrapped text. These can cause problems for other PDF to Excel converters.
The Custom Conversion from PDF to Excel
Even in the standard conversion shown above, there is still some work to do. As a result of left alignment for words and right alignment for numbers, the headings are sometimes split into their own columns.
This is where you can control it in the custom setting. So if you follow all the same procedures as above, when you get to the conversion type, you can choose the Custom option.
Notice below you can specify/ control how the PDF converts to Excel based on the tables, columns, rows and even control the headers and footers. Also very useful is the ability to preview what it will look like in Excel before the conversion process.
Fixing Columns Issues in PDF’s (merged cells, character alignment)
So in the PDF example we are using, we can foresee the following issues. Note below that the green lines are showing where Able2Extract wants to split the columns.
The first red arrow shows the result of merged cells. This heading is going over multiple columns, but as a result of the information below being in their own columns, the heading is going to be cut, and you will have part of the heading in one cell and another part of the heading in another cell (which is not fun to correct).
Another issue is the alignment. The second arrow shows what looks like a single column, but the text headings are left aligned and the numbers are right aligned, almost, but not quiet, creating separate columns. As can be seen with the green lines, this is going to split it.
No problem for Able2Extract. We can erase the column lines selected by choosing the ‘Erase Column Line’ option.
By clicking on the relevant lines we can remove them from the conversion process. So as shown below, by looking at the document and adjusting the green lines for issues with merged cells and character alignment, the preview shows all the information in the correct columns.
So less clean up for you!
Fixing Issues with rows on PDF conversion (wrap text type issues)
What about PDF’s where it looks like wrap text has been used in cells. Below, this questionnaire has blocks of data that we want to include in a single cell.
However, as noted below, a conversion by default will treat each new line as a new row.
There are a number of options to get around this.
Firstly, you can note that if you click on row settings (1 in the image below), you get some extra options. A very useful option is (2) which allows you to use any horizontal lines in the document to specify the rows that should be together in a cell. This is very useful if the PDF was created in word tables where you can see the horizontal line. It also may guide you in how to create PDF’s from Excel. Perhaps you can ask the person creating the pdf to include the grid lines so that it can help the conversion process.
In this case we don’t have lines, so we are going to use (3) manual row editing.
Much like when we changed the columns, we have chosen to Erase Row Lines. As shown below, if we group the data, Excel will have the imported data from the pdf in a single cell which is easier to work with.
What about scanned PDF’s
Thankfully, most PDF’s are generated straight from a computer system, so it is perfectly straight and easier to get into Excel. But what about scanned documents. Using the inbuilt OCR (Optical Character Recognition), Able2Extract can convert even these pdf’s.
Below a scan with lots of colours and other markings. Also note that the page has been scanned slightly skew.
This is where OCR comes into play. In this case we need to help the software out a bit. By clicking the Area button, I have specified exactly where the table is.
When you run the Custom convert, we can specify which columns we want and in this case I have asked it to identify the rows by using the horizontal rows in the document. As shown below it has done a good job of using the document’s horizontal rows.
The end result is something like this. Given all the extra colours and lines drawn onto the document we think this is very impressive.
One way we found to make the conversion even better is to use some imaging software and to ‘straighten’ the scan so that it is straighter. This way it is easier to line up the columns and rows and the conversion is better.
Can you process multiple pdf’s e.g. monthly pdf statements?
Have a look at one of the options in the toolbar. You can batch a number of files to run at the same time and specify where it must export to. Very useful if you want to build some consolidation templates or use something like PowerPivot to import the data into a database.
Excel is a magnificent tool (we are biased, but remember that Excel can be used to replace all the other Office products. You can’t say that about Word!).
But for some reason, even though it imports data from all sorts of other files, it still struggles with PDF’s.
If you regularly need to get PDF’s into Excel you should rather invest in Able2Extract. The time savings and improved accuracy will pay for itself on the first conversion!