Convert PDF to Text

Posted: February 15th, 2009 | Author: Matt | Filed under: Web Programming | Tags: , , | No Comments »

While PDF files can be hard to manipulate, there are plenty of PDF tools out there that can be used to perform the functions you need.

If you have ever needed an easy way to convert numerous pages of a PDF to text then you know it’s not something that the average person knows how to do. The average person knows that selecting the text in Adobe Reader or Adobe Acrobat will work. But, what about preserving formatting or processing thousands of PDF pages/documents? A lot of times people ask me if they can convert a PDF to a Microsoft Word document any my response is always “There is no easy way to do that”. By that I mean that I don’t have the tools and frankly don’t want to spend the time to figure out what tools out there are up-to-date and accurate.

However, on multiple occasions I have had to extract data from PDF files that 1000+ pages long. These PDF files have a standard layout so I just have to get the text into a readable format for my PHP scripts (because I like PHP). If you have never heard of it, pdftotext is a free open source command-line tool that comes with Ubuntu and various other Linux distributions. In my opinion it does a good job getting all the text, however, sometimes the text does not follow the same ordering as the text from other pages.

There are other tools out there, some which must be licensed, that may do a more accurate job than pdftotext but the best tool I’ve found that does an excellent job is Gmail. That’s right, Gmail! If you ever receive an email as a PDF Google will give you the option to view it as HTML. Yes, I know HTML is not plain text. But if you copy it straight out of your browser than it might as well be. But, if you copy it as HTML then you’ve got something better to do regular expressions because each text item in the document is wrapped in markup that you can then use when filtering out the content you need with regular expressions.

pdftotext-gmail



Leave a Reply