How to convert Microsoft Word, Excel and PDF files to HTML or text in PHP

Metamorphosis
Photo by Liyu15

I need to analyze and display documents attached to emails, and that means converting from common formats like .doc, .xls and .pdf to either plain text or HTML. Thankfully there’s several different command-line tools on Linux that do a pretty good job, and then all you need is a bit of PHP duct tape to build your own online document converter, a poor man’s version of Zamzar. Here’s an example running on my test server, with the source code available here. To use it, select an example Word, Excel or PDF document, choose whether you want pretty HTML or processable text, and click Convert File.

If you want to get it running on your own system, here’s the directions for Red Hat Fedora Linux, though with some tweaking of the installation steps it should work on most Unices.

First install the tools by running the commands in bold:

yum install w3mThis gets you the text-based w3m web browser, useful for converting HTML to text
yum install wvThe wvWare package that can convert MS Word .doc files
yum install xlhtml xlhtml converts Excel files to html
yum install poppler-utilshandles PDF files
yum install ghostscriptneeded for high-quality rendering of PDF files

Once they’re in place, you should just be able to copy over the two php files to a folder on your server and get the example running. The rendering isn’t perfect, in particular the PDF handling has been very problematic, I had to disable all image rendering and it defaults to a horrid grey background. This might be an issue with using poppler rather than xpdf, so if pretty PDFs are important you might want to experiment with that instead. I’ve also seen some glitches with the spreadsheet rendering, but overall I’ve been very impressed with the results from wvWare and xlhtml. I was also hoping to handle PowerPoint .ppt files, but xlhtml fails with a ‘xlhtml: cole – OLE2 object not found’ error which I haven’t had a chance to debug yet.

Thanks to Phillip Hollenback for his original article covering using some of these tools within a mail program, he had some great tips on how to wrestle them into a pipeline.

One response

  1. Converting a PDF Document to a Word Document

    Convert PDF to Word is simple. Theres numerous programs available online that the computer can perform this task as simple as printing a document to download. And to the best available program to do this work well, without slowing down your comp…

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: