I need to analyze and display documents attached to emails, and that means converting from common formats like .doc, .xls and .pdf to either plain text or HTML. Thankfully there’s several different command-line tools on Linux that do a pretty good job, and then all you need is a bit of PHP duct tape to build your own online document converter, a poor man’s version of Zamzar. Here’s an example running on my test server, with the source code available here. To use it, select an example Word, Excel or PDF document, choose whether you want pretty HTML or processable text, and click Convert File.
If you want to get it running on your own system, here’s the directions for Red Hat Fedora Linux, though with some tweaking of the installation steps it should work on most Unices.
First install the tools by running the commands in bold:
yum install w3m – This gets you the text-based w3m web browser, useful for converting HTML to text
yum install wv – The wvWare package that can convert MS Word .doc files
yum install xlhtml – xlhtml converts Excel files to html
yum install poppler-utils – handles PDF files
yum install ghostscript – needed for high-quality rendering of PDF files
Once they’re in place, you should just be able to copy over the two php files to a folder on your server and get the example running. The rendering isn’t perfect, in particular the PDF handling has been very problematic, I had to disable all image rendering and it defaults to a horrid grey background. This might be an issue with using poppler rather than xpdf, so if pretty PDFs are important you might want to experiment with that instead. I’ve also seen some glitches with the spreadsheet rendering, but overall I’ve been very impressed with the results from wvWare and xlhtml. I was also hoping to handle PowerPoint .ppt files, but xlhtml fails with a ‘xlhtml: cole – OLE2 object not found’ error which I haven’t had a chance to debug yet.