How to convert Microsoft Word, Excel and PDF files to HTML or text in PHP

Metamorphosis
Photo by Liyu15

I need to analyze and display documents attached to emails, and that means converting from common formats like .doc, .xls and .pdf to either plain text or HTML. Thankfully there’s several different command-line tools on Linux that do a pretty good job, and then all you need is a bit of PHP duct tape to build your own online document converter, a poor man’s version of Zamzar. Here’s an example running on my test server, with the source code available here. To use it, select an example Word, Excel or PDF document, choose whether you want pretty HTML or processable text, and click Convert File.

If you want to get it running on your own system, here’s the directions for Red Hat Fedora Linux, though with some tweaking of the installation steps it should work on most Unices.

First install the tools by running the commands in bold:

yum install w3mThis gets you the text-based w3m web browser, useful for converting HTML to text
yum install wvThe wvWare package that can convert MS Word .doc files
yum install xlhtml xlhtml converts Excel files to html
yum install poppler-utilshandles PDF files
yum install ghostscriptneeded for high-quality rendering of PDF files

Once they’re in place, you should just be able to copy over the two php files to a folder on your server and get the example running. The rendering isn’t perfect, in particular the PDF handling has been very problematic, I had to disable all image rendering and it defaults to a horrid grey background. This might be an issue with using poppler rather than xpdf, so if pretty PDFs are important you might want to experiment with that instead. I’ve also seen some glitches with the spreadsheet rendering, but overall I’ve been very impressed with the results from wvWare and xlhtml. I was also hoping to handle PowerPoint .ppt files, but xlhtml fails with a ‘xlhtml: cole – OLE2 object not found’ error which I haven’t had a chance to debug yet.

Thanks to Phillip Hollenback for his original article covering using some of these tools within a mail program, he had some great tips on how to wrestle them into a pipeline.

Park rangers hate you

Goatwarning
Photo by GordMcKenna

I know a lot of park rangers, and they have a really tough job. The pay’s rotten, they spend most of their time picking trash out of pit toilets, acting as cops when people bring their problems with them to the park (most commonly domestic violence and DUI at the campgrounds), and they are at the mercy of a giant bureaucracy. Things we be a lot easier if there weren’t all these damn park visitors getting in the way.

Anyone who’s dedicated to preserving the outdoors would find it a lot easier if it wasn’t for the pesky general public. They pick flowers, drop litter, make noise, cut trails and generally damage the environment. It would be so much simpler to take care of the wilderness if people would stay out.

This means that increasing the number of park visitors is pushed way down the priority list. In fact, while the idea might be paid lip service, any measures that might help usually conflict with other things considered more important. Charging an entrance fee has to be a big psychological barrier, but it’s pretty popular with rangers because it means that they have an excuse to ask for a receipt from anyone causing trouble, and either search or eject them if they didn’t keep it. Publicizing or improving back-country campgrounds to encourage visitors means a lot more maintenance and enforcement work.

Our parks systems have ended up working like a monopoly, where customers are a hindrance, not a priority. Individual rangers are dedicated to encouraging everyone to share their love of the outdoors, but all of their incentives push them away from acting to pull in more people. Environmental organizations are so focused on preservation, they fight against even low-impact recreation. The 1997 Merced flood halved the number of camping spots in Yosemite, but there’s still a battle to build any replacements at all.

This is all part of a slow crisis, where park attendance across the country is dropping overall, and particularly in California, despite yearly fluctuations. This matters because parks all require government money, which means they need popular support. Why should people pay for something they’re never likely to use? During the California budget crisis, the governor planned to close many state parks like Topanga. That was hardly mentioned on the news, and though it was prevented for now, you can bet the lack of a public outcry will affect politician’s calculations in the future.

What can we do about all this? I think there has to be a grass-roots effort to let people know what’s available, reignite their interest and boost attendance. I’m trying to do my bit by documenting local camping, most of which is not covered by the agencies websites. I’m also trying to be a voice for more low-impact recreation at organizations I’m involved in like the Santa Monica Mountains Trails Council.

What does the cloud mean for email?

Nimbus
Photo by David AG Wilson

There are two big reasons email hasn’t been evolving like the web; the data’s a lot harder to get hold of and it’s really hard to crunch it once you have it. The web relies on the cloud to solve the second part, and I’m convinced that email will need something similar to move forward.

Almost all the exciting tools in the email world are client plugins, because that’s the easiest and most secure place to grab the data. The big drawback is that client CPU cycles and disk space are scarce resources. You can get a 1 terabyte disk for less than $200, but any client application that used more than a few hundred megabytes would be considered ill-behaved, even though the monetary cost of that space is currently just a few cents. This is because you can’t rely on that space being available, it may be an older machine, a lap-top, full of other data files, or a million other reasons that make relying on large client disk usage unpopular with users. The same holds true for CPU cycles, anything that slows down Outlook or increases the risk of a crash will be shunned.

Cloud computing makes it possible to take advantage of cheap storage to improve the user experience. For example, take a heavy email user and assume that an average message contains 10,000 characters, she gets 1,000 a day, and there’s about 2,000 days of email in her account. That’s around 20 GB of storage, or $4.00 worth. Imagine creating a Google-style index of every word in that email, so she can instantly search it all. Even if that quadrupled the storage size to 80 Gb, that’s still only $20 of storage, with a massive user benefit.

So, if that’s all true, what’s stopping a flood of startups taking advantage of this? In the consumer world Xoopit is doing great work, but they’re having to ask people for their gmail passwords since there’s no other way of grabbing the data. Without a more official API, it’s a pretty scary proposition to build a business around. On the enterprise side, there’s an almost complete lack of overlap between the people who know how to interface with Exchange, and those who want to do crazy new startups.

What’s the answer then? That’s what I’m working on, so watch this space.

Camping on Anacapa Island

Lizladder

Me, Liz, and two friends spent Saturday night camping on Anacapa Island, just off the LA coast. It’s a small place, and I was pretty nervous we’d be bored out of our minds. When our friends phoned to book their boat trip with Island Packers, the lady taking their order was incredulous that they wanted to camp there; "Have you been before? People don’t go back there twice." There’s less than a mile of trails over the whole island, no trees or shrubs, just thousands of seagulls nesting, with all the noise and mess you’d expect.

Seagulls

The boat ride across took less than an hour, with the local dolphins putting on a great show. As we got closer, we could see the sheer cliffs that completely surround the island, and the famous rock arch on the tip.

Archrock

The trails may be short, but getting onto the island from the boat is a workout, with 150 stairs from the dock straight up the cliff. There’s no water at the campground, so our rucksacks were heavy as we packed all we needed up the iron stairwell.

Stairs

After a quick talk from the ranger, we headed the half-mile to the campground. There were no assigned spots, though you do need to make reservations ahead of time. There’s only 7 spots in the center of the island. Be careful, there’s a couple of places with numbered posts on the way that look like camping spots, but we saw rangers move two different groups on from them while we were there. Here’s the maps and descriptions of the different spots. We took number 7, without realizing it was designed for a single tent. We fit both of our 2-man tents in, but it was very cosy.

Camptext
Campmap

Campground

There isn’t much privacy at the campground, since there’s no shelter, but spots 6 and 7 were a little bit tucked away. They were also close to the cliff-top overlooking a view down the whole island, and into the sea below.

Clifftop

Once we’d got our tents set up and settled in, the moment of truth arrived and we had to find something to do. The loop around the clifftops kept leading to some amazing overlooks, so we spent several hours staring down through the crystal-clear ocean, watching the sealions doing loops around the scuba divers exploring the reefs and kelp forests. We were al jealous of the divers, I could see why Jacques Cousteau considered the Channel Islands the best temperate diving in the world. We wanted to take a swim too, but with sheer cliffs all around, it seemed impossible. Luckily Liz came to the rescue, and figured out we could take a dip off the dock. You can see her sliding into the water at the top of the post. You’re allowed to swim here, but you’ll need to be very careful since there’s not much margin for error. The conditions were perfect for the four of us, with 64 degree water, flat as a lake with amazing visibility.

Anacapaswim

After the swim, we were ready for bed. The ranger had warned us that the gulls and foghorn would keep us awake, but we slept well until the dawn chorus kicked in. Despite all my misgivings, it turned out to be a great trip for all of us. The best word for the place is ‘wild’, you know you’re on the edge of the world, but there’s so much life all around you. I wouldn’t want to live on Anacapa, but it’s a great place for an adventure.

Debug mysql hangs

Hanging
Photo by BigGolf

As I’m dealing with larger and larger sets of data, I’m hitting situations where a mysql operation starts to take many seconds, and may lock the table it’s working with for reads and writes. That means not only does the task that’s waiting for the result get stuck, other parts of the system can end up wedged too.

If mysql does stop responding, fire up the command-line client, and then run

SHOW PROCESSLIST;

This will output a table describing every job that the database server is working on, or has in its queue. The Info column gives a summary of what the command is, and the State describes whether it’s being processed, or if it’s waiting for another job to finish first. What you’ll usually see with these hangs is that there’ll be one row that’s showing a high number of seconds in the time row, and then a queue of other commands waiting for that one to finish. So what can you do to fix that?

The first thing is to stop the server processing the job if it’s going to take a crazy amount of time to complete. If you look at the process list, note down the Id number in the first column of the one you want to stop, for example 25991, and type

KILL 25991;

There’s no guarantee that bad things won’t happen to your data if you do stop a job halfway, so use this with caution! If you want to understand why a query ended up taking a long time, that deserves a long post to itself, but the best place to start is by running DESCRIBE <your query>, for example

DESCRIBE SELECT * FROM messages;

That will give you a rundown of how many rows it will look at, and whether with will be able to use any indices or keys to speed up the operation.

Can you label emails with Semantic Hacker?

Warningsticker
Photo by Sirkullay

I’m interested in ways of automatically categorizing emails, so I’ve been experimenting with some of the recently launched semantic analysis services. Earlier I set up an OpenCalais demo, and next I tried out Semantic Hacker. Luckily they already have an online demonstration page, which made it a lot easier. To get a rough idea of how it worked, I tried two different pieces of text. The first was a news story about deflecting asteroids that came as part of the OpenCalais test suite, and for the second I took the text from a recent post I did on Independence Day. I chose these because the news piece covered a lot of concrete names, places and organizations, and my post was about a more abstract topic, and I wanted to understand how similar emails would be handled.

For the news story, Semantic Hacker does a great job of picking out the main topic, with 5 extremely relevant suggested categories.
Semantichackershot2

OpenCalais by contrast picked out a lot of organizations and places, but didn’t really try to summarize the overall meaning of the document.

Both systems did a lousy job with the 4th of July post. Semantic Hacker suggested the Boy Scouts as the top category, followed by the Knights of Columbus, which I can only guess came up because I mention patriotism a lot. The first couple of related wikipedia articles were reasonably relevant at least.
Semantichackershot1

OpenCalais picked up some of the places I mentioned, like Juneau, though it assumed LA was Louisiana! It didn’t get any of the more abstract concepts, apart from the holiday name itself.

So far, what I’m seeing is confirming my instinct that general semantic analysis and categorization is AI-complete, but that some of these tools might be useful for limited applications, like pulling out locations, organizations and technical terms from emails.  My next experiments are going to be focused on statistical methods of pulling out interesting words and phrases.

You should attend Defrag

Lighttree
Photo by J Philipson

The first Defrag conference was truly illuminating for me. It gave me a vast number of ideas from people I might never have run across otherwise, and some of them are fueling the work I’m doing today. This year’s Defrag is fast approaching, and so’s the deadline for the early-bird discount, (Update- Eric just sent me a code for an extra $100 off: ‘pete1’) so here’s why I think you should go:

Inspiration. Last year I sat with JP Rangaswami at lunch, heard him describe his open inbox policy, which led me to take a fresh look at the privacy challenges of sharing email. Andrew McAfee gave a compelling example of a building firm that saved half a million dollars thanks to one employee sharing knowledge through a blog post, which reawakened my interest in stopping waste by opening up all the silo-ed information in large companies. There were many other moments like this for me. I don’t know what they’ll be this year, but I know when you get this many smart people from different worlds talking, there will be plenty of them.

Learning. Karen Schneider taught me the basics of what librarians have learnt over the last 2000 years about categorizing information in her talk. Matt Hurst showed a formal way of thinking about the process of creating a visualization. Michael Barrett of Paypal warned that if we were dealing with valuable information, we needed to be thinking about security right from the start. I’m applying all of these to my own work, and I wouldn’t have known about them without hearing the talks.

Contacts. The whole world of trying to do something useful with ambient information is incredibly young and fragmented. This is the only meeting of the tribe, and I talked to more people who were looking at the same problems as me in 2 days than I had in the previous year. If you’re trying to do anything innovative with visualizations, data mining or other information tools, you’ll find people interested in helping you make progress.

Vibe. It’s actually fun! Everybody is here because they’ve made a decision to come, this isn’t a convention where everybody in an industry attends by default. That means packed sessions, people who are excited to be there, lots of smiling faces and a welcoming attitude that makes it easy to talk to strangers. It’s also kept pretty small, which means you’ll feel at home very quickly. I came out last year with a real energy boost, thanks to the atmosphere.

Participation. This year I’m jazzed that I’ll be on stage as a panel member, but nobody is just sitting back and passively absorbing information. The open discussions forced everyone to get engaged, and gave me some conversations I never would have had otherwise. I’m looking forward to what Eric has planned for this year, it sounds like he’s been getting inspiration from Seth Godin’s ideas on getting people involved.

How to report bugs

Ladybird
Photo by Nutmeg66

One of Apple’s secret weapons is its fantastic bug-tracking process. There’s whole systems and departments devoted to crash reports (yes, we do read all those comments, including the swearing in obscure languages), external and internal bug reports. What really made the whole thing work was the quality of the descriptions, thanks to the training we all received. It’s important because a well-described bug will be given the right priority, go to the right engineer, be understood quickly, and can be tested easily to be sure it really is fixed.

If you’re looking for an effective way of improving your own software, it’s hard to beat filing good bugs. Here’s what you need:

Title: Make it short but specific and descriptive. "Crash when closing save dialog" is better than "Save error".

Summary: This should be two or three sentences that cover the information that the person who has to assign the bug needs to know. Usually there’s someone non-technical or semi-technical who works out which engineer should look at it. A good summary will give them the information they need to get it to the person who can fix it first time.

Reproduction Steps: Probably the hardest part to get right is describing what someone has to do to see the problem on their machine. If possible you should try to recreate the problem yourself, noting the steps you take as you do it. If it doesn’t happen again, then that’s important information for the report too, and you should try to describe what you remember doing before the first occurrence.
If you do have luck getting it to happen again, note down in numbered, explicit steps exactly what it takes, eg:

  1. Open up the application
  2. Go to the File main menu, then choose Save
  3. Click on the close icon

It’s tempting to put something like "Try to save, and then close the dialog" for a process that seems as simple as this, but I guarantee that the recepient will use Command+S instead of the menu command, or the keystroke for closing a window, or not use a fresh start of the application, or will have some other variation that happens to avoid the crash.
For really tricky problems, sometimes even doing a screen capture of yourself reproducing the bug can be invaluable, in case there’s something subtle about your actions that triggers the issue. One of the hardest I hit turned out to only occur when the sub-windows of the application were arranged in a certain pattern! Even a saved file that prebakes a lot of the steps can save a lot of time.

Results: In this case it’s pretty obvious, but a lot of bugs may take some domain knowledge to understand what the expected result is, and it’s also helpful to spell out exactly what you’re seeing. Screenshots can be your friend here, it’s often easier to show the bad results than describe them in words.

Regression
: If you’ve tried other versions of the application or service, or run it on other operating systems, the results can be an important clue to the engineer about where in the code it’s going wrong.

Notes: Anything else you think is useful should be in here, such as links to similar bugs or your contact information. It’s good to keep this at the end so that the final engineer assigned to the problem can get some in-depth information, but it’s easy for the people routing the bug through the system to get a clear overview from just the first few sections.

As a lazy programmer, I use other people’s code whenever possible. That means I spend a lot of time filing bugs myself, so if you want to see me eating my own dog food, here’s an example of one I filed against OpenCalais:



PHP demo rendering glitch in Firefox, Safari Javascript error

Summary:
Running the PHP demo in Firefox on OS X draws an extra frame over part
of the results. The results page suffers a Javascript error and only
displays an error message in Safari.

Reproduction steps:

  1. Download CalaisPHPDemo_08May29.zip
  2. Unzip onto a folder on your server
  3. Copy JSON.php from src/pear/ to src/public
  4. Navigate to src/public/CalaisPHPDemo.html in Firefox 2.0.15 or Safari Version 3.1.1 on OS X 10.5.3 (I have a version online at http://funhousepicture.com/calaisdemo_original/src/public/CalaisPHPDemo…. )
  5. Copy and past the text from the example file test/text1.txt (asteroid news story) into the main text box
  6. Leave the format pulldown on Document Viewer Style
  7. Click on the Show Results button

Results:
On Firefox the document text shows up, but there’s a pair of scroll
bars partially obscuring the top portion. I’ve uploaded a screenshot as
http://funhousepicture.com/calais_firefox_result.png
On Safari, the result page is just the logo and a message stating ‘Unsupported Document’. The screenshot is http://funhousepicture.com/calais_safari_result.png
I’d expect to see the results page rendered as it does in Internet Explorer.

Regression:
I was able to run the demo with no problems on Internet Explorer 7 on
Windows Vista. I don’t see the scroll-bar issue on Firefox 2.0.14 on
Vista either.

Notes:
By poking around with Firebug, I determined that the bogus scrollbars
came from the CalaisJSONInfo element with its style set to ‘visibility:
hidden;’, which still affects layout, whereas ‘display:none;’ makes it
truly vanish. This may or may not be the correct fix depending on your
intent, but it does remove the rendering glitch.
The Safari error was a bit more involved. The immediate cause was an
exception in the initHighlight() Javascript, but it was unclear why
that was happening. After some debugging with Drosera I found a couple
of places where the code didn’t sit well with Safari’s JS host that
caused errors, notably a use of insertAdjacentText() which is
unsupported in webkit, and a null check that for some reason caused an
error. After working around those I was able to see the result document
successfully.


How to debug Javascript in Safari

Monkeys
Photo by smthpal

While I was working on my OpenCalais demo, I found that the original code didn’t work in Safari. The project contains a lot of client-side Javascript, so my guess was that it was choking because of differences in the WebKit implementation of JS, since it worked in both Internet Explorer and Firefox, albeit with some rendering glitches in the latter. I was filled with dread, since last time I had to do any major Javascript debugging in Safari, I’d had trouble even displaying the error messages. Thankfully things have got a whole lot better in the last couple of years!

The first thing you’ll need to do is enable the debug menu in Safari. To do this, go to Safari->PreferencesAdvanced in the main menu, and choose the  tab. In there, enable Show Develop menu in menu bar:

Safariscreenshot1

Now you see a new option appear on the top menu, Develop. Choose Show error console, and you’ll see a window appear that displays any Javascript errors. There’s also some other handy tools like the Web Inspector, which gives you a very Firebug-like way of exploring a page’s source dynamically.

With the Console selected, you should see details of any Javascript problems that came up.

Safariscreenshot2

Safariscreenshot3

Click on the arrow icon to the right of the message, and you’ll be taken to the exact source line in the script where the problem occurred. This is a very straightforward interface to track down a lot of common problems, much better integrated than Microsoft’s Script Debugger. In my case it wasn’t enough though, the error happened in the middle of some very complex code, and it seemed like the result of a logic error that happened much earlier. That meant I needed a debugger that I could use to step through the code.

Happily, I discovered Drosera. This is a fully-featured debugger that’s part of the WebKit project. In recent versions of the open source project it’s been integrated into the Web Inspector, but for shipping versions of Safari 3 you can still download it as a separate application.

Once you’ve downloaded it, you need to run the following line on the terminal and restart Safari:

defaults write com.apple.Safari WebKitScriptDebuggerEnabled -bool true

Then, run the Drosera application, and select Safari from the attachment window:

Safariscreenshot4

Now just load the page you want to debug. Whenever there’s an exception or an error the debugger will pause Safari and let you inspect the script and all its variables. To see the value of any variable, open up the Console section of the debugger and type the name into the bottom pane. If you want a breakpoint, just select the script file in the debugger’s side pane and click on the number just to the left of the line you want to stop at.

For the OpenCalais problem, it turned out to be an exception that was thrown and caught early in the script was causing the later problem. Drosera paused on the exception automatically the first time I loaded the page, and after a little bit of inspection I was able to figure out that it was using a function that wasn’t supported in Safari. I’m glad I took the time to download the debugger, I could have spent hours trying to figure that out from inspecting the code otherwise.