Want the average frequencies of 13 million words?

Last year, Google released a list of how frequently single words and combinations appeared, based on analyzing over a trillion words on public web pages. It has over 13 million individual words, and the frequencies of combinations of up to 5 words. It’s available on 6 DVDs for just $180 from the Linguistic Data Consortium at the University of Pennsylvania.

If, like me, you use statistical analysis to pick out unusual words or phrases from documents, this is a god-send. It should be a great base-line to compare the document’s text against, and eliminate the common phrases, leaving just the distinctive parts. I’m hoping to at least use it as an uber-stop-word file. The main down-side is the restrictive license, that forbids "commercially exploiting" the data. It shouldn’t be rocket-science to reproduce similar data by crawling the web when that becomes an issue, so I’ll work within those limits for now.

The LDC has a great collection of other raw data sets too. It’s worth checking out their English Gigaword archive of millions of news stories if you want some more baseline data. Thanks to Ionut at Google Operating System for leading me to the article in the official Google Research blog covering this release.

Inboxer – An easy way to spy on your employee’s emails?

I first ran across Inboxer through their excellent Enron email exploration site. They offer a server appliance that sits inside a company’s firewall, analyzes all internal email, and offers a GUI interface to explore the messages. They have some sophisticated tools that let you see some common types of emails that management would be interested in, such as objectionable content, recruitment-related or by external contacts. They also let you set up alerts and triggers if particular conditions are met, such as unauthorized employees emailing messages that appear to contain contracts to external addresses. You can experiment with their UI through the Enron site, it seems to be pretty well laid out, and simple enough for non-technical people to use.


They offer graphs of important statistics over time.


There’s a set of pre-packaged searches for things management are commonly concerned about. You can drag and drop any of them onto the main pane, and you’ll get a view of all the relevant emails.

They’ve done a great job technically with Inboxer, it seems like a well-rounded service. I’m a bit disturbed that the this is what the market is demanding though. Despite it being pretty clear from a legal standpoint that the company has no duty of privacy, most people don’t treat their work emails as public documents. Some of the searches, such as those for recruitment terms, are clearly aimed at catching employees doing something they don’t want management to know about, but that aren’t aimed at harming the company. I get worried that it would be incredibly tempting to use this as a technical fix for a management problem. Instead of focusing on keeping employees from job-hunting by keeping them happy, just try and punish anyone who makes the mistake of using the company system in their search.

I believe the Inboxer team has done their homework, they’ve clearly tried a lot of different tools, and this is the one that seems most successful. There’s a lot of legitimate uses, especially in regulated industries and government organizations, where there’s liability issues that require some email controls. I just wish that a less command-and-control, top-down approach was more popular. If Inboxer also offered a client-side version, I’d much rather work for a company that required that. It could make it clear which emails would be flagged and looked at, before they were sent, and help employees understand how public their work emails really are.

Roger Matus, the CEO of Inboxer, has collected a lot of useful email and messaging news in his blog, Death by Email. I’d recommend a visit if you’re interested in their work.

How to handle file dragging in a Firefox web app


One of the things I miss most when moving from a desktop app to the web is the ability to drag and drop documents between programs. The default file open dialog within a form is definitely not an adequate substitute. The best you can manage with a plain web app is dragging elements within the same page.

To add the full functionality to a web application, you need to install some client-side code. In Firefox, the easiest way to do this is with an extension, though a signed JAR file containing the script is also a possibility. I haven’t tried to do it in IE yet, so that will have to wait for another post.

Here’s an example extension, with full source code and a test page demonstrating how to use it. To try it out:

  • Install draganddrop.xpi in Firefox
  • Load testpage.html
  • Try dragging some files onto the different text areas on the page

You should see an alert pop up with the file path and the element’s text when you do this. The extension adds a new event type to FireFox; "PeteDragDropEvent". When a file is dragged onto a web page, it sets the element underneath the mouse’s ‘dragdropfilepath’ attribute, and then fires the event on that element. If the element has called addEventListener for that event previously, then its defined handler function will be called, and the script can do what it needs to.

The main drawback is that you only get access to the local file path for the dragged object, and there’s not much an external web script can do with that. I’ll cover the options you have to do something interesting, like uploading a file to a server, in a future post.

This page was invaluable when I was developing the extension, it has a great discussion with examples of the mechanics of Firefox’s drag and drop events. One thing to watch out for if you reuse this extension for your own projects is that you don’t want to open up dragging-and-dropping for all pages. That would be a possible security problem if malicious sites lured users into dragging onto them. Instead you should do some URL white-listing to make sure only trusted locations are allowed, being careful to properly parse the address so that spoofing with @, etc, won’t fool the test.

What I learnt from following walls


When I was 16, I got a copy of The New Hacker’s Dictionary, aka The Jargon File. An entry that stuck in my head was for the AI term Wall Follower. Harvey Wallbanger was an entry in an early AI contest, where the contestants had to solve a maze. The other robots all had sophisticated algorithms with 1000’s of lines of code; all Harvey did was keep moving forward and turning so that his finger was always on the left wall. Of course, he beat them all.

Whenever I fall too deeply in love with the technology I’m building, I try to remember Harvey. Often a little Brute Force and Cunning will produce better results than something more intellectually challenging.

I was thinking of that when I read this paper, on email categorization using statistics. The authors are clearly off-the-charts smart, and they present some promising techniques, but it feels like their goal is unrealistic. Nobody will accept their incoming email being unreliably placed into folders, even if it’s right 90% of the time. I think it’s much more interesting to use the same techniques to present information to the user, by applying a bunch of approximate tags based on the content that aid the user’s email searching and browsing. They’re trying to build something like Yahoo’s web directory; I’d much rather have an imperfect but useful and scalable service like Google’s web search for email.

How to access the Enron data painlessly


Yesterday I gave an overview of the Enron email corpus, but since then I’ve discovered a lot more resources. A whole academic ecosystem has grown up around it, and it’s led me to some really interesting research projects. Even better, the raw data has been put up online in several easy to use formats.

The best place to start is William Cohen’s page, which has a direct download link for the text of the messages in a tar, as well as a brief history of the data and links to some of the projects using it. Another great resource is a mysql database containing a cleaned-up version of the complete set, which could be very handy for a web-based demo.

Berkeley has done a lot of interesting work using the emails. Enronic is an email graph viewer, similar in concept to Outlook Graph but with a lot of interesting search and timeline view features. Jeffrey Heer’s produced a lot of other interesting visualization work too. He’s produced several toolkits, and some compelling work on collaborating through visualization, like the sense.us demographic viewer and annotator.

Equally interesting was this paper on automatically categorizing emails based on their content, comparing some of the popular techniques with the categorization reflected in the email folders that the recipients had used to organize them. Ron Bekkerman has some other interesting papers too, like this one on building a social network from a user’s mailbox, and then expanding it by locating the member’s home pages on the web.

Which corporation generously donated all their emails to the public domain?

One of the challenges of trying to build a tool that does something useful with a corporation’s emails is finding a good data set to experiment on. No company is going to give a random developer access to all of their internal emails. That’s where Enron comes to the rescue. The Federal Energy Regulation Commission released over 16,000 emails to the public as part of its investigation into the 2001 energy crisis.

They’re theoretically available online, but through a database interface that seems designed to make it hard to access, and throws up server errors whenever I try to use it. Luckily, they do promise to send you full copies of their .pst databases through the postal system if you pay a fee. If only there were some kind of global electronic network that you could use to transmit files… I will check the license and try to make it available online myself if I can, once I receive the data.

I first became aware of this data through Trampoline Systems’s Enron Explorer, which demonstrates their email analysis using this data set. Since then, I also ran across a paper analyzing the human response times to emails that also builds on this information.

The secret to showing time in tag clouds…


… is animation! I haven’t seen this used in any commercial sites, but Moritz Stefaner has a flash example of an animated cloud from his thesis. You should check out his other work there too, it includes some really innovative ways of displaying tags over time, like this graph showing tag usage:


His thesis title is "Visual tools for the socio-semantic web", and he really delivers 9 completely new ways of displaying the data, most of them time-based. Even better, he has interactive and animated examples online for almost all of them. Somebody needs to hire him to develop them further.

Moritz has his own discussion on the motivations and problems with animated tag clouds. For my purposes, I want to give people a way to spot changes in the importance of email topics over time. Static tag clouds are great for showing the relative importance of a large number of keywords at a glance, and animation is a way of bringing to life the rise and decline of topics in an easy to absorb way. Intuitively, a tag cloud of words in the subjects of emails would show ‘tax’ suddenly blinking larger in the US in April. On a more subtle level, you could track product names in customer support emails, and get an idea of which were taking the most resources over time. Trying to pull that same information from the data arranged as a line graph is a lot harder.

There’s some practical problems with animating tag clouds. Static clouds are traditionally arranged with words abutting each other. This means when one changes size, it affects the position of all the words after it. This gives a very distracting effect. One way to avoid this is to accept some level of overlap between the words as they change size, which makes the result visually a lot more cluttered and hard to read. You can increase the average separation between terms, which cuts down the overlap, but does result in a lot sparser cloud.

I’m interested in trying out some other arrangement approaches. For example, I’m fond of the OS X dock animation model, where large icons do squeeze out their neighbors, but in a very unobtrusive way. I’m also hopeful there’s some non-flash ways to do this just with JavaScript.