Photo by Irish Typepad
Dr John Wang [Update- sorry, wrong John Wang!] has just started a new site called EnronData.org, dedicated to developing and refining the Enron email dataset. It’s off to a cracking start, offering all the Enron emails as 148 PST files, one for each ‘custodian’ (informally each mail user). I did my own PST conversion, but it was primarily so I had a large data set to load onto an Exchange server and test Mailana against. John’s version is much closer to the original source data, and so will be more of a real-world test for applications.
I’m really pleased John has put this together, it will be a boon to anyone looking at doing heavy-duty email data-mining. I can’t wait to see what else the project produces.