I'm frantically coding for an upcoming launch, so apologies if you're waiting for an email reply. I'm looking forward to showing the world what we're working on though, and to posting the lessons I've learned about using Cassandra and Hadoop in production.
TinkerPop – The equivalent of the LAMP stack for graph data processing, pulling together the best open-source tools to help build a turnkey pipeline. The developer's avatars on the right of the page scare me though.
Gridded Population of the World – I've been looking for something like this for a long time. It's a breakdown of the surface of the earth as a grid, with an estimate of how many people live in each square. Why is this useful? Almost every geographic density map you produce will be dominated by the places people actually live, with the signal you really cared about drowned by the fact that most people live packed into a few urban areas. This data set could be used to correct for that, at least as a first approximation, so you can tell if more people than you'd expect by their raw population are visiting your site from particular areas, for example.
Data Alchemists – I like Ben's idea, resigned as I am to the dominance of the term data scientist.
ISPs are hijacking search queries – This is a hard-to-explain but important story. ISPs are using a third-party service to capture and redirect their users searches. The redirection is painful, but handing over everything their users are searching for to a random bunch of marketing companies with no disclosure is a really bad idea.