Word2Vec – Given a large amount of training text, this project figures out words show up together in sentences most often, and then constructs a small vector representation that captures those relationships for each word. It turns out that simple arithmetic works on these vectors in an intuitive way, so that vector(‘Paris’) – vector(‘France’) + vector(‘Italy’) results in a vector that is very close to vector(‘Rome’). It’s not just elegantly neat, this looks like it will be very useful for clustering, and any other application where you need a sensible representation of a word as a number.
Charles Bukowski, William Burroughs, and the Computer – How two different writers handled onrushing technology, including Bukowski’s poem on the “16 bit Intel 8088 Chip”. This led me down a Wikipedia rabbit hole, since I’d always assumed the 8088 was fully 8-bit, but the truth proved a lot more interesting.
Sane data updates are harder than you think – Tales from the trenches in data management. Adrian Holovaty’s series on crawling and updating data is the first time I’ve seen a lot of techniques that are common amongst practitioners actually laid out clearly.
Randomness != Uniqueness – Creating good identifiers for data is hard, especially once you’re in a distributed environment. There are also tradeoffs between creating IDs that are decomposable, since that makes internal debugging and management much easier, but also reveals information to external folks, often more than you might expect.
Lessons from a year’s worth of hiring data – A useful antidote to the folklore prevalent in recruiting, this small study throws up a lot of intriguing possibilities. The correlation between spelling and grammar mistakes and candidates who didn’t make it through the interview process was especially interesting, considering how retailers like Zappo’s use Mechanical Turk to fix stylistic errors in user reviews to boost sales.