Five short links

Photo by Nugun

Brainstorm – Psychedelic raytraced graphics packed onto a display that can only show a tiny set of characters and colors. A beautiful hack.

YASIV – An interactive visualization of the connections between books on Amazon. I never found a good way to expose these sort of force-directed network graphs within a usable product, but I remain fascinated by them, they're a powerful way of communicating relationships between large numbers of objects.

Mining of Massive Datasets – Rich, detailed, and practical, this is an invaluable overview of the techniques that you can apply to big collections of unstructured data to produce useful information, and is freely available as a PDF. I'm looking forward to learning a lot from this book, I just wish I could pay them for it without ordering a hardback copy.

LoremPixel – Simple but handy service that auto-generates placeholder images for your design prototypes, with easy control over the size and category.

Map of the Drug War – Chilling and information-rich, this visualization of Mexico's violence shows how bitter the drug war has become.

Big Data keeps getting bigger in Boulder


A couple of years ago I started what became the Big Data meetup for the Boulder/Denver area, together with Jacob Rideout. The first few months were tough, despite having a tight-knit tech community in the area, not many people were using or interested in technologies like Hadoop and NoSQL, so we averaged around eight or nine people. After I left Colorado the event really started to pick up steam, as you can see in the graph above.

I like to think it wasn't my absence that fueled the growth, it's the ground-swell of interest in everything under the Big Data umbrella. Boulder is an exciting place to be working on technology, and I'm not at all surprised to see so much work being done with emerging data tools. There seem to be a lot of new (and old!) companies following in the footsteps of local pioneers like Gnip, Next Big Sound and Return Path, and they're looking for people to hire, so if you're an aspiring data geek who wants to work on interesting projects, I highly recommend popping along to the next one!

What I’ve learned from a thousand blog posts

Photo by En Tsai

This is my thousandth post here, and for me that's an important milestone. Years ago I read a study about abandoned blogs. They mentioned that most died after one or two posts, but one had reached a thousand before it went quiet. In the rough early days, that gave me a goal. If I was going to abandon my blog, I would at least beat that post record goddamnit!

Now I'm here, what did I learn during the last six years? 

Blogging is cathartic

I started blogging out of frustration. I felt trapped in my job, with nobody to talk to about the amazing possibilities I could see in the technology world. I was surprised to find that just typing out my thoughts was a big help, even when nobody responded. The process of organizing my thoughts made my internal struggles make more sense, left me feeling more at peace because it gave me a clearer view of the problems I was dealing with.

I still turn to my blog when I'm frustrated, and the funny thing is the posts that come out of that are often the most popular! Failure sucks, but instructs.

Having an audience is addictive

When I first started I would sit in Typepad's dashboard refreshing constantly in the hope of seeing a visitor. There were some days where nobody at all came to the blog. Now, on an average day between five hundred and a thousand people make it here. A fair number of those are for old posts (there are evidently still damned souls who have to write BHOs for Internet Explorer) but the long slog to build an audience makes me happy to see every single one of them. Knowing that people are actually paying attention to what I write is a heady drug, even at my modest scale.

I get to indulge my innate urge to pontificate without having to inflict it on my loved ones, and even get validation from comments and responses from people I admire. Behind every writer's cool exterior there's something pathetically vulnerable that craves attention and drives them on. Having experienced the thrill of seeing tens of thousands of people reading and discussing something I wrote, I can't imagine how anyone could avoid getting hooked.

Writing fast is a bloody useful skill

When I first started blogging, I set aside thirty minutes each weekday morning to write a post. I forced myself to publish whatever I had at the end of the half hour. This initially led to some awful blog posts, but luckily at that point I had no readers (see above). Over time I found it became easier to create something worthwhile with a tight deadline, and writing at speed was the most important skill blogging taught me. I could produce coherent writing before I blogged but it would take me three or four times as long.

Being able to organize my thoughts and type out an argument or explanation within a few minutes has allowed me to do things I'd never have time for otherwise. Creating documentation, replying to user emails, convincing colleagues, or pitching investors, it's amazing how much of my day is spent writing, and it all goes a lot more quickly thanks to my blogging practice.

Blogging is irrational

When people tell me they're thinking of writing a blog, I try to discourage them. By any sane measure the hours I've spent on this haven't had a great return on investment. The thing is, I can't help it! I need this outlet, and you should only be writing a blog if you've got the same screws loose as me, if you feel compelled.

I've enjoyed the last six years of blogging more than I can say, and as a final note I'd like to thank all of you for joining me in this long conversation. I've learned so much from everyone, and made some lifelong friends. I'm so grateful I had the chance to make so many wonderful connections, and I hope you'll join me for another thousand blog posts!

Five short links

Photo by Tanaka Juuyoh

strcpycat – How hard can it be to write a function to copy one string to another? This exploration shows how tough it is to create an algorithm that's truly generic. Seemingly-harmless design choices like returning the length of the source string will kill you when you're copying small chunks from a massive string.

Is there life on Venus? – We believe our own eyes, even when we shouldn't. This is a cautionary tale of a respected Russian astronomer who started to see life forms in the image-processing artifacts of old space missions. I used to generate 8×8 sprites for my 80's game programming by cycling through random blocks of pixels until something caught my eye, so I'm aware of how powerful pareidolia can be.

How to digitally sign a PDF – I can't believe I never knew you could do this, I've wasted so much time printing out and rescanning documents over the last few years of startups! In Lion it's so easy, you can just write your signature on a piece of white paper and hold it up to the camera, and after that just position it in any PDF you've loaded in Preview.

IMDB data set – Emphatically not free and open, but at least available, I'm intrigued by the Kevin Bacon possibilities here.

Computer Scientists and Google+: Something Interesting is Happening – As you may have noticed, I'm optimistic about Google+'s prospects. It comes down to my personal experiences, I'm discovering a lot of content that I just don't see on Facebook or Twitter, and it looks like I'm not the only one. 

Five short links

Photo by Miuenski

Fundamental Oracle flaw revealed – This is a fascinating piece of detective work on a bug, but also a cautionary tale of how even the most conservative assumptions can be proved wrong as data processing speeds and volumes grow.

Extracting structured data from Common Crawl – Shows exactly why I'm so excited by the potential of Common Crawl. Even just a list of all the hcard records from five billion web pages is going to be an amazing research resource, I have plans for doing fun things with the street addresses already.

Travel itineraries with long-exposure photos – I love the way the students used analog techniques to produce a high-tech looking visualizations.

Social Graph and Needlebase are dead – Google's API for publishing unified public profile information to developers never really caught on, but it's a shame to see it vanish. Needlebase's shutdown is less surprising, it always seemed likely to be useful more internally to Google, but I'm still sorry to see it lost to the outside world, it was a great tool.

The Apple logo in unicode – It's great to see how convoluted and political something as seemingly-simple as defining an international character set can be.

Jetpac now supports Google+

Photo by Eva Ho

Google+ has become very popular with photographers and hosts some amazing pictures, so I've been keen to help people discover the awesome travel ones through Jetpac. It took some head-scratching (the API is still in its very early stages) but you can now sign up using your Google account! Log in, and we'll give you inspiring photos from people you follow for wherever you're dreaming of traveling. It's been awesome discovering the wonderful content friends like Eva are putting out there, pictures I'd never have known about otherwise. I bet you'll find some treasures too!

Big Data war stories

TankPhoto by Mark Kelley

If you're in the Bay Area on February 8th, I highly recommend joining me at the Silicon Valley Big Data group's war stories event. It's being put on by some good friends from places like Kosmix (now Walmart Labs) and other folks who've been fighting in the Big Data trenches. The goal is to demystify the field, and show how any engineer can learn the techniques you need to create value from massive data sets, it's not just for Stanford PhD's any more! I hope that after hearing the stories and talking with the panelists, you'll feel confident you can dive in and start hacking.

Five short links

Photo by Daryl Mitchell

DynamoDB – I can't tell you how excited I am to see Amazon's new hosted NoSQL service. I so wanted SimpleDB to be usable, but it was too hard to work with thanks to the way it required clients to deal with implementation details like servers and sharding. I ended up using S3 for some projects, but if I was starting Jetpac today I'd seriously consider going with Dynamo rather than self-hosted Cassandra. The main drawback for our requirements is that we couldn't run Pig scripts across the data, but I'd imagine analytics will come.

Unsupervised Decomposition of a Document into Authorial Components – Uses machine-learning techniques to figure out the authorship of books from the bible, in a way that matches the results of traditional biblical scholars. I love applications of computer science to the humanities, I think there's a lot of ground for cross-fertilization.

The state of NoSQL in 2012 – An insightful look at the past and future of the new wave of database systems, from someone who's been in the trenches working with them for years.

Auto scaling in the Amazon cloud – Describes how Netflix keeps up with demand by automatically creating and destroying instances based on usage. The CPU utilization graphs show effective they are at managing the process, but they also hint at the work that's required to build AMIs for all services that can be spun up automatically.

Google Plus Scraper – There's no API yet to most of the interesting bits of Google+'s content, but a lot of it is available publicly to web crawlers, and the information is delivered as a giant JSON array embedded within the page, so it's surprisingly easy to decode. The biggest pain is the space-saving variant of JSON that Google use, which leaves out explicit null values in arrays, so you get ['a',,'b',,,'c'] instead of ['a',null,'b',null,null,'c']

Give me your Tumblr URL and I’ll give you a globe


One of the toughest challenges with Jetpac is persuading people to sign up with Facebook. It has a lot to offer if you make it over that hurdle, but some of the most thoughtful people I know never will, for reasons I respect and understand.

The nice thing about using unstructured text as our location information is that we can go anywhere there's good photo captions. We started on Facebook just because there's so much data available (the average user has had over 200,000 photos shared with them by their friends) but now I'm finally getting a chance to address other sources. First up is Tumblr, so if you have your own or are a fan of one that mentions locations in any of the posts, you should be able to get a quick visualization of it, no login required, by going to:

Here's an example of what you'll get:

We take all your photos, and build an HTML5/WebGL globe to help you explore them by place. Type in the URL, get a globe, it's as simple as that!

Just for fun I pointed the service at The Economist's Tumblr, to see how it coped with posts that definitely weren't travel photos. It didn't turn out half bad!

Five short links

Photo by Phillip Chapman-Bell

Zipscribble map of Italy – Patterns leap out of post code data when you connect adjacent codes with lines, and color them according to the most significant digits. Shows how useful pictures can be when we need to make sense of complex data. 

The Gentleman Hacker's 1903 lulz – Innovators never enjoy pesky outsiders pointing out flaws with their technology. What's interesting is that I couldn't discover any practical exploits against early radio signals, despite how obvious the flaws were in the wake of these demonstrations. I did discover there's a world of Morse code software I never imagined existed though!

ARM instruction guide – Back in 1990, I learnt ARM assembler as my second language, after Basic, and though I haven't used it since then, I enjoyed this companion to a workshop on hacking the processor's security model because it's actually a concise, useful guide to its important features.

A revolution in mathematics? – An approachable look at the underappreciated changes in maths at the end of the 19th century. The short version is that the process of creating proofs became highly formalized, which sounds dry as dust but actually opened up a world of new possibilities. It's worth quoting at length:

"Well-optimized modern definitions have unexpected advantages. They give access to material that is not (as far as we know) reflected in the physical world. A really “good” definition often has logical consequences that are unanticipated or counterintuitive. A great deal of modern mathematics is built on these unexpected bonuses, but they would have been rejected in the old, more scientific approach. Finally, modern definitions are more accessible to new users. Intuitions can be developed by working directly with definitions, and this is faster and more reliable than trying to contrive a link to physical experience…rank and-file mathematicians can use the new methods confidently and effectively, while success with older methods was mostly limited to the elite"

Stepping away from "common sense" and relying on the logical outcomes of an abstract system that doesn't provide intuitive reasons sounds a lot like where we're headed in the data world. The question "Why does that work?" will often come up when you're making choices based on AB testing, and often the honest answer is "I don't know, but it does!".

Spark – A framework implementing a higher-level approach to writing distributed algorithms, with a more readable statement of the problem than standard MapReduce produces.