A Wisconsin century

We’re visiting Liz’s mom in her hometown of Hayward, WI this week and thought we’d take a day off from work and try to bike over a hundred miles! We’ve both been doing a lot more road-riding this summer in Boulder and with the breathing boost from going to near sea-level we thought this would be a great chance to manage our first ‘century’. It turned out to still be quite a workout even though the terrain looks flat when you’re driving, with the turn-around of the out-and-back almost 1,000 feet lower than the start and lots of hills in between. We made it though, riding 110 miles all the way to Lake Superior and back.

It wasn’t exactly a route custom-built for biking, it was all along the edge of highways but they weren’t too busy and in most places there was a wide, clear shoulder. The drivers that passed us were courteous and we felt pretty safe, especially compared to biking in Los Angeles. Highlights included the tiny store advertising “Minnows, Movie Rentals, Tanning and Turkey Registration” in the misleadingly named Grand View and a John Deere dealership. I have to confess I was more excited by the rival Kubota dealership down the road, thanks to fond memories of the Kubotas we drive on Santa Cruz Island when we’re helping out the rangers.

Anyway, I have no idea if anyone else will ever want to ride this route, but the map’s up above and we had a wonderful, exhausting time. Who knows, maybe we can give Hayward a biking rival to the Birkie?

How to fetch URLs in parallel using Python

Parallellines
Photo by FunkyMonkey

Here's my class to run web requests in parallel in Python

One of the most common patterns in my work is fetching large numbers of web resources, whether it's crawling sites directly or through REST APIs. Often the majority of the execution time is spent waiting for the HTTP request to make the round trip across the internet and return with the results. The obvious way to speed things up is to change from a synchronous 'call-wait-process' model where each call has to wait for the previous one to finish, to an asynchronous one where multiple calls can be in flight at once.

Unfortunately that's hard to do in most scripting languages, despite being a common idiom in Javascript thanks to Ajax. Threads are too heavy-weight in both resources and in programming complexity since we don't actually need any user-side code to run in parallel, just the wait on the network. In most languages the raw functions you need to build this are available through libcurl, but its multi_curl interface is nightmarishly obscure.

In PHP I ended up writing my own ParallelCurl class that provides a simple interface on top of that multi_curl complexity, letting you specify how many fetches to run in parallel and then just feed it URLs and callback functions. Recently though I've been moving away to using Python for longer-lived offline processing jobs, and since I couldn't find an equivalent to ParallelCurl I ported my PHP code over.

This is the result. You'll need to easy_install pycurl to use it and I'm a Python newbie so I'm sure there's ugliness in the code, but I'm really excited that one of the big barriers to migrating more of my code is now gone. Let me know how you get on with it.

Five short links

Fivedollar
Photo by Cayusa

Large Scale Social Media Analysis with Hadoop – A great introduction to the power of MapReduce on massive social network data, courtesy of Jake Hofman

Junk Conferences – A guide to spotting bogus conferences and journals, aimed at the scientific world but equally applicable to startup scammers – via Felix Salmon

Honoring Cadavers – Growing up in a family of nurses I heard plenty of tales of doctors' lack of human empathy, so I love the idea of connecting them with the people behind the corpses they start with. I'd like to see every technical profession forced to deal meaningfully with the people at the receiving end of their work. At Apple, most Pro Apps engineers would volunteer to spend a shift demo-ing our software to users at the NAB show every year, and I got an amazing amount of insight and motivation from that experience.

Gary, Indiana's unbroken spirit – It seems like Gary is in even more dire straits than Detroit. I spent five years in Dundee just after the jute mills had closed, and it felt a lot like this. The European approach is to try to retain the residents and bring jobs to them, but during my time in Scotland that seemed to result in a middle class almost entirely employed by the government, and very few private companies.

The world is full of interesting things – A Google labs roundup of over a hundred cool sites on the internet, with Fan Page Analytics tucked away on page 86.

Why you should blog

Typewriter2
Photo by Anne Bowerman

Yesterday I talked about why startups should blog, and was pretty negative. I stand by everything I wrote there – Brad's recent post about intrinsic motivation captures why I don't believe counting up the external benefits will be enough to get you blogging, as does Bukowski's poem So you want to be a writer?

I'm a strong believer in the power of counting my blessings though, so here's some of the things that blogging has given me:

The people

Many work days they only person I'll talk to face-to-face is Liz, which is a big change from the my previous career spent sitting at a desk surrounded by my team. Blogging is a way for me to hold the sort of water-cooler conversations I miss, and build relationships with interesting people. I'm excited every time I see a comment, and amazed when I blog about an event or get-together and people actually show up. It's led to many off-line conversations that give me whole new perspectives on the problems I'm dealing with.

The ideas

The best way to understand something completely is to explain it to someone else. Blogging forces me to think through my ideas in a disciplined way and expose them to criticism from some very smart people. Just the practice of spending an hour a day truly thinking hard and in depth about an issue has helped me immensely.

The communication skills

I spent a decade building my engineering chops, but at the end of it I was still constantly frustrated by my inability to explain my ideas and persuade people. Writing hundreds of blog posts has left me with a much stronger ability to get across my thoughts in a pithy and effective way, both in writing and general conversation. It helps that I can often cast my mind back to old posts for arguments and evidence, but the sheer mental workout of writing accurate articles quickly helps me think on my feet.

The publicity

Having a thousand people willing to spend their time checking out what I'm up to is incredibly powerful. That's usually enough to get meaningful feedback about what's working and what's failing, and to start a viral spread if it's really a winner. I'm also hopeful you'll all be a good pool of potential customers once I roll out something that actually generates revenue!

The credibility

It's not who you know, it's who knows you. Having someone in the room who's actually heard of your work changes the whole tenor of a meeting. I don't hear from the folks who read my blog and think I'm a clown of course, but some of my strongest and most rewarding business relationships have come about thanks to this blog.

Why should startups blog?

Secretary
Photo by Anne Bowerman

A friend just asked me a simple but important question – why should a startup have a blog? What are the practical, concrete benefits?

The short answer is, you probably shouldn't have one! It takes a lot of time and mental energy to keep an active blog going, and if you spent that time on external consulting instead and channeled the revenue into AdWords you'd almost certainly get more traffic for your effort. If you don't have a burning desire to say something, you'll get dispirited by the initial lack of interest and give up.

It's a bit like choosing to do a startup instead of sticking with a salary job – if you stare at the cold probabilities it's an irrational choice. You hear about the successes, but not about about the hundreds of other blogs that get ignored despite great content.

So why am I blogging? For the same reason I'm building my startup, it's an itch I have to scratch. It's had some great personal benefits, from improving my ability to communicate complex ideas, to making friends I'd never have discovered, but those came slowly. In the first year I had posts which literally nobody read, and it took several months of daily posts before I got my first RSS subscriber! The only things that kept me going were my pig-headed stubborness and the pleasure of looking at a completed post.

After over 800 posts and five years of blogging, I now actually have an audience; 1,600 RSS subscribers and 30,000 web visitors a month. This brings all sorts of bonuses. I love people trusting me enough to invest their time checking out my projects, and meeting readers in person and having conversations that flow from my articles. The constant practice has made me a competent enough writer to pop up as a guest author on some of my favorite sites, which is both a thrill and generates some nice publicity for my startup work. I'm incredibly glad I'm in this position, but it took an obsession bordering on the insane to get here.

So don't look for hard-nosed practical reasons to blog. Unless there's something that you just have to shout from the rooftops it won't make sense.

Fighting the drug war with statistics

Cartels

I only ran across Diego Valle's blog because he sent me a bug report for OpenHeatMap, but reading through his posts I'm amazed he's not famous for his work. He's calmly telling his story through statistics, but the results are powerful, shocking, and surprising. An epic example is his Statistical Analysis and Visualization of the Drug War in Mexico. With the help of just Benford's law and data sets to compare he's able to demonstrate how the police are systematically hiding over a thousand murders a year in a single state, and that's just in one small part of the article.

Mexicomurder
There's been a lot of discussion about data journalism but few examples have been more than diverting eye candy. Diego shows us how it should be done. If you're like me you'll have read other stories about the drug war down south, and all that sticks in your mind is photos of bodies and the gargantuan number of deaths. With these posts he's given me a fresh and apalling perspective on the details of what that actually means, from a criminal justice system that can't even record the murders effectively, how military interventions correlate with increased death rates, to the impact on Mexico of the US ending its assault weapon ban. Equally important, he also shows how some of my preconceptions are wrong; Mexico continues to have a far lower homicide rate than Brazil for example, and the violence tends to be highly concentrated in areas the cartels are disputing.

Diego's work is important because he's using hard evidence to tell the truth about a vitally important story, and the data hints at how we can get out of this mess. We need people like him working in the tradition of John Snow, applying their analytical skills to illuminate life-or-death problems instead of just Twitter trends.

Why we lie on stage (and what we can do about it)

Bedtimestory
Photo by Tim Johnson

Eric Ries has an impassioned plea to entrepreneurs to stop lying on stage and he's spot on, this is a massive problem for first-time founders. Mistaken impressions of how startups worked helped me make terrible decisions and misjudge progress. Almost every startup history I thought I knew turned out to be completely cooked, once I talked privately to people who'd actually been there.

I don't think 'stop lying' is going to be a very effective solution though. If you read diaries by almost anyone living through great historical events, whether they're presidents or foot-soldiers it's just one damn thing after another. The over-arching narrative gets pulled together from that raw material, first by journalists and later by historians. Most of what actually happened gets left out in the telling, because it doesn't fit into the pattern of cause-and-effect that stories require. Really good historians will rescue some of this material by building new and more nuanced descriptions of events, but there's always so much going on and so little space to tell it, inevitably key information gets lost in the compression process.

We're hard-wired to respond to coherent stories, so this sort of lying by omission is never going to go away. The Lean Startup movement itself is built around the story that any company can reach success by mechanically applying some simple techniques to their product development. If you look at the details of their writings this isn't actually what Eric or Steve say, there's a lot more complexity and space for more traditional approaches, but that basic story is what sticks in people's minds. There's endless debates about how much of a cargo cult the movement is, because that over-simplified version is has taken hold of both sides.

The story is so popular because it's an antidote to the traditional 'auteur theory' approach to explaining the success of a startup; pick a charismatic individual and ascribe everything to their brilliant strokes of genius. Customer development's disparagement of visionaries and exhultation of hard, repetitive work is an appealingly puritanical backlash against this classically Romantic picture.

Since we can never truly transfer the totality of our experiences to anyone else, we need to take our role as storytellers seriously. If we're going to reach people, we have to build more truthful stories that win out over the bogus ones. Part of that is encouraging blogging by entreprenuers like Tim Bull. Seeing it unfold before your eyes in realtime makes it both a lot closer to the truth and more compelling . I'm depressed that I don't know more startup founders with active blogs, I worry we're all too concerned about projecting a confident image and afraid to display how imperfect and accidental the real path of most startups is.

Another step is to seriously think about how to incorporate failures into our 'act'. I've usually managed to get a laugh by titling my recent data-processing talks 'How to get sued by Facebook'. I start out my introductions to new business contacts by talking about the 'fruitful failures' of the past two years, all the thousand ways not to build a light-bulb I've discovered and how much that's taught me.

The fundamental thing to recognise though is that we need that Romantic narrative of startup success, or no one would ever persist in trying to do something as crazy as building a company from scratch. As Tom Evslin says, "nothing great has ever been accomplished without irrational exuberance". There's plenty of great raw material that's both exciting and true in every startup's story, so let's learn to be better storytellers and spin that into an gripping tale.

Information wants to be paid

Dubloon
Photo by Swamibu

I want to pay for API access. That probably sounds nuts coming from a starving entreprenuer, but I don’t want to be treated as a charity case by the services I rely on.

Part of my job at Apple was third-party developer support, and even before the iPhone made it so high-profile, the company was brutally self-interested in its relationship with outside developers. With that in my background I was wary of the siren call of becoming a third-party developer when I entered the web world. In the short-term the distribution advantages are hard to resist, but the service provider always has a gun to your head. Look at the arc of both the Facebook and Twitter ecosystems. Is FbFund even still alive after all the restrictions on apps within Facebook? Tweetie’s acquisition was a big win for the team, but also made it clear that Twitter is happy to expand at the expense of other external developers. You can counter with Zynga, but I’d argue that their history shows you need to grow large enough to change the power relationship, and even then you’re at a high-risk of being cut-off.

Fundamentally the problem is that the relationship between developers and API providers is all take and no give. The big guys have no incentive to keep their APIs open and stable. They love the free R&D, but as soon as something looks like it might make money, the temptation to bring it in-house is irresistable.

What I’m looking for in a relationship is reciprocity. The oldest and most successful API on the web is search-engine crawling. This works because providers have a strong incentive to allow Google to index their sites – in return for handing over their content, Google sends them visitors. In the real world, it’s not normal for this sort of business relationship to work through this sort of bartering. In most cases if company A makes money and depends on company B, some of A’s revenue ends up in B’s pocket.

I want to know where I stand relative to the business model of any company I depend on. If API access and the third-party ecosystem makes them money, then I feel a lot more comfortable that I’ll retain access over the long term. If it’s a drain on their resources, then I’ll assume they’re doing it for free R&D and may yank the plug at any point. It doesn’t stop me experimenting, but I’d never build a business that relied on them.

So, I’m basically stuck with Salesforce as my only option, until I can persuade Twitter or Facebook to take my cash!

“On the one hand information wants to be expensive, because it’s so valuable. The right information in the right place just changes your life. On the other hand, information wants to be free, because the cost of getting it out is getting lower and lower all the time. So you have these two fighting against each other.”

Thanks to Jud and Rob for contributing a lot to my thoughts around this

The Boulder/Denver Hadoop meetup is tomorrow!

Cray
Photo by Steve Jurvetson

Tomorrow night (Wednesday October 6th) we'll be holding our monthly Hadoop meetup, this time in the Gnip offices in downtown Boulder. There's always a great mix of folks, including Return Path's Jacob at the helm. This month's theme is sorting out each other's HBase issues, but it's always very informal and we're never quite sure where the discussion will end up until the beer is flowing! If you're using or thinking about using Hadoop, or just have an interest in big data processing, come and geek out with us.

Do you want a map of your blog?

Sunsurfer

http://www.openheatmap.com/view.html?map=HawthornsUncontrollablePilaf

One of my goals with OpenHeatMap is building a better interface to access and explore the firehose of data we’re bombarded with. I’ve always wanted a way to navigate content by location and so I’m experimenting with mapping blogs and other media sites. Above is a visualization I built of a few hundred posts from the Sunsurfer Tumblr blog. You can see at a glance the locations that the site covers, and then by mousing over you’ll see a selection of photos from that place.

I’m looking at turning this into a self-serve interface where anyone can enter a URL and receive a visualization of that whole site, but I need testers. If you have a blog or site that features a lot of locations, whether it’s pictures from around the world, news stories or anything you’re keen to see visualized like this, drop an email to pete@mailana.com and I’ll get on the case!