What rules should govern robots?


Image by Andre D'Macedo

Shion Deysarker of 80legs recently laid out his thoughts on the rules that should govern web crawling. The status quo is a free-for-all with robots.txt providing only bare-bones guidelines for crawlers to follow. Traditionally this hasn't mattered, because only a few corporations could afford the infrastructure required to crawl on a large scale. These big players have reputations and business relationships to lose, so any conflicts not covered by the minimalist system of rules can be amicably resolved through gentleman's agreements.

These days any punk with a thousand bucks can build a crawler capable of scanning hundreds of millions of pages. Startups like mine have no cosy business relationships to restrain them, so when we're doing something entirely new we're left scratching our heads about how it fits into the old rules. There's several popular approaches:


None of the major sites I've looked at and talked to have any defense or even monitoring of crawlers, so as long as you stay below denial-of-service levels they'll probably never even notice your crawling. They rely on the legal force of robots.txt to squash you like a bug if you publicize your work, but there's a clearly a black market developing where shady marketers will happily buy data, no questions asked, much like the trade in email lists for spammers.

An extension of this approach is crawling while logged in to a site, getting access to non-public information. This WSJ article is a great illustration of how damaging that can be, and Shion is right to single it out as unacceptable. I'd actually go further and say that any new rules should build on and emphasize the authority of robots.txt. It has accumulated a strong set of legal precedents to give it force, and it's an interface webmasters understand.

Everything not forbidden is permitted

If your gathering obeys robots.txt, then the resulting data is yours to do with as you see fit. You can analyze it to reveal information that the sources thought they'd concealed, publish derivative works, or even the underlying data if it isn't copyrightable. This was my naive understanding of the landscape when I first began crawling, since it makes perfect logical sense. What's missing is the fact that all of those actions I list above, while morally defensible, really piss website owners off. That matters because the guys with the interesting data also have lots of money and lawyers, and whatever the legal merits of the situation they can tie you up in knots longer than you can keep paying your lawyer.

Hands off my data!

To the folks running large sites, robots.txt is there to control what shows up in Google. The idea that it's opening up their data to all comers would strike them as bizarre. They let Google crawl them so they'll get search traffic, why would they want random companies copying the information they've worked so hard to accumulate?

It's this sense of ownership that's the biggest obstacle to the growth of independent crawler startups. Shion mentions the server and bandwidth costs, but since most crawlers only pull the HTML without any images or other large files, these are negligible. What really freaks site owners out is the loss of control.

Over the next few years, 'wildcatter' crawlers like mine will become far more common. As site owners become more aware of us, they'll be looking for ways to control how their data is used. Unless we think of a better alternative, they'll do what Facebook did and switch to a whitelist containing a handful of the big search engines, since they're the only significant drivers of traffic. This would be a tragedy for innovation, since it would block startups off from massive areas of the Internet and give the existing players in search a huge structural advantage.

To prevent this, we need to figure out a simple way of giving more control to site that won't block innovative startups. Robots.txt needs to communicate the owner's intent more clearly, with new directives similar to 'no-archive' that lay out acceptable usage in much more detail. I'm not the only one to realize this, and I'm hopeful we'll have a more detailed proposal ironed out soon.

At the same time, sites need to take stock of what information they are exposing to the outside world, since the 'scofflaw' crawlers will continue happily ignoring robots.txt. Any security audit should include a breakdown of exactly what they're handing over to scofflaw crawlers – I bet they'd be unpleasantly surprised!

A Wisconsin century

We’re visiting Liz’s mom in her hometown of Hayward, WI this week and thought we’d take a day off from work and try to bike over a hundred miles! We’ve both been doing a lot more road-riding this summer in Boulder and with the breathing boost from going to near sea-level we thought this would be a great chance to manage our first ‘century’. It turned out to still be quite a workout even though the terrain looks flat when you’re driving, with the turn-around of the out-and-back almost 1,000 feet lower than the start and lots of hills in between. We made it though, riding 110 miles all the way to Lake Superior and back.

It wasn’t exactly a route custom-built for biking, it was all along the edge of highways but they weren’t too busy and in most places there was a wide, clear shoulder. The drivers that passed us were courteous and we felt pretty safe, especially compared to biking in Los Angeles. Highlights included the tiny store advertising “Minnows, Movie Rentals, Tanning and Turkey Registration” in the misleadingly named Grand View and a John Deere dealership. I have to confess I was more excited by the rival Kubota dealership down the road, thanks to fond memories of the Kubotas we drive on Santa Cruz Island when we’re helping out the rangers.

Anyway, I have no idea if anyone else will ever want to ride this route, but the map’s up above and we had a wonderful, exhausting time. Who knows, maybe we can give Hayward a biking rival to the Birkie?

How to fetch URLs in parallel using Python

Photo by FunkyMonkey

Here's my class to run web requests in parallel in Python

One of the most common patterns in my work is fetching large numbers of web resources, whether it's crawling sites directly or through REST APIs. Often the majority of the execution time is spent waiting for the HTTP request to make the round trip across the internet and return with the results. The obvious way to speed things up is to change from a synchronous 'call-wait-process' model where each call has to wait for the previous one to finish, to an asynchronous one where multiple calls can be in flight at once.

Unfortunately that's hard to do in most scripting languages, despite being a common idiom in Javascript thanks to Ajax. Threads are too heavy-weight in both resources and in programming complexity since we don't actually need any user-side code to run in parallel, just the wait on the network. In most languages the raw functions you need to build this are available through libcurl, but its multi_curl interface is nightmarishly obscure.

In PHP I ended up writing my own ParallelCurl class that provides a simple interface on top of that multi_curl complexity, letting you specify how many fetches to run in parallel and then just feed it URLs and callback functions. Recently though I've been moving away to using Python for longer-lived offline processing jobs, and since I couldn't find an equivalent to ParallelCurl I ported my PHP code over.

This is the result. You'll need to easy_install pycurl to use it and I'm a Python newbie so I'm sure there's ugliness in the code, but I'm really excited that one of the big barriers to migrating more of my code is now gone. Let me know how you get on with it.

Five short links

Photo by Cayusa

Large Scale Social Media Analysis with Hadoop – A great introduction to the power of MapReduce on massive social network data, courtesy of Jake Hofman

Junk Conferences – A guide to spotting bogus conferences and journals, aimed at the scientific world but equally applicable to startup scammers – via Felix Salmon

Honoring Cadavers – Growing up in a family of nurses I heard plenty of tales of doctors' lack of human empathy, so I love the idea of connecting them with the people behind the corpses they start with. I'd like to see every technical profession forced to deal meaningfully with the people at the receiving end of their work. At Apple, most Pro Apps engineers would volunteer to spend a shift demo-ing our software to users at the NAB show every year, and I got an amazing amount of insight and motivation from that experience.

Gary, Indiana's unbroken spirit – It seems like Gary is in even more dire straits than Detroit. I spent five years in Dundee just after the jute mills had closed, and it felt a lot like this. The European approach is to try to retain the residents and bring jobs to them, but during my time in Scotland that seemed to result in a middle class almost entirely employed by the government, and very few private companies.

The world is full of interesting things – A Google labs roundup of over a hundred cool sites on the internet, with Fan Page Analytics tucked away on page 86.

Why you should blog

Photo by Anne Bowerman

Yesterday I talked about why startups should blog, and was pretty negative. I stand by everything I wrote there – Brad's recent post about intrinsic motivation captures why I don't believe counting up the external benefits will be enough to get you blogging, as does Bukowski's poem So you want to be a writer?

I'm a strong believer in the power of counting my blessings though, so here's some of the things that blogging has given me:

The people

Many work days they only person I'll talk to face-to-face is Liz, which is a big change from the my previous career spent sitting at a desk surrounded by my team. Blogging is a way for me to hold the sort of water-cooler conversations I miss, and build relationships with interesting people. I'm excited every time I see a comment, and amazed when I blog about an event or get-together and people actually show up. It's led to many off-line conversations that give me whole new perspectives on the problems I'm dealing with.

The ideas

The best way to understand something completely is to explain it to someone else. Blogging forces me to think through my ideas in a disciplined way and expose them to criticism from some very smart people. Just the practice of spending an hour a day truly thinking hard and in depth about an issue has helped me immensely.

The communication skills

I spent a decade building my engineering chops, but at the end of it I was still constantly frustrated by my inability to explain my ideas and persuade people. Writing hundreds of blog posts has left me with a much stronger ability to get across my thoughts in a pithy and effective way, both in writing and general conversation. It helps that I can often cast my mind back to old posts for arguments and evidence, but the sheer mental workout of writing accurate articles quickly helps me think on my feet.

The publicity

Having a thousand people willing to spend their time checking out what I'm up to is incredibly powerful. That's usually enough to get meaningful feedback about what's working and what's failing, and to start a viral spread if it's really a winner. I'm also hopeful you'll all be a good pool of potential customers once I roll out something that actually generates revenue!

The credibility

It's not who you know, it's who knows you. Having someone in the room who's actually heard of your work changes the whole tenor of a meeting. I don't hear from the folks who read my blog and think I'm a clown of course, but some of my strongest and most rewarding business relationships have come about thanks to this blog.

Why should startups blog?

Photo by Anne Bowerman

A friend just asked me a simple but important question – why should a startup have a blog? What are the practical, concrete benefits?

The short answer is, you probably shouldn't have one! It takes a lot of time and mental energy to keep an active blog going, and if you spent that time on external consulting instead and channeled the revenue into AdWords you'd almost certainly get more traffic for your effort. If you don't have a burning desire to say something, you'll get dispirited by the initial lack of interest and give up.

It's a bit like choosing to do a startup instead of sticking with a salary job – if you stare at the cold probabilities it's an irrational choice. You hear about the successes, but not about about the hundreds of other blogs that get ignored despite great content.

So why am I blogging? For the same reason I'm building my startup, it's an itch I have to scratch. It's had some great personal benefits, from improving my ability to communicate complex ideas, to making friends I'd never have discovered, but those came slowly. In the first year I had posts which literally nobody read, and it took several months of daily posts before I got my first RSS subscriber! The only things that kept me going were my pig-headed stubborness and the pleasure of looking at a completed post.

After over 800 posts and five years of blogging, I now actually have an audience; 1,600 RSS subscribers and 30,000 web visitors a month. This brings all sorts of bonuses. I love people trusting me enough to invest their time checking out my projects, and meeting readers in person and having conversations that flow from my articles. The constant practice has made me a competent enough writer to pop up as a guest author on some of my favorite sites, which is both a thrill and generates some nice publicity for my startup work. I'm incredibly glad I'm in this position, but it took an obsession bordering on the insane to get here.

So don't look for hard-nosed practical reasons to blog. Unless there's something that you just have to shout from the rooftops it won't make sense.

Fighting the drug war with statistics


I only ran across Diego Valle's blog because he sent me a bug report for OpenHeatMap, but reading through his posts I'm amazed he's not famous for his work. He's calmly telling his story through statistics, but the results are powerful, shocking, and surprising. An epic example is his Statistical Analysis and Visualization of the Drug War in Mexico. With the help of just Benford's law and data sets to compare he's able to demonstrate how the police are systematically hiding over a thousand murders a year in a single state, and that's just in one small part of the article.

There's been a lot of discussion about data journalism but few examples have been more than diverting eye candy. Diego shows us how it should be done. If you're like me you'll have read other stories about the drug war down south, and all that sticks in your mind is photos of bodies and the gargantuan number of deaths. With these posts he's given me a fresh and apalling perspective on the details of what that actually means, from a criminal justice system that can't even record the murders effectively, how military interventions correlate with increased death rates, to the impact on Mexico of the US ending its assault weapon ban. Equally important, he also shows how some of my preconceptions are wrong; Mexico continues to have a far lower homicide rate than Brazil for example, and the violence tends to be highly concentrated in areas the cartels are disputing.

Diego's work is important because he's using hard evidence to tell the truth about a vitally important story, and the data hints at how we can get out of this mess. We need people like him working in the tradition of John Snow, applying their analytical skills to illuminate life-or-death problems instead of just Twitter trends.

Why we lie on stage (and what we can do about it)

Photo by Tim Johnson

Eric Ries has an impassioned plea to entrepreneurs to stop lying on stage and he's spot on, this is a massive problem for first-time founders. Mistaken impressions of how startups worked helped me make terrible decisions and misjudge progress. Almost every startup history I thought I knew turned out to be completely cooked, once I talked privately to people who'd actually been there.

I don't think 'stop lying' is going to be a very effective solution though. If you read diaries by almost anyone living through great historical events, whether they're presidents or foot-soldiers it's just one damn thing after another. The over-arching narrative gets pulled together from that raw material, first by journalists and later by historians. Most of what actually happened gets left out in the telling, because it doesn't fit into the pattern of cause-and-effect that stories require. Really good historians will rescue some of this material by building new and more nuanced descriptions of events, but there's always so much going on and so little space to tell it, inevitably key information gets lost in the compression process.

We're hard-wired to respond to coherent stories, so this sort of lying by omission is never going to go away. The Lean Startup movement itself is built around the story that any company can reach success by mechanically applying some simple techniques to their product development. If you look at the details of their writings this isn't actually what Eric or Steve say, there's a lot more complexity and space for more traditional approaches, but that basic story is what sticks in people's minds. There's endless debates about how much of a cargo cult the movement is, because that over-simplified version is has taken hold of both sides.

The story is so popular because it's an antidote to the traditional 'auteur theory' approach to explaining the success of a startup; pick a charismatic individual and ascribe everything to their brilliant strokes of genius. Customer development's disparagement of visionaries and exhultation of hard, repetitive work is an appealingly puritanical backlash against this classically Romantic picture.

Since we can never truly transfer the totality of our experiences to anyone else, we need to take our role as storytellers seriously. If we're going to reach people, we have to build more truthful stories that win out over the bogus ones. Part of that is encouraging blogging by entreprenuers like Tim Bull. Seeing it unfold before your eyes in realtime makes it both a lot closer to the truth and more compelling . I'm depressed that I don't know more startup founders with active blogs, I worry we're all too concerned about projecting a confident image and afraid to display how imperfect and accidental the real path of most startups is.

Another step is to seriously think about how to incorporate failures into our 'act'. I've usually managed to get a laugh by titling my recent data-processing talks 'How to get sued by Facebook'. I start out my introductions to new business contacts by talking about the 'fruitful failures' of the past two years, all the thousand ways not to build a light-bulb I've discovered and how much that's taught me.

The fundamental thing to recognise though is that we need that Romantic narrative of startup success, or no one would ever persist in trying to do something as crazy as building a company from scratch. As Tom Evslin says, "nothing great has ever been accomplished without irrational exuberance". There's plenty of great raw material that's both exciting and true in every startup's story, so let's learn to be better storytellers and spin that into an gripping tale.

Information wants to be paid

Photo by Swamibu

I want to pay for API access. That probably sounds nuts coming from a starving entreprenuer, but I don’t want to be treated as a charity case by the services I rely on.

Part of my job at Apple was third-party developer support, and even before the iPhone made it so high-profile, the company was brutally self-interested in its relationship with outside developers. With that in my background I was wary of the siren call of becoming a third-party developer when I entered the web world. In the short-term the distribution advantages are hard to resist, but the service provider always has a gun to your head. Look at the arc of both the Facebook and Twitter ecosystems. Is FbFund even still alive after all the restrictions on apps within Facebook? Tweetie’s acquisition was a big win for the team, but also made it clear that Twitter is happy to expand at the expense of other external developers. You can counter with Zynga, but I’d argue that their history shows you need to grow large enough to change the power relationship, and even then you’re at a high-risk of being cut-off.

Fundamentally the problem is that the relationship between developers and API providers is all take and no give. The big guys have no incentive to keep their APIs open and stable. They love the free R&D, but as soon as something looks like it might make money, the temptation to bring it in-house is irresistable.

What I’m looking for in a relationship is reciprocity. The oldest and most successful API on the web is search-engine crawling. This works because providers have a strong incentive to allow Google to index their sites – in return for handing over their content, Google sends them visitors. In the real world, it’s not normal for this sort of business relationship to work through this sort of bartering. In most cases if company A makes money and depends on company B, some of A’s revenue ends up in B’s pocket.

I want to know where I stand relative to the business model of any company I depend on. If API access and the third-party ecosystem makes them money, then I feel a lot more comfortable that I’ll retain access over the long term. If it’s a drain on their resources, then I’ll assume they’re doing it for free R&D and may yank the plug at any point. It doesn’t stop me experimenting, but I’d never build a business that relied on them.

So, I’m basically stuck with Salesforce as my only option, until I can persuade Twitter or Facebook to take my cash!

“On the one hand information wants to be expensive, because it’s so valuable. The right information in the right place just changes your life. On the other hand, information wants to be free, because the cost of getting it out is getting lower and lower all the time. So you have these two fighting against each other.”

Thanks to Jud and Rob for contributing a lot to my thoughts around this

The Boulder/Denver Hadoop meetup is tomorrow!

Photo by Steve Jurvetson

Tomorrow night (Wednesday October 6th) we'll be holding our monthly Hadoop meetup, this time in the Gnip offices in downtown Boulder. There's always a great mix of folks, including Return Path's Jacob at the helm. This month's theme is sorting out each other's HBase issues, but it's always very informal and we're never quite sure where the discussion will end up until the beer is flowing! If you're using or thinking about using Hadoop, or just have an interest in big data processing, come and geek out with us.