Shion Deysarker of 80legs recently laid out his thoughts on the rules that should govern web crawling. The status quo is a free-for-all with robots.txt providing only bare-bones guidelines for crawlers to follow. Traditionally this hasn't mattered, because only a few corporations could afford the infrastructure required to crawl on a large scale. These big players have reputations and business relationships to lose, so any conflicts not covered by the minimalist system of rules can be amicably resolved through gentleman's agreements.
These days any punk with a thousand bucks can build a crawler capable of scanning hundreds of millions of pages. Startups like mine have no cosy business relationships to restrain them, so when we're doing something entirely new we're left scratching our heads about how it fits into the old rules. There's several popular approaches:
None of the major sites I've looked at and talked to have any defense or even monitoring of crawlers, so as long as you stay below denial-of-service levels they'll probably never even notice your crawling. They rely on the legal force of robots.txt to squash you like a bug if you publicize your work, but there's a clearly a black market developing where shady marketers will happily buy data, no questions asked, much like the trade in email lists for spammers.
An extension of this approach is crawling while logged in to a site, getting access to non-public information. This WSJ article is a great illustration of how damaging that can be, and Shion is right to single it out as unacceptable. I'd actually go further and say that any new rules should build on and emphasize the authority of robots.txt. It has accumulated a strong set of legal precedents to give it force, and it's an interface webmasters understand.
Everything not forbidden is permitted
If your gathering obeys robots.txt, then the resulting data is yours to do with as you see fit. You can analyze it to reveal information that the sources thought they'd concealed, publish derivative works, or even the underlying data if it isn't copyrightable. This was my naive understanding of the landscape when I first began crawling, since it makes perfect logical sense. What's missing is the fact that all of those actions I list above, while morally defensible, really piss website owners off. That matters because the guys with the interesting data also have lots of money and lawyers, and whatever the legal merits of the situation they can tie you up in knots longer than you can keep paying your lawyer.
Hands off my data!
To the folks running large sites, robots.txt is there to control what shows up in Google. The idea that it's opening up their data to all comers would strike them as bizarre. They let Google crawl them so they'll get search traffic, why would they want random companies copying the information they've worked so hard to accumulate?
It's this sense of ownership that's the biggest obstacle to the growth of independent crawler startups. Shion mentions the server and bandwidth costs, but since most crawlers only pull the HTML without any images or other large files, these are negligible. What really freaks site owners out is the loss of control.
Over the next few years, 'wildcatter' crawlers like mine will become far more common. As site owners become more aware of us, they'll be looking for ways to control how their data is used. Unless we think of a better alternative, they'll do what Facebook did and switch to a whitelist containing a handful of the big search engines, since they're the only significant drivers of traffic. This would be a tragedy for innovation, since it would block startups off from massive areas of the Internet and give the existing players in search a huge structural advantage.
To prevent this, we need to figure out a simple way of giving more control to site that won't block innovative startups. Robots.txt needs to communicate the owner's intent more clearly, with new directives similar to 'no-archive' that lay out acceptable usage in much more detail. I'm not the only one to realize this, and I'm hopeful we'll have a more detailed proposal ironed out soon.
At the same time, sites need to take stock of what information they are exposing to the outside world, since the 'scofflaw' crawlers will continue happily ignoring robots.txt. Any security audit should include a breakdown of exactly what they're handing over to scofflaw crawlers – I bet they'd be unpleasantly surprised!