Some closure on my collision with Facebook

In response to my last post Bret Taylor, the CTO of Facebook, announced that they will be altering their robots.txt to whitelist particular crawlers rather than trying to enforce their new terms of service. This makes me very happy because it makes it much less likely that other companies will try to impose restrictions this way, leaving crawlers free to obey robots.txt without fear of litigation.

On a personal level I'm hoping that this helps me put this whole episode behind me too. I truly didn't set out to take on Facebook, I was just trying to build my product and rather naively stumbled into a minefield. The last few months of only dealing with their lawyers was very frustrating, and led me to conclude that there was a nefarious plan behind their attempt to impose a terms-of-service agreement beyond robots.txt. I believe Bret when he says it was just a lapse of judgment, and that admission should help me move on. I've always prided myself on being on the straight-and-narrow, and what rankled most was Facebook's legal team treating me like a shady spammer.

Thanks for everyone's support as I've been working through this, all you blog readers and commenters, and especially Liz for putting up with me pulling my hair out and banging my head on the desk as I tried to deal with it all!

Now to get back to building stuff, since I don't want to just be known as that guy who got sued by Facebook…

Facebook employee responds on robots.txt controversy

I received a comment from Blake Ross on the legal changes Facebook have recently made, and I wanted to highlight that response. I recommend you look at the document itself, along with both of our commentaries on it, and make up your own mind.

——————————-

Hey Pete,

I work for Facebook, but this comment should not be
construed as an official company statement.

Your interpretation
of this document isn't correct and frankly doesn't make much sense. If
our goal were to make it difficult for startups to succeed using
Facebook data, we wouldn't have launched an open API that provides
access to all of our data; we wouldn't have launched the fbFund to fund
startups that are built on top of this API; and we wouldn't host an
annual developer conference to help startups use this API. The very
future of our platform is predicated on the notion that we can help
other companies improve their products by leveraging the social graph.

This crawling document exists because we've had problem where shady
companies would try to scrape user information in aggregate and use it
for malicious purposes. For instance, these companies would scrape page
by page from http://www.facebook.com/family/
and then try to resell these bulk lists.

Blake Ross

——————————-

Hi Blake,
thanks for taking the time to comment. I might be willing
to give Facebook the benefit of the doubt on this, if it wasn't for the
$14,000 legal bill I just paid.

The fact is that you've made
all the information on the profile pages public, complete with
micro-format data to help crawlers. There's some simple technical fixes
you could add to solve the specific problem you mention, starting with
amending robots.txt and changing your ID formats:
http://petewarden.typepad.com/searchbrowser/2010/…

You've chosen to leave all that information out in the open so
you can benefit from the search traffic, and instead try to change the
established rules of the web so you can selectively sue anyone you
decide is a threat.

I'm really pretty bummed about this because
I've been a long-time fan of Facebook, you can see me raving about your
XHProf work here:
http://petewarden.typepad.com/searchbrowser/2009/…

The sad fact is, your leadership has decided to change the
open rules that have allowed the web to be such an interesting and
innovative place for the past decade.

Facebook has always been a closed system where developers are expected to live in a culture of asking permission before doing anything, and existing at the whim of the company's management. The web I love is an open world where you are free to innovate as long as you stick to the mutually agreed rules. This is a land grab by Facebook, they've moved into the open web for the commercial benefits they'll reap, but want to change the rules so they can retain absolute control.

Flying to Santa Rosa Island with the CIA

Santarosatail

Don't worry, Facebook haven't had me spirited away to Gitmo, but I did just take a trip with Channel Islands Aviation. They kindly donated a free flight to seven of us who'd previously volunteered with the NPS on the Channel Islands, and neither me nor Liz had been able to visit Santa Rosa Island before, so we were thrilled to have the chance to explore a whole new wilderness, even just for a day.

Santarosaplane

We'd looked into camping there before but there's very few scheduled boat trips to the outer islands, so we'd never been able to find enough time to take the four days off we'd need. I didn't know that the CIA planes had just started to offer regular excursions too. It was quite an adventure, I'd never been in an eight-seater before, and the landing on Santa Rosa's dirt strip seemed daunting, especially in the strong winds the island is known for. Our pilot, Mark Oberman, made it look easy though with a gentle touch-down.

Santarosacoast

Once we landed, we had the afternoon to explore the beautiful coast and hills, guided by Carolyn Greene, an interpretive volunteer for the park. It's amazing how much of the work of the NPS is taken on by volunteers, it makes me want to retire too so I can really get stuck into some of their projects!

Santa Rosa is the second largest island in the chain, at 85 square miles only a little smaller than Santa Cruz, and until very recently it was used for cattle ranching. There's still a population of deer and elk that were introduced for hunting, all of which have left the native vegetation pretty beleaguered. There's lots of signs of recovery though, and the rare Torrey Pines seem to be thriving, with numbers up from less than a thousand to over four thousand in just a few years.

Santarosapinecone

I'm always amazed how the Channel Islands draw people in, so many of those working there fell in love with their beauty and arranged their whole lives around them. Luluis, one of the rangers, actually grew up on Santa Rosa when it was a cattle ranch. Our pilot has been flying to them for 35 years, and was married in Santa Cruz Island's chapel, with the old owner Harry Stanton as a witness. Brent, another ranger, used to work the boats that ferried tourists out there before he got his current job. I guess you can add Liz and I to that list, since we've made trips all the way from Colorado twice in the last six months to visit the islands.

Santarosabeach

Finally we headed back to the plane and civilization, along the sort of beautiful beach that makes me wonder what California was like before all the development.

If you're interested in visiting Santa Rosa Island, you could make a day trip either by boat with Island Packers or flying with CIA, or head out for some camping. I don't imagine the camping would be easy though, the winds can be strong and the whole habitat is still recovering from the cattle, so you don't find much shelter. There's a developed campground with water, or if you're really hardy, you can sometimes backpack over to the south beach. I can't imagine anywhere more remote that's only a few dozen miles from Los Angeles, it would be an amazing experience despite the hardships.

Facebook changes the rules for the public web

A friend recently sent me this link to a new legal document Facebook have added to their site:

http://www.facebook.com/apps/site_scraping_tos_terms.php

It's the first time I've seen Facebook formally lay out how they think the world should treat the web pages they've made public and have indicated they allow to be crawled through their robots.txt. What it says is what they told me when they threatened to sue me a few months ago; anyone who crawls the web must obtain prior written permission from every site.

Why should you care? It's their attempt to have their cake and eat it too. They want to make as much information as possible about their members public so that they can get traffic from search engines and drive brands to prioritize their Facebook pages, but they know they have users trapped as long as their data is hard to transfer out of the service into any potential competitors.

So, stuck between these two incompatible goals, they've reached for the lawyers. They could change their robots.txt to disallow crawling, or remove the pages they've made public, but that would remove their valuable search traffic. There's a lot of legal backing to the rules in robots.txt, but you'll need deeper pockets than mine to contest Facebook's new interpretation.

What it means in practice is that large established companies are able to crawl (though always with the threat of legal action hanging over them) but smaller, newer startups will be attacked by Facebook's lawyers as soon as they look threatening. Google definitely fall foul of the new rules (caching web pages, the use of data for advertising purposes), so I'd be interested to know if they've signed up? I know these changes would make it impossible for them to get started today, since they'd have to contact each and every website before they crawled them and respond to things like "an accounting of all uses of data collected through Automated Data Collection within ten (10) days of your receipt of Facebook’s request for such an accounting". Avoiding that sort of mess was exactly why the industry agreed on robots.txt as a standard.

To be completely clear, I understand that Facebook need to protect their users' privacy. This does nothing to help that, anyone malicious is free to gather and analyze all the information they have made public about people, Facebook has left it all completely in the open with no technical safeguards. What this does is gives Facebook a legal stick to beat anyone legitimate who tries to openly use the data they've made available in a way they decide they don't like.

Five short links

Fivedogs
Photo by Xanboozled

The Git Parable – After reading this story by Tom Preston of github, the way git works actually makes sense. It’s still a little maddening that my common workflow is more involved than it was on svn, but I understand more about how that flexibility can also be very powerful. via Elben

MacMail Power User Tips – TA from Gist has a great article on how to use the Apple desktop mail program effectively. I’m exclusively on gmail these days, and I constantly miss these sort of advanced features from Outlook, but the convenience of webmail is hard to beat.

Mapping conference connections – This is an area I’ve been fascinated by, and I actually spent some time with the guys on the late lamented Eventvue trying to create a compelling product around the same idea. I’d love to have a social map of how I’m connected to conference attendees before I went, but it’s been surprisingly tough to turn that into a business. via Eutiquio

Mapping the world’s photos – This is a couple of years old, but it’s still amazing to see the details that get highlighted when you map the locations of 35 million photos from Flickr. via Michael

Poly9 Globe – A Flash component that lets you render an interactive 3D globe and overlay information over it. Very nicely done.

An end to the loneliness of the open-source coder?

Lonewolf
Photo by Ucumari

I’ve been publishing free software since I was 15, back in the days of ‘public domain’ floppy disks and magazine listings. I still get lovely emails from people using my open-source visual effects plugins, and I’m still so amazed by the magic of what computers can do that I can’t help but keep sharing code around the areas that fascinate me.

What’s been strange though is how solitary my open-source coding work has been. In my commercial work I usually end up being the guy who talks to everyone on the team and knows how all the pieces fit together. Partly that’s because all the juiciest bugs are in the gaps between the modules, but I’m also the sort of person who loves to learn other people’s code. By contrast, even my popular free projects have never been shared endeavors. I haven’t even found anyone willing to take on the task of porting my plugins to new versions of After Effects, now that I no longer work in that world.

I don’t think I’m alone in this, looking around there’s a massive long tail of open-source modules that are being ignored by potential users and contributors either because they don’t know about them, or because the barriers to getting involved are too high. Today something happened that gives me hope that things are changing.

I recently began publishing all of my new code on github. I tried it out because somebody nagged me to in the blog comments, and I stayed with it because the website interface was so straightforward and friendly. It’s what Sourceforge would have been if they had any UI skills. I’m still at the stage with git where I’m typing in commands from tutorials without really knowing what I’m doing, and occasionally cursing the new mental model it forces on me, but I’m able to get simple tasks done.

What I realized this afternoon is that its familiar interface hides a deeper hidden infrastructure, something that has the potential to change the way open-source works. I received a pull request for ParallelCurl!

So what? Well, first it was great to know people are using the project enough to want to make changes, but more importantly it made me realize how much github has lowered the barriers to contributing to an open-source project. In the old days I remember biting my nails to the quick as I reviewed, patched, tested and documented small fixes to large codebases. Writing to the mailing list to get your patch accepted was an art all to itself, and on both ends it was a painful amount of work to get changes from new contributors. It was fraught with social problems too, the process had the potential to be confrontational and unpleasant if the patch was rejected.

Github changes all that. SoftwareElves was able to create his own branch of the code, make some changes and then just drop me a notification that he had a new version I should consider rolling in. Reviewing and accepting his changes was simple, but if I’d been unresponsive or hadn’t liked them, his branch of the code would still have been a first-class citizen and there would have been no awkwardness involved. Both git and github have collaboration baked in, which sounds obvious for a version control system but I realize now has been lacking from every other service I’ve used. Github is a social network for code.

There’s still going to be some tumbleweed blowing through the long tail of open-source projects, but Github is a massive step forward. I’m eagerly anticipating lots more people pointing out my mistakes, the world of open-source will be a lot more productive with that sort of collaboration.

Five short links

Grasschains
Photo by Peter Kurdulija

The Visualization Trap – The authors argue that visualizations are dangerous because they’re too persuasive, using accident reconstructions as an example where the computer-generated animation makes viewers more likely to take a strong position on the cause than witnesses to the actual event. I think that production values are a big part of this. We’re unconsciously impressed by the amount of money that someone spends on a presentation. It’s like peacock feathers, if they can expend that many resources on their argument, they must have a lot of confidence in it. That’s why commercials cost millions, and visualizations are just another high-cost way of telling stories, with the same unfair persuasive advantage as any other expensive medium

Statistical Intensity Map Creator – A neat little (commercial) Flash map for displaying US state data

Modest Maps – An awesome open-source project making it easy to include tile-based zoomable maps in either Flash or Python on the server side. One of the authors is Michal Migurski of Stamen, who produce some amazing visualizations

Extracting Place Semantics from Flickr Tags – Users are generating massive amounts of data by tagging photos with known locations. Can we use that information to build a rich database of information on places?

The Buzzer – A spooky Russian radio station that’s been broadcasting an enigmatic signal for decades. Some claim it’s just for atmospheric research, but is it actually a “dead man’s switch” for a nuclear apocalypse?