There's information about my Facebook data set scattered around multiple news articles, as well as posts in this blog, but here's the full story of how it all came down.
I'm a software engineer, my last job was at Apple but for the last two years I've been working on my own startup called Mailana. The name comes from 'Mail Analysis', and my goal has been to use the data sitting around in all our inboxes to help us in our day-to-day lives. I spent the first year trying (and failing) to get a toe-hold in the enterprise market. Last year I moved to Boulder to go through the Techstars startup program, where I met Antony Brydon, the former CEO of Visible Path. He described the immense difficulties they'd faced with the enterprise market, which persuaded me to re-focus on the consumer side.
I'd already applied the same technology to Twitter to produce graphs showing who people talked to, and how their friends were clustered into groups. I set out to build that into a fully-fledged service, analyzing people's Twitter, Facebook and webmail communications to understand and help maintain their social networks. It offered features like identifying your inner circle so you could read a stream of just their updates, reminding you when you were falling out of touch with people you'd previously talked to a lot, and giving you information about people you'd just met.
It was the last feature that led me to crawl Facebook. When I meet someone for the first time, I'll often Google their name to find their Twitter and LinkedIn accounts, and maybe Facebook too if it's a social contact rather than business. I wanted to automate that Googling process, so for every new person I started communicating with, I could easily follow or friend them on LinkedIn, Twitter and Facebook. My first thought was to use one of the search engine APIs, but I quickly discovered that they only offer very limited results compared to their web interfaces.
I scratched my head a bit and thought "well, how hard can it be to build my own search engine?". As it turned out, it was very easy. Checking Facebook's robot.txt, they welcome the web crawlers that search engines use to gather their data, so I wrote my own in PHP (very similar to this Google Profile crawler I open-sourced) and left it running for about 6 months. Initially all I wanted to gather was people's names and locations so I could search on those to find public profiles. Talking to a few other startups they also needed the same sort of service so I started looking into either exposing a search API or sharing that sort of 'phone book for the internet' information with them.
I noticed Facebook were offering some other interesting information too, like which pages people were fans of and links to a few of their friends. I was curious what sort of patterns would emerge if I analyzed these relationships, so as a side project I set up fanpageanalytics.com to allow people to explore the data. I was getting more people asking about the data I was using, so before that went live I emailed Dave Morin at Facebook to give him a heads-up and check it was all kosher. We'd chatted a little previously, but I didn't get a reply, and he left the company a month later so my email probably got lost in the chaos.
I had commercial hopes for fanpageanalytics, I felt like there was demand for a compete.com for Facebook pages, but I was also just fascinated by how much the data could tell us about ourselves. Out of pure curiosity I created an interactive map showing how different countries, US states and cities were connected to each other and released it. Crickets chirped, tumbleweed blew past and nobody even replied to or retweeted my announcement. Only 5 or 6 people a day were visiting the site.
That weekend I was avoiding my real work but stuck for ideas on a blog post, and I'd been meaning to check out how good the online Photoshop competitors were. I'd also been chatting to Eric Kirby, a local marketing wizard, who had been explaining how effective catchy labels were for communicating complex polling data, eg 'soccer moms'. With that in mind, I took a screenshot of my city analysis, grabbed SumoPaint and started sketching in the patterns I'd noticed. After drawing those in, I spent a few more minutes coming up with silly names for the different areas and wrote up some commentary on them. I was a bit embarassed by the shallowness of my analysis, and I was keen to see what professional researchers could do with the same information, so I added a postscript offering them an anonymized version of my source data. Once the post was done, I submitted it to news.ycombinator.com as I often do, then went back to coding and forgot about it.
On Sunday around 25,000 people read the article, via YCombinator and Reddit. After that a whole bunch of mainstream news sites picked it up, and over 150,000 people visited it on Monday. On Tuesday I was hanging out with my friends at Gnip trying to make sense of it all when my cell phone rang. It was Facebook's attorney.
He was with the head of their security team, who I knew slightly because I'd reported several security holes to Facebook over the years. The attorney said that they were just about to sue me into oblivion, but in light of my previous good relationship with their security team, they'd give me one chance to stop the process. They asked and received a verbal assurance from me that I wouldn't publish the data, and sent me on a letter to sign confirming that. Their contention was robots.txt had no legal force and they could sue anyone for accessing their site even if they scrupulously obeyed the instructions it contained. The only legal way to access any web site with a crawler was to obtain prior written permission.
Obviously this isn't the way the web has worked for the last 16 years since robots.txt was introduced, but my lawyer advised me that it had never been tested in court, and the legal costs alone of being a test case would bankrupt me. With that in mind, I spent the next few weeks negotiating a final agreement with their attorney. They were quite accommodating on the details, such as allowing my blog post to remain up, and initially I was hopeful that they were interested in a supervised release of the data set with privacy safeguards. Unfortunately it became clear towards the end that they wanted the whole set destroyed. That meant I had to persuade the other startups I'd shared samples with to remove their copies, but finally in mid-March I was able to sign the final agreement.
I'm just glad that the whole process is over. I'm bummed that Facebook are taking a legal position that would cripple the web if it was adopted (how many people would Google need to hire to write letters to every single website they crawled?), and a bit frustrated
that people don't understand that the data I was planning to release is already in the hands of lots of commercial marketing firms, but mostly I'm just looking forward to leaving the massive distraction of a legal threat behind and getting on with building my startup. I really appreciate everyone's support, stay tuned for my next project!
You seem to have gone through quite a lot. Anyway, I wish you luck with the coming days.
Your story makes me amazingly sad. This is information that people have stated they want public — heck, Facebook has made many controversial changes to encourage and/or trick its users into making more information public. I wish Google had stepped up to the plate and offered to defend you, since their whole business model depends on being able to analyze the content web sites without prior permission from the site owners.
Hey, if you’re interested in some legal precedent you may want to check out EF Cultural Travel BV v. Explorica, Inc., 274 F.3d 577 (1st Cir. 2001) and EF Cultural Travel BV v. Zefer Corporation
318 F.3d 58 (1st Cir. 2003), both available from Google Scholar. Essentially the former said that use of a scraper was wrongful because the CEO had worked at the other company and was using proprietary information gained from the scrapes. On the other hand, the latter was okay because they were only using what was publicly available. The key was that the TOS said nothing about not using scrapers. I would think that a robots.txt would be equivalent to a TOS, but I don’t know what other terms you agreed to by using Facebook’s API. Ultimately, your lawyer was right. Even if you did nothing wrong, they have a lot more money and could have sued you into oblivion.
webscraping is bypassing the API altogether. it’s basically what your browser does, or what a search engine does. Google doesn’t use the facebook API to scrape facebook; they just scrape it.
So you’re data-mining, for profit, and the big boys refuse to play with you ?
Allow me to whip out the worlds smallest violin.
“Obviously this isn’t the way the web has worked for the last 16 years since robots.txt was introduced, but my lawyer advised me that it had never been tested in court, and the legal costs alone of being a test case would bankrupt me.”
-This is the way the internet works. Unless someone with a heavy cash hand wants to stop you. You did nothing wrong, up until you signed that release.
It’s an unfortunate story. It is a shame that it only comes down to who has more money. Ethics and legality have nothing to do with it. This is the American legal system, aka the aristocracy.
Did you even bother to contact the EFF or another tech lawfirm that would potentially be willing to represent you pro-bono?
These big corporate bully stories always depress me. I am a bit annoyed at the commenters that suggest that Pete shouldn’t have signed the release or should have found some lawyer to champion his cause. He has no obligation to take on a corporate giant at the expense of his wellbeing. He did what he had to do. If any of you feel so strongly about this, then go collect the data and take on Farcebook yourselves.
Oh and Pete, thanks for the awesome map, I hope it gets your start up the positive attention you deserve.
-Mike a refugee from the border of Mormonia and the Nomadic West living in Socialistan.
You should have contacted the EFF as soon as this happened. They bullied you in an outrageous way stretching their version of how the web should work beyond credulity. Unfortunately you’ve now signed some sort of agreement with them to stop a legal activity under threat which I’m sure they constructed to weaken your position.
There’s good legal advice on these comments but the bottom line is that defending the case will bankrupt him.
That’s it. All of it: Facebook have more money, so much so that they can exhaust Pete’s resources. That’s all the legal merit a case needs in your jurisdiction, and that’s all the justice there is.
The only hope is that Facebook try suing someone like Google, who have even deeper pockets; they could win or lose and it’s actually a good thing if Pete backs down, rather than lose a meritorious but under-resourced defence which, in being lost, sets a precedent that can’t be overturned for *any* amount of money.
Disclaimers: I’m not a lawyer. I like my opinions, and you might too: but you’re a fool if you choose a course of action on a layman’s opinions instead of seeking advice from a qualified, experienced and accredited legal professional.
I don’t get it?
Why are you backing down from this? Sounds like a largely easy case for any lawyer. I think you need a new one, your one got paid off or is an absolute pussy.
Nice work though, keep it up, next time just release anon based on your own open sourced work that doesn’t piss anyone off, then you can use your data freely. Freenet or something would be a good place.
I agree with the previous commenters. This sounds like something the EFF would take on.
Court cases lately have addressed the question of whether website Terms of Service are enforceable, and found that they are.
Facebook’s ToS (now called “Rights and Responsibilities, but still linked from the bottom of the page as “Terms” – see http://www.facebook.com/#!/terms.php?ref=pf ) says that they don’t allow scraping of their site without their permission. See item 3 Safety point 2 “You will not collect users’ content or information, or otherwise access Facebook, using automated means (such as harvesting bots, robots, spiders, or scrapers) without our permission.”
One such case was by airlines against consumer-facing frequent flyer program information aggregation and optimization services. The airlines’ websites prohibited accessing the data contained therein except through the interfaces specifically offered or allowed by the airlines. (They want your eyeballs. So does Facebook, with a vengeance). The optimization service violated that term, regardless that it did so with the full authorization of the airlines’ customers who very much wanted the optimization service. The court found in favor of the airlines, and the optimization service shut down.
As a result, I doubt that you have a legal basis to challenge them. I would be surprised if the EFF wants to try to bust that kind of already adjudicated contract term, no matter how much it does limit some valuable types of innovation.
I am quite disappointed that Facebook didn’t want to bring you in to work with them on the value that your research and technology provides (though of course they’d want it for its commercial ability to further increase eyeballs on and data shared in Facebook itself), as I’m sure they see the value in it. Perhaps they already have their own such project in the works, far enough along that they don’t feel the need to buy additional talent.
Regardless of the legal whys and wherefores, of course I’m disappointed for you, and hope you’ll find an outlet for the value you obviously do bring!
lol you took down all posts that pointed out your error. Way to go champ.
besides the facebook violence, i found your projects very interesting, keep up!
Something to be aware of:
This could be very relevant to your issue?
One such case was by airlines against consumer-facing frequent flyer program information aggregation and optimization services. The airlines’ websites prohibited accessing the data contained therein except through the interfaces specifically offered or allowed by the airlines.
One such case was by airlines against consumer-facing frequent flyer program information aggregation and optimization services.
One such case was by airlines against consumer-facing frequent flyer program information aggregation and optimization services.
he greeting card is a good idea too.
he greeting card is a good idea too.
Asian Medical Travel Destinations
South Korea, home of the first cloned dog, stay on the cutting edge of technological advances in medicine and health. Doctors are very good and skilled, state-of-the-art technology and some of the most advanced medical facilities around the world conti…
Home Business Entrepreneur
Making money from home is very exciting idea, especially if you have the drive and ambition to start their own company. How big or small your dream of entrepreneurship, there are many good business ideas to be found. Although there is a good business i…
Nice post. This post is different from what I read on most blog. And it have so many valuable things to learn.
Very interesting. You wouldn’t want to be the first but I bet the court would rule in your favor because the whole industry would have to change otherwise, including Facebook. Still what an experience. Maybe it’s an idea to pursue your research but with Facebook backing and funding. They were obviously scared by your idea which means you’re on to something
I know this if off topic but I’m looking into starting my own weblog and was wondering what all is needed to get setup? I’m
assuming having a blog like yours would cost a pretty penny?
I’m not very internet savvy so I’m not 100% sure. Any suggestions or advice would be greatly appreciated.
I visited several web sites but the audio feature for audio songs present
at this website is actually marvelous.
You may have also consider the idea that Facebook was only willing to share the data publicly for those who paid for it
Pingback: Facebook Says It Knows Where People Are Migrating? |
Pingback: Facebook Says It Knows Where People Are Migrating |
Pingback: Blalalalalalalalalala |
‘the legal costs alone of being a test case would bankrupt me’
When you get told that by your lawyer, you just know you don’t live in a justice state. So depressing.
There’s some merit in being Judgement Proof? Why do I feel there are hundreds of programmers
who openly scrape Facebook, and turn around and sell the information to the highest American bidder? What bothers me about Facebook is they alienate the very users who build the pyramid!
I bet you were one of the first Facebook Fans? When I think of Facebook; I picture a fat, spoiled
kid. “But Daddy, I want the golden egg! I don’t care what it costs. Give them 7 billion dollars!
Kill my chicken–I’m hungry! I hope there isn’t a God?” Ugh!
Thanks for sharing this post. If Facebook’s claims would actually be upheld in court, I imagine it would be a disaster! Every website owner could potentially add a similar ToS and sue every search engine out there just because the search engine didn’t obtain prior approval to crawling their website…
Besides court fees, I’m curious what amount they can really even sue you for, since you haven’t really caused any damage to the company or stolen anything, besides using a standard means of access that they “prohibit”…
I just opened up their robots.txt file, and it says:
# Notice: Crawling Facebook is prohibited unless you have express written
# permission. See: http://www.facebook.com/apps/site_scraping_tos_terms.php
Pingback: Crawl Rules Of Famous Websites | zner Internet Marketing
Pingback: The Toolkit of a Data Scientist, a.k.a. the magic wand of Harry Potter :D - tpoHUB
Sueing fb for being a two face actor can win. How many has got the same treatment? I think if there are a lot, more then 50 that will coaporate, i can take it to win and get them money+stop the two face internet act by fb
Pingback: How Russia’s New Facial Recognition App Could End Anonymity | I Am Digital
Pingback: Can Facebook and Twitter stop social media surveillance? | technology market
I think this is easy to view as the big evil corporation coming after the little guy again. But I think we should also consider the underlying principles as well, if a company invests a massive amount of money, time and energy into building an vast informational structure, it would make sense that they wouldn’t want any random guy procuring all of that for their own commercial purposes or simply just disseminating it which can dilute and devalue their product by increasing the supply and decreasing the demand at any one given point.
Facebook users may also not want their info paraded around the world for all to see, they may not want to serve as a cog in the machine you are so brilliantly creating. They may be even less likely to use FB or put their information on there if they know what it will be used for. That can result in very real financial damage to the company. When you financially damage a company you are also hurting jobs, real peoples lives are actually effected.
We also want societies brilliant innovators and investors to put a lot into massive projects like FB, which they will be less interested and less likely to do if someone can devalue, dilute and procure it for their own purposes. This is why we have laws around intellectual property.
So sure, we can easily say FB was being a little dickish or hypocritical, but let’s not be childish and only look at one side of the coin.
Pingback: Legality of Extracting Publicly Available User-Generated Content – PromptCloud
Pingback: How to Scrape Facebook Posts for Free Content Ideas
Pingback: Facebook data harvesting—what you need to know (From Phys.org) – Peter Schwartz
Pingback: Should You Be Concerned About Your Facebook Data Being Scraped? – TRAVELWORLDLINE
Pingback: Is Web Data Scraping Legal? - Datahut - Blog
Pingback: Is Google Spying on your Conversations? « Pete Warden's blog