With the launch of Buzz, millions of people have created Google public profiles. These contain detailed personal information, including name, a portrait, location, your job title and employer, where you were born, where you've lived since, links to any blogs and other sites associated with you, and some public buzz comments. All of this information is public by default, and Google micro-formats the page to make it easier for crawlers to understand, allows crawling in robots.txt and even provides a directory listing to help robots find all the profiles (which is actually their recommended way to build a firehose of Buzz messages).
This sort of information is obviously a gold-mine for researchers interested in things like migration and employment patterns, but I've been treading very carefully since this is people's personal information. I've spent the last week emailing people I know at Google, posting on the public Buzz API list, even contacting the various government privacy agencies who've been in touch, but with no replies from anyone.
Since it's now clear that there's a bunch of other people using this technique, I'm open-sourcing my implementation as BuzzProfileCrawl. As you can tell from looking at the code this is not rocket-science, just running some simple regular expressions on each page as it's crawled.
We need to have a debate on how much of this information we want exposed, on how to balance innovation against privacy, but the first step is making it clear how much is already out there. There's a tremendous mis-match between what's technologically possible, and ordinary people's expectations. I hope this example helps spark an actual debate on this, rather than the current indifference.
Great article- interesting that one doesn’t read more about it other places.
Thanks again Pete!
Parker