Facebook data destruction

I'm sorry to say that I won't be releasing the Facebook data I'd hoped to share with the research community. In fact I've destroyed my own copies of the information, under threat of a lawsuit from Facebook.

As you can imagine I'm not very happy about this, especially since nobody ever alleged that my data gathering was outside the rules the web has operated by since crawlers existed. I followed their robots.txt directions, and was even helped by microformatting in the public profile pages. Literally hundreds of commercial search engines have followed the same path and have the same data. You can even pull identical information from Google's cache if you don't want to hit Facebook's servers. So why am I destroying the data? This area has never been litigated and I don't have enough money to be a test case.

Despite the bad taste left in my mouth by the legal pressure, I actually have some sympathy for Facebook. I put them on the spot by planning to release data they weren't aware was available. I know from my time at Apple that reaching for the lawyers is a tempting first option when there's a nasty surprise like that. If I had to do it all over again, I'd try harder not to catch them off-guard.

So what's the good news? From my conversations with technical folks at Facebook, there seems to be a real commitment to figuring out safeguards around the widespread availability of this data. They have a lot of interest in helping researchers find ways of doing worthwhile work without exposing private information.

To the many researchers I've disappointed, there's a whole world of similar data available from other sources too. By downloading the Google Profile crawling code you can build your own data set, and it's easy enough to build something similar for Twitter. I'm already in the middle of some new research based on public Buzz information, so this won't be stopping my work, and I still plan to share my source data with the research community in the future.

Flexible access to Gmail, Yahoo and Hotmail in PHP

Knittedmask
Photo by Poppalina

I've been a heavy user of the php-imap extension, but last year I was driven to reimplement the protocol in native PHP by its limitations. Because it was a compiled extension implementing a thin wrapper on top of the standard UW IMAP library, any changes meant working in C code and propagating them through several API layers. A couple of problems in particular forced me to resort to my own implementation:

– Yahoo uses a strange variant of IMAP on their basic accounts, where you need to send an extra undocumented command before you can login.

– Accessing message headers only gives you the first recipient of an email, and getting any more requires a full message download. This is a design decision by the library, not a limitation of the protocol, and severely slowed down access to the information Mailana needed.

I've now open-sourced my code as HandmadeIMAP. I chose the name to reflect the hand-crafted and somewhat idiosyncratic nature of the project. It's all pulled from production code, so while it works for my purposes it isn't particularly pretty, only implements the parts I need and focuses on supporting Gmail, Yahoo and Hotmail. On the plus side, it's been used pretty heavily and works well for my purposes, so hopefully it will prove useful to you too.

I'm also hoping to use it as a testing-ground for Gmail's new oAuth extension to IMAP that makes it possible to give mailbox access without handing over your password. Because new commands can easily be inserted, it should be possible to follow the new specification once the correct tokens have been received, but I will let you know how that progresses.

To give it a try for yourself, just download the code and run the test script, eg:

php handmadeimaptest.php -u youremail@gmail.com -p yourpasswordhere -a list -d

Thinking by Coding

Thinker
Photo by Sidereal

David Mandell took me under his wing at Techstars last summer, and we ended up spending a lot of time together. One day towards the end of the program he told me "Pete, you think by coding". That phrase stuck with me because it felt true, and summed up a lot of both my strengths and weaknesses.

I choose my projects by finding general areas that have interesting possibilities for fun and profit, but then almost immediately start hacking on code to figure out the details of the technical landscape. I find it intensely frustrating to sit and discuss the theory of what we should be building before we have a good idea of what's technically possible. Just as no battle plan survives contact with the enemy, so no feature list on a technically innovative product survives the realities of the platform's limitations.

I've never been a straight-A student, and I'm not a deep abstract thinker, but I have spent decades obsessed with solving engineering problems. I think at this point that's probably affected the wiring of my brain, or at least given me a good mental toolkit to apply to those sort of puzzles. I find myself putting almost any issue in programming terms if I want to get a handle on it. For example, the path to sorting out my finances suddenly became clear when I realized I could treat it as a code optimization problem.

It also helps explain why I'm so comfortable open-sourcing my code. For me, the value I get out of creating any code is the deep understanding I have to achieve to reach a solution. The model remains in my head, the actual lines of code are comparatively easy to recreate once I've built the original version. If I give away the code, anyone else still needs to make a lot of effort to reach the level of understanding where they can confidently make changes, so the odds are good they'll turn to me and my mental model. My open-source projects act as an unfakeable signal that I really understand these domains.

This approach has served me well in the past, but it can be my Achilles heel. Most internet businesses these days don't require technical innovation. They're much more likely to die because no potential customers are interested than because they couldn't get the engineering working. Market risk used to be an alien concept to me, but the dawning realization that I kept building solutions to non-existent problems drove me into the arms of Customer Development.

I feel like I've now been punched in the face often enough by that issue that I've (mostly) learned to duck, but I'm also aware that I'm still driven by techno-lust. I love working in areas where the fundamental engineering rules are still being figured out, where every day I have the chance to experiment and discover something nobody else had known before. I can already feel most of the investors I've met shaking their heads sadly. I'm well aware that starting a revenue-generating business is almost impossible anyway, let alone exposing yourself to the risk and pain of relying on bleeding-edge technology. The trouble is, that's what my startup dream has always been. I want to do something genuinely fresh and new with the technology, and hope to be smart and fast enough to build a business before the engineering barriers to entry crumble.

I don't know if I'd recommend my approach to anyone else, but it seems baked in to who I am. Oddly enough, I didn't fully understand that until I wrote this post. Perhaps there's some thinking that I can only do by writing too?

How to gather the Google Profiles data set

Beebody
Photo by Max Xx

With the launch of Buzz, millions of people have created Google public profiles. These contain detailed personal information, including name, a portrait, location, your job title and employer, where you were born, where you've lived since, links to any blogs and other sites associated with you, and some public buzz comments. All of this information is public by default, and Google micro-formats the page to make it easier for crawlers to understand, allows crawling in robots.txt and even provides a directory listing to help robots find all the profiles (which is actually their recommended way to build a firehose of Buzz messages).

This sort of information is obviously a gold-mine for researchers interested in things like migration and employment patterns, but I've been treading very carefully since this is people's personal information. I've spent the last week emailing people I know at Google, posting on the public Buzz API list, even contacting the various government privacy agencies who've been in touch, but with no replies from anyone.

Since it's now clear that there's a bunch of other people using this technique, I'm open-sourcing my implementation as BuzzProfileCrawl. As you can tell from looking at the code this is not rocket-science, just running some simple regular expressions on each page as it's crawled.

We need to have a debate on how much of this information we want exposed, on how to balance innovation against privacy, but the first step is making it clear how much is already out there. There's a tremendous mis-match between what's technologically possible, and ordinary people's expectations. I hope this example helps spark an actual debate on this, rather than the current indifference.

How I removed spyware from a 1000 miles away with Crossloop

Eyespy
Picture by Ocular Invasion

Way back in '06 one of my first blog posts was a review of Crossloop, a free and awesomely user-friendly remote desktop application for Windows. Ever since then I've made sure to install it on any Windows machine I might ever have to provide support for, and today it saved my bacon yet again.

A few years ago, we bought a new laptop for Liz's mom. She's pretty computer-savvy, but since she was used to Outlook Express and Word we didn't want to switch her over to OS X, so it was an XP machine. I did the standard things to secure it; made certain automatic updates were running, bought McAfee, made Firefox the default browser. It doesn't look like that's enough any more, since yesterday a trojan slipped through and she was bombarded with bogus anti-spyware popups whenever she did anything on the machine. She knew something wasn't kosher and gave us a call to find out what she should do.

The description made my heart sink. In the past I'd ended up spending 12 hours straight getting a stubborn piece of spyware off Liz's old laptop, and her mom lives over 1000 miles away in Wisconsin. Since my Windows knowledge is way out-of-date I put a call out to Twitter for software suggestions, and got the usual high quality of advice. The top pick was Spybot Search and Destroy, with 'nuke the machine and reinstall' a strong second! I tend to do the latter for my personal machines, since even OS X gets pretty unpredictable if you keep doing incremental updates over multiple OS revisions, but I didn't relish doing that remotely and getting the software she needs re-setup as well.

This afternoon I bit the bullet, got on a phone call to Wisconsin and started on the process. The first step was getting the remote desktop sharing working. It took about 15 minutes to figure out that the old version of Crossloop on her machine wouldn't allow a connection to my newer one, but once that was clear I talked her through downloading the latest from the website, and we were up-and-running. Incidentally one of the killer features of Crossloop is the complete lack of configuration, all she had to do was read off a 12 digit number and I was able to connect and take control.

Next, I set out to squash the spyware. I downloaded Spybot, did a little bit of head-scratching over the options, and started the scan. It was pretty slow, taking about 30 minutes to complete. Once that completed, I clicked on the fix problems button, and things got confusing. The Spybot registry watcher kept asking for confirmation about registry changes the Spybot scanner was making, and since there were several hundred this rapidly became a problem. I turned off the registry watcher, and it claimed to have fixed the issues it had uncovered. Unfortunately the spyware popup windows still kept appearing, so I made sure that the definitions were updated and ran another scan. After another 30 minute scan, it detected a different set of problems, fixed them, but still didn't squash the spyware.

At that point I did the research I should have done at the start, figured out this particular malware was named XP Internet Security 2010, and found a good blog post explaining how to remove it manually. I created and ran the suggested .reg file, and then downloaded the free version of Malwarebytes Anti-Malware. It took about 8 minutes to run a quick scan, and then it successfully removed
the spyware!

After doing a little dance of joy, I looked through the settings to see if there was anything else I could do to protect the machine in the future. With McAfee, auto-updates and now Spybot's running protection, the only other recommendation I could think of was manually running Anti-Malware's scan every week.

As depressing as the spyware problem is (and yes, we'll be getting her a Mac next time), I'm amazed by the quality and workmanship of the free solutions out there. For all the black hats who waste our time and try to steal our money, there's dedicated folks like the Crossloop, Spybot and Malwarebytes teams offering free tools to help us fight back. Thanks to them all, I guess it's time to show my appreciation in the most sincere way, by upgrading to the paid versions!

How to prevent emails revealing your location

Wrestlingmask
Photo by Upeslases

Today I received an email from a person who announced they wished to be anonymous, and didn't want to reveal which organization they worked for. They used Hotmail and a pseudonym to avoid revealing their identity, and asked some detailed questions. That left me very curious to know how I was replying to, so I checked the message headers and they contained the IP address of the computer they were on. Running whois on that IP gave me the company they worked for, since they were apparently logged in from a work machine.

I'm not going to go into details on exactly how to do this sort of detective work, instead I want to focus how to fix prevent information about your location leaking into your email headers. The main culprit are headers that show the IP address of the original machine that the email came from. Here's an example that came from someone logged into Yahoo through a browser:

Received: from [76.95.184.187] by web50009.mail.re2.yahoo.com

And here's someone who emailed from Hotmail's website:

X-Originating-IP: [76.95.184.187]

If you use a desktop program like Outlook or Apple Mail with any account, the IP address of your machine is almost always included in a header that looks like the Yahoo example.

Why should you care? That IP address will pinpoint your organization if you're within a company, or your ISP and a rough location if you're using broadband from home. If you're working on a side-project you want to keep separate from your employer, and they get hold of your sent emails, that header is proof that you were using work equipment on your idea and potentially gives them ownership when your startup becomes the next Google. And if your email with a doctor's note has an IP address in Cancun, you may have some questions to answer! (I actually ran across this flaw when I was looking at matching email contacts with other accounts, using geolocation on the IP address to figure out if it was the John Smith in Denver or LA, but I decided that was too creepy)

What should you do? The simplest fix is to use Gmail. As far as I can tell they're the one mainstream provider that doesn't include the IP address in the headers. The Chinese hacking incidents show they're not a panacea for all your security problems, but they definitely seem to have got this right. There's a lot of other more complex techniques that could safeguard your privacy, but if I was recommending something to a family member, I'd go with Google. You do need to be careful that you log into the website interface when you want to send an anonymous email though, since desktop programs tend to add the IP address anyway.

How to run visitor analytics on your Facebook fan page

Becomeafan
Via Rocketboom

I built FanPageAnalytics because I was looking for a way to understand the audience of a fan page, and the traditional analytics solutions relied on Javascript that couldn't run within the Facebook environment. Happily it looks like there's some new solutions emerging that use alternative methods to handle visitor tracking. My friend Eric Kirby pointed these two out to me today:

Google Analytics for Facebook Fan Pages

This is a great explanation of how to use the image fallback path that Google Analytics provides for situations where Javascript isn't available. I've come to prefer Clicky for free web analytics, since it gives you instant feedback rather than Google's delayed results, and avoids the 'space shuttle control panel' UI of Google's offering. I'll look into whether the same technique can be adapted for Clicky.

WebTrends Analytics for Facebook

Firmly aimed at the high-end market, WebTrends makes intriguing promises about the level of data they can collect, despite the lack of scripting and caching of images. I'm very curious as to how they manage it, I may need to look at a page using their analytics if I can locate one. They're offering a webinar on March 3rd if you want to see a live demo.

I haven't used wither of these myself yet, so I'd love to hear about your experiences with them, or any other alternatives you'd recommend.