Why you should visit Santa Cruz Island

Cavernpoint

Liz and I just got back from a four day trip to Santa Cruz Island, helping to maintain the hiking trails. We drove all the way from Colorado to California for the opportunity, and we've been so many times we've lost count. On the drive, I was thinking about what keeps us coming back, and why I recommend it to anyone who loves the outdoors.

Dolphins

The trip over

To get to Santa Cruz Island you need to take an Island Packers boat from Ventura harbor. The trip only takes about an hour, but it packs in an amazing range of sea-life. Just yesterday we had a humpback whale leap out of the water and do a 180 degree twist only 200 feet from the boat, there's always hundreds of dolphins, and I've even had Orcas approach close enough to bite me.

Morningview

Solitude on LA's doorstep

That's the view I wake up to every morning on the island. With no permanent inhabitants or cell reception, the only vehicles a few ranger trucks, and a hundred square miles to lose yourself in, Santa Cruz is heaven for anyone looking to get away from it all. Even better, you can be there in just a couple of hours from the center of LA, whether you want a quick day trip or a longer camp out.

There's no commercial presence there at all, no food stands, not even a soda machine, so you'll need to be prepared for a trip back to the 19th century, but it's worth it for the tranquility.

Islandfox

Watch a world recover

When we first started visiting almost a decade ago, the sheep had only just been removed and there were still wild pigs roaming everywhere. Ecologically it was a mess, the sheep had devoured almost all the native vegetation, leaving nothing but brown grass to cover the hills in the summer; the pigs were digging up the dirt in search of roots and causing the hillsides to erode, and with no predators the mice were everywhere. Now they've eradicated all the pigs, deported the golden eagles that lived on them and reintroduced bald eagles, and got rid of the fennel thickets that choked the trails. The difference over just a few years has been astonishing, with clumps of buckwheat, coreopsis and the unique island oaks popping up over previously bare hillsides. Even better, the indigenous island foxes have gone from being endangered to pests in the campground in record time, with numbers up from a few hundred to 1,200 in just two years now the golden eagles are no longer picking them off.

You'll never get another chance to see a whole National Park turn itself from a barren wasteland into a natural garden packed with plants and animals you'll find nowhere else. Get out there now while it's still in progress, and I guarantee you'll be amazed at the changes as you keep coming back.

Railing

Experience a dark history

I don't know if the island attracts crazy people, or if it turns normally sane people a little nuts, but you'll be surprised at how many of the people you'll meet there have ended up with a borderline obsession with the place. I'm one to talk, driving 1200 miles to visit, but its recorded history is a long succession of feuds, disputes and dreams of private empires. The one man who built a successful ranching venture on the island left behind a family that squabbled for over a century, with lawsuits ricocheting so long that finally most of it was sold to pay the legal bills, with the final parcel taken over after a dawn helicopter raid by a SWAT team in 1997! Long before that there's evidence of over 13,000 years of Chumash habitation, possibly the earliest in the Americas, before the population was taken to the mainland for easier control. There's so much archaeology, it's hard to walk anywhere that doesn't show evidence of a midden or worked chert fragments.

You'll need to be a big donor or volunteer with the Nature Conservancy before you can visit the main ranch situated on their land (their acquisition of that property was more fallout of legal feuding; the previous owner was determined to avoid being forced to sell to the NPS) but you can explore the smaller stations such as Scorpion and Smugglers, with century-old groves of olive and cypress trees to shelter under. There's also a new visitor's center at Scorpion, with some amazing work by Exhibitology giving a fascinating look into the island's past.

I haven't even touched on the breathtaking hikes, secluded campgrounds like del Norte or diving so spectacular that Jacques Costeau considered it the best in the temperate world. If you need to refresh your soul (and are willing to risk developing a lifetime obsession) visit Santa Cruz Island.

Facebook data destruction

I'm sorry to say that I won't be releasing the Facebook data I'd hoped to share with the research community. In fact I've destroyed my own copies of the information, under threat of a lawsuit from Facebook.

As you can imagine I'm not very happy about this, especially since nobody ever alleged that my data gathering was outside the rules the web has operated by since crawlers existed. I followed their robots.txt directions, and was even helped by microformatting in the public profile pages. Literally hundreds of commercial search engines have followed the same path and have the same data. You can even pull identical information from Google's cache if you don't want to hit Facebook's servers. So why am I destroying the data? This area has never been litigated and I don't have enough money to be a test case.

Despite the bad taste left in my mouth by the legal pressure, I actually have some sympathy for Facebook. I put them on the spot by planning to release data they weren't aware was available. I know from my time at Apple that reaching for the lawyers is a tempting first option when there's a nasty surprise like that. If I had to do it all over again, I'd try harder not to catch them off-guard.

So what's the good news? From my conversations with technical folks at Facebook, there seems to be a real commitment to figuring out safeguards around the widespread availability of this data. They have a lot of interest in helping researchers find ways of doing worthwhile work without exposing private information.

To the many researchers I've disappointed, there's a whole world of similar data available from other sources too. By downloading the Google Profile crawling code you can build your own data set, and it's easy enough to build something similar for Twitter. I'm already in the middle of some new research based on public Buzz information, so this won't be stopping my work, and I still plan to share my source data with the research community in the future.

Flexible access to Gmail, Yahoo and Hotmail in PHP

Knittedmask
Photo by Poppalina

I've been a heavy user of the php-imap extension, but last year I was driven to reimplement the protocol in native PHP by its limitations. Because it was a compiled extension implementing a thin wrapper on top of the standard UW IMAP library, any changes meant working in C code and propagating them through several API layers. A couple of problems in particular forced me to resort to my own implementation:

– Yahoo uses a strange variant of IMAP on their basic accounts, where you need to send an extra undocumented command before you can login.

– Accessing message headers only gives you the first recipient of an email, and getting any more requires a full message download. This is a design decision by the library, not a limitation of the protocol, and severely slowed down access to the information Mailana needed.

I've now open-sourced my code as HandmadeIMAP. I chose the name to reflect the hand-crafted and somewhat idiosyncratic nature of the project. It's all pulled from production code, so while it works for my purposes it isn't particularly pretty, only implements the parts I need and focuses on supporting Gmail, Yahoo and Hotmail. On the plus side, it's been used pretty heavily and works well for my purposes, so hopefully it will prove useful to you too.

I'm also hoping to use it as a testing-ground for Gmail's new oAuth extension to IMAP that makes it possible to give mailbox access without handing over your password. Because new commands can easily be inserted, it should be possible to follow the new specification once the correct tokens have been received, but I will let you know how that progresses.

To give it a try for yourself, just download the code and run the test script, eg:

php handmadeimaptest.php -u youremail@gmail.com -p yourpasswordhere -a list -d

Thinking by Coding

Thinker
Photo by Sidereal

David Mandell took me under his wing at Techstars last summer, and we ended up spending a lot of time together. One day towards the end of the program he told me "Pete, you think by coding". That phrase stuck with me because it felt true, and summed up a lot of both my strengths and weaknesses.

I choose my projects by finding general areas that have interesting possibilities for fun and profit, but then almost immediately start hacking on code to figure out the details of the technical landscape. I find it intensely frustrating to sit and discuss the theory of what we should be building before we have a good idea of what's technically possible. Just as no battle plan survives contact with the enemy, so no feature list on a technically innovative product survives the realities of the platform's limitations.

I've never been a straight-A student, and I'm not a deep abstract thinker, but I have spent decades obsessed with solving engineering problems. I think at this point that's probably affected the wiring of my brain, or at least given me a good mental toolkit to apply to those sort of puzzles. I find myself putting almost any issue in programming terms if I want to get a handle on it. For example, the path to sorting out my finances suddenly became clear when I realized I could treat it as a code optimization problem.

It also helps explain why I'm so comfortable open-sourcing my code. For me, the value I get out of creating any code is the deep understanding I have to achieve to reach a solution. The model remains in my head, the actual lines of code are comparatively easy to recreate once I've built the original version. If I give away the code, anyone else still needs to make a lot of effort to reach the level of understanding where they can confidently make changes, so the odds are good they'll turn to me and my mental model. My open-source projects act as an unfakeable signal that I really understand these domains.

This approach has served me well in the past, but it can be my Achilles heel. Most internet businesses these days don't require technical innovation. They're much more likely to die because no potential customers are interested than because they couldn't get the engineering working. Market risk used to be an alien concept to me, but the dawning realization that I kept building solutions to non-existent problems drove me into the arms of Customer Development.

I feel like I've now been punched in the face often enough by that issue that I've (mostly) learned to duck, but I'm also aware that I'm still driven by techno-lust. I love working in areas where the fundamental engineering rules are still being figured out, where every day I have the chance to experiment and discover something nobody else had known before. I can already feel most of the investors I've met shaking their heads sadly. I'm well aware that starting a revenue-generating business is almost impossible anyway, let alone exposing yourself to the risk and pain of relying on bleeding-edge technology. The trouble is, that's what my startup dream has always been. I want to do something genuinely fresh and new with the technology, and hope to be smart and fast enough to build a business before the engineering barriers to entry crumble.

I don't know if I'd recommend my approach to anyone else, but it seems baked in to who I am. Oddly enough, I didn't fully understand that until I wrote this post. Perhaps there's some thinking that I can only do by writing too?

How to gather the Google Profiles data set

Beebody
Photo by Max Xx

With the launch of Buzz, millions of people have created Google public profiles. These contain detailed personal information, including name, a portrait, location, your job title and employer, where you were born, where you've lived since, links to any blogs and other sites associated with you, and some public buzz comments. All of this information is public by default, and Google micro-formats the page to make it easier for crawlers to understand, allows crawling in robots.txt and even provides a directory listing to help robots find all the profiles (which is actually their recommended way to build a firehose of Buzz messages).

This sort of information is obviously a gold-mine for researchers interested in things like migration and employment patterns, but I've been treading very carefully since this is people's personal information. I've spent the last week emailing people I know at Google, posting on the public Buzz API list, even contacting the various government privacy agencies who've been in touch, but with no replies from anyone.

Since it's now clear that there's a bunch of other people using this technique, I'm open-sourcing my implementation as BuzzProfileCrawl. As you can tell from looking at the code this is not rocket-science, just running some simple regular expressions on each page as it's crawled.

We need to have a debate on how much of this information we want exposed, on how to balance innovation against privacy, but the first step is making it clear how much is already out there. There's a tremendous mis-match between what's technologically possible, and ordinary people's expectations. I hope this example helps spark an actual debate on this, rather than the current indifference.

How I removed spyware from a 1000 miles away with Crossloop

Eyespy
Picture by Ocular Invasion

Way back in '06 one of my first blog posts was a review of Crossloop, a free and awesomely user-friendly remote desktop application for Windows. Ever since then I've made sure to install it on any Windows machine I might ever have to provide support for, and today it saved my bacon yet again.

A few years ago, we bought a new laptop for Liz's mom. She's pretty computer-savvy, but since she was used to Outlook Express and Word we didn't want to switch her over to OS X, so it was an XP machine. I did the standard things to secure it; made certain automatic updates were running, bought McAfee, made Firefox the default browser. It doesn't look like that's enough any more, since yesterday a trojan slipped through and she was bombarded with bogus anti-spyware popups whenever she did anything on the machine. She knew something wasn't kosher and gave us a call to find out what she should do.

The description made my heart sink. In the past I'd ended up spending 12 hours straight getting a stubborn piece of spyware off Liz's old laptop, and her mom lives over 1000 miles away in Wisconsin. Since my Windows knowledge is way out-of-date I put a call out to Twitter for software suggestions, and got the usual high quality of advice. The top pick was Spybot Search and Destroy, with 'nuke the machine and reinstall' a strong second! I tend to do the latter for my personal machines, since even OS X gets pretty unpredictable if you keep doing incremental updates over multiple OS revisions, but I didn't relish doing that remotely and getting the software she needs re-setup as well.

This afternoon I bit the bullet, got on a phone call to Wisconsin and started on the process. The first step was getting the remote desktop sharing working. It took about 15 minutes to figure out that the old version of Crossloop on her machine wouldn't allow a connection to my newer one, but once that was clear I talked her through downloading the latest from the website, and we were up-and-running. Incidentally one of the killer features of Crossloop is the complete lack of configuration, all she had to do was read off a 12 digit number and I was able to connect and take control.

Next, I set out to squash the spyware. I downloaded Spybot, did a little bit of head-scratching over the options, and started the scan. It was pretty slow, taking about 30 minutes to complete. Once that completed, I clicked on the fix problems button, and things got confusing. The Spybot registry watcher kept asking for confirmation about registry changes the Spybot scanner was making, and since there were several hundred this rapidly became a problem. I turned off the registry watcher, and it claimed to have fixed the issues it had uncovered. Unfortunately the spyware popup windows still kept appearing, so I made sure that the definitions were updated and ran another scan. After another 30 minute scan, it detected a different set of problems, fixed them, but still didn't squash the spyware.

At that point I did the research I should have done at the start, figured out this particular malware was named XP Internet Security 2010, and found a good blog post explaining how to remove it manually. I created and ran the suggested .reg file, and then downloaded the free version of Malwarebytes Anti-Malware. It took about 8 minutes to run a quick scan, and then it successfully removed
the spyware!

After doing a little dance of joy, I looked through the settings to see if there was anything else I could do to protect the machine in the future. With McAfee, auto-updates and now Spybot's running protection, the only other recommendation I could think of was manually running Anti-Malware's scan every week.

As depressing as the spyware problem is (and yes, we'll be getting her a Mac next time), I'm amazed by the quality and workmanship of the free solutions out there. For all the black hats who waste our time and try to steal our money, there's dedicated folks like the Crossloop, Spybot and Malwarebytes teams offering free tools to help us fight back. Thanks to them all, I guess it's time to show my appreciation in the most sincere way, by upgrading to the paid versions!

How to prevent emails revealing your location

Wrestlingmask
Photo by Upeslases

Today I received an email from a person who announced they wished to be anonymous, and didn't want to reveal which organization they worked for. They used Hotmail and a pseudonym to avoid revealing their identity, and asked some detailed questions. That left me very curious to know how I was replying to, so I checked the message headers and they contained the IP address of the computer they were on. Running whois on that IP gave me the company they worked for, since they were apparently logged in from a work machine.

I'm not going to go into details on exactly how to do this sort of detective work, instead I want to focus how to fix prevent information about your location leaking into your email headers. The main culprit are headers that show the IP address of the original machine that the email came from. Here's an example that came from someone logged into Yahoo through a browser:

Received: from [76.95.184.187] by web50009.mail.re2.yahoo.com

And here's someone who emailed from Hotmail's website:

X-Originating-IP: [76.95.184.187]

If you use a desktop program like Outlook or Apple Mail with any account, the IP address of your machine is almost always included in a header that looks like the Yahoo example.

Why should you care? That IP address will pinpoint your organization if you're within a company, or your ISP and a rough location if you're using broadband from home. If you're working on a side-project you want to keep separate from your employer, and they get hold of your sent emails, that header is proof that you were using work equipment on your idea and potentially gives them ownership when your startup becomes the next Google. And if your email with a doctor's note has an IP address in Cancun, you may have some questions to answer! (I actually ran across this flaw when I was looking at matching email contacts with other accounts, using geolocation on the IP address to figure out if it was the John Smith in Denver or LA, but I decided that was too creepy)

What should you do? The simplest fix is to use Gmail. As far as I can tell they're the one mainstream provider that doesn't include the IP address in the headers. The Chinese hacking incidents show they're not a panacea for all your security problems, but they definitely seem to have got this right. There's a lot of other more complex techniques that could safeguard your privacy, but if I was recommending something to a family member, I'd go with Google. You do need to be careful that you log into the website interface when you want to send an anonymous email though, since desktop programs tend to add the IP address anyway.

How to run visitor analytics on your Facebook fan page

Becomeafan
Via Rocketboom

I built FanPageAnalytics because I was looking for a way to understand the audience of a fan page, and the traditional analytics solutions relied on Javascript that couldn't run within the Facebook environment. Happily it looks like there's some new solutions emerging that use alternative methods to handle visitor tracking. My friend Eric Kirby pointed these two out to me today:

Google Analytics for Facebook Fan Pages

This is a great explanation of how to use the image fallback path that Google Analytics provides for situations where Javascript isn't available. I've come to prefer Clicky for free web analytics, since it gives you instant feedback rather than Google's delayed results, and avoids the 'space shuttle control panel' UI of Google's offering. I'll look into whether the same technique can be adapted for Clicky.

WebTrends Analytics for Facebook

Firmly aimed at the high-end market, WebTrends makes intriguing promises about the level of data they can collect, despite the lack of scripting and caching of images. I'm very curious as to how they manage it, I may need to look at a page using their analytics if I can locate one. They're offering a webinar on March 3rd if you want to see a live demo.

I haven't used wither of these myself yet, so I'd love to hear about your experiences with them, or any other alternatives you'd recommend.

MapReduce for Idiots: The Musical

I've just uploaded an audio slideshow of the talk I gave at Gnip last week, covering why MapReduce really isn't scary and why you should be looking into it for your problems. I'll be refining this talk and going into more technical detail in further posts, but my first goal was convince people that they shouldn't be put off by the fog of mystery that surrounds the process. It's become incredibly easy and cheap over the last few years, and I regret not using it earlier. There's also some code to accompany the talk here.

OK, so I don't actually sing, but you should be thankful for that, just ask the folks who invited me over for a Karaoke New Year's party. They'd all come from the Phillippines, and I never knew it was so big over there, they were amazing performers. They kept hoping that maybe I'd be able to carry The Beatles being British and all. It wasn't pretty.

Robots and Stalking 2.0

Purplerobot
Photo by Peyri

Web crawlers used to be restricted to big companies, purely because the
cost was so prohibitive. Now anyone with a few thousand dollars and an
Amazon Web Services account can crawl and analyze hundreds of millions
of pages. The old honor system that worked when it was just Google, Yahoo and Microsoft with access to that data won't cut it.

My approach has been to avoid dealing with companies that seem spammy or scammy, and trying to work out in the open. I'm no saint, I'd like to make a living from the insights I can gather
from public profiles, but I'm also very aware that most
people don't know how much they're exposing
.

The problem is that information that was made available to help search engines can also be fed into sophisticated analysis pipelines to produce much deeper and potentially more invasive data sets. What does that mean in practice? Just using public profile data, you could use gender-guessing and portrait images to produce HotOrNot 2.0. Getting even creepier, it would be possible to match up interests, locations and even friends in common using that same data to produce a great tool for stalkers and perverts. Intuitively that all seems very wrong, but since it's technically straightforward I'm certain somebody out there is already working on it.

So how can we respond to this new world?

Expand robots.txt

If there's personal information on a page, make it clear that there's privacy implications for handling it, and lay out some rules. I don't know exactly what this would look like, but a noanalyze or indexonly meta tag that worked like noarchive might be a good start. It would be a polite request that the crawler only use the information for serving direct user searches. Like all robots.txt directives it's not enforceable, but it would give clear guidance and give networks a stick to beat violators with.

Look backwards

We've lived with our names and addresses in public phone books for over a century, despite the potential for abuse by time-traveling robot hitmen. We mitigated the risks by adopting some simple tricks that might also work in the internet world. How about just an initial for your first name to limit gender identification? There was also a very clear process for 'hide me', going ex-directory, that was standardized and easy to understand, not complex and constantly changing like the space shuttle control panel that most privacy dialogs now resemble.

Obsfucate sensitive information

Keeping email addresses as images is an old trick, but if there's information you'd like to show to humans on public profiles but not have stored by robots, why not use the same technique for that too? It's far from perfect, but it makes grabbing that data much tougher and slower. You can also use Javascript to make it harder for a crawler to pull the information, but still leave it as text.

Keep changing links and ids

There's no reason that the id for a public profile has to have anything in common with the actual user id, or that the portrait image URL can't be a redirect that changes every two weeks. Keeping the public and private worlds unconnected makes it much harder to subvert the privacy constraints.