MapReduce for Idiots: The Musical

I've just uploaded an audio slideshow of the talk I gave at Gnip last week, covering why MapReduce really isn't scary and why you should be looking into it for your problems. I'll be refining this talk and going into more technical detail in further posts, but my first goal was convince people that they shouldn't be put off by the fog of mystery that surrounds the process. It's become incredibly easy and cheap over the last few years, and I regret not using it earlier. There's also some code to accompany the talk here.

OK, so I don't actually sing, but you should be thankful for that, just ask the folks who invited me over for a Karaoke New Year's party. They'd all come from the Phillippines, and I never knew it was so big over there, they were amazing performers. They kept hoping that maybe I'd be able to carry The Beatles being British and all. It wasn't pretty.

Robots and Stalking 2.0

Purplerobot
Photo by Peyri

Web crawlers used to be restricted to big companies, purely because the
cost was so prohibitive. Now anyone with a few thousand dollars and an
Amazon Web Services account can crawl and analyze hundreds of millions
of pages. The old honor system that worked when it was just Google, Yahoo and Microsoft with access to that data won't cut it.

My approach has been to avoid dealing with companies that seem spammy or scammy, and trying to work out in the open. I'm no saint, I'd like to make a living from the insights I can gather
from public profiles, but I'm also very aware that most
people don't know how much they're exposing
.

The problem is that information that was made available to help search engines can also be fed into sophisticated analysis pipelines to produce much deeper and potentially more invasive data sets. What does that mean in practice? Just using public profile data, you could use gender-guessing and portrait images to produce HotOrNot 2.0. Getting even creepier, it would be possible to match up interests, locations and even friends in common using that same data to produce a great tool for stalkers and perverts. Intuitively that all seems very wrong, but since it's technically straightforward I'm certain somebody out there is already working on it.

So how can we respond to this new world?

Expand robots.txt

If there's personal information on a page, make it clear that there's privacy implications for handling it, and lay out some rules. I don't know exactly what this would look like, but a noanalyze or indexonly meta tag that worked like noarchive might be a good start. It would be a polite request that the crawler only use the information for serving direct user searches. Like all robots.txt directives it's not enforceable, but it would give clear guidance and give networks a stick to beat violators with.

Look backwards

We've lived with our names and addresses in public phone books for over a century, despite the potential for abuse by time-traveling robot hitmen. We mitigated the risks by adopting some simple tricks that might also work in the internet world. How about just an initial for your first name to limit gender identification? There was also a very clear process for 'hide me', going ex-directory, that was standardized and easy to understand, not complex and constantly changing like the space shuttle control panel that most privacy dialogs now resemble.

Obsfucate sensitive information

Keeping email addresses as images is an old trick, but if there's information you'd like to show to humans on public profiles but not have stored by robots, why not use the same technique for that too? It's far from perfect, but it makes grabbing that data much tougher and slower. You can also use Javascript to make it harder for a crawler to pull the information, but still leave it as text.

Keep changing links and ids

There's no reason that the id for a public profile has to have anything in common with the actual user id, or that the portrait image URL can't be a redirect that changes every two weeks. Keeping the public and private worlds unconnected makes it much harder to subvert the privacy constraints.

VCs I’ll be avoiding

Vcspam

I've been getting weekly emails from YoungStartup.com. The funny thing is, they're coming through on an account I only use for DNS registration purposes, and I certainly didn't sign up for them. I normally shrug off this kind of thing, but then I actually looked at the details of this one.

$175 to get in the door for entrepreneurs?! If your idea's any good and you have a little bit of hustle, you'll be able to get a meeting with any VC who might be interested. If neither of those are true, then VC meetings won't help you anyway. Whilst not nearly as bad as the ridiculous fees for presenting to some angel forums, I don't believe this is a sensible use of your precious startup funds and it definitely doesn't leave me with a good impression of the VCs involved.

Social network data and research

Networkgraph
Image by Yankee in Canada

One of my favorite parts of my own Facebook research has been discovering some of the existing work in this area I didn't know about. Here's some of the most interesting papers:

Inference of Profile Elements of Individuals Using Publicly Available Social Web Data

Using Rapleaf's massive data store of publicly-available social network data, Piotr Kozikowski wrote his master's thesis on inferring attributes like gender, location and age from other known information about a person.

http://current.cs.ucsb.edu/facebook/

Contains details on the EuroSys '09 academic data set containing both connections and interactions for
Facebook.

Real-world separation effects in an online social network

A paper on how geography influences social networks,
using 30,000 users public friendship data from a German social network.

http://randomwalker.info/social-networks/Notes_on_data_acquisition.html

Arvind's got a few notes about the LiveJournal, Twitter and Flickr
data they're using. It sounds like Mislove has been willing to share
LiveJournal network data with other academics in the past.

http://overstated.net/

Cameron Marlow is the head of Facebook's data mining team, and covers
their internal research on his blog.

Finally, it's in a different area, but one of the scariest datasets I've run across is the Enron collection of 500,000 emails released as part of the investigation. I was a heavy user of this for developing my email services, but I'm still amazed it's out there!

How to run CURL fetches in parallel in PHP

Brightthreads
Photo by Incurable Hippie

PHP is my workhorse language, for a lot of reasons I'll need to blog about soon. One problem I keep running into though is how serial it is, especially when it comes to making CURL calls to access web APIs. FindByEmail is a great example; I'm calling over a dozen different APIs, and I have to wait for each one to finish before I can call the next. It can easily take 20 seconds to run the script, where almost all the time is spent idly waiting for a single API call to complete.

There is a solution for this sort of problem; curl_multi_exec() lets you fire off multiple CURL requests at once. Unfortunately the interface is awful, it feels like I might as well be writing in C, which is unsurprising since it's a thin layer over the underlying C library. Typically I'm going through some inputs and fetching URLs as I go, so to speed up those sort of tasks I wrote a much simpler interface on top of the curl_multi engine. ParallelCurl lets you just specify a URL and a callback function, and handles all the mechanics of running your requests simultaneously.

To give you an idea of the possible speedup, the test script makes 100 calls to Google's search API, takes nearly two minutes without parallelization, and runs in 11 seconds with 20 requests running at once. Of course you need to be careful not to overwhelm your target server!

The code's up on github, and using it is as simple as calling:

$parallelcurl->startRequest('http://example.com', 'on_request_done', array('something'));

What’s wrong with confidence?

Questionsign
Photo by Leo Reynolds

I come from Britain, a country swimming in deprecation, self-doubt and cheerful pessimism, so it's been tough for me to learn the confidence I need to get things done. Often the confidence has to come from a gut instinct before there's any real evidence; as Tom Evslin says, nothing great has ever been accomplished without irrational exuberance!

What makes it hard is that you need confidence to take on seemingly impossible tasks, but most of the time your initial confidence is misplaced, they're actually impossible. That doesn't mean you give up, it just means as you learn more about the facts you have to change your approach or even your goals. I've spent a lot of time figuring out tools I can use to help me decide.

Most of them boil down to actively hunting down evidence. What I love about Customer Development is that you collect meaningful user feedback early and often. Even with only a few users, the Net Promoter Score and other surveys I've run have astonished me, and forced radical rethinks. I'm also a disciple of Bob Sutton's work, he's spent his entire career on a crusade for evidence-based decisions, and I love his phrase "Strong opinions, weakly held". If you don't have any confidence or certainty, you'll never persuade anyone to act or really test your own ideas, but if you drink too much of your own kool-aid, the psychological effect of confirmation bias makes you ignore important evidence against your theory.

So, where am I going with this? This is in the front of my mind because I'm sitting banging my head against the table because of the recent developments in the global warming debate. It's obvious from reading the leaked emails that the IPCC and Anglia University have their share of hubris, but I'm driven crazy by how evidence doesn't even seem to feature in the skeptics' arguments. There's an objective reality; either man-made global warming is occurring or it isn't. This isn't a philosophical or political debate (though the question of the right response is). Either it's happening, or it's not, and we have a whole process for determining physical questions like this, called science. It's flawed and uncertain, and the ClimateGate scientists were perverting the process by trying to keep critics away from their data, but it's built on a base of evidence. If you want to dispute the conclusions, you need to dispute that evidence, otherwise it's all just about who shouts the loudest.

Since I'm old enough to remember the coming Ice Age I've got a natural inclination towards skepticism, so I've spent some time going back to the basic papers underlying the consensus. Nobody I'm aware of is disputing the observed rise in CO2 levels, now the highest in 2 million years, and up about 30% in a century. I then dug out a paper that tries to quantify how much heat CO2 and other components in the atmosphere trap. It shows that CO2 is responsible for retaining a significant amount of heat, and again I haven't been able to find any evidence that this is in dispute. Volcanic eruptions are a natural experiment demonstrating that changing the atmosphere's composition can change the climate.

Take these conclusions as facts, and you've got significantly rising levels of a gas known to contribute to warming. Even if you throw out everything else related to temperature measurements and climate models, it's hard to escape concluding that there's a significant possibility that we're affecting the climate. There's legitimate arguments around exactly what will happen and what we should do, but if you're going to assert that global warming is a myth, you have to tackle the evidence directly. I would love to read reasoned arguments based on solid evidence about why I'm wrong, but the few I find tend to evaporate into mirages upon closer inspection.

Confidence is vital for getting things done, but it has to be a spur to test your theories, not a lazy substitute for gathering evidence. I'm very aware of this trap because it's one I keep falling into! I look back at the last two years of building Mailana, and I've lost months chasing lost causes because I didn't see what was right in front of my nose. The skeptics' attacks on AGW remind me so much of those wild goose chases, driven by a firm belief with no foundation. Please, throw evidence at me, but I'm tired of confident assertions with nothing to back them up.

What's wrong with confidence? It turns into a vice when it stops you wrestling with reality.

Micronations

Sealand

Photo by Misterbisson

I've long been fascinated by micronations. It's the geek obsession with taking fundamental building blocks and trying to reinvent them. Following their histories also reminds me how disastrous applying that engineering instinct to social problems can be!

The first of the modern micronations was Sealand, and it's still up and running. Started by a 60's pirate radio broadcaster on a WWII sea fort, it's gone through everything from firearms arrests to kidnapping and helicopter assaults. It even has a government in exile!

The Republic of Indian Stream was an accidental state on the US/Canadian border in the 1800's. An ambiguous border treaty led first to both countries trying to tax the inhabitants of the 300 square mile area, and in protest they declared their independence. The situation lingered on until the inhabitants invaded Canada to free one of their neighbors imprisoned over a hardware store debt. At that point the ambassadors decided an international incident over money owed to a shopkeeper was too embarrassing and negotiated a border agreement ending the republic's short life.

The Territory of Poyais never actually existed, but Gregor MacGregor still managed to raise hundreds of thousands of pounds in the 1820's to settle it. The unlucky colonists who made it to the supposed tropical paradise found themselves stranded in hostile jungle. One committed suicide, the rest were eventually rescued by a chance encounter with a Honduran ship after their original vessel was swept away in a storm. When the survivors made it back to London and spread news of the scam, MacGregor fled to Paris to try it all over again!

Probably the only place out of these I'd actually want to live isn't a real micronation since it doesn't claim to be its own country, but Spiral Island deserves a mention. It's a floating island built on a raft made of 250,000 recycled plastic bottles, complete with a two storey house, beaches and even palm trees. Unfortunately the first version succumbed to a hurricane, but Richie Sowa, the artist behind it, rebuilt it with even more amenities.