What is MapReduce good for?


I’m working on a video series for O’Reilly that aims to de-mystify Hadoop and MapReduce, explaining how mere mortals can analyze massive data sets. I’m recording the first drafts of my segments now, and I’m finding it incredibly tough (it’s taking me about two hours to prepare and record a 2 minute piece on average) but it’s also been a great way to round out my own knowledge.

I need to get feedback to help me improve the final versions, so I’ll be posting some excerpts to get ideas. Above I’ve included the introductory segment, where I try to cover the fundamental strength of the MapReduce approach. I’ll be cutting down on the wild-eyed look for the final shoot, but I’ll be interested to hear your thoughts on the content.

Why do I do what I do?

Crowdblur
Photo by David Sim

Linda Stone asked me that question at SciFoo, and I couldn't give her an answer. To me, everything I've been passionate about is just different sides of the same project, but I've never been able to communicate it effectively in words. The experiments I build are attempts to capture parts of the vision I see in my head. The only way I can explain it is to demonstrate.

I always was a spacey kid, off in my own world. My mother claims I never made a sound until I was one year old, I was always sleeping, and all of my early memories are of solitary thinking or dreaming. It came as a shock to discover that I was sharing the world with other people, and that sense of wonder never quite wore off. I still get a shiver up my spine every time I think about the fact that there are 6 billion other people on this planet right now. Think about that for a second. Try to imagine a thousand people standing in a crowd. Now picture a crowd a hundred times bigger. The world is made up of sixty thousand of those massive crowds.

What keeps me up at nights is that I know every one of those people has a story I'd love to sit down and hear. Human lives are like fractals, there's so much depth I never get tired of learning about people's journeys, but I'm only ever going to glimpse the tiniest fraction of a sliver of all those stories.

Think about all of the people who pass each other in the street every day and never talk. There must be so many pairs who'd fall madly in love, or would write a symphony together, find the cure for cancer if they collaborated, or just become lifelong friends, but they never meet. I can only imagine how many wasted lives could be salvaged just by making the right connections.

Everything I've been driven to do, from weaving live footage of concert-goers into my visuals, to automatic expert location, to mapping social networks, has come from those urges to connect people and hear their stories. It feels like a victory against entropy to be able to say 'You guys should talk' and watch something beautiful grow.

Can you picture what the social network of the planet looks like? Imagine the richness and beauty of the web of relationships between six billion people. I can picture it in my minds eye, glimpses at least, an amazing view of each persons ties to thousands of others, constantly evolving and changing. Every individual's network is a snapshot of their life's story, and the sum of them is the story of the world.

Walt Disney said "We don't make movies to make money, we make money to make more movies". I claim to be running a business, but whenever I've been offered a lucrative opportunity that would pull me away from these goals, I've found myself turning it down. This hasn't been great for my dwindling savings, and it's driven people who try to help me crazy because I've never been able to get across to them what I'm shooting for and why I'd refuse slam-dunk chances to make money.

It's all worth it though, when I create something that communicates part of what I'm seeing, whether its the invisible web of relationships on Twitter, or the structure of a whole country's friendships. I'm convinced there's an immense amount of value in making connections, and that current 'social' software is just scratching the surface. Over the next few decades we're going to see amazing new ways of sharing our stories, and I want to be part of building those tools.

Five short links

Sunglassesgirl

WriteLikeMe – A superb project by James Hughes and friends at Dartmouth. It takes a sample of your writing and works out how you place relative to some of the greats. It’s especially cool because it misuses Google Maps to display the network graphs, which is a hack I can appreciate.

The war on attention poverty – A comprehensive rundown of the challenges of measuring authority on Twitter. It brings home that we’re still in the early days of figuring out how to identify and block the spammers on social networks.

Fovia – These guys are doing some spectacular 3D volume rendering in software. I’ve been out of that world for a long time, but it’s breathtaking to see what can be done with today’s CPUs, and it looks like they’ll be eating the traditional $100k+ workstation manufacturer’s lunches pretty soon.

Eureqa – Billed as a ‘robot scientist’, this does an impressive job of taking raw input data and trying to derive equations that describe the behavior. It’s intriguing to see how scientists adapt to these sort of intellectual tools. Will they turn to packages like these to quickly test their ideas in the same way we turn to Google to answer questions?

Trapcode – I knew and admired Peder’s work on classic visual effects plugins like Shine back when I was in the industry, but it’s inspirational to see how far he’s gone since then. Just check out some of these videos using his effects:

http://vimeo.com/10786322

http://vimeo.com/11547544

http://vimeo.com/10786322

(and of course, I recommend running them in Apple Motion for best results!)

Abolish birthright citizenship!

There's a groundswell of support for the proposal that children shouldn't get US citizenship merely because they're born here. I'm in complete agreement, it's downright un-American to just get handed citizenship on a plate without having to earn it. We rightly have a suspicion of those who inherit their wealth, and all of the same problems accompany being admitted as a member of this great nation without having to work for it.

I have a modest proposal: At 18 every young adult in the world goes through our immigration process to earn their green card, regardless of where they were born or live. We all worry about the quality of our education system – can you imagine the extra hours our kids will put in when they know they're competing against the best that China and Japan can offer? Competition is the American way – it's unfortunate that we'll end up trucking some of our teenagers across the border to Canada or Mexico, but that will be a great encouragement to the others.

I look forward to seeing this adopted as a bi-partisan measure, since the unfairness of inheriting citizenship seems to be such an important principle for so many.

Harness the power of being an idiot

Dunce
Photo by K Macice

One of my favorite moments at SciFoo was Carole Goble coming up to me after my talk. As we chatted, she mentioned she was at Manchester University and I suddenly recognized her – she'd taught me databases for a semester. After reminiscing, she mentioned she was no longer teaching undergrads and with my usual tact I replied "That's probably for the best"! I truly meant it was a lucky escape for her, not the students, but it brought back my own abysmal academic career.

I achieved an overall average of 18% in the exams at the end of my first year of university. Thankfully Manchester allowed me to come back at the end of the summer to have another try, and after studying I managed to scrape together enough improvement to continue into the final two years.

I was a terrible academic and came within whisker of dropping out. Why did I fail then, and how have those same traits helped me later in life?

Try it and see – a bias towards action

I hated sitting in a room writing down what somebody else was telling me, I wanted to test everything for myself. I learn by trying to build something, there's no other way I can discover the devils-in-the-details. Unfortunately that's an incredibly inefficient way to gain knowledge. I basically wander around stepping on every rake in the grass, while the A Students memorize someone else's route and carefully pick their way across the lawn without incident. My only saving graces are that every now and again I discover a better path, and faced with a completely new lawn I have an instinct for where the rakes are.

This obsession with learning-by-building must have been pretty frustrating for the managers of a clueless junior programmer, but my successes have all come when I've just gone ahead and just did something instead of studying it. It's the only way to discover something new and unexpected, and even the failures build judgment.

Wildly unrealistic romantic visions

I never knew any college students growing up, but watching TV gave me a pretty firm grasp of the way it worked. I'd spend a few years punting along a river, have the occasional witty exchange with my tutor and regularly reveal an astonishing discovery to rapturous applause. The grim reality of being lost among thousands of anonymous students was a shock. With that dream dead, I married at 19 with rose-tinted visions of connubial bliss, which led to five stormy years and a divorce.

I was a complete idiot. I saw the world as I wanted it to be, not as it actually was. In the technology world 'visionary' is a compliment, but in real life people who see non-existent things usually wind up in a mental ward.

The thing is, every project I'm proud of started off as a crazy dream, from using MIDI controllers for film editing software to mapping hundreds of millions of Facebook users. The difference now is that they're seasoned with a connection to reality. What's so valuable about Customer Development is that it gives me tools to take a wild idea, and get enough feedback to hone it into something realistic, without losing the heart of what makes it so interesting.

Spinning plates

At SciFoo Eva Amsen talked about her interviews with scientist/musicians, asking them why they didn't focus on just the science or the music. Their reply was "Oh, that would be boring". Linda Stone gave her own angle from her research on how Nobel laureates played when they were kids – their 'work' was an extension of their childhood play patterns. I'm in no danger of getting a Nobel Prize, but I've always had multiple plates spinning, because I'm interested in so many areas and can't bear to give any up.

In college this meant staying up all night playing MUDs, stacking shelves at Kwik-Save evenings and weekends to support myself, building a version of Pacman for X Windows, and spending my summer in a treehouse in Alaska avoiding the bears. This was not a recipe for academic achievement to say the least, but it was exactly that sort of goofing off that got me a job at Apple.

I was grinding away at a game industry job, doing well but very unchallenged, so I started spending nights and weekends projecting visuals for live music at clubs and concerts. I couldn't find software that implemented the effects I needed, so I wrote my own and released them as open-source. A lot of the VJs who used them had day jobs in the TV industry, and they kept bugging me to port them to a piece of software I'd never used, "After Effects". I gave in eventually, put the AE versions on my site, and suddenly had a deluge of email, with lots of people asking how to pay for them! Apparently they were used to spending thousands of dollars for effects collections like mine!

I left my game job, started my own company, but before I'd got more than a few weeks into it I was approached by Apple, who offered to buy out my technology, sponsor me for a green card, and set me to work building the same sort of thing for their products.

Every new direction I've taken has started out like that, one of a hundred things I'm curious about.

It's not a bug, it's a feature

I have a lot of demons driving me. What's different between now and my college years is that I've arranged my life to channel them productively. Everything I've talked about would doom me to failure if I hadn't fought my way to a niche where they're strengths not weaknesses.

Don't change your bad habits – turn them into assets instead.

A Mad Engineer at SciFoo

Madengineer
From Cowbirds in Love

I’m not quite sure how an engineer got invited to SciFoo, it was all very mysterious, but I’m planning on having fun! I am passionate about the possibilities for answering some of the important questions using the new sources of data we’re all creating on the web, so I’ll be evangelizing cheap web crawling, analysis using Hadoop and of course visualization. After some last minute idiocy on my part over the hotel, I’m now all set for the flight to San Jose tomorrow, and my first visit to the Google campus.

Talking of visualization, I was really curious to see where the other attendees would be coming from, I built a special-edition OpenHeatMap to display the locations of the attendees mentioned on Kaitlin Thaney’s Twitter list:

Scifoomap

Here’s the full interactive version. You can also now do the same for any Twitter list in the main app by typing in list:user/list-name in the main Twitter Followers search box.

Discrimination and sticking your hand up

Raiseyourhand
Photo by Cpt. Obvious

This article by Stubbornella on women in technology was a great summary of what I've seen in my programming career. I've seldom witnessed overt discrimination, there's no girlie calendars in the break-room, but we're terrible at assessing and promoting coders according to how effective and efficient they are. Instead there's a tendency to reward the qualities that Nicole lists under 'Cowboy-coders', even when these people have a negative impact on our ability to ship product.

Why does this matter? It's insanely hard to build great software, it's really tough to find strong engineers, and a system that discourages a large proportion of entrants with the potential to grow into those engineers is broken.

The point that Nicole mentions but doesn't go into depth on, is that this isn't just about women. Some of the most technically brilliant male coders I've worked with are still quietly working away with zero recognition from either their own companies or the outside world. I've always loved talking about my work, and that's driven me to overcome my innate shyness and learn to perform in public, and generally be pretty vocal with my opinions in discussions. This is vital in a profession infested with arrogant jerks, and sometimes I wonder if I've had to become one too. I try hard to "act like I'm right, but listen like I'm wrong", but that's a very fine line.

We tend to take this as a law of nature, that's how people get ahead in the programming world, but it's a sign of both poor management and a very ineffective culture. By contrast my partner Liz is a trained actuary, her job involves a decade of taking exams and some very deep math, and almost all of her old department was women. You don't have to be a pushy jerk to get noticed as an actuary, there's actually a system for actively finding and rewarding talent, and as if by magic, there's a lot more women in the profession.

Our problem is that we're fixated on a romantic image of programming, one full of individual rock-stars producing astonishing products thanks to all-night coding sessions. The problem is this doesn't work, at least not consistently or reliably. The reality of building great software is that it's a long process that takes a lot of teamwork. Our immature reverence for aggressive self-promoters leads to poor outcomes, and even if you don't give a fig about discrimination, that's still a massive problem.

So how do we fix it? I'm actually still a bit ambivalent about Google's scholarship specifically for women, though I can see why it makes sense in a lot of ways. I'd much rather see companies and individuals in leadership positions actively seeking out strong engineers to speak and attend conferences, rather than waiting for people to come to their attention, since that rewards loud-mouth folks like me. We need to find more people who are quietly doing great work, without waiting for them to stick up their hand.

Visualizing the war deaths in Afghanistan

Open map

Niraj Chokshi took the WikiLeaks data from Afghanistan and filtered it to produce some maps. It's always tough to build visualizations on a deadline, and so there were some issues with the initial graphs, but they presented the data in a useful and interesting way. Niraj made the underlying data he'd used available as a spreadsheet, so with almost no changes I was able to upload it into OpenHeatMap to produce some different views.

It was pretty sobering to be handling data covering hundreds of people's deaths, and I'm honestly not sure what story the data is telling us. Just looking at the location and magnitude of the enemy dead in 2004 compared to 2009 shows how much the battlefield has changed though, from a handful of hotspots along the Pakistan border to a dense ring around the whole country.

Want a map of your Twitter followers?

Twitterheatmap

I've always wanted to know where the people who follow me on Twitter are from, both out of curiosity and so I can connect with them as I travel around the country. To find out, I built a tool using OpenHeatMap to visualize your followers by location. It only shows followers who have been active recently, but I've had great fun discovering connections to Ghangzhou, Prague and even the exotic, enigmatic country of Canada, little known to westerners.

There's actually three different views you can use to explore Twitter as a map. You can put in your own or someobody else's handle to see their active followers, you can visualize the updates from the people you follow, or do a search on a keyword and see where in the world people are talking about that topic.

It's still just a prototype, but it feels like a step towards the interface we need to make sense of the flood of location data that's flowing all around us. I look forward to hearing your ideas on improving it, and since the component is completely open-source, feel free to build your own to show me how it should really be done!

OpenHeatMap for journalists

I’ve long admired The Guardian’s innovative approach to opening up their data, so I was very excited to see their technology editor, Charles Arthur, using it on a recent story. I was happily surprised that he was able to set up his map without ever contacting me. I’ve always intended it to be self-serve but that can be very hard to achieve with a completely new product.

Since I’m really keen to see the stories other reporters can tell with OpenHeatMap, I’ve created a four minute video guide aimed at journalists that walks you through exactly what you need to do to build your own maps. If you have some information about places, I’ve made it drop-dead simple to create a map that tells your story, so please check the guide out out and pass it along to any other folks who might be interested.