Does Facebook’s purchase of Instagram make sense?

Instagram
Picture by Oridusartic

I've spent the last year obsessed with social photo sharing as I've been building out Jetpac, so while I can't pretend I was expecting it, Facebook's acquisition of Instagram made sense to me. Here's why:

Facebook is a photo sharing site with a social network attached

The extent to which photos have always driven the growth of the network astonished me. Unlike games or even status updates, sharing pictures was an existing social behavior that the recipients understood and welcomed, giving friends and relatives of users a strong incentive to sign up themselves. Nothing else has this kind of pull, it's the bedrock of everything else they do. They currently host 140 billion photos, and are adding 10 billion a month, and that's a crucial engine of engagement.

Instagram has cracked the creative app problem

Instagram's real value is in their experience building a creative app that everybody can use. Nobody else has built an interface that's clear enough to be approachable and yet can produce results that people appreciate. It may sound simple, but it's deceptively hard to replicate from the outside. People like the filtered images because they're expressing a creative act by the taker, something they've put thought and time into, but for a wide audience of creators to use it, it actually has to be a lot easier and quicker than it appears. This balancing act is not only hard to reverse-engineer, it's also helped by an aura of exclusivity in the early days that's near-impossible for an established company to replicate. The flip-side of large companies being able to get easy press is that nobody gets credit for telling their friends about a cool new service they've launched.

Instagram was on the verge of going mainstream

The company clearly had proven their service had wide appeal, and showed all the signs of going into a rapid expansion. Even for behemoths like Facebook it gets very expensive to acquire a startup once they truly blow up, so with the cautionary tale of Yahoo's failure to buy Google in its early days in mind, it was a last chance of sorts for an acquisition. Instagram user's interaction with photos is very different from anything that Facebook offers, so if it did become widely popular there would be a real threat that they'd siphon off users.

Instagram is the first natively-mobile app

When Somini Sengupta asked me about this story for her New York Times post, I felt like I was repeating conventional wisdom, but I realized that's not something everybody's absorbed. It's profoundly shaped the way we approach Jetpac, with a laser-focus on our iPad app, because there's a deep shift in user behavior that established web companies are struggling to adapt to. Facebook is keenly aware of how important mobile is, but they're facing a classic innovator's dilemma where their core web business will suffer if they really prioritize phones and tablets. Bringing in the pioneers in mobile-only applications can't hurt as they wrestle with the changes they know they need to make.

Facebook's own valuation gives it a strong war chest for moves like this, so in their position I see why they made the purchase. The key is understanding how central photo sharing is for their business, and how much they believe in mobile.

Five short links

Fiveposter
Picture by Pink Ponk

Why Open Science failed after the Gulf oil spill – The description of this researcher's interactions with the media rang very true. They took his reports and "eliminated a lot of the caveats and limits that Asper placed on his own results".

Sigma.js – An interactive network graph library, with support for both live force-directed layouts, and importing more complex structures from the desktop Gephi application. It has some very stylish visual defaults too.

Accumulo – I'd missed this Apache database project until now, but I'm interested in their take on the BigTable concept, especially their focus on security controls. Intriguing that it came out of the NSA too.

Visualizing live event broadcast delay – Working backwards from website traffic at different locations to figure out the broadcast delay for a TV commercial.

Online Hex Editor – Does exactly what it says on the tin. I don't know why I'm still amazed by how effective web apps can be, but it's striking how few barriers there are to replacing desktop programs.

Where am I, who am I?

Mobydickmap

"Queequeg was a native of Rokovoko, an island far away to the West and South. It is not down in any map; true places never are."

Where am I right now? Depending on who I'm talking to, I'm in SoMa, San Francisco, South Park, the City, or the Bay Area. What neighborhood is my apartment in? Craigslist had it down as Castro when it was listed. Long-time locals often describe it as Duboce Triangle, but people less concerned with fine differences lump it into the Lower Haight, since I'm only two blocks from Haight Street.

When I first started working with geographic data, I imagined this was a problem to be solved. There had to be a way to cut through the confusion and find a true definition, a clear answer to the question of "Where am I?".

What I've come to realize over the last few years is that geography is a folksonomy. Sure, there's political boundaries, but the only ones that people pay much attention to are states and countries. City limits don't have much effect on people's descriptions of where they live. Just take a look at this map of Los Angeles' official boundaries:

Lacitylimits

There's clearly little correlation between the legal city boundaries and how people describe the place that they live. You could argue that Los Angeles County is the correct region to use, but then people way out in the desert by Littlerock would be included!

The arbitrary and human nature of places is even more pronounced with neighborhoods. As I showed above, there's a surprising amount of consensus on the names of the neighborhoods, but almost none on their boundaries.

Why do I care about all this? It's crucial for data processing to recognize that if you force what the user puts in the 'Location' box into a standardized form, you're losing information. For example, knowing how somebody naturally describes where they are is going to be a lot more useful for grouping them together than a street address or latitude/longitude coordinates. If I choose the Lower Haight label, I'm more likely to be a hippy or a punk, for the Castro I want to identify with the gay image, or if I go for the Mission I'm associating myself with hipsters.

I'm glad Twitter has stuck with their free-form text fields, and I hope Facebook will become more flexible. Don't throw this data away, treasure it! It makes it a lot harder for machines to deal with the content that people produce, but unless you're shipping packages or targeting ICBMs, the payoff of richer knowledge of your users is worth it.

Find amazing travel photos from your friends with Jetpac on the iPad

Ipadappscreenshot1

I'm interrupting my usual stream of geek consciousness to bring you a message from our sponsors. I'm very pleased to announce that the Jetpac iPad app is now available! Some of your friends are taking astonishing travel pictures that you've never seen. Get the app and we'll give you the very best of the two hundred thousand photos your friends have shared on Facebook.

Ratings are very important to help other people discover the app, so if you do enjoy it, please consider taking a few seconds to rate us too.

There's a lot of data stories from this release, and I'll be writing about them over the next few weeks, in between new features and bug-fixes for the next update!

Five short links

Handprint
Photo by Hobvias Sudoneighm

HTTP cookies, or how not to design protocols – Browser protocols feel a lot more like Windows than Unix in their design and evolution. The lack of clear principles means we'll face the same endless-but-just-about-manageable cascade of bugs that afflicted Microsoft's OS.

Nilometer – Predictive analytics from a hole in the ground. The level of the Nile was such a strong sign of the strength of the harvest months later that Cairo's biggest festival was cancelled and replaced with prayers and fasting if it didn't measure up. The 1,400 years of time series data from this instrument spawned some fascinating research in modern times too.

Earth Station: The afterlife of technology at the end of the world – What happens when the future becomes the past? An abandoned satellite tracking station vital for the moon landing, and the trailer park that now surrounds it.

Designing user experiences for imperfect data – Thinking about the UI from the start is vital to building effective data algorithms, and often turns impossible problems into solvable ones, as Matthew demonstrates.

Spatial isn't special – There's a life-cycle to every technology niche. As demand first emerges, the few developers who can serve it can make a handsome living, but gradually knowledge and tools diffuse to a wider world and the specialty becomes a skill that can be acquired rather than an expert you need to hire. This is a very good thing for the wider world, what were hard and expensive problems become cheap and easy to solve, but it's worth remembering that when the money's too good it won't last forever.

Programming and prior experience

I wanted to highlight a comment to my previous post about unpaid work, since I think it deserves to be more prominent:

——————————–

I'm a female who majored in computer science but then did not use my degree after graduating (I do editing work now). While I was great with things like red-black trees and k-maps, I would have trouble sometimes with implementations because it was assumed going into the field that you already had a background in it. I did not, beyond a general knowledge of computers. 

I was uncomfortable asking about unix commands (just use "man"! – but how do I interpret it?) or admitting I wasn't sure how to get my compiler running. If you hadn't been coding since middle school, you were behind. I picked up enough to graduate with honors, but still never felt like I knew "enough" to be qualified to work as a "true" programmer. 

I like editing better anyway, so I'm not unhappy with my career, but that enviroment can't be encouraging for any but a certain subset of people, privileged and pushed to start programming early. 

———————————

Five short links

Fiveways
Photo by Elliott Brown

Hollow visions, bullshit, lies, and leadership vs management – "the best creative work depends on getting the little things right", "organizations need both poets and plumbers". So much to absorb here, but it all chimes with my experiences at Apple. Steve Jobs may have been a visionary, but he also knew his business inside and out, and obsessed over details.

Exceptions in C with longjmp and setjmp – When I was learning C, I loved how it felt like complete mastery was within reach, it was contained and logical enough to compile mentally once you had enough experience. longjmp() and setjmp() were two parts I never quite understood until now, so it was fascinating to explore them here.

Web Data Commons – Structured data extracted from 1.5 billion pages. To give you an idea of the economics behind big data, the job only cost around $600 in processing costs.

Greg's Cable Map – A lovely tool for exploring the globe-spanning physical infrastructure that's knitting our world together.

Hammer.js – "Can't touch this!" – A cross-platform Javascript library for advanced gestures on touch devices like tablets and phones. Even just building a basic swipe gesture from tap events is a pain, so this is much needed.

Twelve steps to running your Ruby code across five billion web pages

Stacks2
Photo by Andrew Ferguson

Common Crawl is one of those projects where I rant and rave about how world-changing it will be, and often all I get in response is a quizzical look. It's an actively-updated and programmatically-accessible archive of public web pages, with over five billion crawled so far. So what, you say? This is going to be the foundation of a whole family of applications that have never been possible outside of the largest corporations. It's mega-scale web-crawling for the masses, and will enable startups and hackers to innovate around ideas like a dictionary built from the web, reverse-engineering postal codes, or any other application that can benefit from huge amounts of real-world content.

Rather than grabbing each of you by the lapels individually and ranting, I thought it would be more productive to give you a simple example of how you can run your own code across the archived pages. It's currently released as an Amazon Public Data Set, which means you don't pay for access from Amazon servers, so I'll show you how on their Elastic MapReduce service.

I'm grateful to Ben Nagy for the original Ruby code I'm basing this on. I've made minimal changes to his original code, and built a step-by-step guide describing exactly how to run it. If you're interested in the Java equivalent, I recommend this alternative five-minute guide.

1 – Fetch the example code from github

You'll need git to get the example source code. If you don't already have it, there's a good guide to installing it here:

http://help.github.com/mac-set-up-git/

From a terminal prompt, you'll need to run the following command to pull it from my github project:

git clone git://github.com/petewarden/common_crawl_types.git

2 – Add your Amazon keys

If you don't already have an Amazon account, go to this page and sign up:

https://aws-portal.amazon.com/gp/aws/developer/registration/index.html

Your keys should be accessible here:

https://aws-portal.amazon.com/gp/aws/securityCredentials

To access the data set, you need to supply the public and secret keys. Open up extension_map.rb in your editor and just below the CHANGEME comment add your own keys (it's currently around line 61).

3 – Sign in to the EC2 web console

To control the Amazon web services you'll need to run the code, you need to be signed in on this page:

http://console.aws.amazon.com

4 - Create four buckets on S3

Commoncrawl0

Buckets are a bit like top-level folders in Amazon's S3 storage system. They need to have globally-unique names which don't clash with any other Amazon user's buckets, so when you see me using com.petewarden as a prefix, replace that with something else unique, like your own domain name. Click on the S3 tab at the top of the page and then click the Create Bucket button at the top of the left pane, and enter com.petewarden.commoncrawl01input for the first bucket. Repeat with the following three other buckets:

com.petewarden.commoncrawl01output

com.petewarden.commoncrawl01scripts

com.petewarden.commoncrawl01logging

The last part of their names is meant to indicate what they'll be used for. 'scripts' will hold the source code for your job, 'input' the files that are fed into the code, 'output' will hold the results of the job, and 'logging' will have any error messages it generates.

5 – Upload files to your buckets

Commoncrawl1

Select your 'scripts' bucket in the left-hand pane, and click the Upload button in the center pane. Select extension_map.rb, extension_reduce.rb, and setup.sh from the folder on your local machine where you cloned the git project. Click Start Upload, and it should only take a few seconds. Do the same steps for the 'input' bucket and the example_input.txt file.

6 – Create the Elastic MapReduce job

The EMR service actually creates a Hadoop cluster for you and runs your code on it, but the details are mostly hidden behind their user interface. Click on the Elastic MapReduce tab at the top, and then the Create New Job Flow button to get started.

7 – Describe the job

Commoncrawl2

The Job Flow Name is only used for display purposes, so I normally put something that will remind me of what I'm doing, with an informal version number at the end. Leave the Create a Job Flow radio button on Run your own application, but choose Streaming from the drop-down menu.

8 – Tell it where your code and data are

Commoncrawl3

This is probably the trickiest stage of the job setup. You need to put in the S3 URL (the bucket name prefixed with s3://) for the inputs and outputs of your job. Input Location should be the root folder of the bucket where you put the example_input.txt file, in my case 's3://com.petewarden.commoncrawl01input'. Note that this one is a folder, not a single file, and it will read whichever files are in that bucket below that location.

The Output Location is also going to be a folder, but the job itself will create it, so it mustn't already exist (you'll get an error if it does). This even applies to the root folder on the bucket, so you must have a non-existent folder suffix. In this example I'm using 's3://com.petewarden.commoncrawl01output/01/'.

The Mapper and Reducer fields should point at the source code files you uploaded to your 'scripts' bucket, 's3://com.petewarden.commoncrawl01scripts/extension_map.rb' and 's3://com.petewarden.commoncrawl01scripts/extension_map.rb'. You can leave the Extra Args field blank, and click Continue.

9 – Choose how many machines you'll run on

Commoncrawl4

The defaults on this screen should be fine, with m1.small instance types everywhere, two instances in the core group, and zero in the task group. Once you get more advanced, you can experiment with different types and larger numbers, but I've kept the inputs to this example very small, so it should only take twenty minutes on the default three-machine cluster, which will cost you less than 30 cents. Click Continue.

10 – Set up logging

Commoncrawl6

Hadoop can be a hard beast to debug, so I always ask Elastic MapReduce to write out copies of the log files to a bucket so I can use them to figure out what went wrong. On this screen, leave everything else at the defaults but put the location of your 'logging' bucket for the Amazon S3 Log Path, in this case 's3://com.petewarden.commoncrawl01logging'. A new folder with a unique name will be created for every job you run, so you can specify the root of your bucket. Click Continue.

11 – Specify a boot script

Commoncrawl5

The default virtual machine images Amazon supplies are a bit old, so we need to run a script when we start each machine to install missing software. We do this by selecting the Configure your Bootstrap Actions button, choosing Custom Action for the Action Type, and then putting in the location of the setup.sh file we uploaded, eg 's3://com.petewarden.commoncrawl01scripts/setup.sh'. After you've done that, click Continue.

12 – Run your job

Commoncrawl7

The last screen shows the settings you chose, so take a quick look to spot any typos, and then click Create Job Flow. The main screen should now contain a new job, with the status 'Starting' next to it. After a couple of minutes, that should change to 'Bootstrapping', which takes around ten minutes, and then running the job, which only takes two or three.

Debugging all the possible errors is beyond the scope of this post, but a good start is poking around the contents of the logging bucket, and looking at any description the web UI gives you.

Commoncrawl8

Once the job has successfully run, you should see a few files beginning 'part-' inside the folder you specified on the output bucket. If you open one of these up, you'll see the results of the job.

Commoncrawl9

This job is just a 'Hello World' program for walking the Common Crawl data set in Ruby, and simply counts the frequency of mime types and URL suffixes, and I've only pointed it at a small subset of the data. What's important is that this gives you a starting point to write your own Ruby algorithms to analyse the wealth of information that's buried in this archive. Take a look at the last few lines of extension_map.rb to see where you can add your own code, and edit example_input.txt to add more of the data set once you're ready to sink your teeth in.

Big thanks again to Ben Nagy for putting the code together, and if you're interested in understanding Hadoop and Elastic MapReduce in more detail, I created a video training session that might be helpful. I can't wait to see all the applications that come out of the Common Crawl data set, so get coding!

Unpaid work, sexism, and racism

 

Skatergirldevilboy

Photo by Wayan Vota

You may have been wondering why I haven't been blogging for over a week. I've got the generic excuse of being busy, but truthfully it's because I've had a draft of this post staring back at me for most of that time. God knows I'm not normally one to shy away from controversy, but I also know how tough it is to talk about racism and sexism without generating more heat than light. After two more head-slapping examples of our problem appeared just in the last few days, I couldn't hold off any longer. I'm not a good person to talk about explicit discrimination in the tech industry, I'd turn to somebody like Kristina Chodorow, but I have been struck by one of the more subtle reasons we discourage a lot of potential engineers from joining the profession.

I don't get paid for most of the things I spend my time on. I do my blogging, open source coding, and speak at conferences for free, my books provide beer money, and I've only been able to pay myself a small salary for the last few months, after four years of working on startups. This isn't a plea for sympathy, I love doing what I do and see it all as a great investment in the future. I saved up money during my time at Apple precisely so I'd have the luxury of doing all these things.

I was thinking about this when I read Rebecca Murphey's post about the Fluent conference. Her complaints were mostly about things that seemed intrinsic to commercial conferences to me, but I was struck by her observation that the lack of expenses for speakers hits diversity.

I think it goes beyond conferences though (and I've actually found O'Reilly to be far better at paying contributors than most organizers, and they work very hard on discrimination problems). The media industry relies on unpaid internships as a gateway to journalism careers, which excludes a lot of people. Our tech community chooses its high-flyers from people who have enough money and confidence to spend significant amounts of time on unpaid work. Isn't this likely to exclude a lot of people too?

And yes, we do have a diversity problem. I'm not wringing my hands about this out of a vague concern for 'political correctness', I'm deeply frustrated that I have so much trouble hiring good engineers. I look around at careers that require similar skills, like actuaries, and they include a lot more women and minorities. I desperately need more good people on my team, and the statistics tell me that as a community we're failing to attract or keep a lot of the potential candidates.

We're a meritocracy. Writing, speaking, or coding for free helps talented people get noticed, and it's hard to picture our industry functioning without that process at its heart. We have to think hard about how we can preserve the aspects we need, but open up the system to people we're missing right now. Maybe that means setting up scholarships, having a norm that internships should all be paid, setting aside time for training as part of the job, or even doing a better job of reaching larval engineers earlier in education? Is part of it just talking about the career path more explicitly, so that people understand how crucial spending your weekends coding on open source, etc, can be for your career?

I don't know exactly what to do, but when I look around at yet another room packed with white guys in black t-shirts, I know we're screwing up.

Five short links

Twoplusthree
Photo by Bitzi

Geotagging poses security risks – An impressively level-headed look at how the quiet embedding of locations within photos can cause security issues, especially for the service members it's aimed at.

I can't stop looking at tiny homes – I was so happy to discover I'm not the only one obsessed with houses the size of dog kennels. If you're a fellow sufferer, avoid this site at all costs.

From CMS to DMS – Are we moving into an era of Data Management Systems, that play the same interface role for our data that CMS's do for our content?

Drug data reveals sneaky side effectsDrew Breunig pointed me at this example of how bulk data is more than the sum of its parts. By combining a large amount of adverse reaction reports, a large number of new side effects caused by mixing drugs were discovered.

Gisgraphy – An intriguing open-source LGPL project that offers geocoding services based on OpenStreetMap and Geonames information. I look forward to checking this out and having a play.