Is Michigan more beautiful than Italy?

Lakemichigan

Photo by Rachel Kramer

I can now officially pronounce Michigan as the fifth most beautiful place in the world!

With the launch of Jetpac, my big data science job is identifying the photos you'll find most inspiring. I've been exploring the 50 million captions you've shared with us so far, trying to identify patterns, and it really is the most fun part of my day! There's so many surprises hidden in the data, but one of the biggest came when I calculated the places where people were most likely to use the word 'beautiful' in their captions:

#1 Sedona, 10.8x

#2 Cabo San Lucas, 9.7x

#3 Lake Victoria, 6.6x

#4 Amarillo, TX, 6.2x

#5 Michigan's Upper Peninsula, 6.0x

#6 Algarve, Portugal, 5.9x

#7 Montevideo, Uruguay, 5.9x

#8 Bath, UK, 5.9x

#9 Florence, Italy, 5.8x

#10 Hood River Valley, Oregon, 5.2x

A lot of those made perfect sense, who doesn't love Sedona or Lake Victoria, but I had to triple-check my calculations when Michigan showed up! How did the world center of trashed building photography end up in fifth place above Florence?!

As I looked through my friends photos on Facebook and the public ones on Flickr it all started to make a lot more sense. The area around Lake Michigan and the Upper Peninsula in particular are full of stunning scenes, with the storms, massive skys and cliffs producing amazing shots. 

Michigansky
Photo by Kevin Dooley

So now the results started to make a lot more sense. The numbers don't lie! Now I just need to see if I can get a free vacation from the Michigan Tourist Commision, I'm actually itching to check it out after being sucked in to all the photos I've had to check out. I've discovered somewhere new in the world that I'm dying to visit, even though I'd never have thought of going there in a million years.

If you're as interested in playing with the data as I am, I've also put up a tool where you can find the distribution of any words that show up a lot in our 50 million photos, including beautiful, I'd love to hear what patterns you find.

Beautifulrainbow

About that awesome video

No, not my one from yesterday, the product demo we have on the front page of Jetpac! I wasn't sure we should be spending time and effort on something that seemed non-critical, but it turned out to be incredibly useful in explaining what we're doing and persuading users to try us out. I spent years working on high-end video software, so I tend to be very critical of production work, but the producer Mike Kaney did an amazing job, even on very tricky elements like the reflections. What's even more impressive is that he achieved all this on a startup budget! If you're interested in getting something made for one of your own projects, check out his Rockbridge production company and tell him we sent you.

I have to mention the star performance by Jetpac's very own mad marketing genius Stephanie Southerland. Despite no background in acting, she's apparently a natural performer, even on top of a building in her swimwear on a freezing-cold November day.

How to persuade users to sign up with Facebook Connect

Willyoume

Photo by Poppy Thomas-Hill

A friend told me that she was encountering a lot of people who liked the idea of her new service, but were put off by the idea of using Facebook to connect. She asked me how we tackled this issue for Jetpac, so I thought I'd put together a quick summary of what we've found to help her, and anyone else who's struggling.

In my experience there are three kinds of potential users, who all require different approaches:

Facebook Negatives

Especially in the tech world there's a small but real group of people who either don't have Facebook accounts at all, or who use it minimally. I'm not a heavy Facebook user myself, and I understand a lot of their reasons, and so I just try to make it clear that we plan to support other services like Flickr and Instagram in the future, and leave them in peace.

App Enthusiasts

At the other end of the spectrum, there's a good number of people who don't have any concerns about adding new applications. This population is slowly shrinking, thanks to the feedback from friends who are annoyed when they get spammed, but they're still out there. That means if you can't persuade anyone to sign up for your service, then you're doing really badly and should triple-check your messaging.

Persuadable Skeptics

The biggest group are those who may be willing to connect, but want some reassurance before they do. Here's the things we've found help convince them:

Clear message – The name, tagline and copy on the website and in any ads need to make it very clear why Facebook is needed. People are very wary of signing up if they don't understand why your site needs access to their social network. A friend even suggested putting in an extra dialog before the external permissions page, spelling out exactly what the benefits of connecting are, and why it's necessary for your service, which we hope to try out soon.

Minimal permissions – Cut down the number of different permissions you're asking for to the absolute minimum. As an example, we originally asked for feed posting permissions just so we could easily support an in-app method for users to comment on their friends photos, but experience of spammy applications who abuse that power made many of our early users refuse. We reworked the feature so that we use a Facebook widget for commenting instead, so we didn't have to ask for posting rights, and our conversion rate went way up.

High production values – It sounds superficial, but a professional design for your site is essential. People are looking for any clues about your trustworthiness, and the fact that you've put a lot of effort into the look of your site reassures them that you aren't a scam. Sometimes judging a book by its cover is a useful heuristic.

Personal touch – High production values don't mean adopting a distant, corporate voice, that's guaranteed to put people off. Most likely you're a small team of enthusiasts like us, so use that as a strength, put yourselves front and center. Seeing that the team is proud of what they've built and willing to stand behind it makes a world of difference to wavering users. It's also a helpful culture-building tool internally, if the team knows they're putting their own reputations on the line they will be extra-careful about protecting user information.

The data geekery behind Jetpac

My new startup has just gone public, and I wanted to talk a bit about the data geekery behind the consumer experience, and some of the technical and privacy challenges, so I threw together a quick video cast. You can get more information on the project over on the company blog, and by following us on Twitter. I'm pretty excited to finally be able to talk about what I've been working on for the last six months!

Don't forget too check out my new visualization too, an interactive word map of 50 million photo captions:

Whereshot0

Five short links

Homemadejetpack
Photo by Michael Donovan

Place graphs are the new social graphs – Fascinating work by Matt Biddulph, looking for geographic analogies (for example tell me the neighborhood in New York that's most similar to Noe Valley in San Francisco).

Yet another government portal to ignore – Though they're a massive step forward conceptually, most government open data efforts are crippled by terrible usability. I still find myself digging through the FTP server for the US Census, after failing to navigate their web interfaces.

Angels of the Right – There's been a lot of attempts to produce graphs showing networks of influence, but this is by far the most approachable and informative I've seen. It's actually useful for discovering things! Even better, Skye's helped package the code behind it into an open-source framework called nodeviz.  

Data Illumination – An intriguing new data blog that's just started. Not much content yet, but I like what's there so far. I'm guessing the more readers and commenters appear, the more likely we'll keep getting more.

Shuttle Radar Topography Mission – Free elevation data for the entire world, with samples as close as 30m in the US, and 90m for the rest.

 

Communists in Space, and now on the Kindle

Communiststamp
Picture by Joseph Morris

When I was eight years old, I found a book in my brother's room about nuclear war. In it was a map showing the likely British targets of a Soviet nuclear strike as circles. I grew up in East Anglia, surrounded by American air bases, so everywhere for miles around was such a solid mass that you couldn't even see the individual dots. This so terrified me that I made excuses for years to avoid going into the nearby city of Cambridge, I had such a vivid picture in my head of roasting alive as the air caught fire. 

A two-week school visit to Russia just before the fall of the USSR gave a glimpse of the grim and tawdry reality of the Soviet system (brown fruit juice, anyone?) but the idea of communists as terrifying bogeymen has never really left me. I've had a strange fascination, an impulse to understand how people ended up in such a twisted state, that's led me to read up on the early Soviet era, especially Stalin's particularly demonic rule. As I've got older I've also tried to understand what drove well-intentioned people to support terrible actions, and the humanistic resistance of others like George Orwell.

That all left me a prime audience for Ken Macleod's Fall Revolution series. I first came across Star Fraction by accident, but was immediately captured by a very British near future, inhabited by people I recognized. Trotskyite militants battle the Animal Liberation Front, a quasi-Richard-Dawkins summons familiars to attack enemies from his Seastead, and a combined UN/US 'peacekeeping force' has suffered the ultimate mission creep and runs the world from its space weapons platforms. Running through the book is a Communist conspiracy theory that blows the tired Templar myths out of the water because it's based on historical templates that actually happened. Communists truly ran effective underground organizations for decades and otherthrew governments, so for someone with Macleod's knowledge of the movements (here's his take on Orwell in context) there's rich material to choose from.

In case this sounds too stuffy, it's fundamentally an adventure story that has pleasant echoes of Neuromancer, it's not heavy reading. The only thing that has surprised me is how little attention it ever received, people seem far more focused on later books like his Cosmonaut Keep series. Star Fraction was one of those novels that stuck in my head, and since my paper copy is still in storage in the UK, I've been hoping for an ebook version so I could justify buying it again. When I saw Ken announce that one of his more recent books had just been released electronically, I went back to search for a copy of Star Fraction and finally found one for the Kindle, bundled with The Stone Canal as Fractions: The First Half of the Fall Revolution. I'm now a few chapters in and it's every bit as good as I remember, popping with wild ideas and a refreshingly different angle on the world.

Since I didn't see the news appear on Ken's blog, and he didn't know about it when I hassled him on Twitter a few months ago, consider this a public service announcement: Star Fraction is available as an ebook! If you find the idea of Communist Conspiracies in Space at all intriguing, buy it now, you won't be sorry.

How to enter a data contest – machine learning for newbies like me

Lotteryticket
Photo by John Carleton

I've not had much experience with machine learning, most of my work has been a struggle just to get data sets that are large enough to be interesting! That's a big reason why I turned to the Kaggle community when I needed a good prediction algorithm for my current project. I wasn't completely off the hook though, I still needed to create an example of our current approach, limited as it is, to serve as a benchmark for the teams. While I was at it, it seemed worthwhile to open up the code too, so I've created a new Github project:

https://github.com/petewarden/MLloWorld

It actually produces very poor results, but does demonstrate the basics of how to pull in the data and apply one of scikit-learn's great collection of algorithms. If you get the itch there's lots of room for improvement, and the contest has another two weeks to run!

Installing scikits-learn

Before you can run the python scripts, you'll need to install the scikits-learn machine-learning framework. Here's the instructions.

It's also worth checking out the tutorial and their other guides, they've written some great documentation.

Getting the code

To pull the latest copy of this code and enter the directory run these commands:

git clone git://github.com/petewarden/MLloWorld.git

cd MLloWorld/

Creating a model

Before you can predict unknown values, you need to train up the algorithm with example data. I've packaged a set of 40,000 items as a CSV file, with each column representing an attribute of the original photo albums. You'll need to run these through the training script to build a model that can be used for prediction. Here's the command:

python train.py training_data.csv storedmodel

That may take ten or twenty minutes to run, but at the end you should have a file called storedmodel in the current directory.

Predicting results

Now that you have a model built, you can take the test set of data and predict their values:

python predict.py test_data.csv storedmodel > results.csv

This will also take a few minutes, but at the end you'll have a CSV file containing a list of the album ids and a prediction for each one. It's in the right format to submit to Kaggle, and if you look for the 'Full scikit-learn example' in the benchmarks at the bottom of the leaderboard, you'll see how this simple approach scored:

http://www.kaggle.com/c/PhotoQualityPrediction/Leaderboard

As you can see, it's not that great! If you modify the code and think you've improved its predictions, you can create a team and submit your new results to find out how well you've done. There's already stiff competition from the current teams of course!

http://www.kaggle.com/c/PhotoQualityPrediction/Submissions

Notes on the internal data format

The trickiest part for me was getting the data into a format that scikit-learn's functions could understand. Because the CSV stores which words occurred for an album, the full row vector for each of them could be thousands of entries long, most of them zero. To speed up the training and save on memory, I used numpy's sparse matrix class to store the results, coo_matrix. You can see the sort of unpacking I do in the expand_to_vectors() function in mlloutils.py

[Update – Big thanks to Olivier Grisel who vastly improved the results by fixing some errors in the CSV reader and picking a more accurate and much faster classifier. I've integrated his changes, and now see a score of 0.44, which still puts it at the bottom of the leaderboard but is at least respectable!]

Why your startup should use data competitions

Machinelearning
Photo by Brett Jordan

When I first came across Kaggle last year I loved the idea. They run Netflix-style data competitions as a service, and by linking real-world problems with the researchers who can solve them, everybody wins! The only thing that surprised me was that I didn't see startups creating contests, it seemed like a perfect way to bring in some amazingly talented helpers on a tight budget. It stayed on my mind, and when I hit a tough prediction problem with my current company, I knew I wanted to give Kaggle a try.

Happily I've got to know Anthony, Jeremy and the team quite well since I moved to San Francisco, so they were extremely helpful when I turned to them (and they cut me a great deal at starving-startup rates!). I first reached out on Wednesday, and we had the competition live on Saturday morning:

Photo Quality Prediction

With a prize pool of $5,000, we've attracted over thirty teams in less than two days! The results are already very promising, and there's still three weeks left.

If you're a startup, have a look at the amount of intelligent and enthusiastic help we're getting, and think about the problems you face in your business. You want to be focused on your product, so why not get help from experts on the machine-learning side? I bet they'll do a much better job than you have time for, and it won't break the bank. Kaggle's help has meant it's taken very little of my time as well, freeing me up to work on our core technology.

In my next post I'll walk you through exactly what it took to set up the competition through Kaggle's site, but go check it out now, and picture what you could do with dozens of machine-learning ninjas on your team.

Why we need an open-source geocoding alternative to Google

Usmappins
Photo by Marc Levin

You can't use Google's geocoding for anything but map display! I've always been surprised by how many services rely on the Google Maps API for general address to coordinate translation, despite it being prohibited unless you're displaying the results on one of their maps. Google have provide some fantastic resources for geo developers, they've moved the whole field forward, but we can't rely on them for everything. The recent changes to their terms of service have alerted a few people to this long-time issue, so here's the alternatives I've discovered over the years, and why I think you should look into open-source solutions.

Yahoo

The easiest change for an application developer is to use one of Yahoo's excellent geocoding APIs, either Placefinder for street addresses, or Placemaker for more unstructured names of places like towns, provinces or countries. There's no restrictions on how you use the data, and you get 50,000 requests a day. It has good coverage worldwide (though I recently noticed an issue with Finland).

The biggest downside is that they clearly have an uncertain future. Yahoo hasn't managed to monetize their awesome developer APIs, and most of the engineers involved in setting them up have left. It's nerve-wracking to build your application on an API that could disappear at any point!

Schuyler Erle

Schuyler is a one-man open-source-geocoding machine! He wrote the original Perl module for taking US Census data and looking up addresses, and also created an updated Ruby version for the Geocommons folks. I've found it works impressively well on US addresses. The biggest drawbacks are the requirement that you download and import many gigabytes of US census data before you can set it up on your own machine, and a lack of international coverage.

Nominatim

OpenStreetMap has created the Nominatim project for converting addresses into coordinates using its open-source collection of mapping information. Unfortunately it's way too logical for its own good, expecting to receive addresses that are strictly hierarchical. For example, it can't understand "40 Meadow Lane, Over, Cambridge CB24 5NF, United Kingdom", you have to mangle it to something unnatural like "40 Meadow Lane, Over, Cambridgeshire, England" before it starts to parse it, and even then it picks the wrong one as the first result. It also generally doesn't know where numbers fall on particular streets, since it relies on landmark points like pubs with numbers attached, and these are generally very sparse.

Data Science Toolkit

Since I couldn't find anything that met my needs, I decided to take a shot at pulling together a lot of the existing resources into a more convenient package. I took Schuyler's work on the TIGER/Line data for US addresses, and used some of the Nominatim backend code with a more flexible front-end to handle more postal addresses. I then rolled up a couple of virtual machine packages so you don't have to do the messy data importing yourself, so you can grab it as an Amazon AMI or a VMware image. You can get started using the main datasciencetoolkit.org site through the API too, but I wouldn't recommend it for heavy use since it's just a single machine.

Its main limitation is that it only handles US and UK addresses. The UK lookups are all done through OpenStreetMap data, so it should be possible to extend it worldwide given enough work, I just haven't been able to devote enough time to do that. I'd love to see someone extend the current code though, or improve a different project like Nominatim, or even start a whole new one. There's already enough data out there to build a truly open API for geocoding, so let's make it happen!

Analytics adventures with Pig and Cassandra

Pignurse
Photo by Jes

After a lot of head-scratching, I ended up choosing Cassandra as the main data store for my latest project. I needed a system that could handle a loading process throwing hundreds of thousands of items a minute at it, denormalized across multiple indexes, whilst simultaneously serving up results to a web application, and so far it has performed magnificently. 

Unfortunately it has ended up being a victim of its own success. Now that we have tens of millions of pieces of user-generated content, we want to ask the data questions. That's not so easy with NoSQL, so here's some notes on the solution I ended up building.

Hadoopable

The only way to run code across data held on a large cluster of machines is to execute the processing on a similarly large cluster, preferably on machines that already have local copies of the data. That means using Hadoop, and since I'm using DataStax EC2 machine images for my Cassandra servers anyway, I started by trying to add some Brisk AMIs to my existing cluster. This is a Hadoop distribution, pre-configured to integrate with Cassandra, designed for exactly what I was hoping to do. Unfortunately I struggled to figure out the right startup parameters to get the newly-created machines talking to my existing cluster, despite some excellent help from Joaquin. I took a break from that, and discovered that hand-building the basic Cassandra components I needed wasn't too hard on my existing Hadoop machines, so I continued with more of a home-brew setup.

Pig or Hive?

In order to run Hadoop jobs on Cassandra data, you need a fast way to pull it out from your tables into a supported coding environment. The only two supported languages for this in the current Cassandra releases are Pig and Hive. Pig is a procedural data-transformation language, whereas Hive looks a lot like SQL. I think a lot more procedurally and I needed something that could handle some tough unpacking and formatting tasks, so I went with Pig.

Squeal!

I don't know if my experiences would have been any better with Hive, but I found I was walking a fairly lonely path using Pig in conjunction with Cassandra. Jeremy Hanna was a life-saver, and Brandon Williams put in a lot of hard work to get me up and running, but my initial encounter involved a couple of days of me tearing my hair out. I was trying to make some sense out of the results I was seeing on the latest stable release of Cassandra, but they left me baffled. It turned out that the recent introduction of types into the Cassandra adapter had broken all the existing example code, and left the schema reported for the data very different from the actual structure that was returned. Happily I was able to monkey up a messy patch, which Brandon then fixed properly, but it definitely made me realize how far out on the bleeding edge I was. That's a place I usually try to avoid for mission-critical projects!

The right tool for the job

With that overcome, I was able to move forward, but not as speedily as I had hoped. Pig's great strength is that it's a domain-specific language for data processing. It's a big bag of useful operations, with no particular grand design for the language. It reminds me of PHP or R, and I don't mean those comparisons as an insult, I have a fondness for these sort of languages. When you're working inside their domain they're extremely productive, you almost never need to install extra dependencies and everything's at your fingertips. Sadly, I found I was operating a bit outside of the mainstream for Pig.

Pig wrestling

As an example, I store a large array of records per-user. In theory, I could store each item as a new row in Cassandra, with a secondary index for the user key, but in practice the performance and storage size overhead of that approach rules it out. A CSV string is a simple but effective way of holding the data, so when I'm running a Pig script, I needed to decode that string back into records. There is a smart CSV loader in the latest release, but it only works when you're reading in files, not on strings you've already loaded. To make the job easier, I reloaded all my data using a format that was going to be a lot easier to parse (stripping out any line terminator or separation characters inside quoted strings, for example) and then set out to do the job using Pig's built-in primitives.

I thought I could just use STRSPLIT to break up my strings into individual rows, but it turns out that it only returns a tuple. This is bad because turning a tuple into separate records for further processing is pretty involved at best. What you really need is what Pig calls a bag, an unordered set of records that can be easily turned into a proper stream of records. TOKENIZE is almost identical to STRSPLIT and returns a bag, so I thought I was in luck. Unfortunately it doesn't take a parameter allowing you to specify what characters to split the string on. Undaunted, I thought I should contribute something back to the code base and create a patch. I dusted off my Java neurons, and created the changes I needed. That let me finally parse the CSV files I was dealing with, but I was stuck as I tried to figure out how to clean up the code to offer it as a patch.

The problem is that my new TOKENIZE takes an optional second parameter to specify the custom characters to split on. I needed to keep the existing behavior for a single parameter call to avoid breaking old scripts. That's easy enough to handle in the execution code, I just check the number of arguments passed in, but there's also a function that exposes the signature of the function. That doesn't support variable numbers of arguments, as a known limitation. The suggested workaround is to remove the signature definition entirely, but since that's presumably there for a reason, it didn't seem a sensible approach. In the end I was stymied, but since I could move forward with a slightly custom branch of Pig, I reluctantly abandoned the patch.

Happily ever after

I don't want to sound too negative about my experiences, now that I've got the basics set up I'm able to write scripts very quickly and answer all sorts of questions. It's also amazing to think about the power of all this free software unleashes, and the generosity of the community who helped me out as a newbie. If you're considering a similar project though, I would either budget for more research time than you might expect, or track down a native guide, somebody who has already done something similar in a production environment.

[UpdateJon Coveney added a great comment explaining more about how to solve the TOKENIZE issue I hit, so I'm including it below, and I'll do another post when I give it a try]

Hey Pete, thanks for writing about your experience. It's a goal of mine to find a project to use Cassandra for, and I'm sure I'll be walking through many similar problems. 

Just wanted to note something about the TOKENIZE piece. Pig can be a bit weird about variable arguments, but in this case, it shouldn't be too bad. You have two options. 

1 is to have an optional constructor which takes one parameter. You could then do 
DEFINE mytokenize TOKENIZE('*'); 

now you could just use mytokenize normally. Implementing this is as difficult as implementing the constructor. 

Another option is that you can have pig take 1 or 2 arguments. The limitation you pointed to doesn't actually apply here…that limitation is that if you have a function that takes a variable number of arguments _that also takes arguments of different types_ (this part is key) THEN you can't do it. [1] 

In your case, tokenize always takes a string, and an optional string delimiter. I personally would go the constructor route, but either is fine 🙂 I usually write UDF's with an initialize method instead of doing it all constructors anyway, so you would only have to check the number of arguments once. 

Happy hacking 
Jon 

[1] To explain where this applies, think of SQL's coalesce function. This is a function that can take both functions of different types, and varying numbers of them. So you could do coalesce(1,2,3,4,5) or coalesce('hey','you','get','out') or whatever. Pig does not allow you to do this. With the getArgToFuncMapping, you can map to a function that takes a fixed number of arguments. Or you can have a varying number of arguments, but it can't be sensitive to the type of the input…while you can still then implement something like coalesce, it's going to be slow.