Five short links

fiveinches

Photo by Sean Lamb

Nine ways to break your system code using volatile – Even seemingly-simple constructs in low-level languages can have tremendous subtleties. I love reading explorations like these, hoping I never need to use the knowledge in production, but feeling like a little more of my in-game map has been filled in. At some point pedantic becomes sublime.

Bayes Rule and the paradox of pre-registration of RCTs – There’s a movement to declare what hypotheses you’re going to test before you start your research, to avoid the classic cherry-picking problem. Donald does a great job explaining why it should make a big difference in how much we trust study results, even though it feels counter-intuitive, and again a bit pedantic. I guess 2014 is becoming the year I try to bring pedantry back?

Are towns stuck in the wrong places? – It’s not often you can perform a 2,000 year-long natural experiment, but this look at the performance of British and French towns after the Romans is intriguing, and very relevant as we consider how to respond to struggling cities like Detroit.

The Chinese wheelbarrow – The eastern version of the wheelbarrow could carry far larger loads than the European approach, thanks to the design of a central wheel that allowed most of the weight to be taken by the vehicle, instead of the operator. This article makes a convincing case that the Europeans lost out heavily by sticking to their wheel-forward design that left half the lifting on the driver, but it also left me wanting to dig into the contrarian angle, and see what good reasons there might be for the adherence to tradition.

Empathy is a core engineering value – We have a tremendous amount of power as engineers, and time and again I’ve seen decisions that save a few hours for a single developer cost man-years of time and frustration for thousands of other people. Keith does a great job of showing why “it’s essential to at least attempt to understand the plight of users”. You can’t always gold-plate your software and deal with every problem in the depth you’d like, but putting yourself in the end-user’s shoes will help prioritize where to best spend your limited time.

What does Jetpac measure?

appscreenshot

Jetpac is building a modern version of Yelp, using Big Data rather than user reviews. People take a billion photos every day, and many of these are shared publicly on social networks. We’re analyzing these pictures to build better descriptions of bars, restaurants, hotels, and other venues around the world.

When you see a label like “Hipsters” in the app, you probably wonder where it comes from. The short answer is that we’re spotting places that have a lot of mustaches! There’s a lot going on under the hood to reach that conclusion, and we’ve had fun building some pretty unusual algorithms, so I’ll be geeking out a bit about how we do it.

One thing to bear in mind when digging into this is that we’re in the engineering business, not research, so our goal is to build tools that meet our needs, rather than trying to perform basic science. While I’ve included the results of our internal testing, nothing here has gone through rigorous peer review, so use our conclusions with care. The ultimate proof is in the app, which I’m damn proud of, so please download it and see for yourself!

Image-based measurements

The most important information we pull out is from the image pixels. These tell us a lot about the places and people who are in the photos, especially since we have hundreds or thousands of pictures for most locations.

One very important difference between what we’re doing with Big Data and traditional computer vision applications is that we can tolerate a lot more noise in our recognition tests. We’re trying to analyze the properties of one object (a bar for example) based on hundreds of pictures taken there. That means we can afford to have some errors in whether we think an individual photo is a match, as long as the errors are random enough to cancel themselves out over those sort of sample sizes. For example, we only spot 18% of actual mustaches, and we mistakenly think 1.5% of the clean-shaven people we see have facial hair. This would be useless for making a decision on an individual photo, but it’s very effective at categorizing a population of people.

Imagine one bar that has a hundred photos of people, and in reality none of them have mustaches. We’ll likely see one or two mistakenly tagged as having mustaches, giving a mustache rating of 0.01 or 0.02. Now picture another bar where 25 of the hundred people have mustaches. We’ll spot four or five of those mustaches, along with probably one mistaken bare-face, giving a mustache rating of 0.05 or 0.06.

“All models are wrong, but some are useful.” – George E.P. Box

This might sound like a cheat, and it is, but a useful one! Completely-accurate computer vision is an AI-complete problem,  but just like in language translation, the combination of heuristics and large numbers of samples offers an effective alternative to the traditional semantic approach.

We have put together some highly-accurate individual tests too, such as  food and sky detection, which are good enough to use for creating slideshows, an application where false positives are a lot more jarring as part of the experience, but the key point is that we’re able to draw on a much wider range of algorithms than traditional object recognition approaches. Because we’re focused on using them for data, any algorithm with a decent correlation to the underlying property helps, even if it would be too noisy to use for returning search results.

Testing

Internally, we use a library of several thousand images that we’ve manually labeled with the attributes we care about as a development set to help us build our algorithms, and then a different set of a thousand or so to validate our results. All of the numbers are based on that training set, and I’ve included grids of one hundred random images to demonstrate the results visually.

We’re interested in how well our algorithms correlate with the underlying property they’re trying to measure, so we’ve been using the Matthews Correlation Coefficient (MCC) to evaluate how well they’re performing. I considered using precision and recall, but these ignore all the negative results that are correctly rejected, which is the right approach for evaluating search results you’re presenting to users, but isn’t as useful as a correlation measurement for a binary classifier. The full testing result numbers are up as a Google spreadsheet, but I’ll be quoting the MCC in the text as our main metric.

Mustaches

We first find likely faces in a photo using image recognition, and then we analyze the upper lip to determine if there’s a mustache or other facial hair there, with an MCC of 0.29. The false positives tend to be cases where there’s strong overhead lighting, giving a dark shadow underneath people’s noses. We use the prevalence of mustaches to estimate how many hipsters inhabit a venue.

Results on photos with mustaches

Results on photos with mustaches

.

Results on people without mustaches

Results on people without mustaches

Smiles

Once we’ve found faces, we run pattern recognition to look for mouths that appear to be smiling. We’re looking for toothy smiles, rather than more subtle grins. The metric gives us an MCC of 0.41. The measurement we get out is actually the number of pixels in the image we detect being a part of a smile, so large smiles have more weight than smaller ones. We use the number of smiles to estimate how good of a time people are having at a place.

People with smiles

Results on people with smiles

.

Results on people without smiles

Results on people without smiles

Lipstick

We look for an area of bright-red color in the lower half of any faces we detect. We have an MCC of 0.36, with some of the false positives caused by people with naturally red lips. The amount of lipstick found is used to calculate how dressy and glamorous a bar or club is.

Results on people with lipstick

Results on people with lipstick

.

Results on people without lipstick

Results on people without lipstick

Plates

We run an algorithm that looks for plates or cups taking up most of the photo. It’s fairly picky, with a precision of 0.78, but a recall of just 0.15, and an MCC of 0.32. If a lot of people are taking photos of their meals or coffee, we assume that there’s something remarkable about what’s being served, and that it’s popular with foodies.

Results on photos with large plates or cups

Results on photos with large plates or cups

.

Results on photos without large plates or cups

Results on photos without large plates or cups

Exposed chest

We look about half a head’s height below a detected face, and see how large a contiguous area of skin-colored pixels is exposed. This will detect bare chests, and low-cut dresses, with the value we get out corresponding to how much skin is exposed. We use this measurement to estimate how risqué a bar or nightclub is. Pink sweaters and other items at chest height can easily cause false positives, so it requires quite a large sample size to have much confidence in the results.

Skin

The skin detection algorithm is simple, looking at which pixels are within a particular ‘flesh-colored’ hue range. This simplicity does make it prone to identifying things like beige walls and parchment menus as skin too unfortunately.

Sky

We scan through the top of the photo, looking for areas that have the color and texture of blue sky. This misses photos taken on overcast days, sunsets, and can be confused by blue ceilings, but has proven effective at judging how scenic a place is, and whether a bar has an outdoor area. Our tests show we get an MCC of 0.84.

Results on photos with blue sky

Results on photos with blue sky

.

Results on photos without blue skies

Results on photos without blue skies

Colorfulness

This measures what proportion of the hue circle is present in the image. It looks at how many different colors are present, rather than how saturated any single color is, trying to find images with a rich range of colors rather than just being garish. Since this is a continuous quality, we don’t have true and false positive statistics, but here’s the results of applying it to one hundred random images, with the most colorful images at the top left, and moving left to right down the image in descending order:

Colorfulness

Photos arranged by colorfulness, starting top-left and moving right and down in descending order

Quality

Bad photos aren’t much fun to look through, so we try to spot common mistakes like bad lighting, composition, focusing, or exposure. This is aimed at getting rid of the worst photos, rather than surfacing the very best, since it’s a lot harder to algorithmically tell the difference between a good photo and a great photo, than it is to distinguish decent pictures from terrible ones. We get an MCC of 0.76.

Results on decent-quality photos

Results on decent-quality photos

.

Results on low-quality images

Results on lower-quality images

Name-based labels

Some of the information about a venue is gleaned from looking at the names of people who take photos there. https://petewarden.com/2013/06/10/how-does-name-analysis-work/ has a full run-down and source code on how this works.

Asian

We look at how many of the photographers at this venue have typically-Asian family names to figure out which places are particularly popular with Asian people. It’s based on statistics from the US Census, so it’s much less accurate outside America.

Hispanic

Very like the Asian label, this is applied when a lot of the photographers at a venue have a surname that’s prevalent in the Hispanic world.

Men/Women

This measurement uses the first name of the photographer to guess at whether they’re male or female. There are some ambiguous names like Francis, but overall it’s an effective technique.

Behavior-based labels

Some of our venue descriptions come from the other places that their visitors have also been to. These behavior-based labels are applied when the number of people who have been to particular types of places is several times higher than average.

Locals

People who have photographs taken at least 30 days apart in the same city as this venue.

Tourists

Applied to photographers who have a longer history of taking photos elsewhere, but were taking photos in the same city as this place for less than 30 days.

Business Travelers

If someone we’ve identified as a tourist in this city has taken photos at a conference or office, we’ll guess they’re business travelers.

LGBT

This is applied to places where lots of photographers have also taken photos at a gay bar on at least two different occasions.

Ski Bums

Applied to places popular with people who’ve also been to ski resorts.

Gym Bunnies

People who’ve also taken pictures at gyms.

Kung Fu Fighters

Photographers who’ve been to martial arts studios.

Skaters

People who’ve taken photos at skate parks.

Sun Worshippers

Photographers who’ve documented beaches.

Dog People

Places where people who’ve been to dog runs go.

Pet Lovers

If the visitors to this venue have also taken pictures at pet stores or animal shelters, we’ll flag it as popular with pet lovers.

Surfers

People who have shared photos from beaches popular with surfers.

Winos

Oenophiles who have been to vineyards or wine stores.

Sports Fans

Anyone who has taken a photo at a sports stadium on multiple occasions.

Musos

People who’ve been to live music performance spaces like concert halls or music festivals.

Strip Club Denizens

Places where an unusual number of visitors have also shared photos from a strip club.

Students

Anyone who has taken a picture on a college campus.

Parents

Places popular with people who’ve shared photos from schools, day-care centers, toy stores, or theme parks on multiple occasions.

Stoners

Venues where a lot of the photographers have also been to smoke shops.

Jet Setters

People who’ve taken photos from airport buildings on three or more different occasions.

Intellectuals

Venues where the attendees have also been to art galleries, museums, or small theatres.

Startup Folks

Anyone who’s taken a photograph at a tech startup office, or a co-working space.

Outdoorsmen

People who’ve taken hiking, camping, or hunting pictures.

Finding out more

If you’ve made it this far, you’re pretty dedicated, and I salute you! I find this whole area fascinating, and I love chatting with anyone else who’s as interested in it too. I’m @petewarden on Twitter, and my email is pete@jetpac.com, I’d love to hear from you, whether it’s brickbats, bouquets, or just comments!

Five short links

pentagonalwell

Photo by Harshvardhan Dhawan

The county problem in the West – This is a brilliant example of why you need to understand some GIS basics to sensibly use even the most basic geographic statistics. The large size and arbitrary boundaries of western US counties mean that the default view of historical settlement is muddled, and only by switching to alternate spatial partitions can you understand what was actually going on.

The cost of satisfaction – Patients who are  satisfied with their doctors are more likely to die than the malcontents! This appears to be a real effect, judging by the statistics, and I wonder if it’s because picky patients are more likely to push for more information and second opinions? Whatever the cause, it’s a good reminder that even the most obvious metrics might not match up with the goal you’re trying to achieve.

Don’t mix threads and fork – The complexities of getting threads to play nicely with fork() are mind-boggling, and in practice seem insurmountable.

Free GIS data – An impressive list of free-as-in-beer geographic data. As the page recommends, do look into the licensing terms for any you want to use. You might be surprised at the requirements for something like the Open Database License if you use OpenStreetMap files for example.

Signals from the void – A blend of the inspiring vision of picturing the black hole at the center of our galaxy with the mundane grind of performing research from bleak mountain-tops, at the mercy of the weather and unreliable equipment. This story rings very true, especially around the chaos behind the project and the personalities of the people who are attracted to the quest.

Five short links

fivev

Photo by Jed Sullivan

Sunlight intensity based global position system – It turns out you can geo-locate underwater sensors to within a few kilometers just by measuring the sunset and sunrise times. It’s a beautifully cheap way to figure out where fixed-position outdoor sensors are, since taking light measurements a few hundred times across a day is simple to implement, and doesn’t take much power or computation.

NSNotificationCenter with blocks considered harmful – Managing the lifetime of memory allocations is incredibly hard, and this is a cautionary tale in how nasty it can get.

Blackhash – Do you really trust your security audit company with your hashed password files? An interesting approach that allows them to do some limited testing, without handing over the data itself.

Geo-located Twitter as the proxy for global migration patterns – Understanding how the world is connected by analyzing people who tweet from multiple countries.

Analyzing the Iranian Embassy bombing in Beirut from photos – The format of the slideshow is a bit hard to navigate, but it’s worth stepping through. Felim explains how he used a combination of ground and satellite photos to verify that a suspect video was actually taken at the right time and place.

Five short links

keyfive

Photo by egazelle

How to program unreliable chips – It’s been a vitally useful simplification to pretend that computer calculations are 100% reliable, but as our data volumes grow and chips shrink, we’ll need to start planning for errors.

Abigail’s regex to test for prime numbers – A thing of terrifying beauty.

Disguise detection – Using cheap IR detectors to check that the face you’re detecting in visible light isn’t a latex mask.

Mapping “For Whom the Bell Tolls” – A thoughtful look at how locations occur in the novel, with the visualization in its proper place as another tool in the analysis, rather than being seen as the final product. It’s worth following some links too, you’ll find some gems like this analysis of the baggage and future of GIS.

Know thy Java object memory layout – All abstractions leak around the edges, and I love catching glimpses of the machinery that’s whirring away inside black boxes. The complexity of the accumulating layers of software archaeology we’re building on top of is staggering.

Five short links

fivefences

Photo by Christophe Kummer

Why manual memory management can be worse for performance than garbage collection – I spent over a decade coding in C and C++, and these are true words. Instead of GC pause you’ll have ‘deallocation hiccups’ whenever a big object destructor or scope change occurs, and reference counting is an intrusive performance hog that leads to horrific constructs like loops that repeatedly Release() a COM object until the reference count is zero. This allocation meditation from someone switching to C from Python is worth a read too, it captures the near-obsession with malloc that C programmers have to develop.

Proper handling of SIGINT/SIGQUIT – Have you ever wondered what’s going on when you press Control-C in the terminal? This article is a great case study on how a seemingly simple requirement spirals into tough-to-get-right complexity when you have to integrate it into a wider system.

Use multiple CPU cores with your Linux commands – How to use GNU Parallel to speed up your grep, awk, and sed-ing.

Thieves pose as truckers to steal huge cargo loads – The interesting part is that criminals are doing intensive web research to build themselves convincing false identities from publicly-available information. Open data has its downsides.

Accidental aRt – When R attacks!

Five short links

hatsofftocanada

Photo by Morgan

Starivore Extraterrestrials? – Are some of the strange binary star systems we’re discovering actually evidence of a strange form of life? Almost certainly not, but it’s worth reading just for the sheer audacious imagination of the idea.

Want to API enable your COBOL applications? – Over the years I’ve developed tremendous respect for the depth of subtle requirements that have been baked into legacy applications through countless undocumented changes. When I was younger my first instinct was always to rewrite them, but after discovering by painful experience that the complexity of the old software almost always reflected the poorly-articulated complexity of the users needs, I learned to love shims like these. By wrapping modern web APIs in a layer that looks like the file system that COBOL programs understand, you can keep the knowledge embedded in them.

Bad Attitude – The motivational-speaker framing will drive you crazy, but there’s truth and real research buried in this analysis of how your attitude affects you at work. It might explain why I often struggled in corporate jobs where I didn’t have as much of a personal connection to the bigger goals, but “true believers” in a company’s mission are rewarded over skeptics, regardless of talent.

Getting real about distributed system reliability – We’ve spent man-months dealing with Cassandra setup and maintenance at Jetpac. It’s a massive investment for a small startup, and I struggled to avoid it for exactly the reasons Jay brings up. The real cost is the amount of time it takes to keep things running reliably, and if DynamoDB had been available when we started it would have been my technology of choice. I even considered using my S3-as-a-database approach to keep the maintenance time minimal!

JSON parser as a single Perl regex – Terrifyingly cool. Coolly terrifying.

Five short links

fivepipes

Photo by Stefano

Seven command-line tools for data science – 90% of data science is loading the damn stuff, and this is a great set of basic utilities for a lot of the formats you’ll have to deal with.

Classifying digits with deep-belief networks – A very readable guide to the new new thing in machine learning!

Our logo looks like underpants – British people are weird.

Busting the King’s Gambit, this time for sure – I don’t know chess, but the state of the art of computerized analysis is amazing.

Sloane’s Gap – A numerical investigation into strange properties of a large collection of number series. I learned that 11630 is the first uninteresting number, and there are 350 interesting sequences that contain 1729, so it’s even more exciting than Ramanujan thought!

 

Geocode the world with the new Data Science Toolkit

watercolorworld

Picture by Nicholas Raymond

I’ve published a new version of the Data Science Toolkit, which includes David Blackman’s awesome TwoFishes city-level geocoder. Largely based on data from the Geonames project, the biggest improvement is that the Google-style geocoder now handles millions of places around the world in hundreds of languages:

http://www.datasciencetoolkit.org/maps/api/geocode/json?sensor=false&address=القاهرة

{
  "status": "OK",
  "results": [
    {
      "types": [
        "locality",
        "political"
      ],
      "address_components": [
        {
          "types": [
            "locality",
            "political"
          ],
          "short_name": "Cairo",
          "long_name": "Cairo, EG"
        },
        {
          "types": [
            "country",
            "political"
          ],
          "short_name": "EG",
          "long_name": "Egypt"
        }
      ],
      "geometry": {
        "viewport": {
          "southwest": {
            "lng": 31.1625480652,
            "lat": 29.9635601044
          },
          "northeast": {
            "lng": 31.3563537598,
            "lat": 30.1480960846
          }
        },
        "location": {
          "lng": 31.24967,
          "lat": 30.06263
        },
        "location_type": "APPROXIMATE"
      }
    }
  ]
}

You can also access the TwoFishes API directly, which offers a lot of very powerful features like breaking down search queries into their where and what parts, so you can do something useful with “Pizza New York”.

For the first time I’ve made AMIs available in all the EC2 regions worldwide, and you can download or torrent the Vagrant version. Have fun, and let me know about any issues or improvements you’d like to see!

Five short links

chain

Photo by Racineur

Software development without estimates, specs, or other lies – The secret to being a good coder is understanding the business problem you’re being paid to solve. I know, you just want to code, but that skill’s getting democratized to death. Your real value manifests when you hunt down a ton of messy and contradictory needs and figure out a solution that works for the most important ones.

The mystery of San Francisco English – Did you know San Franciscans used to sound like Brooklynites?

On Chomsky and the two cultures of statistical learning – Peter Norvig dismantles Chomsky’s dismissal of statistical models. Bring popcorn.

Five lies your world map told you – Country borders aren’t nearly as well-defined as you might think. Some great examples in here, including the Bir Tawil Trapezoid that neither Sudan nor Egypt want to claim.

If it doesn’t work on mobile, it doesn’t work – Must-read data on understanding mobile. Written by Brian Boyer from the trenches of web development, you’ll learn why almost no mobile use is by people moving, that they prefer reading on phones to desktops or even tablets, and that American hourly usage patterns are very similar to British, but without a tea-time spike at 4pm!