How does name analysis work?

Inigo
Photo by Glenda Sims

Over the last few months, I've been doing a lot more work with name analysis, and I've made some of the tools I use available as open-source software. Name analysis takes a list of names, and outputs guesses for the gender, age, and ethnicity of each person. This makes it incredibly useful for answering questions about the demographics of people in public data sets. Fundamentally though, the outputs are still guesses, and end-users need to understand how reliable the results are, so I want to talk about the strengths and weaknesses of this approach.

The short answer is that it can never work any better than a human looking at somebody else's name and guessing their age, gender, and race. If you saw Mildred Hermann on a list of names, I bet you'd picture an older white woman, whereas Juan Hernandez brings to mind an Hispanic man, with no obvious age. It should be obvious that this is not always reliable for individuals (I bet there are some young Mildreds out there) but as the sample size grows, the errors tend to cancel each other out.

The algorithms themselves work by looking at data that's been released by the US Census and the Social Security agency. These data sets list the popularity of 90,000 first names by gender and year of birth, and 150,000 family names by ethnicity. I then use these frequencies as the basis for all of the estimates. Crucially, all the guesses depend on how strong a correlation there is between a particular name and a person's characteristics, which varies for each property. I'll give some estimates of how strong these relationships are below, and I link to some papers with more rigorous quantitative evaluations below.

If you are going to use this approach in your own work, the first thing to watch out for is that any correlations are only relevant for people in the US. Names may be associated with very different traits in other countries, and our racial categories especially are social constructs and so don't map internationally.

Gender is the most reliable signal that we can gleam from names. There are some cross-over first names with a mixture of genders, like Francis, and some that are too unique to have data on, but overall the estimate of how many men and women are present in a list of names has proved highly accurate. It helps that there are some regular patterns to augment the sampled data, like names ending with an 'a' being associated with women.

Asian and Hispanic family names tend to be fairly unique to those communities, so an occurrence is a strong signal that the person is a member of that ethnicity. There are some confounding factors though, especially with Spanish-derived names in the Phillipines. There are certain names, especially those from Germany and Nordic countries, that strongly indicate that the owner is of European descent, but many surnames are multi-racial. There are some associations between African-Americans and certain names like Jackson or Smalls, but these are also shared by a lot of people from other ethnic groups. These ambiguities make non-Hispanic and non-Asian measures more indicators than strong metrics, and they won't tell you much until you get into the high hundreds for your sample size.

Age has the weakest correlation with names. There are actually some strong patterns by time of birth, with certain names widely recognized as old-fashioned or trendy, but those tend to be swamped by class and ethnicity-based differences in the popularity of names. I do calculate the most popular year for every name I know about, and compensate for life expectancy using actuarial tables, but it's hard to use that to derive a likely age for a population of people unless they're equally distributed geographically and socially. There tends to be a trickle-down effect where names first become popular amongst higher-income parents, and then spread throughout society over time. That means if have a group of higher-class people, their first names will have become most widely popular decades after they were born, and so they'll tend to appear a lot younger than they actually are. Similar problems exist with different ethnic groups, so overall treat the calculated age with a lot of caution, even with large sample sizes.

You should treat the results of name analysis cautiously – as provisional evidence, not as definitive proof. It's powerful because it helps in cases where no other information is available, but because those cases are often highly-charged and controversial, I'd urge everyone to see it as the start of the process of investigation not the end.

I've relied heavily on the existing academic work for my analysis, so I highly recommend checking out some of these papers if you do want to work with this technique. As an engineer, I'm also working without the benefit of peer review, so suggestions on improvements or corrections would be very welcome at pete@petewarden.com.

Use of Geocoding and Surname Analysis to Estimate Race and Ethnicity – A very readable survey of the use of surname analysis for ethnicity estimation in health statistics.

Estimating Age, Gender, and Identity using First Name Priors – A neat combination of image-processing techniques and first name data to improve the estimates of people's ages and genders in snapshots.

Are Emily and Greg More Employable than Lakisha and Jamal? – Worrying proof that humans rely on innate name analysis to discriminate against minorities.

First names and crime: Does unpopularity spell trouble? – An analysis that shows uncommon names are associated with lower-class parents, and so correlate juvenile delinquency and other ills connected to low socioeconomic status.

Surnames and a theory of social mobility – A recent classic of a paper that uses uncommon surnames to track the effects of social mobility across many generations, in many different societies and time periods.

OnoMap – A project by University College London to correlate surnames worldwide with ethnicities. Commercially-licensed, but it looks like you may be able to get good terms for academic usage.

Text2People – My open-source implementation of name analysis.

Fixing OpenCV’s Java bindings on gcc systems

Coffee
Photo by Julian Schroeder

I just spent quite a few hours tracking down a subtle problem with OpenCV's new Java bindings on gcc platforms, like my Ubuntu servers. The short story is that the default for linked symbols was recently changed to hidden on gcc systems, and the Java native interfaces weren't updated to override that default, so any Java programs using native OpenCV functions would mysteriously fail with an UnsatisfiedLinkError. Here's my workaround:

--- a/cmake/OpenCVCompilerOptions.cmake
+++ b/cmake/OpenCVCompilerOptions.cmake
@@ -252,8 +252,8 @@ set(OPENCV_EXTRA_EXE_LINKER_FLAGS_DEBUG "${OPENCV_EXTRA_EXE_LINKER_FLAGS_DEBUG

# set default visibility to hidden
if(CMAKE_COMPILER_IS_GNUCXX AND CMAKE_OPENCV_GCC_VERSION_NUM GREATER 399)
- add_extra_compiler_option(-fvisibility=hidden)
- add_extra_compiler_option(-fvisibility-inlines-hidden)
+# add_extra_compiler_option(-fvisibility=hidden)
+# add_extra_compiler_option(-fvisibility-inlines-hidden)
endif()

The tricky part of tracking this down was that nm didn't show the .hidden attribute, so the library symbols appeared fine, it was only when I switched to objdump after exhausting everything else I could think of that the problem became clear.

Anyway, I wanted to leave some Google breadcrumbs for anyone else who hits this! I've filed a bug with the OpenCV folks, hopefully it will be fixed soon.

Five short links

Fivesign
Photo by Leo Reynolds

External framework problems in Go – Handling dependencies well is extremely hard, and can lead to insane yak-shaving expeditions like this when things go wrong. It's like an avalanche – changing versions on one library can impact several others, so you have to update or downgrade those too, and suddenly you're facing an ever-increasing amount of work just to get back to where you were!

D-wave comparison with classical computers – I don't know enough about quantum computing problems to comment on the details of the argument, but it's awesome to see such a deep technical dive as an instant blog post, rather than having to wait months for a paper.

Blogging is dead, but have we fixed anything? – "I find my blogging here to be too useful to me to stop doing it" sums up why I'm still working in a now-archaic medium!

What statistics should do about big data – "[Statisticians] want an elegant theory that we can then apply to specific problems if they happen to come up." That's been exactly my experience, and why I've never encountered statisticians as I've followed my curiosity to new problems data. The article this is in response to contains the assumption that 'funding agencies' have driven the CS takeover of data processing, but, despite a lot of the founders having roots in academia, almost all the innovations I've seen have been incubated in commercial environments.

The hidden sexism in CS departments – A portrait of managerial cluelessness when dealing with a nasty incident. Even if each occurrence is comparatively minor, it's the steady drip-drip of unwelcoming behavior that drives non-stereotypical geeks out of our world.

Five short links

Fivetally
Photo by ahojohnebrause

Max Headroom and the strange world of pseudo-CGI – I've always been fascinated by cargo cult analog tributes to technology. Maybe my early exposure to Max gave me the bug?

Reidentification as basic science – Arvind does a fantastic job of explaining why the research he does is so important. I love learning more about people from data, and most of the interesting insights come from interrogating it in unusual ways and finding unexpected connections, which is what his work is all about.

A 21cm radio telescope for the cost-conscious – Beautiful geekery. Who doesn't want to map the Milky Way's radio emissions using nothing more sophisticated than a $20 USB TV dongle?

How Google Code worked – An eminently-practical guide to implementing a regular-expression search engine, from the author of the late-lamented Google Code. It even comes with working source code!

3D lightning – Calculating the three-dimensional path of a lightning bolt from two simultaneous pictures taken from different spots.

Five short links

Fivelocks
Photo by Tony Preece

CLAVIN – A very promising open source geotagging project that analyzes unstructured text and identifies geographic entities. It has some very neat tricks up its sleeve to disambiguate common names like 'Springfield' based on the context.

The Sokal Hoax: At whom are we laughing? – Post-modernism makes an easy target for hard scientists, but this is a good reminder that some of the giants of physics made even more meaningless pronouncements about fields they knew nothing about.

Name-cleaver – A scrumptious little project from Sunlight Labs that handles a lot of the messy data cleanup work around people and organization names.

altmetrics: a manifesto – On the topic of scientists being silly, the way we measure academic output is antiquated beyond belief, so it was great to see this from my friend Cameron Neylon. We can do way better than citations.

Improving the security of your SSH private key files – This is what happens when hackers (in the old-school sense) get interested in a topic. Martin's curiosity about how SSH works led him to find out some sub-par default settings that make a passphrase on your keys a lot less effective than you might think. I didn't know about those particular problems, but I've always followed my Apple and kept my keys on an encrypted DMG.

Five short links

Fivestar
Photo by Eldeeem

The Cartography of Bullshit – A righteous rant against a piece of pop-sociology digging into just how flimsy the underlying statistics are. It hits home because numbers I've mined have ended up in similar columns – a White Power group even used some of my research to 'prove' Mexicans were conquering Texas based on the numbers of Juans versus Johns! Take all studies on controversial subjects like race with a massive pinch of salt.

Welcome, recent graduates – Advice I wished I'd had when I looked for my first post-college job. 

Sublime DataConverter – We've ended up using CSV for lists of objects where the property names remain constant and JSON for messier data structures and as a programming model post-transport. We've homebrewed a limited set of routines to automatically scan headers or walk all objects and extract all possible properties so we can automatically convert between the two representations, but this project is a much more general approach to the same problem.

The Split-Apply-Combine Strategy for Data Analysis – A technical but enlightening read from Hadley Wickham, covering ways of applying the same algorithms across many different representations of data.

Nightmare after nightmare: Students trying to replicate work – Remember what I said about taking studies with a pinch of salt? Even with help from the original authors, PhD students had incredible trouble reproducing the results of published papers. This isn't just a problem for social science, all science is a messy business and we need to keep our skepticism intact. That isn't a free pass to ignore evolution and climate change though!

No more heatmaps that are just population maps!

I'm pleased to announce that there's a brand new 0.50 version of the DSTK out! It has a lot of bug fixes, and a couple of major new features, and you can get it on Amazon's EC2 as ami-7b9df412, download the Vagrant box from http://static.datasciencetoolkit.org/dstk_0.50.box, or grab it as a BitTorrent stream from http://static.datasciencetoolkit.org/dstk_0.50.torrent

What are the new features?

The biggest is the integration of high resolution (sub km-squared) geostatistics for the entire globe. You can get population density, elevation, weather and more using the new coordinates2statistics API call. Why is this important? No more heatmaps that are just population maps, for the love of god! I'm using this extensively to normalize my data analysis so that I can actually tell which places actually have an unusually high occurrence of X, rather than just having more people.

I've also added the text2sentiment method, which has been a big help as I've been categorizing positive and negative comments.

text2people now incorporates information from the US Census on which ethnic groups are most likely to have a particular surname, to help you do a rough-and-ready ethnic makeup analysis of a list of names.

I've expanded language support, with a new Ruby gem that you can get via 'gem install dstk' (which includes unit testing), and an R Package adding the two new APIs to Ryan Elmore's original, available as RDSTK. The Python and Javascript clients have been updated to the latest APIs too.

There's also an official .ova version for people using VMware, up at http://static.datasciencetoolkit.org/dstk_0.50.ova

What's still to be done?

The size has ballooned, from about 5GB to nearly 20GB! Most of this is the elevation and other global data, so I'm considering making these optional in the future if that's a problem for a lot of people.

The new surname analysis in text2people has a very high latency on the first request (tens of seconds), which isn't acceptable, so I'll be figuring out a fix for that.

Unit testing has shown that text2sentences isn't working at all!

Thanks to everyone who's contributed to the project so far, both coders and the many good folks who make data openly available! It's exciting to help democratize these tools, I'm looking forward  to hearing feedback on how to keep improving that process.

pete@jetpac.com

Five short links

Stationfive
Photo by Curtis Perry

The Declassification Engine – "Saving history from official secrecy". A fascinating concept that shows how the firehose of cheap distributed computing power fundamentally changes what privacy and secrecy mean. We can probably reconstruct a lot of information that people think they've hidden in these documents, but what are the rules?

A 63-bit floating point type for 64-bit OCaml – I've never used the language, but I adore the bit-fiddling that goes into floating-point representations, and this is a lovely hack on top of them.

Local geocoder – A lovely minimal reverse geocoder that's self-contained, including data. I've been excited to see a blossoming of open geocoding solutions, Nominatim has improved in leaps and bounds, PostGIS now has some strong capabilities, and I've been having fun with the Data Science Toolkit of course!

How to say nothing in 500 words – Ancient advice about writing that's still useful. "Call a fool a fool"!

Olympians Festival – I've been getting a lot out of the local TheaterPub nights in San Francisco, so I'm excited to make it to this twelve-night festival with a whopping 36 new plays in November! I'm also a sucker for the greek myths, ever since I hear up with Tony 'Blackadder' Robinson's retelling of the Iliad as a kid.

Five short links

Fivetype
Photo by Grant Hutchinson

Assuming everybody else sucked – If an industry is behaving in an apparently irrational way, try to figure out the internal logic that's driving that behavior. You'll be much more effective at breaking the rules if you understand what they are first.

Storing and publishing sensor data – Now we're scattering sensors around like confetti, we're generating ever-growing mounds of time-series data, so here's a good overview of where you can shove it.

100,000 Stars – This WebGL exploration of the universe is so good I feel like this should have already been plastered all over the internet already, but maybe I've been living under a rock?

Mapping the product manifold – I started off in image processing, carried what I'd learned to unstructured text, and now I'm fascinated to see techniques flowing back the other way. We're going to be doing crazily effective recognition of images, language, and every other kind of noisy signal within a few years.

What happened to the crypto dream? – A clear-eyed examination of where the crypto dream of the 90's ended up – ""the demand for technologies that will upset that power balance is quite low".

We’re all starting to track ourselves

Mapscreenshot

We’re releasing a massive and growing amount of information about who we are, where we go, and when. There are hundreds of millions of public checkins already out there, and millions more are being created every day. People think of Foursquare as the leading source, but actually Instagram, Facebook, Twitter, Flickr, Google Plus all produce incredible numbers of geo-located checkins, some of many, many more than Foursquare.

This is going to cause big changes in our world. We’ve already taught our computers what we buy and read, now we’re telling them where we spend our real-world lives. Just our presence at a location at a particular time becomes powerful data when it’s combined with all the other people doing the same thing. We’re instrumenting our movements at a very detailed level, and sending them out into the ether. Even more amazingly, we’re adding high-resolution photos and detailed comments to the checkins.

It’s hard to overstate how effective this data can be at solving intractable problems. Economists, sociologists, and epidemiologists would kill to have detailed pictures of the lives we lead at this kind of scale. There will be applications we haven’t even thought of too, connecting us with people we should be talking to, introducing us to new experiences, all sorts of feedback that will change how we live.

It’s a scary new world to contemplate too of course, which is why I keep blogging about what I’m up to. Recently I’ve been working with my team at Jetpac analyzing billions of photos from all sorts of social sources, to help both tourists and locals figure out where to go and what to do. I want to share an internal tool we use to explore the data, a map interface to the checkins that people have shared publicly. If you want to get a concrete feel for how our world’s changing, check it out:

https://www.jetpac.com/map

It’s still an experimental tool so apologies for any bugs, but I hope you find this glimpse of the mountain of public data we’re all creating as fascinating as I do! You can find all of the individual photos and other checkins out there on the public web, but seeing them accumulated together in one place still blows my mind.