Five short links

chalkfive

Photo by Leo Reynolds

San Francisco aerial views from 1938 – A startlingly high-resolution zoomable map of the city from 75 years ago. The detail is amazing, as are the sand dunes across what’s now the Sunset neighborhood. It’s an incredible resource for anyone interested in our local history.

One weird kernel trick – Ad spam for ML researchers – “the secret to learning any problem with just ten training samples”!

Encoding messages in public SSH keys – Pimp your public RSA key with a custom message!

Why we need a compiler with a long-term support release – Dependencies are the devil, and compilers are at the root of most dependency trees. I can’t even imagine how much wrangling it takes to keep a big software platform compiling across dozens of architectures, but this heartfelt plea from the trenches gives me a flavor of the suffering involved. In a few decades we’ll be stuck searching for complete VMs to run old software, since most of the frameworks they depend on will have rotted off the internet.

Archaeology and Github – Liberating data from hard-to-OCR tables at the end of papers, and into a simple shared environment. I don’t know if Github is the right space for this, but like wikis, programmers are the early adopters of collaboration tools, so it’s a good place to start.

 

Five short links

fivefoxes

Photo by Art.Crazed

Flush and reload cache attack – An attack on a common encryption program, using only cache manipulation to figure out exactly which code paths are being executed, and so what the secret inputs are. There are so many obscure ways for information to leak out from any complex system.

How to build effective heatmaps – There’s no objective way to build a heatmap. Every way of turning discrete points into gradients on a map emphasizes different features of the data, and this discussion has a great overview of the most common techniques.

Overly-honest social science – Worth quoting in depth: “Good social scientific practice should be about acknowledging the weaknesses of the methods used: not to reward sloppiness, but to highlight what really goes on; to reassure novice researchers that real world research often is messy; or to improve on current methods and find ways to do things better.  Better in this case does not necessarily mean less subjective, but it does mean more transparent, and usually more rigorous.  The publication of mistakes is a necessary part of this process.”

Visualizing historical map data – I’ve been very impressed with CartoDB, and this is a great example of how powerful it can be. The author uses uncommon names to track New Yorker’s movements between the 1839 and 1854 censuses.

America’s artificial heartland – The prose is purpler than an eggplant, and a bit hard to swallow, but there’s something thought-provoking about this analysis. He paints a world where we industrialize farming beyond recognition, and then compensate with a lot of expensive theater at the places where we buy our food.

 

How to easily resize and cache images for the mobile web

darkroom

Photo by Trugiaz

Mobile web pages crash when they run out of memory. Apple devices are particularly painful – they’ll die hard without even giving you a chance to record the crash! Even if they don’t crash, performance will be sluggish if you’re pushing memory limits. Large images are the usual cause, but since we display a lot of photos from social networks, I don’t have much control over their size outside of the default dimensions the different services supply. What I really needed was a zero-maintenance way to pull arbitrary sizes of thumbnails from external images, very fast. Happily, I found a way!

The short version is that I set up a server running the excellent ImageProxy open-source project, and then I placed a Cloudfront CDN in front of it to cache the results. The ImageProxy server listens for requests containing a source image location as a parameter, along with a set of requested operations also baked into the URL. The code forks a call to ImageMagick’s command-line tools under the hood, which isn’t the most elegant solution, but does provide a lot of flexibility. With this sort of execution model, I do recommend running the server on a separate box from your other machines, for both performance and security reasons.

There are a few wrinkles to getting the code running, so if you’re on EC2, I put together a public AMI you can use on US-West1 as ami-3c85ad79. Otherwise, here are the steps I took, and the gotchas I hit:

– I started with clean Ubuntu 12.04, since this is the modern distribution I’m most comfortable with.

– ImageProxy requires Ruby 1.9.x, and installing it instead of the default 1.8.7 using apt-get proved surprisingly hard! I eventually found these instructions, but you might be better off using RVM.

– I mostly followed the ImageProxy EC2 setup recipe, but one missing piece was that the default AllowEncodedSlashes caused Apache to give 404 errors on some requests, so I had to make a fiddly change to my configuration.

I now had an image proxy server up and running! To check if yours worked, go to the /selftest page on your own server. Pay close attention to the “Resize (CloudFront-compatible URL format)” image, since this is the one that was broken by Apache’s default AllowEncodedSlashes configuration, and is the one you’ll need for the next CDN step.

I’m an Amazon junkie, so when I needed a flexible and fast caching solution, CloudFront was my first stop. I’d never used a CDN before, so I was pleasantly surprised by how easy it was. I clicked on “Create New Distribution” in the console, specified the URL of the ImageProxy server I’d set up as the source, and within a few minutes I had a new CloudFront domain to use. It handles distributing its copies of the cached files to locations near each requesting user, and if a URL’s not yet cached, it will pull it from the ImageProxy server the first time it’s requested, and then offer super-fast access for subsequent calls. It all just worked!

Now, my code targeting mobile devices is able to automagically transform image URLs into their resized equivalents, eg from http://eahanson.s3.amazonaws.com/imageproxy/sample.png to http://imageproxy.heroku.com/convert/resize/100×100/source/
http%3A%2F%2Feahanson.s3.amazonaws.com%2Fimageproxy%2Fsample.png. There is a slight delay the first time a URL is accessed while the resizing is performed, but as long as you have a finite set of images you’re loading, this should go away pretty quickly as the results get cached by CloudFront.

The only other issue to think about is limiting the access to your server, since by default any other site could potentially start leaching off your proxy image processing. There are plenty of other ways of doing the same thing, so I wouldn’t stay awake at night worrying about it, but ImageProxy does allow you to require a signature, and restrict the source URLs to particular domains, both of which would help prevent that sort of usage.

Big thanks to Erik Hanson for building ImageProxy, I’ve been putting off doing this kind of caching for months, but his project made it far simpler than I’d hoped. It’s working great, improving performance and preventing crashes, so if you’re struggling with unwieldy images, I recommend you take a look too!

[Update – I forgot to mention that the ultimate easy solution is a paid service like embed.ly or imgix! After looking into them, I decided the overhead of depending on yet another third-party service outweighed the convenience of avoiding setup hassles. It wasn’t an easy choice though, and I could see myself switching to one of them in the future if the server maintenance becomes a pain – I know that’s what happened with my friend Xavier at Storify!]

Five short links

starfish

Photo by Dezz

Hillbilly Tracking of Low Earth Orbit Satellites – A member of the Southern Appalachian Space Agency gives a rundown of his custom satellite tracking hardware and software, all built from off-the-shelf components. Pure hacking in its highest form, done for the hell of it.

Introduction to data archaeology – I’m fascinated by traditional archaeology, which has taught me a lot about uncovering information from seemingly intractable sources. Tomasz picks a nice concrete problem that shows how decoding an unknown file format from a few examples might not be as hard as you think!

Adversarial Stylometry – If you know your writing will be analyzed by experts in stylometry, can you obsfucate your own style to throw them off, or even mimic someone elses to frame them?

The man behind the Dickens and Doestoevsky hoax – Though I struggle with the credibility problems of data science, I’m constantly reminded that no field is immune. For years biographers reproduced the story of a meeting that never happened between the British and Russian novelists. The hoax’s author was driven by paranoia about his rejection by academia, and set out to prove that bad work under someone else’s name would be accepted by journals that rejected his regular submissions.

Dedupe – A luscious little Python library for intelligently de-duplicating entities in data. This task normally takes up more time than anything in the loading stage of data processing pipelines, and the loading stage itself always seems to be the most work, so this is a big deal for my projects.

Why you should never trust a data scientist

lieslieslies

Photo by Jesse Means

The wonderful thing about being a data scientist is that I get all of the credibility of genuine science, with none of the irritating peer review or reproducibility worries. My first taste of this was my Facebook friends connection map. The underlying data was sound, derived from 220m public profiles. The network visualization of drawing lines between the top ten links for each city had issues, but was defensible. The clustering was produced by me squinting at all the lines, coloring in some areas that seemed more connected in a paint program, and picking silly names for the areas. I thought I was publishing an entertaining view of some data I’d extracted, but it was treated like a scientific study. A New York Times columnist used it as evidence that the US was perilously divided. White supremacists dug into the tool to show that Juan was more popular than John in Texan border towns, and so the country was on the verge of being swamped by Hispanics. What was worse, I couldn’t even get my data into the hands of reputable sociologists, thanks to concerns from Facebook.

I’ve enjoyed publishing a lot of data-driven stories since then, but I’ve never ceased to be disturbed at how the inclusion of numbers and the mention of large data sets numbs criticism. The articles live in a strange purgatory between journalism, which most readers have a healthy skepticism towards, and science, where we sub-contract verification to other scientists and so trust the public output far more. If a sociologist tells you that people in Utah only have friends in Utah, you can follow a web of references and peer review to understand if she’s believable. If I, or somebody at a large tech company, tells you the same, there’s no way to check. The source data is proprietary, and in a lot of cases may not even exist any more in the same exact form as databases turn over, and users delete or update their information. Even other data scientists outside the team won’t be able to verify the results. The data scientists I know are honest people, but there’s no external checks in the system to keep them that way. The best you can hope for is blog and Twitter feedback, but without access to the data, or even a full paper on the techniques, you can’t dig very deeply.

Why are data scientists getting all the attention? I blame the real scientists! There’s a mass of fascinating information buried in all the data we’re collecting on ourselves, and traditional scientists have been painfully slow to take advantage of it. There are all sorts of barriers, ranging from the proprietary nature of the source data, the lack of familiarity with methods able to handle the information at scale, and a cultural distance between the academic and startup worlds. None of these should be insurmountable though. There’s great work being done with confidential IRS and US Census data, so the protocols exist to both do real science and preserve secrecy. I’ve seen the size of the fiber-optic bundles at CERN, so physicists at least know how to deal with crazy data rates. Most of the big startups had their roots in universities, so the cultural gap should be bridgeable.

What am I doing about it? I love efforts by teams like OpenPaths to break data out from proprietary silos so actual scientists can use them, and I do what I can to help any I run across. I popularize techniques that are common at startups, but lesser-known in academia. I’m excited when I see folks like Cameron Marlow at Facebook collaborating with academics to produce peer-reviewed research. I keep banging the drum about how shifty and feckless we data scientists really are, in the hope of damping down the starry-eyed credulity that greets our every pronouncement.

What should you do? If you’re a social scientist, don’t let us run away with all the publicity, jump in and figure out how to work with all these new sources. If you’re in a startup, figure out if you have data that tells a story, and see if there’s any academics you can reach. If you’re a reader, heckle data scientists when we make a sage pronouncement, keep us honest!

Five short links

chainlinkfive`

Photo by Trevor Pritchard

Network dynamic temporal visualization – Skye Bender-deMoll’s blog is the perfect example of why I had to find an alternative RSS service when Google Reader shut down. He only posts once every few months, but I’d hate to miss any of them! Here he covers a delicious little R tool that makes visualizing complex networks over time easy.

PressureNet live API – We’re all carrying little networked laboratories in our pockets. You see a photo. I see millions of light-sensor readings at an exact coordinate on the earth’s surface with a time resolution down to the millisecond. The future is combining all these signals into new ways of understanding the world, like this real-time stream of atmospheric measurements.

Inferring the origin locations of tweets – You can guess where a Twitter message came from with a surprising degree of accuracy, just based on the unstructured text of the tweet, and the user’s profile.

What every web developer should know about URL encoding – Nobody gets URL encoding right, not even me! Most of the time your application won’t need to handle the obscurer cases (path parameters anyone?) but it’s good to know the corners you’re cutting.

Global Name DataAdam Hyland alerted me to the awesome OpenGenderTracking project. Not only does this have data on names from the US and the UK, but it includes the scripts to download them from the source sites. Hurrah for reproducibility! I’ve also just been alerted to a Dutch source of first name popularity data.

 

Switching to WordPress from Typepad.com

Screen Shot 2013-07-12 at 12.21.19 PM

Photo by Dhaval Shah

I’ve been on the domain petewarden.typepad.com since I started blogging in 2006. At the time I knew that the sensible thing to do was to set up a custom domain that pointed to the typepad.com one, but since I didn’t expect I’d keep it up, it didn’t seem that important. A thousand posts later, I’ve had plenty of time to regret that shortcut!

Plenty of friends have tried to stage interventions, especially those with any design sense. I would half-heartedly muttered “If it aint broke, don’t fix it” and plead lack of time, but the truth is my blog was broken. I post because I want people to read what I write, and the aging design and structure drove readers away. When Matt Mullenweg called me out on it, I knew I had to upgrade.

The obvious choice was WordPress.com. Much as I love open source, I hate dealing with upgrades, dependencies, and spam, and I’m happy to pay a few dollars a month not to worry about any of those. Here’s what I had to do:

Purchased a premium WordPress domain at https://petewarden.wordpress.com. I wanted paid hosting, just like I had at Typepad, partly for the extra features (custom domains in particular) but also because I want to pay for anything that’s this important to me.

Bought the Elemin design template created by Automattic. I wanted something clean, and not too common, and I haven’t seen this minimal design in too many places, unlike a lot of other themes.

Exported my old posts as a text file, and then uploaded them to my new WordPress blog. The biggest downside to this is that none of the images are transferred, so I’ll need to keep the old typepad.com blog up until I can figure out some scripted way to move those too. The good news is that all of the posts seem to have transfered without any problems.

Set up a custom domain in the WordPress settings to point to petewarden.com. This involved changing my nameservers through my DNS provider, and then using a custom settings language to duplicate the MX records. This was the scariest part, since I’m using Google Apps for my email and I really didn’t want to mess their settings up. The only hitch I hit was trying to duplicate the www CNAME, I couldn’t get an entry working until I realized that it was handled automatically, so I just had to leave it out! With that all set up, I made petewarden.com my primary domain, so that petewarden.wordpress.com links would redirect to it.

Updated my Feedburner settings to point to the WordPress RSS feed. I was worried about this step too, but other than duplicating the last ten posts when I checked in my RSS reader, it just seemed to work.

Added an RSS redirect in the head section of my old typepad blog. This is a Javascript hack so none of the Google Juice from any popular posts will transfer to the new domain (and it may even ding search rankings), but at least readers should now see the modern version. Unfortunately there’s no proper 30x redirect support in Typepad, so this is the best I can do.

Matt kindly offered to help me with the transfer, but so far I haven’t hit anything I can’t figure out for myself. Using a modern CMS after wrestling with Typepad’s neglected interface for years is like slipping into a warm bath, I’m so glad I took the leap – thanks to everyone who pushed me!

Hacks for hospital caregiving

Redcross
Photo by Adam Fagen

I've just emerged from the worst two weeks of my life. My partner suffered a major stroke, ended up in intensive care, and went through brain surgery. Thankfully she's back at home now and on the road to a full recovery. We had no warning or time to prepare, so I had to learn a lot about how to be an effective caregiver on the fly. I discovered I had several friends who'd been in similar situations, so thanks to their advice and some improvisation here are some hacks that worked for me.

Be there

My partner was unconscious or suffering from memory problems most of the time, so I ended up having to make decisions on her behalf. Doctors, nurses and assistants all work to their own schedules, so sitting by the bedside for as long as you can manage is essential to getting information and making requests. Figure out when the most important times to be there are. At Stanford's intensive care unit the nurses work twelve hour shifts, and hand over to the next nurse at 7pm and 7am every day. Make sure you're there for that handover, it's where you'll hear the previous nurse's assessment of the patients condition, and be able to get support from the old nurse for any requests you have for the new one to carry out. The nurses don't often see patients more than once, so you need to provide the long-term context.

The next most important time is when the doctors do their rounds. This can be unpredictable, but for Stanford's neurosurgery team it was often around 5:30am. This may be your one chance to see senior physicians involved in your loved one's care that day. They will page a doctor for you at any time if you request it, but this will usually be the most junior member of the team who's unlikely to be much help beyond very simple questions. During rounds you can catch a full set of the doctors involved, hear what they're thinking about the case, give them information, and ask them questions.

Even if there are no memory problems, the patient is going to be groggy or drugged a lot of the time, and having someone else as an informed advocate is always going to be useful, and the only way to be informed is to put in as much time as you can.

Be nice to the nurses

Nurses are the absolute rulers of their ward. I should know this myself after growing up in a nursing family, but an old friend reminded me just after I'd entered the hospital how much they control access to your loved one. They also carry a lot of weight with the doctors, who know they see the patients for a lot longer than they do, and often have many years more experience. Behaving politely can actually be very hard when you're watching someone you love in intensive care, but it pays off. I was able to spend far more than the official visiting hours with my partner, mostly because the nurses knew I was no trouble, and would actually make their jobs easier by doing mundane things for her, and reassuring her when she had memory problems. This doesn't mean you should be a pushover though. If the nurses know that you have the time to very politely keep insisting that something your loved one needs happens, and will be there to track that it does, you'll be able to get what your partner needs.

Track the drugs

The most harrowing part of the experience was seeing my loved one in pain. Because neurosurgeons need to track their patients cognitive state closely to spot problems, they limit pain relief to a small number of drugs that don't cause too much drowsiness. I knew this was necessary, but it left my partner with very little margin before she was hit with attacks of severe pain. At first I trusted the staff to deal with it, but it quickly became clear that something was going wrong with her care. I discovered she'd had a ten hour overnight gap in the Vicodine medication that dealt with her pain, and she spent the subsequent morning dealing with traumatic pain that was hard to bring back under control. That got me digging into her drugs, and with the help of a friendly nurse, we discovered that she was being given individual Tylenols, and the Vicodine also contained Tylenol, so she would max out on the 4,000mg daily limit of its active ingredient acetaminophen and be unable to take anything until the twenty-four hour window had passed. This was crazy because the Tylenol did exactly nothing to help the pain, but it was preventing her from taking the Vicodine that did have an effect.

Once I knew what was going on I was able to get her switched to Norco, which contains the same strong pain-killer as Vicodine but with less Tylenol. There were other misadventures along the same lines, though none that caused so much pain, so I made sure I kept a notebook of all the drugs she was taking, the frequency, any daily limits, and the times she had taken them last, so I could manually track everything and spot any gaps before they happened. Computerization meant that the nurses no longer did this sort of manual tracking, which is generally great, but also meant they were always taken by surprise when she hit something like the Tylenol daily limit, since the computer rejection would be the first they knew of it.

Start a mailing list

When someone you love falls ill, all of their family and friends will be desperate for news. When I first came up for air, I phoned everyone I could think of to let them know, and then made sure I had their email address. I would be kicked out of the ward for a couple of hours each night, so I used that time to send out a mail containing a progress report. At first I used a manual CC list, but after a few days a friend set up a private Google Group that made managing the increasingly long list of recipients a lot easier. The process of writing a daily update helped me, because it forced me to collect my thoughts and try to make sense of what had happened that day, which was a big help in making decisions. It also allowed me to put out requests for assistance, for things like researching the pros and cons of an operation, looking after our pets, or finding accommodation for family and friends from out-of-town. My goal was to focus as much of my time as possible on looking after my partner. Having a simple way to reach a lot of people at once and being able to delegate easily saved me a lot of time, which helped me give better care.

Minimize baggage

A lot of well-meaning visitors would bring care packages, but these were a problem. During our eleven day stay, we moved wards eight times. Because my partner was in intensive care or close observation the whole time, there were only a couple of small drawers for storage, and very little space around the bed. I was sleeping in a chair by her bedside or in the waiting room, so I didn't have a hotel room to stash stuff. I was also recovering from knee surgery myself, so I couldn't carry very much!

I learned to explain the situation to visitors, and be pretty forthright in asking them to take items back. She didn't need clothing and the hospital supplied basic toiletries, so the key items were our phones, some British tea bags and honey, and one comforting blanket knitted by a friend's mother. Possessions are a hindrance in that sort of setting, the nurses hate having to weave around bags of stuff to take vital signs, there's little storage, and moving them all becomes a royal pain. Figure out what you'll actually use every day, and ask friends to take everything else away. You can always get them to bring something back if you really do need it, but cutting down our baggage was a big weight off my mind.

Sort out the decision-making

My partner was lucid enough early in the process to nominate me as her primary decision-maker when she was incapacitated, even though we're not married. As it happened, all of the treatment decisions were very black-and-white so I never really had to exercise much discretion, but if the worst had happened I would have been forced to guess at what she wanted. I knew the general outlines from the years we've spent together, but after this experience we'll both be filling out 'living wills' to make things a lot more explicit. We're under forty, so we didn't expect to be dealing with this so soon, but life is uncertain. The hospital recommended Five Wishes, which is $5 to complete online, and has legal force in most US states. Even if you don't fill out the form, just talking together about what you want is going to be incredibly useful.

Ask for help

I'm normally a pretty independent person, but my partner and I needed a large team behind us to help her get well. The staff at Stanford, her family, and her friends were all there for us, and gave us a tremendous amount of assistance. It wasn't easy to reach out and ask for simple help like deliveries of clothes and toiletries, but the people around you are looking for ways they can do something useful, it actually makes them feel better. Take advantage of their offers, it will help you focus on your caregiving.

Thanks again to everyone who helped through this process, especially the surprising number of friends who've been through something similar and whose advice helped me with the hacks above.

How does name analysis work?

Inigo
Photo by Glenda Sims

Over the last few months, I've been doing a lot more work with name analysis, and I've made some of the tools I use available as open-source software. Name analysis takes a list of names, and outputs guesses for the gender, age, and ethnicity of each person. This makes it incredibly useful for answering questions about the demographics of people in public data sets. Fundamentally though, the outputs are still guesses, and end-users need to understand how reliable the results are, so I want to talk about the strengths and weaknesses of this approach.

The short answer is that it can never work any better than a human looking at somebody else's name and guessing their age, gender, and race. If you saw Mildred Hermann on a list of names, I bet you'd picture an older white woman, whereas Juan Hernandez brings to mind an Hispanic man, with no obvious age. It should be obvious that this is not always reliable for individuals (I bet there are some young Mildreds out there) but as the sample size grows, the errors tend to cancel each other out.

The algorithms themselves work by looking at data that's been released by the US Census and the Social Security agency. These data sets list the popularity of 90,000 first names by gender and year of birth, and 150,000 family names by ethnicity. I then use these frequencies as the basis for all of the estimates. Crucially, all the guesses depend on how strong a correlation there is between a particular name and a person's characteristics, which varies for each property. I'll give some estimates of how strong these relationships are below, and I link to some papers with more rigorous quantitative evaluations below.

If you are going to use this approach in your own work, the first thing to watch out for is that any correlations are only relevant for people in the US. Names may be associated with very different traits in other countries, and our racial categories especially are social constructs and so don't map internationally.

Gender is the most reliable signal that we can gleam from names. There are some cross-over first names with a mixture of genders, like Francis, and some that are too unique to have data on, but overall the estimate of how many men and women are present in a list of names has proved highly accurate. It helps that there are some regular patterns to augment the sampled data, like names ending with an 'a' being associated with women.

Asian and Hispanic family names tend to be fairly unique to those communities, so an occurrence is a strong signal that the person is a member of that ethnicity. There are some confounding factors though, especially with Spanish-derived names in the Phillipines. There are certain names, especially those from Germany and Nordic countries, that strongly indicate that the owner is of European descent, but many surnames are multi-racial. There are some associations between African-Americans and certain names like Jackson or Smalls, but these are also shared by a lot of people from other ethnic groups. These ambiguities make non-Hispanic and non-Asian measures more indicators than strong metrics, and they won't tell you much until you get into the high hundreds for your sample size.

Age has the weakest correlation with names. There are actually some strong patterns by time of birth, with certain names widely recognized as old-fashioned or trendy, but those tend to be swamped by class and ethnicity-based differences in the popularity of names. I do calculate the most popular year for every name I know about, and compensate for life expectancy using actuarial tables, but it's hard to use that to derive a likely age for a population of people unless they're equally distributed geographically and socially. There tends to be a trickle-down effect where names first become popular amongst higher-income parents, and then spread throughout society over time. That means if have a group of higher-class people, their first names will have become most widely popular decades after they were born, and so they'll tend to appear a lot younger than they actually are. Similar problems exist with different ethnic groups, so overall treat the calculated age with a lot of caution, even with large sample sizes.

You should treat the results of name analysis cautiously – as provisional evidence, not as definitive proof. It's powerful because it helps in cases where no other information is available, but because those cases are often highly-charged and controversial, I'd urge everyone to see it as the start of the process of investigation not the end.

I've relied heavily on the existing academic work for my analysis, so I highly recommend checking out some of these papers if you do want to work with this technique. As an engineer, I'm also working without the benefit of peer review, so suggestions on improvements or corrections would be very welcome at pete@petewarden.com.

Use of Geocoding and Surname Analysis to Estimate Race and Ethnicity – A very readable survey of the use of surname analysis for ethnicity estimation in health statistics.

Estimating Age, Gender, and Identity using First Name Priors – A neat combination of image-processing techniques and first name data to improve the estimates of people's ages and genders in snapshots.

Are Emily and Greg More Employable than Lakisha and Jamal? – Worrying proof that humans rely on innate name analysis to discriminate against minorities.

First names and crime: Does unpopularity spell trouble? – An analysis that shows uncommon names are associated with lower-class parents, and so correlate juvenile delinquency and other ills connected to low socioeconomic status.

Surnames and a theory of social mobility – A recent classic of a paper that uses uncommon surnames to track the effects of social mobility across many generations, in many different societies and time periods.

OnoMap – A project by University College London to correlate surnames worldwide with ethnicities. Commercially-licensed, but it looks like you may be able to get good terms for academic usage.

Text2People – My open-source implementation of name analysis.