Five short links

starfish

Photo by Dezz

Hillbilly Tracking of Low Earth Orbit Satellites – A member of the Southern Appalachian Space Agency gives a rundown of his custom satellite tracking hardware and software, all built from off-the-shelf components. Pure hacking in its highest form, done for the hell of it.

Introduction to data archaeology – I’m fascinated by traditional archaeology, which has taught me a lot about uncovering information from seemingly intractable sources. Tomasz picks a nice concrete problem that shows how decoding an unknown file format from a few examples might not be as hard as you think!

Adversarial Stylometry – If you know your writing will be analyzed by experts in stylometry, can you obsfucate your own style to throw them off, or even mimic someone elses to frame them?

The man behind the Dickens and Doestoevsky hoax – Though I struggle with the credibility problems of data science, I’m constantly reminded that no field is immune. For years biographers reproduced the story of a meeting that never happened between the British and Russian novelists. The hoax’s author was driven by paranoia about his rejection by academia, and set out to prove that bad work under someone else’s name would be accepted by journals that rejected his regular submissions.

Dedupe – A luscious little Python library for intelligently de-duplicating entities in data. This task normally takes up more time than anything in the loading stage of data processing pipelines, and the loading stage itself always seems to be the most work, so this is a big deal for my projects.

Why you should never trust a data scientist

lieslieslies

Photo by Jesse Means

The wonderful thing about being a data scientist is that I get all of the credibility of genuine science, with none of the irritating peer review or reproducibility worries. My first taste of this was my Facebook friends connection map. The underlying data was sound, derived from 220m public profiles. The network visualization of drawing lines between the top ten links for each city had issues, but was defensible. The clustering was produced by me squinting at all the lines, coloring in some areas that seemed more connected in a paint program, and picking silly names for the areas. I thought I was publishing an entertaining view of some data I’d extracted, but it was treated like a scientific study. A New York Times columnist used it as evidence that the US was perilously divided. White supremacists dug into the tool to show that Juan was more popular than John in Texan border towns, and so the country was on the verge of being swamped by Hispanics. What was worse, I couldn’t even get my data into the hands of reputable sociologists, thanks to concerns from Facebook.

I’ve enjoyed publishing a lot of data-driven stories since then, but I’ve never ceased to be disturbed at how the inclusion of numbers and the mention of large data sets numbs criticism. The articles live in a strange purgatory between journalism, which most readers have a healthy skepticism towards, and science, where we sub-contract verification to other scientists and so trust the public output far more. If a sociologist tells you that people in Utah only have friends in Utah, you can follow a web of references and peer review to understand if she’s believable. If I, or somebody at a large tech company, tells you the same, there’s no way to check. The source data is proprietary, and in a lot of cases may not even exist any more in the same exact form as databases turn over, and users delete or update their information. Even other data scientists outside the team won’t be able to verify the results. The data scientists I know are honest people, but there’s no external checks in the system to keep them that way. The best you can hope for is blog and Twitter feedback, but without access to the data, or even a full paper on the techniques, you can’t dig very deeply.

Why are data scientists getting all the attention? I blame the real scientists! There’s a mass of fascinating information buried in all the data we’re collecting on ourselves, and traditional scientists have been painfully slow to take advantage of it. There are all sorts of barriers, ranging from the proprietary nature of the source data, the lack of familiarity with methods able to handle the information at scale, and a cultural distance between the academic and startup worlds. None of these should be insurmountable though. There’s great work being done with confidential IRS and US Census data, so the protocols exist to both do real science and preserve secrecy. I’ve seen the size of the fiber-optic bundles at CERN, so physicists at least know how to deal with crazy data rates. Most of the big startups had their roots in universities, so the cultural gap should be bridgeable.

What am I doing about it? I love efforts by teams like OpenPaths to break data out from proprietary silos so actual scientists can use them, and I do what I can to help any I run across. I popularize techniques that are common at startups, but lesser-known in academia. I’m excited when I see folks like Cameron Marlow at Facebook collaborating with academics to produce peer-reviewed research. I keep banging the drum about how shifty and feckless we data scientists really are, in the hope of damping down the starry-eyed credulity that greets our every pronouncement.

What should you do? If you’re a social scientist, don’t let us run away with all the publicity, jump in and figure out how to work with all these new sources. If you’re in a startup, figure out if you have data that tells a story, and see if there’s any academics you can reach. If you’re a reader, heckle data scientists when we make a sage pronouncement, keep us honest!

Five short links

chainlinkfive`

Photo by Trevor Pritchard

Network dynamic temporal visualization – Skye Bender-deMoll’s blog is the perfect example of why I had to find an alternative RSS service when Google Reader shut down. He only posts once every few months, but I’d hate to miss any of them! Here he covers a delicious little R tool that makes visualizing complex networks over time easy.

PressureNet live API – We’re all carrying little networked laboratories in our pockets. You see a photo. I see millions of light-sensor readings at an exact coordinate on the earth’s surface with a time resolution down to the millisecond. The future is combining all these signals into new ways of understanding the world, like this real-time stream of atmospheric measurements.

Inferring the origin locations of tweets – You can guess where a Twitter message came from with a surprising degree of accuracy, just based on the unstructured text of the tweet, and the user’s profile.

What every web developer should know about URL encoding – Nobody gets URL encoding right, not even me! Most of the time your application won’t need to handle the obscurer cases (path parameters anyone?) but it’s good to know the corners you’re cutting.

Global Name DataAdam Hyland alerted me to the awesome OpenGenderTracking project. Not only does this have data on names from the US and the UK, but it includes the scripts to download them from the source sites. Hurrah for reproducibility! I’ve also just been alerted to a Dutch source of first name popularity data.

 

Switching to WordPress from Typepad.com

Screen Shot 2013-07-12 at 12.21.19 PM

Photo by Dhaval Shah

I’ve been on the domain petewarden.typepad.com since I started blogging in 2006. At the time I knew that the sensible thing to do was to set up a custom domain that pointed to the typepad.com one, but since I didn’t expect I’d keep it up, it didn’t seem that important. A thousand posts later, I’ve had plenty of time to regret that shortcut!

Plenty of friends have tried to stage interventions, especially those with any design sense. I would half-heartedly muttered “If it aint broke, don’t fix it” and plead lack of time, but the truth is my blog was broken. I post because I want people to read what I write, and the aging design and structure drove readers away. When Matt Mullenweg called me out on it, I knew I had to upgrade.

The obvious choice was WordPress.com. Much as I love open source, I hate dealing with upgrades, dependencies, and spam, and I’m happy to pay a few dollars a month not to worry about any of those. Here’s what I had to do:

Purchased a premium WordPress domain at https://petewarden.wordpress.com. I wanted paid hosting, just like I had at Typepad, partly for the extra features (custom domains in particular) but also because I want to pay for anything that’s this important to me.

Bought the Elemin design template created by Automattic. I wanted something clean, and not too common, and I haven’t seen this minimal design in too many places, unlike a lot of other themes.

Exported my old posts as a text file, and then uploaded them to my new WordPress blog. The biggest downside to this is that none of the images are transferred, so I’ll need to keep the old typepad.com blog up until I can figure out some scripted way to move those too. The good news is that all of the posts seem to have transfered without any problems.

Set up a custom domain in the WordPress settings to point to petewarden.com. This involved changing my nameservers through my DNS provider, and then using a custom settings language to duplicate the MX records. This was the scariest part, since I’m using Google Apps for my email and I really didn’t want to mess their settings up. The only hitch I hit was trying to duplicate the www CNAME, I couldn’t get an entry working until I realized that it was handled automatically, so I just had to leave it out! With that all set up, I made petewarden.com my primary domain, so that petewarden.wordpress.com links would redirect to it.

Updated my Feedburner settings to point to the WordPress RSS feed. I was worried about this step too, but other than duplicating the last ten posts when I checked in my RSS reader, it just seemed to work.

Added an RSS redirect in the head section of my old typepad blog. This is a Javascript hack so none of the Google Juice from any popular posts will transfer to the new domain (and it may even ding search rankings), but at least readers should now see the modern version. Unfortunately there’s no proper 30x redirect support in Typepad, so this is the best I can do.

Matt kindly offered to help me with the transfer, but so far I haven’t hit anything I can’t figure out for myself. Using a modern CMS after wrestling with Typepad’s neglected interface for years is like slipping into a warm bath, I’m so glad I took the leap – thanks to everyone who pushed me!

Hacks for hospital caregiving

Redcross
Photo by Adam Fagen

I've just emerged from the worst two weeks of my life. My partner suffered a major stroke, ended up in intensive care, and went through brain surgery. Thankfully she's back at home now and on the road to a full recovery. We had no warning or time to prepare, so I had to learn a lot about how to be an effective caregiver on the fly. I discovered I had several friends who'd been in similar situations, so thanks to their advice and some improvisation here are some hacks that worked for me.

Be there

My partner was unconscious or suffering from memory problems most of the time, so I ended up having to make decisions on her behalf. Doctors, nurses and assistants all work to their own schedules, so sitting by the bedside for as long as you can manage is essential to getting information and making requests. Figure out when the most important times to be there are. At Stanford's intensive care unit the nurses work twelve hour shifts, and hand over to the next nurse at 7pm and 7am every day. Make sure you're there for that handover, it's where you'll hear the previous nurse's assessment of the patients condition, and be able to get support from the old nurse for any requests you have for the new one to carry out. The nurses don't often see patients more than once, so you need to provide the long-term context.

The next most important time is when the doctors do their rounds. This can be unpredictable, but for Stanford's neurosurgery team it was often around 5:30am. This may be your one chance to see senior physicians involved in your loved one's care that day. They will page a doctor for you at any time if you request it, but this will usually be the most junior member of the team who's unlikely to be much help beyond very simple questions. During rounds you can catch a full set of the doctors involved, hear what they're thinking about the case, give them information, and ask them questions.

Even if there are no memory problems, the patient is going to be groggy or drugged a lot of the time, and having someone else as an informed advocate is always going to be useful, and the only way to be informed is to put in as much time as you can.

Be nice to the nurses

Nurses are the absolute rulers of their ward. I should know this myself after growing up in a nursing family, but an old friend reminded me just after I'd entered the hospital how much they control access to your loved one. They also carry a lot of weight with the doctors, who know they see the patients for a lot longer than they do, and often have many years more experience. Behaving politely can actually be very hard when you're watching someone you love in intensive care, but it pays off. I was able to spend far more than the official visiting hours with my partner, mostly because the nurses knew I was no trouble, and would actually make their jobs easier by doing mundane things for her, and reassuring her when she had memory problems. This doesn't mean you should be a pushover though. If the nurses know that you have the time to very politely keep insisting that something your loved one needs happens, and will be there to track that it does, you'll be able to get what your partner needs.

Track the drugs

The most harrowing part of the experience was seeing my loved one in pain. Because neurosurgeons need to track their patients cognitive state closely to spot problems, they limit pain relief to a small number of drugs that don't cause too much drowsiness. I knew this was necessary, but it left my partner with very little margin before she was hit with attacks of severe pain. At first I trusted the staff to deal with it, but it quickly became clear that something was going wrong with her care. I discovered she'd had a ten hour overnight gap in the Vicodine medication that dealt with her pain, and she spent the subsequent morning dealing with traumatic pain that was hard to bring back under control. That got me digging into her drugs, and with the help of a friendly nurse, we discovered that she was being given individual Tylenols, and the Vicodine also contained Tylenol, so she would max out on the 4,000mg daily limit of its active ingredient acetaminophen and be unable to take anything until the twenty-four hour window had passed. This was crazy because the Tylenol did exactly nothing to help the pain, but it was preventing her from taking the Vicodine that did have an effect.

Once I knew what was going on I was able to get her switched to Norco, which contains the same strong pain-killer as Vicodine but with less Tylenol. There were other misadventures along the same lines, though none that caused so much pain, so I made sure I kept a notebook of all the drugs she was taking, the frequency, any daily limits, and the times she had taken them last, so I could manually track everything and spot any gaps before they happened. Computerization meant that the nurses no longer did this sort of manual tracking, which is generally great, but also meant they were always taken by surprise when she hit something like the Tylenol daily limit, since the computer rejection would be the first they knew of it.

Start a mailing list

When someone you love falls ill, all of their family and friends will be desperate for news. When I first came up for air, I phoned everyone I could think of to let them know, and then made sure I had their email address. I would be kicked out of the ward for a couple of hours each night, so I used that time to send out a mail containing a progress report. At first I used a manual CC list, but after a few days a friend set up a private Google Group that made managing the increasingly long list of recipients a lot easier. The process of writing a daily update helped me, because it forced me to collect my thoughts and try to make sense of what had happened that day, which was a big help in making decisions. It also allowed me to put out requests for assistance, for things like researching the pros and cons of an operation, looking after our pets, or finding accommodation for family and friends from out-of-town. My goal was to focus as much of my time as possible on looking after my partner. Having a simple way to reach a lot of people at once and being able to delegate easily saved me a lot of time, which helped me give better care.

Minimize baggage

A lot of well-meaning visitors would bring care packages, but these were a problem. During our eleven day stay, we moved wards eight times. Because my partner was in intensive care or close observation the whole time, there were only a couple of small drawers for storage, and very little space around the bed. I was sleeping in a chair by her bedside or in the waiting room, so I didn't have a hotel room to stash stuff. I was also recovering from knee surgery myself, so I couldn't carry very much!

I learned to explain the situation to visitors, and be pretty forthright in asking them to take items back. She didn't need clothing and the hospital supplied basic toiletries, so the key items were our phones, some British tea bags and honey, and one comforting blanket knitted by a friend's mother. Possessions are a hindrance in that sort of setting, the nurses hate having to weave around bags of stuff to take vital signs, there's little storage, and moving them all becomes a royal pain. Figure out what you'll actually use every day, and ask friends to take everything else away. You can always get them to bring something back if you really do need it, but cutting down our baggage was a big weight off my mind.

Sort out the decision-making

My partner was lucid enough early in the process to nominate me as her primary decision-maker when she was incapacitated, even though we're not married. As it happened, all of the treatment decisions were very black-and-white so I never really had to exercise much discretion, but if the worst had happened I would have been forced to guess at what she wanted. I knew the general outlines from the years we've spent together, but after this experience we'll both be filling out 'living wills' to make things a lot more explicit. We're under forty, so we didn't expect to be dealing with this so soon, but life is uncertain. The hospital recommended Five Wishes, which is $5 to complete online, and has legal force in most US states. Even if you don't fill out the form, just talking together about what you want is going to be incredibly useful.

Ask for help

I'm normally a pretty independent person, but my partner and I needed a large team behind us to help her get well. The staff at Stanford, her family, and her friends were all there for us, and gave us a tremendous amount of assistance. It wasn't easy to reach out and ask for simple help like deliveries of clothes and toiletries, but the people around you are looking for ways they can do something useful, it actually makes them feel better. Take advantage of their offers, it will help you focus on your caregiving.

Thanks again to everyone who helped through this process, especially the surprising number of friends who've been through something similar and whose advice helped me with the hacks above.

How does name analysis work?

Inigo
Photo by Glenda Sims

Over the last few months, I've been doing a lot more work with name analysis, and I've made some of the tools I use available as open-source software. Name analysis takes a list of names, and outputs guesses for the gender, age, and ethnicity of each person. This makes it incredibly useful for answering questions about the demographics of people in public data sets. Fundamentally though, the outputs are still guesses, and end-users need to understand how reliable the results are, so I want to talk about the strengths and weaknesses of this approach.

The short answer is that it can never work any better than a human looking at somebody else's name and guessing their age, gender, and race. If you saw Mildred Hermann on a list of names, I bet you'd picture an older white woman, whereas Juan Hernandez brings to mind an Hispanic man, with no obvious age. It should be obvious that this is not always reliable for individuals (I bet there are some young Mildreds out there) but as the sample size grows, the errors tend to cancel each other out.

The algorithms themselves work by looking at data that's been released by the US Census and the Social Security agency. These data sets list the popularity of 90,000 first names by gender and year of birth, and 150,000 family names by ethnicity. I then use these frequencies as the basis for all of the estimates. Crucially, all the guesses depend on how strong a correlation there is between a particular name and a person's characteristics, which varies for each property. I'll give some estimates of how strong these relationships are below, and I link to some papers with more rigorous quantitative evaluations below.

If you are going to use this approach in your own work, the first thing to watch out for is that any correlations are only relevant for people in the US. Names may be associated with very different traits in other countries, and our racial categories especially are social constructs and so don't map internationally.

Gender is the most reliable signal that we can gleam from names. There are some cross-over first names with a mixture of genders, like Francis, and some that are too unique to have data on, but overall the estimate of how many men and women are present in a list of names has proved highly accurate. It helps that there are some regular patterns to augment the sampled data, like names ending with an 'a' being associated with women.

Asian and Hispanic family names tend to be fairly unique to those communities, so an occurrence is a strong signal that the person is a member of that ethnicity. There are some confounding factors though, especially with Spanish-derived names in the Phillipines. There are certain names, especially those from Germany and Nordic countries, that strongly indicate that the owner is of European descent, but many surnames are multi-racial. There are some associations between African-Americans and certain names like Jackson or Smalls, but these are also shared by a lot of people from other ethnic groups. These ambiguities make non-Hispanic and non-Asian measures more indicators than strong metrics, and they won't tell you much until you get into the high hundreds for your sample size.

Age has the weakest correlation with names. There are actually some strong patterns by time of birth, with certain names widely recognized as old-fashioned or trendy, but those tend to be swamped by class and ethnicity-based differences in the popularity of names. I do calculate the most popular year for every name I know about, and compensate for life expectancy using actuarial tables, but it's hard to use that to derive a likely age for a population of people unless they're equally distributed geographically and socially. There tends to be a trickle-down effect where names first become popular amongst higher-income parents, and then spread throughout society over time. That means if have a group of higher-class people, their first names will have become most widely popular decades after they were born, and so they'll tend to appear a lot younger than they actually are. Similar problems exist with different ethnic groups, so overall treat the calculated age with a lot of caution, even with large sample sizes.

You should treat the results of name analysis cautiously – as provisional evidence, not as definitive proof. It's powerful because it helps in cases where no other information is available, but because those cases are often highly-charged and controversial, I'd urge everyone to see it as the start of the process of investigation not the end.

I've relied heavily on the existing academic work for my analysis, so I highly recommend checking out some of these papers if you do want to work with this technique. As an engineer, I'm also working without the benefit of peer review, so suggestions on improvements or corrections would be very welcome at pete@petewarden.com.

Use of Geocoding and Surname Analysis to Estimate Race and Ethnicity – A very readable survey of the use of surname analysis for ethnicity estimation in health statistics.

Estimating Age, Gender, and Identity using First Name Priors – A neat combination of image-processing techniques and first name data to improve the estimates of people's ages and genders in snapshots.

Are Emily and Greg More Employable than Lakisha and Jamal? – Worrying proof that humans rely on innate name analysis to discriminate against minorities.

First names and crime: Does unpopularity spell trouble? – An analysis that shows uncommon names are associated with lower-class parents, and so correlate juvenile delinquency and other ills connected to low socioeconomic status.

Surnames and a theory of social mobility – A recent classic of a paper that uses uncommon surnames to track the effects of social mobility across many generations, in many different societies and time periods.

OnoMap – A project by University College London to correlate surnames worldwide with ethnicities. Commercially-licensed, but it looks like you may be able to get good terms for academic usage.

Text2People – My open-source implementation of name analysis.

Fixing OpenCV’s Java bindings on gcc systems

Coffee
Photo by Julian Schroeder

I just spent quite a few hours tracking down a subtle problem with OpenCV's new Java bindings on gcc platforms, like my Ubuntu servers. The short story is that the default for linked symbols was recently changed to hidden on gcc systems, and the Java native interfaces weren't updated to override that default, so any Java programs using native OpenCV functions would mysteriously fail with an UnsatisfiedLinkError. Here's my workaround:

--- a/cmake/OpenCVCompilerOptions.cmake
+++ b/cmake/OpenCVCompilerOptions.cmake
@@ -252,8 +252,8 @@ set(OPENCV_EXTRA_EXE_LINKER_FLAGS_DEBUG "${OPENCV_EXTRA_EXE_LINKER_FLAGS_DEBUG

# set default visibility to hidden
if(CMAKE_COMPILER_IS_GNUCXX AND CMAKE_OPENCV_GCC_VERSION_NUM GREATER 399)
- add_extra_compiler_option(-fvisibility=hidden)
- add_extra_compiler_option(-fvisibility-inlines-hidden)
+# add_extra_compiler_option(-fvisibility=hidden)
+# add_extra_compiler_option(-fvisibility-inlines-hidden)
endif()

The tricky part of tracking this down was that nm didn't show the .hidden attribute, so the library symbols appeared fine, it was only when I switched to objdump after exhausting everything else I could think of that the problem became clear.

Anyway, I wanted to leave some Google breadcrumbs for anyone else who hits this! I've filed a bug with the OpenCV folks, hopefully it will be fixed soon.

Five short links

Fivesign
Photo by Leo Reynolds

External framework problems in Go – Handling dependencies well is extremely hard, and can lead to insane yak-shaving expeditions like this when things go wrong. It's like an avalanche – changing versions on one library can impact several others, so you have to update or downgrade those too, and suddenly you're facing an ever-increasing amount of work just to get back to where you were!

D-wave comparison with classical computers – I don't know enough about quantum computing problems to comment on the details of the argument, but it's awesome to see such a deep technical dive as an instant blog post, rather than having to wait months for a paper.

Blogging is dead, but have we fixed anything? – "I find my blogging here to be too useful to me to stop doing it" sums up why I'm still working in a now-archaic medium!

What statistics should do about big data – "[Statisticians] want an elegant theory that we can then apply to specific problems if they happen to come up." That's been exactly my experience, and why I've never encountered statisticians as I've followed my curiosity to new problems data. The article this is in response to contains the assumption that 'funding agencies' have driven the CS takeover of data processing, but, despite a lot of the founders having roots in academia, almost all the innovations I've seen have been incubated in commercial environments.

The hidden sexism in CS departments – A portrait of managerial cluelessness when dealing with a nasty incident. Even if each occurrence is comparatively minor, it's the steady drip-drip of unwelcoming behavior that drives non-stereotypical geeks out of our world.

Five short links

Fivetally
Photo by ahojohnebrause

Max Headroom and the strange world of pseudo-CGI – I've always been fascinated by cargo cult analog tributes to technology. Maybe my early exposure to Max gave me the bug?

Reidentification as basic science – Arvind does a fantastic job of explaining why the research he does is so important. I love learning more about people from data, and most of the interesting insights come from interrogating it in unusual ways and finding unexpected connections, which is what his work is all about.

A 21cm radio telescope for the cost-conscious – Beautiful geekery. Who doesn't want to map the Milky Way's radio emissions using nothing more sophisticated than a $20 USB TV dongle?

How Google Code worked – An eminently-practical guide to implementing a regular-expression search engine, from the author of the late-lamented Google Code. It even comes with working source code!

3D lightning – Calculating the three-dimensional path of a lightning bolt from two simultaneous pictures taken from different spots.