Five short links

fivelines

Photo by Etienne Girardet

TwoFishes – A kick-ass geocoder for everything above the street level, by David Blackman based on geonames and other public data. I’m hoping to roll this into the next distribution of the DSTK, it does a brilliant job handling a lot of tricky problems like native-language unicode.

NIST Randomness Beacon – Provides a new set of random numbers every sixty seconds, along with a checksum linking them to the previous randoms, and an archive of all the random values from previous time intervals. I have no idea what to do with this, but it feels like such an interesting primitive for applications that need verifiable timestamps.

Understanding DMA Malware – After hiding code on hard drive controllers, and then in CPU’s microcode, here’s an example of writing a keylogger that runs entirely through the direct-memory access controllers that most systems support, with demos on both Linux and Windows. As our devices become saturated with computation, there are so many places for malicious code to hide.

What are these giant concrete arrows across the American landscape? – We used to need a chain of massive earthworks across the continent to help planes navigate!

Mapping and its discontents – I’m excited to see Berkeley focusing on the power of maps as stories rather than treating them as a technical subject. I hope I can make this symposium, I’d love to hear from folks like Rebecca Solnit, author of the wonderful Infinite City atlas of SF.

Why OpenHeatMap is banned from Github

noentrywall

Photo by Sinead Fenton

OpenHeatMap is an abject failure commercially, but I’ve kept it running because around 40,000 people a year use it to create simple visualizations. It’s free and aimed at non-technical folks, so a lot of them are from non-profits, schools, local activist groups, and other causes that I enjoy helping.

I spend a few hours a week answering emails, and occasionally have to do some server maintenance, but it’s generally a fairly light labor-of-love. The source code has always been up on Github, but a few months ago it was taken down after a copyright complaint. Unfortunately Github don’t seem to have any process for me to contest this decision, and after a few inconclusive exchanges with their support team supplying further information they’ve stopped replying to my messages.

It’s not really hurting me personally, I’m still able to keep up my maintenance on the project from a local copy, but I’ve had to field puzzled emails from people who want to fork or learn from the project, including one team with an intriguing idea for a suite of visualization tools aimed at non-profits. I don’t like disappointing these folks, so I’m putting this together as an explanation of why I’ve not been able to help them.

The actual complaint isn’t completely nuts, but what has left me sad is how Github handled it. About five years ago somebody emailed me a bug report, attaching a few spreadsheets of addresses and names, without any description of what they were. This is a pretty typical use case for the website, I’ll often see what look like short fragments of a phone book as test files. I fixed the bug and added the file to my unit tests. A few months ago, years after I’d originally created the unit test, I received an email from a CTO at a consulting firm, angry that what appeared to be a list of his staff was available if you dug around enough on Github, though it was unlabeled. It appeared one of his employees had sent it to me as a test case. I felt a bit embarrassed – looking back I’d be more careful about what I used as my unit test inputs, at least scrubbing them more vigorously so that only address data remains (though a lot of the tests are about identifying what is address data in a soup of other columns). I went ahead and removed the offending data from my repository and its history, and checked the result in. I also started the process of removing the unit test directory entirely from the public project, though that was a longer task. Unfortunately the complainant found another copy of the file he hadn’t spotted before, and rather than contacting me, got in touch with Github and persuaded them to disable the project.

I was pretty disappointed, but assumed I could get the project back online with the unit tests removed entirely. Unfortunately they don’t seem to have any kind of process for resolving problems like these, and the only point of contact I’ve had is through their main support@github.com email. I tried reaching out on Twitter too, but without any luck. Right now the project’s stuck in limbo, apparently permanently banned, and I’m not sure how to get it online again. I’m a long-time fan of Github, and there’s no other provider that offers such a good environment for sharing source code, so I’m just sad that they don’t seem set up to handle this sort of problem, and I hope this doesn’t affect other projects going forward.

[Update – I’ve had an email back from Github, it sounds like my previous mails may have gone astray, and they’re on the case.]

[Update 2 – OpenHeatMap is now back online, thanks to everyone for their support!]

How many people read my posts?

emptyseats

Photo by Dustin Jamison

A friend just asked “How popular are the links on your Five Short Links posts compared with your ‘regular’ posts?“. This seemed like a good chance to share some data, so here’s a rundown of who reads what, and what drives me to write the posts.

I get 25,000 unique visitors on a typical month, and around 35,000 views. I also have around 4,000 RSS subscribers according to Feedburner, but I don’t know how many of those are actively reading my blog. My biggest traffic sources are search engines, then Twitter, Hacker News, and Reddit. I’ve found I write three kinds of posts, and their traffic patterns are very different.

Show and Tell

When I’ve got a topic I want to tell the world about, I’ll put together a post with some examples and a bit of background on why I think it’s important. A lot of these end up with just a few hundred views, maybe a few thousand if they end up shared on Twitter, and occasionally I’ll end up with tens of thousands if I really hit a nerve. My recent post on Google’s geo APIs has racked up 42,349 views over the last few weeks, largely thanks to a stint on the front page of Hacker News. My post on distrusting data scientists never made it onto a big aggregator like that, but was widely shared on Twitter and through specialized blogs, and has had 13,792 views since it was published. The post I did on name analysis is more typical, with 2,871 views since it was published.

These articles usually take quite a lot of time to research and put together, which means I can rarely do more than one or two a month. I write them because I can’t help myself! I’m passionate about my work, and I love having a platform I can use to grab people by the lapels and rant at them.

How-to’s

It takes less time to write up my notes on a technical problem I’ve had to figure out. I think of these as trails of breadcrumbs through the forest, and my hope is that anyone who  else who hits the issue can Google it and find something helpful. I rely on other people’s write-ups as a starting point for almost any bug I hit, so my goal is to pay back some of that help, and keep the ecosystem of user-written documentation alive. A typical example would be my post on debugging Javascript errors on iOS which gets 1,932 views a year. These tend to be evergreen, keeping steady traffic for years, and almost everyone finds them through search engines.

Links

The easiest posts, and so the most frequent, are my short link digests. I have a large list of blogs I follow through Feedly, and I often run across interesting articles while I’m searching on technical topics, and from the folks I follow on Twitter. I also find the ‘newest’ page of Hacker News full of neglected gems. A lot of my favorite links never make it to the front page, which is usually heavy on controversy and unkind to interesting-but-unsensational projects.

I’ve been collecting links for years,  and copying Nat Torkington’s Radar post format (plus 25% extra) gave me a fun way to share them with the world. The posts don’t get a massive number of clicks, the last three posts got 66, 61, and 91 views total, but people seem to like them. I end up having a lot of conversations with folks I never would have been in touch with, and it feels good to shine a light on projects that deserve more attention. My favorite result is seeing a startup or framework get picked up by a publication with a much bigger audience, since I seem to have a decent number of journalists and other bloggers following my posts.

Five short links

bollyfive

Photo by Romana Klee

The joy of unrepresentative samples – It’s uncontroversial in the commercial world that biased samples can still produce useful results, as long as you are careful. There are techniques that help you understand your sample, like bootstrapping, and we’re lucky enough to have frequent external validation because we’re almost always measuring so we can make changes, and then we see if they work according to our models. The comments on this post are worth reading because the approach seems to offend some sociologists viscerally. (via Trey Causey and Benjamin Lind)

Humanize – A Javascript library that handles the common language transformations like translating numbers into positional text (eg 1 into ‘first’), turning lists into comma-separated strings with ‘and’ between the last two entries, and other goodies. I wonder if this will be translated into languages other than English?

Should Excel spreadsheets be subject to external peer review? – Making it easy to get more sets of eyes on your data.

Thoughts on Intel’s upcoming software guard extensions – My conclusion after reading this overview is that the complexity of modern processors is mind-boggling, and it’s becoming increasingly impossible to verify the security of any of the hardware or software we use by inspection.

Black Midi – Jamming insane numbers of notes into an ancient music format, and playing them back with the dinkiest software you can find. A thing of beauty.

Five short links

fivelight

Photo by Chintermeyer

Black Perl – “BEFOREHAND: close door, each window & exit; wait until time. / open spellbook, study, select it, confess, tell, deny;” – A compilable poem, beautifully weird.

Relationship Timelines – It’s rare that a network visualization illuminates, rather than impresses, but XKCD’s Lord of the Rings et al narrative charts actually added something to my understanding of the movies. Skye Bender-deMoll has pulled together some similar research examples to try to figure out how to create similar graphs automatically, and I’m hoping he succeeds.

A deadly gift from the stars – A meditation on the fragility and unlikeliness of life, built around Iain Banks’ hope that his cancer was caused by cosmic rays rather than something more banal.

Freedreno updates – I spent months working with the then-ATI driver engineers debugging problems with the way we used their Radeon GPUs at Apple, but I was never able to see their source code. That means I’m excited to see an open-source driver for what looks like a very similar chip, the Adreno, used on a lot of mobile devices. The development process is fascinating too, I wish I’d been able to instrument the drivers to understand performance problems in the depth Rob is able to.

GitSpatial – Github are making a big effort with their GeoJSON support, which is an interesting expansion outside of their traditional code focus and into data. GitSpatial is an intriguing layer on top of that support, adding a query API with Github as the backing store.

Five short links

fivemeerkats

Photo by Tambako the Jaguar

Where’s my fusion reactor? – An engrossing overview of the state of smaller fusion research projects. For the past half-century, fusion has permanently been twenty years away, so I’d love one of these to come out of the shadows and surprise us all.

Smathermather’s weblog – I don’t often link to entire blogs, but Stephen Mather’s is so full of impressive geo-hacking posts it would be an injustice to link to just one of them. I am particularly fond of his use of POV-Ray for analyzing the available views from particular points in the landscape though. I spent the summer of 1990 furiously rendering 160×120 images using POV trying to create the ultimate mirror-ball on a chess-board. It left me amazed that there were programmers were generous enough to give the software away for free, and itching to write something myself.

Finding important words in a document using TF/IDF – A straightforward explanation of a powerful approach that’s often cloaked in jargon.

Unusually effective debugging – Early in my career I noticed that I spent most of my time debugging, and that the biggest difference between the most productive programmers and the least was how effective they were at it. You end up debugging when there’s a mismatch between the mental model of what you think your code should be doing, and how it’s actually being executed. This article has some excellent advice on ways to find the flaw in your mental model as quickly as possible: “It’s about killing your darlings, looking for evidence to prove your theories false. It’s about ignoring the how and why and describing, as precisely as possible, what the problem is. It’s about imagining a huge multidimensional search space of possibilities and looking for ways to eliminate half or whole dimensions, recursively, until you’ve isolated the fault.”

Akkie, and the 101 things you can do with a CD-ROM drive’s eject function – There’s a zen-like beauty about focusing on the possibilities of misusing a single basic component in creative ways. Feeding hamsters, twitter notifications, ringing bells, all pure hacks in the best way possible.

Five short links

fivespots

Photo by Ken-ichi Ueda

Using public data to extract money by shaming people – There is a big difference between theoretically public, and being publicized. The traditional computer science model of privacy is binary, either information is secret or not, but real-world security has always relied on shades of accessibility, enforced by mechanisms that make it hard to gather and distribute protected data sets in bulk. Fifty years ago someone could have gone down a courthouse, copied parking tickets from paper files, and taken out thousands of classified ads in the local newspaper to run the same scheme, but they didn’t because the time and money involved meant it wouldn’t make a profit. We’ve now removed almost all the friction from data transfers, and so suddenly the business model is viable.

Cargo Cult Analytics – All the measurements in the world won’t help you if you don’t know what your goal is.

How to ruin your technical session in ten easy stages – I’ve given some terrible talks, usually when I’ve over-committed myself and not spent enough time preparing. I love “anti-planning”, where you list all the ways you’d screw up a project if you were deliberately trying to sabotage it, and then use that as a check-list of the dangers to watch out for, so this post will be on my mind for next time.

Notes on Intel microcode – A demonstration of how little we actually know about our CPUs, despite building a civilization that relies on them.  Just like hard drive controller subversion, this provides an attack surface that almost nobody would think of guarding. The techniques used to investigate the encrypted microcode updates are worth studying as outstanding hacks too.

Null Island – Nestled off the coast of West Africa at latitude, longitude (0˚, 0˚), Null Island is the home of a surprising amount of geo data, though I never knew its name until Gnip gave me a cool t-shirt. After mentioning my appreciation, I was pleased to find out that my friend Michal Migurski was one of the original discoverers!

Why you should stop pirating Google’s geo APIs

skullandcrossbones

Picture by Scott Vandehey

This morning I ran across a wonderful open source project called “Crime doesn’t climb“, analyzing how crime rates vary with altitude in San Francisco. Then I reached this line, and honestly couldn’t decide whether to cry or scream: “Here’s the code snippet that queries the Google Elevation API (careful–Google rate limits aggressively)

Google is very clear about the accepted usage of all their geo APIs, here’s the quote that’s repeated in almost every page: “The Elevation API may only be used in conjunction with displaying results on a Google map; using elevation data without displaying a map for which elevation data was requested is prohibited.

The crime project isn’t an exception, it’s common to see geocoding and other closed APIs being used in all sorts of unauthorized ways . Even tutorials openly recommend going this route.

So what? Everyone ignores the terms, and Google doesn’t seem to enforce them energetically. People have projects to build, and the APIs are conveniently to hand, even if they’re technically breaking the terms of service. Here’s why I care, and why I think you should too:

Google’s sucking up all the oxygen

Because everyone’s using closed-source APIs from Google, there’s very little incentive to improve the open-source alternatives. Microsoft loved it when people in China pirated Windows, because that removed a lot of potential users for free alternatives, and so hobbled their development, and something very similar is happening in the geo world. Open geocoding alternatives would be a lot further along if crowds of frustrated geeks were diving in to improve them, rather than ignoring them.

You’re giving them a throat to choke

Do you remember when the Twitter API was a wonderful open platform to build your business on? Do you remember how well that worked out? If you’re relying on Google’s geo APIs as a core part of your projects you already have a tricky dependency to manage even if it’s all kosher. If you’re not using them according to the terms of service, you’re completely at their mercy if it becomes successful. Sometimes the trade-off is going to be worth it, but you should at least be aware of the alternatives when you make that choice.

A lot of doors are closed

Google is good about rate-limiting its API usage, so you won’t be able to run bulk data analysis. You also can only access the data in a handful of ways. For example, for the crime project they were forced to run point sampling across the city to estimate the proportion of the city that was at each elevation, when having full access to the data would have allowed them to calculate that much more directly and precisely. By starting with a closed API, you’re drastically limiting the answers you’ll be able to pull from the data.

You’re missing out on all the fun

I’m not RMS, I love open-source for very pragmatic reasons. One of the biggest is that I hate hitting black boxes when I’m debugging! When I was first using Yahoo’s old Placemaker API, I was driven crazy by its habit of marking an references to “The New York Times” as being in New York. I ended up having to patch around this habit for all sorts of nouns, doing a massive amount of work when I knew that it would be far simpler to tweak the original algorithm for my use case. When I run across bugs or features I’d like to add to open-source software, I can dive in, make the changes, and anyone else who has the same problem also benefits. It’s not only more efficient, it’s a lot more satisfying too.

So, what can you do?

There’s a reason Google’s geo APIs are dominant – they’re well-documented, have broad coverage, and are easy to access. There’s nothing in the open world that matches them overall. There are good solutions out there though, so all I’d ask is that you look into what’s available before you default to closed data.

I’ve put my money where my mouth is, by pulling together the Data Science Toolkit as an open VM that wraps a lot of the geo community’s greatest open-source projects in a friendly and familiar interface, even emulating Google’s geocoder URL structure. Instead of using Google’s elevation API, the crime project could have used NASA’s SRTM elevation data through the coordinates2statistics JSON endpoint, or even logged in to the PostGIS database that drives it to run bulk calculations.

There are a lot of other alternatives too. I have high hopes for Nominatim, OpenStreetMap’s geocoding service, though a lot of my applications require a more ‘permissive’ interface that accepts messier input. PostGIS now comes with a geocoder for US Census ‘Tiger’ data pre-installed too. Geonames has a great set of data on places all around the world you can explore.

If you don’t see what you want, figure out if there are any similar projects you might be able to extend with a little effort, or that you can persuade the maintainers to work on for you. If you need neighborhood boundaries, why not take a look at building them in Zetashapes and contributing them back? If Nominatim doesn’t work well for your country’s postal addresses, dig into improving their parser. I know only a tiny percentage of people will have the time, skills, or inclination to get involved, but just by hearing about the projects, you’ve increased the odds you’ll end up helping.

I want to live in a world where basic facts about the places we live and work are freely available, so it’s a lot easier to build amazing projects like the crime analysis that triggered this rant. Please, at least find out a little bit about the open alternatives before you use Google’s geo APIs, you might be pleasantly surprised at what’s out there!

Five short links

fivecoffee

Photo by Igor Schwarzmann

A guide for the lonely bioinformatician – I run across a lot of data scientists who are isolated in their teams, which is a recipe for failure. This guide has some great practical steps you can take to connect to other people both inside and outside your organization.

A guide for the young academic politician – From 1908, but still painfully funny, and even more painfully true. “You will begin, I suppose, by thinking that people who disagree with you and oppress you must be dishonest. Cynicism is the besetting and venial fault of declining youth, and disillusionment its last illusion. It is quite a mistake to suppose that real dishonesty is at all common. The number of rogues is about equal to the number of men who act honestly; and it is very small. The great majority would sooner behave honestly than not. The reason why they do not give way to this natural preference of humanity is that they are afraid that others will not; and the others do not because they are afraid that they will not. Thus it comes about that, while behavior which looks dishonest is fairly common, sincere dishonesty is about as rare as the courage to evoke good faith in your neighbors by showing that you trust them.”

The most mysterious radio transmission in the world – A Russian radio station that’s been transmitting a constant tone, interrupting every few months by mysterious letters, since the 1970’s.

The surprising subtleties of zeroing a register – x86 CPUs recognize common instructions that programmers use to zero registers,  such as XOR or SUB-ing from yourself, and replace them on the fly with no-cost references to hidden constant zero register. A good reminder that even when you think you’re programming the bare metal, you’re still on top of ever-increasing layers of indirection.

Why do so many incompetent men become leaders? – Evidence-based analysis of why “we tend to equate leadership with the very psychological features that make the average man a more inept leader than the average woman”.