Five short links

fivemeerkats

Photo by Tambako the Jaguar

Where’s my fusion reactor? – An engrossing overview of the state of smaller fusion research projects. For the past half-century, fusion has permanently been twenty years away, so I’d love one of these to come out of the shadows and surprise us all.

Smathermather’s weblog – I don’t often link to entire blogs, but Stephen Mather’s is so full of impressive geo-hacking posts it would be an injustice to link to just one of them. I am particularly fond of his use of POV-Ray for analyzing the available views from particular points in the landscape though. I spent the summer of 1990 furiously rendering 160×120 images using POV trying to create the ultimate mirror-ball on a chess-board. It left me amazed that there were programmers were generous enough to give the software away for free, and itching to write something myself.

Finding important words in a document using TF/IDF – A straightforward explanation of a powerful approach that’s often cloaked in jargon.

Unusually effective debugging – Early in my career I noticed that I spent most of my time debugging, and that the biggest difference between the most productive programmers and the least was how effective they were at it. You end up debugging when there’s a mismatch between the mental model of what you think your code should be doing, and how it’s actually being executed. This article has some excellent advice on ways to find the flaw in your mental model as quickly as possible: “It’s about killing your darlings, looking for evidence to prove your theories false. It’s about ignoring the how and why and describing, as precisely as possible, what the problem is. It’s about imagining a huge multidimensional search space of possibilities and looking for ways to eliminate half or whole dimensions, recursively, until you’ve isolated the fault.”

Akkie, and the 101 things you can do with a CD-ROM drive’s eject function – There’s a zen-like beauty about focusing on the possibilities of misusing a single basic component in creative ways. Feeding hamsters, twitter notifications, ringing bells, all pure hacks in the best way possible.

Five short links

fivespots

Photo by Ken-ichi Ueda

Using public data to extract money by shaming people – There is a big difference between theoretically public, and being publicized. The traditional computer science model of privacy is binary, either information is secret or not, but real-world security has always relied on shades of accessibility, enforced by mechanisms that make it hard to gather and distribute protected data sets in bulk. Fifty years ago someone could have gone down a courthouse, copied parking tickets from paper files, and taken out thousands of classified ads in the local newspaper to run the same scheme, but they didn’t because the time and money involved meant it wouldn’t make a profit. We’ve now removed almost all the friction from data transfers, and so suddenly the business model is viable.

Cargo Cult Analytics – All the measurements in the world won’t help you if you don’t know what your goal is.

How to ruin your technical session in ten easy stages – I’ve given some terrible talks, usually when I’ve over-committed myself and not spent enough time preparing. I love “anti-planning”, where you list all the ways you’d screw up a project if you were deliberately trying to sabotage it, and then use that as a check-list of the dangers to watch out for, so this post will be on my mind for next time.

Notes on Intel microcode – A demonstration of how little we actually know about our CPUs, despite building a civilization that relies on them.  Just like hard drive controller subversion, this provides an attack surface that almost nobody would think of guarding. The techniques used to investigate the encrypted microcode updates are worth studying as outstanding hacks too.

Null Island – Nestled off the coast of West Africa at latitude, longitude (0˚, 0˚), Null Island is the home of a surprising amount of geo data, though I never knew its name until Gnip gave me a cool t-shirt. After mentioning my appreciation, I was pleased to find out that my friend Michal Migurski was one of the original discoverers!

Why you should stop pirating Google’s geo APIs

skullandcrossbones

Picture by Scott Vandehey

This morning I ran across a wonderful open source project called “Crime doesn’t climb“, analyzing how crime rates vary with altitude in San Francisco. Then I reached this line, and honestly couldn’t decide whether to cry or scream: “Here’s the code snippet that queries the Google Elevation API (careful–Google rate limits aggressively)

Google is very clear about the accepted usage of all their geo APIs, here’s the quote that’s repeated in almost every page: “The Elevation API may only be used in conjunction with displaying results on a Google map; using elevation data without displaying a map for which elevation data was requested is prohibited.

The crime project isn’t an exception, it’s common to see geocoding and other closed APIs being used in all sorts of unauthorized ways . Even tutorials openly recommend going this route.

So what? Everyone ignores the terms, and Google doesn’t seem to enforce them energetically. People have projects to build, and the APIs are conveniently to hand, even if they’re technically breaking the terms of service. Here’s why I care, and why I think you should too:

Google’s sucking up all the oxygen

Because everyone’s using closed-source APIs from Google, there’s very little incentive to improve the open-source alternatives. Microsoft loved it when people in China pirated Windows, because that removed a lot of potential users for free alternatives, and so hobbled their development, and something very similar is happening in the geo world. Open geocoding alternatives would be a lot further along if crowds of frustrated geeks were diving in to improve them, rather than ignoring them.

You’re giving them a throat to choke

Do you remember when the Twitter API was a wonderful open platform to build your business on? Do you remember how well that worked out? If you’re relying on Google’s geo APIs as a core part of your projects you already have a tricky dependency to manage even if it’s all kosher. If you’re not using them according to the terms of service, you’re completely at their mercy if it becomes successful. Sometimes the trade-off is going to be worth it, but you should at least be aware of the alternatives when you make that choice.

A lot of doors are closed

Google is good about rate-limiting its API usage, so you won’t be able to run bulk data analysis. You also can only access the data in a handful of ways. For example, for the crime project they were forced to run point sampling across the city to estimate the proportion of the city that was at each elevation, when having full access to the data would have allowed them to calculate that much more directly and precisely. By starting with a closed API, you’re drastically limiting the answers you’ll be able to pull from the data.

You’re missing out on all the fun

I’m not RMS, I love open-source for very pragmatic reasons. One of the biggest is that I hate hitting black boxes when I’m debugging! When I was first using Yahoo’s old Placemaker API, I was driven crazy by its habit of marking an references to “The New York Times” as being in New York. I ended up having to patch around this habit for all sorts of nouns, doing a massive amount of work when I knew that it would be far simpler to tweak the original algorithm for my use case. When I run across bugs or features I’d like to add to open-source software, I can dive in, make the changes, and anyone else who has the same problem also benefits. It’s not only more efficient, it’s a lot more satisfying too.

So, what can you do?

There’s a reason Google’s geo APIs are dominant – they’re well-documented, have broad coverage, and are easy to access. There’s nothing in the open world that matches them overall. There are good solutions out there though, so all I’d ask is that you look into what’s available before you default to closed data.

I’ve put my money where my mouth is, by pulling together the Data Science Toolkit as an open VM that wraps a lot of the geo community’s greatest open-source projects in a friendly and familiar interface, even emulating Google’s geocoder URL structure. Instead of using Google’s elevation API, the crime project could have used NASA’s SRTM elevation data through the coordinates2statistics JSON endpoint, or even logged in to the PostGIS database that drives it to run bulk calculations.

There are a lot of other alternatives too. I have high hopes for Nominatim, OpenStreetMap’s geocoding service, though a lot of my applications require a more ‘permissive’ interface that accepts messier input. PostGIS now comes with a geocoder for US Census ‘Tiger’ data pre-installed too. Geonames has a great set of data on places all around the world you can explore.

If you don’t see what you want, figure out if there are any similar projects you might be able to extend with a little effort, or that you can persuade the maintainers to work on for you. If you need neighborhood boundaries, why not take a look at building them in Zetashapes and contributing them back? If Nominatim doesn’t work well for your country’s postal addresses, dig into improving their parser. I know only a tiny percentage of people will have the time, skills, or inclination to get involved, but just by hearing about the projects, you’ve increased the odds you’ll end up helping.

I want to live in a world where basic facts about the places we live and work are freely available, so it’s a lot easier to build amazing projects like the crime analysis that triggered this rant. Please, at least find out a little bit about the open alternatives before you use Google’s geo APIs, you might be pleasantly surprised at what’s out there!

Five short links

fivecoffee

Photo by Igor Schwarzmann

A guide for the lonely bioinformatician – I run across a lot of data scientists who are isolated in their teams, which is a recipe for failure. This guide has some great practical steps you can take to connect to other people both inside and outside your organization.

A guide for the young academic politician – From 1908, but still painfully funny, and even more painfully true. “You will begin, I suppose, by thinking that people who disagree with you and oppress you must be dishonest. Cynicism is the besetting and venial fault of declining youth, and disillusionment its last illusion. It is quite a mistake to suppose that real dishonesty is at all common. The number of rogues is about equal to the number of men who act honestly; and it is very small. The great majority would sooner behave honestly than not. The reason why they do not give way to this natural preference of humanity is that they are afraid that others will not; and the others do not because they are afraid that they will not. Thus it comes about that, while behavior which looks dishonest is fairly common, sincere dishonesty is about as rare as the courage to evoke good faith in your neighbors by showing that you trust them.”

The most mysterious radio transmission in the world – A Russian radio station that’s been transmitting a constant tone, interrupting every few months by mysterious letters, since the 1970’s.

The surprising subtleties of zeroing a register – x86 CPUs recognize common instructions that programmers use to zero registers,  such as XOR or SUB-ing from yourself, and replace them on the fly with no-cost references to hidden constant zero register. A good reminder that even when you think you’re programming the bare metal, you’re still on top of ever-increasing layers of indirection.

Why do so many incompetent men become leaders? – Evidence-based analysis of why “we tend to equate leadership with the very psychological features that make the average man a more inept leader than the average woman”.

 

How to save an image to disk from PostGIS

PostGIS’s new raster support has been incredibly useful to me, but problems can be hard to debug. One of my favorite techniques is saving out intermediate images so I can see what’s going on, so I sat down and tried to figure out how to do that in PostGres. I found some handy functions like ST_AsPng(), ST_AsJpeg() and ST_AsTiff() to convert rasters to bitmap images in memory, but was surprised that none of the examples showed how to save the results as files from the pgsql client. As it turns out, that’s because it’s very awkward to do! I spent some time Googling, and managed to piece together a solution from various different sources, so here’s a trail of breadcrumbs for anyone else hitting the same problem.

The key problem is that the postgres client has no way to write out plain-old untouched binary data. You might think that the BINARY modifier to COPY would do the trick, but that actually writes out a header and other fields as well as the data we want. It turns out that your best bet is to encode the data as hexadecimal text, and then use an external program to convert it back to the raw binary data you want. Here’s the recipe:

First, in the psql client run this (with the table and file names changed as appropriate):

COPY (SELECT encode(ST_AsPNG(raster), ‘hex’) AS png FROM table_with_raster) TO ‘/tmp/myimage.hex’;

If you cat the /tmp/myimage.hex file, you’ll see a stream of hex numbers in text, like this: “89504e470d0a1a0a0000…“. You’ll now need to open up a normal unix terminal and run the xxd command to stitch the text back into a binary file:

xxd -p -r /tmp/sf_colors.hex > /tmp/sf_colors.png

Voila! You now have a file representing your raster, so you can actually see what’s going on with your data. Any suggestions on how to improve this are very welcome in the comments section, it doesn’t seem like it should be this hard.

Five short links

russianfive

Photo by Leo-setä

Are your recommendations any good? – Does the output of your machine learning pass the ‘So what?’ test? We’re in the enviable position of being able to predict a lot of things we couldn’t before, but we need to be careful that the predictions actually matter.

Meet Marvin Heiferman – “1.2 billion photographs are made per day. Half-a-trillion per year.”  I sometimes feel like I’m taking crazy pills trying to explain how mind-blowingly world-changing humanities photo-taking habit is, so it was great to run across Marvin’s deep thinking on the subject.

Parsing C++ is literally undecidable – I spent over a decade working in C++ almost exclusively, and the arcana involved in the language design still amazes me. I’ve lost more neurons than I’d like to remembering Koenig’s lookup rules, this example just reminds me how much more insanity can emerge from a language defined so complexly.

Evidence-based data analysis – If we were able to define best practices for the data analysis pipeline, then we’d be able to compare and trust the results of data science a lot more. We’d no longer be able to pick and choose our techniques to get artificially good results.

Transmitting live from the ocean below the Antarctic ice – Bring a little bit of the worlds strange vastness to your desk.

Five short links

fivekids

Picture by Wackystuff

Colors of the Internet – The palettes of the most popular 1,000 websites worldwide.

Compression bombs – Create a tiny compressed file that expands into terabytes when it’s expanded, persuade a user to open it in their browser, and sit back as their drive fills up and their machine becomes unresponsive.

What is the best comment in source code you have ever encountered? – ‘stop(); // Hammertime!‘ might be my favorite.

Unix at 25 – How AT&T’s house rules of ‘no support, no bug fixes, and no credit‘ ended up fostering an incredible community ethos of collaboration. ‘The general attitude of AT&T toward Unix–“no advertising, no support, no bug fixes, payment in advance”–made it necessary for users to band together.

‘User’ Considered Harmful – Break the habit of referring to the wonderful human beings who take the time to interact with your software as ‘users’.

Five short links

grainyfive

Photo by N-ino

Word2Vec – Given a large amount of training text, this project figures out words show up together in sentences most often, and then constructs a small vector representation that captures those relationships for each word. It turns out that simple arithmetic works on these vectors in an intuitive way, so that vector(‘Paris’) – vector(‘France’) + vector(‘Italy’) results in a vector that is very close to vector(‘Rome’). It’s not just elegantly neat, this looks like it will be very useful for clustering, and any other application where you need a sensible representation of a word as a number.

Charles Bukowski, William Burroughs, and the Computer – How two different writers handled onrushing technology, including Bukowski’s poem on the “16 bit Intel 8088 Chip”. This led me down a Wikipedia rabbit hole, since I’d always assumed the 8088 was fully 8-bit, but the truth proved a lot more interesting.

Sane data updates are harder than you think – Tales from the trenches in data management. Adrian Holovaty’s series on crawling and updating data is the first time I’ve seen a lot of techniques that are common amongst practitioners actually laid out clearly.

Randomness != Uniqueness – Creating good identifiers for data is hard, especially once you’re in a distributed environment. There are also tradeoffs between creating IDs that are decomposable, since that makes internal debugging and management much easier, but also reveals information to external folks, often more than you might expect.

Lessons from a year’s worth of hiring data – A useful antidote to the folklore prevalent in recruiting, this small study throws up a lot of intriguing possibilities. The correlation between spelling and grammar mistakes and candidates who didn’t make it through the interview process was especially interesting, considering how retailers like Zappo’s use Mechanical Turk to fix stylistic errors in user reviews to boost sales.

Five short links

marblefive

Photo by Adolfo Chavez

TI Calculator Simulator – A geeky labor of love, simulating one of the seminal 1970’s calculators. I’m awestruck that a machine with only 320 words of ROM and a CPU that’s minimal beyond belief could be so incredibly useful.

The Mighty Dictionary – A detailed explanation of how the Python dictionary works under the hood. The hash-attack freakouts are a good reminder of how useful understanding even the most black-box parts of your stack can be.

A Gauss machine gun – Sometimes I wish I was a physical engineer.

The Boolean Trap – “Code is written once, but read (and debugged and changed) many times” – words to live by.

Frak – Impressive abuse of the regular expression engine to match arbitrary sets of strings. It’s evil, but I’ve resorted to similar hacks myself to get good text-matching performance out of scripting languages

Five short links

tally

Photo by ahojohnebrause

Telepath logger – An open source keylogger, with a lot of fun extras like webcam snapshots, light-level, and music-monitoring.

A failure of authorship and peer review – An inadvertent #overlyhonestmethods moment in a chemistry paper.

Datahacker – Some beautiful experiments with 3d maps in the browser.

Black holes, lifeloggers, and space brainAdam Becker’s been doing some great stories for New Scientist, one of my favorite mags, and it’s well worth following his blog too.

Comments on the Mill CPU -I’m a hopeless romantic when it comes to off-beat CPU architectures (I still pine for the clockless ARM Amulet, despite knowing about the problems that have emerged over the decades), so I was excited to read the original announcement of the Mill approach. It’s good to see some in-depth analysis from an expert, explaining a bit more of the history and challenges the Mill architecture faces, especially from dog-slow memory.