Five short links

Fivetally
Photo by ahojohnebrause

Max Headroom and the strange world of pseudo-CGI – I've always been fascinated by cargo cult analog tributes to technology. Maybe my early exposure to Max gave me the bug?

Reidentification as basic science – Arvind does a fantastic job of explaining why the research he does is so important. I love learning more about people from data, and most of the interesting insights come from interrogating it in unusual ways and finding unexpected connections, which is what his work is all about.

A 21cm radio telescope for the cost-conscious – Beautiful geekery. Who doesn't want to map the Milky Way's radio emissions using nothing more sophisticated than a $20 USB TV dongle?

How Google Code worked – An eminently-practical guide to implementing a regular-expression search engine, from the author of the late-lamented Google Code. It even comes with working source code!

3D lightning – Calculating the three-dimensional path of a lightning bolt from two simultaneous pictures taken from different spots.

Five short links

Fivelocks
Photo by Tony Preece

CLAVIN – A very promising open source geotagging project that analyzes unstructured text and identifies geographic entities. It has some very neat tricks up its sleeve to disambiguate common names like 'Springfield' based on the context.

The Sokal Hoax: At whom are we laughing? – Post-modernism makes an easy target for hard scientists, but this is a good reminder that some of the giants of physics made even more meaningless pronouncements about fields they knew nothing about.

Name-cleaver – A scrumptious little project from Sunlight Labs that handles a lot of the messy data cleanup work around people and organization names.

altmetrics: a manifesto – On the topic of scientists being silly, the way we measure academic output is antiquated beyond belief, so it was great to see this from my friend Cameron Neylon. We can do way better than citations.

Improving the security of your SSH private key files – This is what happens when hackers (in the old-school sense) get interested in a topic. Martin's curiosity about how SSH works led him to find out some sub-par default settings that make a passphrase on your keys a lot less effective than you might think. I didn't know about those particular problems, but I've always followed my Apple and kept my keys on an encrypted DMG.

Five short links

Fivestar
Photo by Eldeeem

The Cartography of Bullshit – A righteous rant against a piece of pop-sociology digging into just how flimsy the underlying statistics are. It hits home because numbers I've mined have ended up in similar columns – a White Power group even used some of my research to 'prove' Mexicans were conquering Texas based on the numbers of Juans versus Johns! Take all studies on controversial subjects like race with a massive pinch of salt.

Welcome, recent graduates – Advice I wished I'd had when I looked for my first post-college job. 

Sublime DataConverter – We've ended up using CSV for lists of objects where the property names remain constant and JSON for messier data structures and as a programming model post-transport. We've homebrewed a limited set of routines to automatically scan headers or walk all objects and extract all possible properties so we can automatically convert between the two representations, but this project is a much more general approach to the same problem.

The Split-Apply-Combine Strategy for Data Analysis – A technical but enlightening read from Hadley Wickham, covering ways of applying the same algorithms across many different representations of data.

Nightmare after nightmare: Students trying to replicate work – Remember what I said about taking studies with a pinch of salt? Even with help from the original authors, PhD students had incredible trouble reproducing the results of published papers. This isn't just a problem for social science, all science is a messy business and we need to keep our skepticism intact. That isn't a free pass to ignore evolution and climate change though!

No more heatmaps that are just population maps!

I'm pleased to announce that there's a brand new 0.50 version of the DSTK out! It has a lot of bug fixes, and a couple of major new features, and you can get it on Amazon's EC2 as ami-7b9df412, download the Vagrant box from http://static.datasciencetoolkit.org/dstk_0.50.box, or grab it as a BitTorrent stream from http://static.datasciencetoolkit.org/dstk_0.50.torrent

What are the new features?

The biggest is the integration of high resolution (sub km-squared) geostatistics for the entire globe. You can get population density, elevation, weather and more using the new coordinates2statistics API call. Why is this important? No more heatmaps that are just population maps, for the love of god! I'm using this extensively to normalize my data analysis so that I can actually tell which places actually have an unusually high occurrence of X, rather than just having more people.

I've also added the text2sentiment method, which has been a big help as I've been categorizing positive and negative comments.

text2people now incorporates information from the US Census on which ethnic groups are most likely to have a particular surname, to help you do a rough-and-ready ethnic makeup analysis of a list of names.

I've expanded language support, with a new Ruby gem that you can get via 'gem install dstk' (which includes unit testing), and an R Package adding the two new APIs to Ryan Elmore's original, available as RDSTK. The Python and Javascript clients have been updated to the latest APIs too.

There's also an official .ova version for people using VMware, up at http://static.datasciencetoolkit.org/dstk_0.50.ova

What's still to be done?

The size has ballooned, from about 5GB to nearly 20GB! Most of this is the elevation and other global data, so I'm considering making these optional in the future if that's a problem for a lot of people.

The new surname analysis in text2people has a very high latency on the first request (tens of seconds), which isn't acceptable, so I'll be figuring out a fix for that.

Unit testing has shown that text2sentences isn't working at all!

Thanks to everyone who's contributed to the project so far, both coders and the many good folks who make data openly available! It's exciting to help democratize these tools, I'm looking forward  to hearing feedback on how to keep improving that process.

pete@jetpac.com

Five short links

Stationfive
Photo by Curtis Perry

The Declassification Engine – "Saving history from official secrecy". A fascinating concept that shows how the firehose of cheap distributed computing power fundamentally changes what privacy and secrecy mean. We can probably reconstruct a lot of information that people think they've hidden in these documents, but what are the rules?

A 63-bit floating point type for 64-bit OCaml – I've never used the language, but I adore the bit-fiddling that goes into floating-point representations, and this is a lovely hack on top of them.

Local geocoder – A lovely minimal reverse geocoder that's self-contained, including data. I've been excited to see a blossoming of open geocoding solutions, Nominatim has improved in leaps and bounds, PostGIS now has some strong capabilities, and I've been having fun with the Data Science Toolkit of course!

How to say nothing in 500 words – Ancient advice about writing that's still useful. "Call a fool a fool"!

Olympians Festival – I've been getting a lot out of the local TheaterPub nights in San Francisco, so I'm excited to make it to this twelve-night festival with a whopping 36 new plays in November! I'm also a sucker for the greek myths, ever since I hear up with Tony 'Blackadder' Robinson's retelling of the Iliad as a kid.

Five short links

Fivetype
Photo by Grant Hutchinson

Assuming everybody else sucked – If an industry is behaving in an apparently irrational way, try to figure out the internal logic that's driving that behavior. You'll be much more effective at breaking the rules if you understand what they are first.

Storing and publishing sensor data – Now we're scattering sensors around like confetti, we're generating ever-growing mounds of time-series data, so here's a good overview of where you can shove it.

100,000 Stars – This WebGL exploration of the universe is so good I feel like this should have already been plastered all over the internet already, but maybe I've been living under a rock?

Mapping the product manifold – I started off in image processing, carried what I'd learned to unstructured text, and now I'm fascinated to see techniques flowing back the other way. We're going to be doing crazily effective recognition of images, language, and every other kind of noisy signal within a few years.

What happened to the crypto dream? – A clear-eyed examination of where the crypto dream of the 90's ended up – ""the demand for technologies that will upset that power balance is quite low".

We’re all starting to track ourselves

Mapscreenshot

We’re releasing a massive and growing amount of information about who we are, where we go, and when. There are hundreds of millions of public checkins already out there, and millions more are being created every day. People think of Foursquare as the leading source, but actually Instagram, Facebook, Twitter, Flickr, Google Plus all produce incredible numbers of geo-located checkins, some of many, many more than Foursquare.

This is going to cause big changes in our world. We’ve already taught our computers what we buy and read, now we’re telling them where we spend our real-world lives. Just our presence at a location at a particular time becomes powerful data when it’s combined with all the other people doing the same thing. We’re instrumenting our movements at a very detailed level, and sending them out into the ether. Even more amazingly, we’re adding high-resolution photos and detailed comments to the checkins.

It’s hard to overstate how effective this data can be at solving intractable problems. Economists, sociologists, and epidemiologists would kill to have detailed pictures of the lives we lead at this kind of scale. There will be applications we haven’t even thought of too, connecting us with people we should be talking to, introducing us to new experiences, all sorts of feedback that will change how we live.

It’s a scary new world to contemplate too of course, which is why I keep blogging about what I’m up to. Recently I’ve been working with my team at Jetpac analyzing billions of photos from all sorts of social sources, to help both tourists and locals figure out where to go and what to do. I want to share an internal tool we use to explore the data, a map interface to the checkins that people have shared publicly. If you want to get a concrete feel for how our world’s changing, check it out:

https://www.jetpac.com/map

It’s still an experimental tool so apologies for any bugs, but I hope you find this glimpse of the mountain of public data we’re all creating as fascinating as I do! You can find all of the individual photos and other checkins out there on the public web, but seeing them accumulated together in one place still blows my mind.

Five short links

Starknot
Photo by Neil Platform1

GeoURI – I have no earthly use for these, but I love that they exist, and are even an IETF standard!

Nathaniel Bowditch – He created the American Practical Navigator over two hundred years ago. He improved the data quality of previous works and made the results widely available in a form non-specialists could easily understand. That approach transformed navigation then, and it's still incredibly effective today across all sorts of fields.

Digital Elevation Data – On that topic Jonathan de Ferranti has spent years painstaking correcting open-source geographic data about the height of the earth's surface, and then releasing the results openly to anyone who needs them. It may be hard for non-geo folks to understand how tough a problem this is, and how hard he's worked on it, so here are some example renders and an independent review.

Sentiment Analysis Corpora – A fantastic summary and comparison of the raw data sets you need to build sentiment analysis algorithms.

A Major Breakthrough in Image Processing – It's time to retire Lena!