Five short links

Fivelocks
Photo by Tony Preece

CLAVIN – A very promising open source geotagging project that analyzes unstructured text and identifies geographic entities. It has some very neat tricks up its sleeve to disambiguate common names like 'Springfield' based on the context.

The Sokal Hoax: At whom are we laughing? – Post-modernism makes an easy target for hard scientists, but this is a good reminder that some of the giants of physics made even more meaningless pronouncements about fields they knew nothing about.

Name-cleaver – A scrumptious little project from Sunlight Labs that handles a lot of the messy data cleanup work around people and organization names.

altmetrics: a manifesto – On the topic of scientists being silly, the way we measure academic output is antiquated beyond belief, so it was great to see this from my friend Cameron Neylon. We can do way better than citations.

Improving the security of your SSH private key files – This is what happens when hackers (in the old-school sense) get interested in a topic. Martin's curiosity about how SSH works led him to find out some sub-par default settings that make a passphrase on your keys a lot less effective than you might think. I didn't know about those particular problems, but I've always followed my Apple and kept my keys on an encrypted DMG.

Five short links

Fivestar
Photo by Eldeeem

The Cartography of Bullshit – A righteous rant against a piece of pop-sociology digging into just how flimsy the underlying statistics are. It hits home because numbers I've mined have ended up in similar columns – a White Power group even used some of my research to 'prove' Mexicans were conquering Texas based on the numbers of Juans versus Johns! Take all studies on controversial subjects like race with a massive pinch of salt.

Welcome, recent graduates – Advice I wished I'd had when I looked for my first post-college job. 

Sublime DataConverter – We've ended up using CSV for lists of objects where the property names remain constant and JSON for messier data structures and as a programming model post-transport. We've homebrewed a limited set of routines to automatically scan headers or walk all objects and extract all possible properties so we can automatically convert between the two representations, but this project is a much more general approach to the same problem.

The Split-Apply-Combine Strategy for Data Analysis – A technical but enlightening read from Hadley Wickham, covering ways of applying the same algorithms across many different representations of data.

Nightmare after nightmare: Students trying to replicate work – Remember what I said about taking studies with a pinch of salt? Even with help from the original authors, PhD students had incredible trouble reproducing the results of published papers. This isn't just a problem for social science, all science is a messy business and we need to keep our skepticism intact. That isn't a free pass to ignore evolution and climate change though!

No more heatmaps that are just population maps!

I'm pleased to announce that there's a brand new 0.50 version of the DSTK out! It has a lot of bug fixes, and a couple of major new features, and you can get it on Amazon's EC2 as ami-7b9df412, download the Vagrant box from http://static.datasciencetoolkit.org/dstk_0.50.box, or grab it as a BitTorrent stream from http://static.datasciencetoolkit.org/dstk_0.50.torrent

What are the new features?

The biggest is the integration of high resolution (sub km-squared) geostatistics for the entire globe. You can get population density, elevation, weather and more using the new coordinates2statistics API call. Why is this important? No more heatmaps that are just population maps, for the love of god! I'm using this extensively to normalize my data analysis so that I can actually tell which places actually have an unusually high occurrence of X, rather than just having more people.

I've also added the text2sentiment method, which has been a big help as I've been categorizing positive and negative comments.

text2people now incorporates information from the US Census on which ethnic groups are most likely to have a particular surname, to help you do a rough-and-ready ethnic makeup analysis of a list of names.

I've expanded language support, with a new Ruby gem that you can get via 'gem install dstk' (which includes unit testing), and an R Package adding the two new APIs to Ryan Elmore's original, available as RDSTK. The Python and Javascript clients have been updated to the latest APIs too.

There's also an official .ova version for people using VMware, up at http://static.datasciencetoolkit.org/dstk_0.50.ova

What's still to be done?

The size has ballooned, from about 5GB to nearly 20GB! Most of this is the elevation and other global data, so I'm considering making these optional in the future if that's a problem for a lot of people.

The new surname analysis in text2people has a very high latency on the first request (tens of seconds), which isn't acceptable, so I'll be figuring out a fix for that.

Unit testing has shown that text2sentences isn't working at all!

Thanks to everyone who's contributed to the project so far, both coders and the many good folks who make data openly available! It's exciting to help democratize these tools, I'm looking forward  to hearing feedback on how to keep improving that process.

pete@jetpac.com

Five short links

Stationfive
Photo by Curtis Perry

The Declassification Engine – "Saving history from official secrecy". A fascinating concept that shows how the firehose of cheap distributed computing power fundamentally changes what privacy and secrecy mean. We can probably reconstruct a lot of information that people think they've hidden in these documents, but what are the rules?

A 63-bit floating point type for 64-bit OCaml – I've never used the language, but I adore the bit-fiddling that goes into floating-point representations, and this is a lovely hack on top of them.

Local geocoder – A lovely minimal reverse geocoder that's self-contained, including data. I've been excited to see a blossoming of open geocoding solutions, Nominatim has improved in leaps and bounds, PostGIS now has some strong capabilities, and I've been having fun with the Data Science Toolkit of course!

How to say nothing in 500 words – Ancient advice about writing that's still useful. "Call a fool a fool"!

Olympians Festival – I've been getting a lot out of the local TheaterPub nights in San Francisco, so I'm excited to make it to this twelve-night festival with a whopping 36 new plays in November! I'm also a sucker for the greek myths, ever since I hear up with Tony 'Blackadder' Robinson's retelling of the Iliad as a kid.

Five short links

Fivetype
Photo by Grant Hutchinson

Assuming everybody else sucked – If an industry is behaving in an apparently irrational way, try to figure out the internal logic that's driving that behavior. You'll be much more effective at breaking the rules if you understand what they are first.

Storing and publishing sensor data – Now we're scattering sensors around like confetti, we're generating ever-growing mounds of time-series data, so here's a good overview of where you can shove it.

100,000 Stars – This WebGL exploration of the universe is so good I feel like this should have already been plastered all over the internet already, but maybe I've been living under a rock?

Mapping the product manifold – I started off in image processing, carried what I'd learned to unstructured text, and now I'm fascinated to see techniques flowing back the other way. We're going to be doing crazily effective recognition of images, language, and every other kind of noisy signal within a few years.

What happened to the crypto dream? – A clear-eyed examination of where the crypto dream of the 90's ended up – ""the demand for technologies that will upset that power balance is quite low".

We’re all starting to track ourselves

Mapscreenshot

We’re releasing a massive and growing amount of information about who we are, where we go, and when. There are hundreds of millions of public checkins already out there, and millions more are being created every day. People think of Foursquare as the leading source, but actually Instagram, Facebook, Twitter, Flickr, Google Plus all produce incredible numbers of geo-located checkins, some of many, many more than Foursquare.

This is going to cause big changes in our world. We’ve already taught our computers what we buy and read, now we’re telling them where we spend our real-world lives. Just our presence at a location at a particular time becomes powerful data when it’s combined with all the other people doing the same thing. We’re instrumenting our movements at a very detailed level, and sending them out into the ether. Even more amazingly, we’re adding high-resolution photos and detailed comments to the checkins.

It’s hard to overstate how effective this data can be at solving intractable problems. Economists, sociologists, and epidemiologists would kill to have detailed pictures of the lives we lead at this kind of scale. There will be applications we haven’t even thought of too, connecting us with people we should be talking to, introducing us to new experiences, all sorts of feedback that will change how we live.

It’s a scary new world to contemplate too of course, which is why I keep blogging about what I’m up to. Recently I’ve been working with my team at Jetpac analyzing billions of photos from all sorts of social sources, to help both tourists and locals figure out where to go and what to do. I want to share an internal tool we use to explore the data, a map interface to the checkins that people have shared publicly. If you want to get a concrete feel for how our world’s changing, check it out:

https://www.jetpac.com/map

It’s still an experimental tool so apologies for any bugs, but I hope you find this glimpse of the mountain of public data we’re all creating as fascinating as I do! You can find all of the individual photos and other checkins out there on the public web, but seeing them accumulated together in one place still blows my mind.

Five short links

Starknot
Photo by Neil Platform1

GeoURI – I have no earthly use for these, but I love that they exist, and are even an IETF standard!

Nathaniel Bowditch – He created the American Practical Navigator over two hundred years ago. He improved the data quality of previous works and made the results widely available in a form non-specialists could easily understand. That approach transformed navigation then, and it's still incredibly effective today across all sorts of fields.

Digital Elevation Data – On that topic Jonathan de Ferranti has spent years painstaking correcting open-source geographic data about the height of the earth's surface, and then releasing the results openly to anyone who needs them. It may be hard for non-geo folks to understand how tough a problem this is, and how hard he's worked on it, so here are some example renders and an independent review.

Sentiment Analysis Corpora – A fantastic summary and comparison of the raw data sets you need to build sentiment analysis algorithms.

A Major Breakthrough in Image Processing – It's time to retire Lena!

Open Sentiment Analysis

Smileyfingers
Photo by Courtney Carmody

Sentiment analysis is fiendishly hard to solve well, but easy to solve to a first approximation. I've been frustrated that there have been no easy free libraries that make the technology available to non-specialists like me. The problem isn't with the code, there are some amazing libraries like NLTK out there, but everyone guards their training sets of word weights jealously. I was pleased to discover that SentiWordNet is now CC-BY-SA, but even better I found that Finn Årup has made a drop-dead simple list of words available under an Open Database License!

With that in hand, I added some basic tokenizing code and was able to implement a new text2sentiment API endpoint for the Data Science Toolkit:

http://www.datasciencetoolkit.org/developerdocs#text2sentiment

Give it a try, it's as simple as a CURL call from the terminal:

curl -d "I hate this hotel" "http://www.datasciencetoolkit.org/text2sentiment"

{"score": -3.0}

I've been having a blast with it, simple-minded as it is, so I hope you do too!

Five short links

Earthlight

A Global Poverty Map Derived from Satellite Data – This is an old paper from 2006, but I love the idea of using how much light that a neighborhood sends into to the night sky to measure how wealthy it is. Richness is highly correlated with wastefulness, apparently.

Open Multi-lingual WordNets – We’re mapping our inner worlds too, these open data sets are incredibly useful information on word meanings for anyone working with computers and human languages.

The Invisible City – A fake Canadian city briefly appeared on OpenStreetMap, complete with an elaborate public transport network. Or was it briefly a real place blinking in and out of existence, with only a lone volunteer mapper spotting it?

The Dark Side of Social Capital – We usually think of community as a good thing, but anybody who grew up in a small town can tell you that the power can be used to exclude outsiders too.

K2C 1N5 – Ervin Ruci is being hounded by the Canadian Postal Service for the crime of making a crowdsourced database of postal codes freely available, and now they’ve decided they own the copyright to the words “postal code” too!

Five short links

Fiveoclock
Photo by Tasty Goodness

Yoyodyne – How a fictional company was born in the novels of Thomas Pynchon, was adopted by Buckaroo Banzai and Star Trek, and ended up in the GPL.

What will be left of our cities? – The nitty-gritty details of what will happen to our concrete, brick, and steel long after we're dead and gone.

On glitch art, and the fascinating mistakes computers make – I was a terrible VJ with footage, but I had so much fun with live feeds and static. Don't believe technology's mask of perfection, engineers knowwhat a rats' nest every product is under the hood.

Is MS Office the quiet villain of global finance? – Our kids will look back on the last couple of decades as a time when we fell under the spell of cold hard numbers, without really looking at how they were produced.

Search history and accidental class warfare – A variant of the echo chamber effect, and an example of the law of unintended consequences. Recommendation algorithms are becoming our century's version of press barons.