No more heatmaps that are just population maps!

I'm pleased to announce that there's a brand new 0.50 version of the DSTK out! It has a lot of bug fixes, and a couple of major new features, and you can get it on Amazon's EC2 as ami-7b9df412, download the Vagrant box from http://static.datasciencetoolkit.org/dstk_0.50.box, or grab it as a BitTorrent stream from http://static.datasciencetoolkit.org/dstk_0.50.torrent

What are the new features?

The biggest is the integration of high resolution (sub km-squared) geostatistics for the entire globe. You can get population density, elevation, weather and more using the new coordinates2statistics API call. Why is this important? No more heatmaps that are just population maps, for the love of god! I'm using this extensively to normalize my data analysis so that I can actually tell which places actually have an unusually high occurrence of X, rather than just having more people.

I've also added the text2sentiment method, which has been a big help as I've been categorizing positive and negative comments.

text2people now incorporates information from the US Census on which ethnic groups are most likely to have a particular surname, to help you do a rough-and-ready ethnic makeup analysis of a list of names.

I've expanded language support, with a new Ruby gem that you can get via 'gem install dstk' (which includes unit testing), and an R Package adding the two new APIs to Ryan Elmore's original, available as RDSTK. The Python and Javascript clients have been updated to the latest APIs too.

There's also an official .ova version for people using VMware, up at http://static.datasciencetoolkit.org/dstk_0.50.ova

What's still to be done?

The size has ballooned, from about 5GB to nearly 20GB! Most of this is the elevation and other global data, so I'm considering making these optional in the future if that's a problem for a lot of people.

The new surname analysis in text2people has a very high latency on the first request (tens of seconds), which isn't acceptable, so I'll be figuring out a fix for that.

Unit testing has shown that text2sentences isn't working at all!

Thanks to everyone who's contributed to the project so far, both coders and the many good folks who make data openly available! It's exciting to help democratize these tools, I'm looking forward  to hearing feedback on how to keep improving that process.

pete@jetpac.com

2 responses

  1. Thanks for the work you put into this! Besides the toolkit, even this blog post turns out to be highly relevant for me. I wasn’t aware that there is a world wide population density data set. Will have to find out more about that.

  2. I’m new to geomaps and spatial statistics in general. Unfortunately, I don’t get the joke about heat maps and population maps. Could you elaborate?

Leave a comment