Five short links

Lichenstarfish
Photo by Phillip Hay

Kartograph – An open-source web component for rendering beautiful interactive maps using SVG. Fantastic work by Gregor Aisch.

Hard science, soft science, hardware, software – I have a blog crush on John D. Cook's site, it's full of thought-provoking articles like this. As someone who's learned a lot from the humanities, I think he gets the distinction between the sciences exactly right. Disciplines that don't have elegant theoretical frameworks and clear-cut analytical tools for answering questions do take a lot more work to arrive at usable truths.

Don't fear the web – A good overview of moral panics on the internet, and how we should react to the dangers of new technology.

Using regression isolation to decimate bug time-to-fix – Once you're dealing with massive, interdependent software systems, there's a whole different world of problems. This takes me back to my days of working with multi-million line code bases, automating testing and bug reporting becomes essential.

Humanitarian OpenStreetMap Team – I knew OSM did wonderful work around the world, but I wasn't aware of HOT until now, great to see it all collected in one place.

Five short links

Fivelonglinks
Photo by Jody Morgan

Open Data Handbook Launched – I love what the Open Knowledge Foundation are doing with their manuals. Documentation is hard and unglamorous work, but has an amazing impact. I'm looking forward to their upcoming title on data journalism.

My first poofer Workshop – This one's already gone, but I'm hoping there will be another soon. I can't think of a better way to spend an afternoon than learning to build your very own ornamental flamethrower.

Using photo networks to reveal your home town – Very few people understand how the sheer volume of data that we're producing makes it possible to produce scarily accurate guesses from seemingly sparse fragments of information. When you look at a single piece in isolation it looks harmless, but pull enough together and the result becomes very revealing.

Introducing SenseiDB – Another intriguing open-source data project from LinkedIn. There's a strong focus on the bulk loading process, which in my experience is the hardest part to engineer. Reading the documentation leaves me wanting more information on their internal DataBus protocol, I bet that includes some interesting tricks.

IPUMS and NHGIS – As someone who recently spent far too long trying to match the BLS's proprietary codes for counties with the US Census's FIPS standard, I know how painful the process of making statistics usable can be. There's a world of difference between a file dumps in obscure formats with incompatible time periods and units, and a clean set that you can perform calculations on. I was excited to discover the work being done at the University of Minnesota to create unified data sets that cover a long period of time, and much of the world.

Data scientists came out of the closet at Strata

Outofthecloset
Photo by Sarah Ackerman

Roger Magoulas asked me an interesting question during Strata – what was the biggest theme that emerged from this year's gathering? It took a bit of thought, but I realized that I was seeing a lot of people from all kinds of professions and organizations becoming conscious and open about their identity as data scientists.

The term itself has received a lot of criticism and there's always worries about 'big-data-washing', but what became clear from dozens of conversations was that it's describing something very real and innovative. The people I talked to came from professions as diverse as insurance actuaries, physicists, marketers, geologists, quants, biologists, web developers, and they were all excited about the same new tools and ways of thinking. Kaggle is concrete proof that the same machine-learning skills can be applied across a lot of different domains to produce better results than traditional approaches, and the same is being proved for all sorts of other techniques from NoSQL databases to Hadoop.

A year ago, your manager would probably roll her eyes if you were in a traditional sector and she caught you experimenting with the standard data science tools. These days, there's an awareness and acceptance that they have some true advantages over the old approaches, and so people have been able to make an official case for using them within their jobs. There's also been a massive amount of cross-fertilization, as it's become clear how transferrable across domains the best practices are.

This year thousands of people across the world have realized they have problems and skills in common with others they would never have imagine talking to. It's been a real pleasure seeing so much knowledge being shared across boundaries, as people realize that 'data scientist' is a useful label for helping them connect with other people and resources that can help with their problems. We're starting to develop a community, and a surprising amount of the growth is from those who are announcing their professional identity as data scientists for the first time.