Five short links


Photo by Holeymoon

Sourcetree – I don't often recommend commercial software, mostly because my personal stack's mostly open source these days. I've fallen in love with this tool for exploring my git repositories though. Git's new Mac app is fantastic too, but focused on 'doing things'. I've found sourcetree a wonderful way to explore and understand your code. I just discovered they've been acquired by Atlassian, so I guess I'm not the only fan!

The Sketchbook Project – I never lose my sense of wonder at how many ways the web can be used to drive creativity. By offering to scan people's sketchbooks they've motivated a community of artists from all over the world, and given me a vast set of material to browse through when my imagination needs a jump-start.

Smart Meter surveillance – How your electricity meter can reveal what TV channel you're watching. My German's not good enough to follow the main paper, but the abstract sounds very plausible. From my perspective not something to freak out over, but a good example of all the unexpected ways we leak information about our lives. We measure more and more things to improve efficiency, but the by-product is that the same data can be used for many unintended purposes too.

The Brown Revolution – An unfortunate name, but a compelling idea for sustainable grazing. I'm normally skeptical of agricultural 'silver bullets' like this, but I know from my experience maintaining trails how effective thoughtful drainage can be. When water's compressed into a narrow stream by a gully it will cut through even packed soil like a plasma torch, but keep spread out it in a wide sheet using a shallow 'rolling dip' and you'll have a surface that can survive years of storms.

Apple insiders remember Steve Jobs – I'm very sad we've lost Steve, he always seemed more like a super hero than a mortal to me. At the Guardian's request I contributed a few thoughts about my time working at Apple, and how he was a constant presence even though I barely met him. I'll be thinking of his family.

Five short links

Photo by Nick P

Is teaching MapReduce healthy? – The working conditions inside Hadoop are terrible, but it's rapidly becoming the default for large-scale data processing. Does that mean students should learn the MapReduce approach? It feels a lot like the debates over teaching ugly, confusing, widely-used C/C++ versus beautiful, elegant but niche functional languages, and which should come first in the curriculum. This article gave me some fantastic glimpses into the wider world of distributed frameworks and techniques, and left me itching to try Bloom.

Amazon comments on Spot price spikes – A refreshingly detailed and open response from a large company. I'm disappointed that the prices have suddenly become so volatile though, since that severely limits the places I can use them.

PhantomJS – A WebKit based headless browser, lighter-weight than Selenium and driven by Javascript. I hope I get a chance to use this, and that they can sever the vestigial dependency on X windows soon. My main use case would be generating screenshots.

Teaching data to speak humanely – Looking at the Facebook timeline, and how old metaphors of interaction disappear as the interface gets closer to the content.

Pictures of the Big Bang – Gorgeous snapshots of our universe's very first moments, courtesy of a computer simulation:


How I saved $1,000 on my monthly EC2 costs


Photo by Paul Hohmann

If you’ve been a user of my Data Science Toolkit or OpenHeatMap sites, you may have noticed they’ve been a bit flakey recently. The back story is that I’ve started a new company (of which more soon) and I had to cut back on how much I was spending on my own servers. It was costing me over $1,200 a month on all the different systems I’d set up over the last three years! This was mostly because it was quicker and easier to set up a new instance than worry about jamming it onto an existing one, and I never got around to cleaning things up. I doubt there are many people who’ve been as lazy about this as me, but if you’re looking at cutting costs, I would start by figuring out if there’s servers you can merge.

By cutting out several like and that were no longer being updated or heavily used, I was able to cut that in half, but I then reached two sites that have decent traffic. I started by merging DSTK and OHM onto one large server, which mostly worked but caused some hiccups. That got me down to around $400 a month, including another small instance for some legacy Mailana labs sites.

I then decided to switch to ‘spot instances’, Amazon’s auction model for buying spare server capacity cheaply. Most of the time it’s only about 12 cents an hour, a third of the normal price. I switched, but then started to experience some serious price spikes that kept taking the server down, and required manual intervention to set everything up again. At some points, the price went to $15 an hour, so there’s obviously capacity limits being hit. That’s very different from my experiences just a few months ago when prices seemed a lot more stable. At this point I’d never recommend spot instances for user-facing servers, the downtime seems too high. They’re still a great deal for things like MapReduce backend processing.

I need 64 bit support for a lot of the frameworks DSTK relies on, so I couldn’t go down to a small instance, but I did realize that micro instances were x86_64, so my next step was to try running both sites off one of those. Rather predictably, the processing requirements and lack of memory crippled the tiny instance, so the site was extremely flakey. It would have been only $14 a month though, so I spent some time trying to fix the issues, by configuring swap space for instance. My conclusion was that micros are too limited for anything but light web serving.

Today I finally bit the bullet and bought a one year reserved m1.large instance, which cost me about $900 up front and another $1,100 over the next year in costs. I also rolled in my small instance into the same server, so I’m down to about $170 a month! This is around a $100 less than the unreserved cost, so I’d seriously look at reserving servers as a way of managing your costs, if you can stomach the up-front deposit.

I’ve also added a link to my O’Reilly books to the sites, in the hope I’ll cover some of my costs:

Improve your data skills (and keep this server running!) by buying my guides:

 I’m happy I’ve found a solution that should allow me to keep offering the DSTK and OpenHeatMap services without breaking my bank account. Apologies to all the users who have suffered through the transition, but things should be a lot more stable from now on.