Using encrypted DMGs to store sensitive data on OS X

It has been years since I used a desktop machine for development, but working on a laptop does make physical security harder. If someone steals your machine, how much do they have access to? If you're on a Mac, here's a tip Apple taught me for securing your data.

You're probably familiar with .DMG files from software downloads, but not many people know that it's easy to create your own, and they can be writable and encrypted with a password. They work a lot like external drives, but are just stored as a file on your main machine. The advantage is that you have to enter a password before you mount them, otherwise they're just meaningless random data, even if someone has taken your machine and changed the root password. They're not a magic bullet, but they're a useful general-purpose tool to use as part of your security strategy. Apple required us to keep any source code we had on our laptops in them for example. Here's how you build your own:

– Open up the 'Disk Utility' application, under Applications/Utilities.

– Select 'New Image' from the top toolbar.

– Choose the name, size you want (it's hard to change afterwards) and under encryption pick 256-bit AES. Everything else you can leave at the defaults.

– On the next screen pick a strong password, and very importantly, uncheck 'Remember password in my keychain'! We want to ensure that anyone who wants access to the data has to enter our original password, not get access to it via a reset user password, so we don't want it stored in the keychain.

You'll now have a DMG file on your drive, and it will be automatically mounted when you create it. To test it's working, go to the finder and hit the eject icon next to where it appears in the sidebar. After that's complete, double-click on the DMG file and you should be prompted to enter your password. Again, make sure you don't check the remember password option. You should see it appear in the sidebar again.

The volume will remain mounted for as long as you're logged in, so you need to make sure you have a password set on your screensaver. With that in place, an attacker will need to reboot your machine to reset account passwords, and so would need to re-enter the password to get access to the data on the disk image. 

I use this a lot for things like SSH credentials, so I'll usually create a symbolic link by running something like:

ln -s /volumes/ssh/.ssh /Users/petewarden/.ssh

There are alternatives of course, such as using FileVault, TrueCrypt or an encrypted USB key, but I've found this a simple and straightforward way to help secure your data.


Five short links

Photo by Brian Fuller

Visualizing Jane Austen – How do the names of key characters recur in Austen's novels? Matthew Hurst has put together some simple but interesting visualizations 

Elasticache – You know how I was just asking for a no-hassle, pay-as-you-go, true database-as-a-service? It isn't persistent, but Amazon's new memcached on-demand is a big step forward. It's a shame you still have to worry about nodes, and I hope it doesn't turn out to have hidden flaws (like Amazon's SimpleDB's reliance on manual sharding) but I'm excited to try it out.

Mumakil – So, I can't actually get this to run, but in theory this should be a great way of doing bulk loads and dumps from Cassandra using Hadoop. I'll also be digging into Brisk over the next few weeks, but I like the idea of something like Mumakil that's laser-focused on data transfer, as a complement to Datastax's more general tools.

Junar – An Argentinean data-marketplace startup. As a data consumer, it's great to see a thousand flowers blossom in this area, and it will be interesting to see how their offerings start to specialize.

Map Tile URL formats – A machine-readable collection describing the format of many different public web services that offer map image tiles (though keep in mind that most of them have lots of conditions attached).

Five short links


Photo by Susanna S.

Data Patterns – A pithy, useful and opinionated (in a good way) collection of advice and techniques for dealing with common data problems, from parsing HTML, threading scrapers and the joy of CSV for data storage. It's early days and there's lots more to be filled out, but what's there is great.

The Guild of Silicon Valley – This article makes me want to grow a chin beard. One funny thing about the 'new wave' of data technologies like Hadoop, Lucene and Cassandra is that they're written in Java, a language most startup web developers avoid like the plague. The painful thing about Java and C++ is that they force you to think hard up front about what you're building before you dive in. The insight of agile programming is that for smaller projects that's a waste, but these show you still need it for industrial-grade frameworks. Or maybe it's just that Doug Cutting's a force of nature and it happens to be his favorite language, since he's responsible for two of the three projects above?

WeoGeo – The interface is mind-boggling, but if you persevere. there's a rich set of free and commercial geographic data sets available. I discovered a compendium of cell tower locations from the FCC I was unaware of, amongst other goodies.

Scaling Up Machine Learning – Solid advice from people who've obviously been fighting in the trenches.

Xeround – I'm tired of spending my time dealing with database housekeeping for uninteresting transactional data problems, so I love the idea of a relational database that just works, a turnkey service that I don't have to set up but that can still scale. I haven't used it or similar services like ScaleDB, so I'm sure there's caveats, but it's a problem that needs solving. Today it feels like I have to build my own power plant just to get electricity. I'd much rather pay somebody else to deal with a lot of the solved database issues so I can focus on the more interesting problems.

Porting Flash/Flex 3’s Matrix, Point and Rectangle classes to Javascript

Photo by Free Wildebeest

I started off writing the OpenHeatMap renderer for Flash using Flex3, and then ported the code to vanilla Javascript to support HTML5. There were many things that felt poorly designed in Flex, but the 2D geometry support was a pleasure to use. To minimize the differences between code for the two renderers I ended up rewriting the bulk of the Matrix, Point and Rectangle classes in Javascript. Today I needed to reuse some of my OpenHeatMap functions in another project, so it seemed like a good chance to split off the classes and relicense them as BSD.

Why should you care? You almost certain don't, unless you're somebody who's porting a big project from Flash to Javascript. In that case you're probably sobbing in a corner, rocking back-and-forth and clutching your knees, thanks to all the other painful issues you're dealing with. If you emerge from your fugue state long enough to notice, you'll be happy though, trust me.

The code's up at

Securing Cassandra on EC2


Photo by Edward Ross

Over the last couple of months I've been creating a large-scale data processing pipeline for my new startup. I've used all of the technologies involved before, but never all together or in an environment where the processing is so user-driven. The main ingredients are a Ruby/Sinatra frontend, a Postgres database for small-scale transactional information like user accounts, a Cassandra cluster for big data, and Hadoop for processing, all hosted on EC2. I've learned lots of lessons about integration, but one of the ones I found the least guidance on was security. I'll be talking about Hadoop at some point, but here's what I discovered about Cassandra:

– Most people use it on machines that are completely inaccessible from the outside world, so security just means keeping attackers outside your firewall. Since with EC2 your machines have to be minimally-accessible from outside the data center, it isn't straightforward to implement this strategy.

– I love the Datastax material on Cassandra, but their guide to setting up on EC2 suggests that you allow port 9160 to be reached from any address. This allows anyone who discovers the address of the machine to log in and look through your data. I don't want to beat up on them, that's actually a good way to get started with minimal hassle when you're experimenting, but it's worth calling out the implications.

– There's password authentication built into Cassandra but it's not very mature. As one of the commenters on this thread says "I am not aware of anyone using the security features of the SimpleAuthenticator anywhere in production" and my research showed a lot of fiddly things to get wrong, so I'm not ready to rely on it to protect my user's data.

So, what did I end up doing? I set up strict firewall rules using Amazon's security groups feature to block every port but 22 for ssh on my Cassandra cluster. I then added some exceptions, to allow any other machines in the 'Cassandra' security group to access port 7000 for internal cluster communications, and machines in the 'Frontend' and 'Hadoop' groups to call 9160, the external interface to the data. These machines themselves are in EC2, locked down behind their own firewall rules.

This makes the security problem very similar to the standard Cassandra setup within an intranet, where the goal is to keep attackers outside. It means I have to use ssh tunneling or similar techniques if I want to develop on my local machine connecting to the cluster, but that's not too much of an inconvenience.

Five short links

Photo by somethingstartedcrazy

Airport security using your online profile – Most people now have multiple public profiles on different services like Facebook and LinkedIn. Unlike traditional self-supplied information, these are hard to fake because they require significant numbers of other people to implicitly supply references by friending you. You can imagine creating a large number of fake acounts all friending each other, but that structure will stick out in a social graph like a sour thumb. Then there's semi-public data like credit reports on top of that, which either requires years of preparation or cooperation from multiple private companies to fake. This means that if you can verify that a person is who they say they are, you can be very sure about whether that identity is a real or made-up one. I first heard about this as a problem that spies in foreign countries now face when building new identities, but this article indicates that airline security in the US may rely on similar data as a signal when screening travelers.

Mapbox's Wax  – I only recently discovered Mapbox, but I've been blown away by the quality of their work. Wax is their Javascript library that makes it easy to use a whole bunch of different map technologies through a common interface.

Girls go geek again – It's eye-opening to see how female-friendly computing used to be, and depressing to see how much ground we've lost since the early 80's. Don't dismiss this as a hippy political-correctness problem, just think about all of the kick-ass github projects that don't exist because the authors didn't go into our field.

Open-source data journalism with BuzzData – I've been excited to see Peter Forde's vision of a socially-focused data site become a reality. It ties in with one of my big dreams, of seeing every journalistic story that references data make the raw numbers available for follow-ups and responses, just like scientific papers.

HMS Pinafore – The Pirates of Penzance came up on my iTunes shuffle a few days ago, and I decided to see if there were any Gilbert and Sullivan performances coming up. As luck would have it, Lamplighters were finishing up a run of HMS Pinafore at Mountain View, so a few of us ventured down in some trepidation (South Bay isn't normally where I head for my culture). I was amazed, it was by far the best G&S production I've ever seen. The singing of the main players and the chorus was clear, rich and powerful, the choreography was crisp but still full of life, and the orchestra was note-perfect. The acting made the show though, with Robby Stafford stealing scenes left and right as Dick Deadeye. I'm going to be following the Lamplighter's schedule closely from now on, that was one of the best shows I've seen all year and I'm looking forward to catching more.

Data Science Toolkit security fix

Photo by Lee Haywood

Just a quick note and apology to users of the DSTK EC2 AMI. The default public key that Amazon adds to ~/.ssh/authorized_keys wasn't being removed automatically during the AMI creation process as I expected, so I had unknowingly been given login access to any unmodified instances created from a DSTK AMI. Happily Amazon's audit procedures spotted the problem, so I've now gone ahead and built a new version with my public key removed. Apologies to everyone, that was my mistake. To be clear the worst case was that I would be able to log in to a server you'd created, it didn't give anyone else access.

I've updated the docs and released ami-a971b7c0 as the current version of DSTK 0.35. I recommend users of the current AMI either switch to the new one, or just edit ~/.ssh/authorized_keys to remove the first line containing my key. It will be the line that begins ssh-rsa AAAAB3Nza… Be careful not to delete your own login credentials, or you'll be unable to log into the box yourself!

To be sure I wasn't missing anything else I've been studying the Amazon guide to creating shared AMIs, and use the following commands just before I build the image to wipe sensitive data like website visitors, command line histories and the default server SSH keys: 

sudo rm -rf /var/log/apache2/*

sudo rm -rf ~/.ssh/authorized_keys

sudo rm -rf /etc/ssh/ssh_host_dsa_key

sudo rm -rf /etc/ssh/

sudo rm -rf /etc/ssh/ssh_host_rsa_key

sudo rm -rf /etc/ssh/

history -c



Five short links

Photo by Billie Hara

I'm frantically coding for an upcoming launch, so apologies if you're waiting for an email reply. I'm looking forward to showing the world what we're working on though, and to posting the lessons I've learned about using Cassandra and Hadoop in production.

TinkerPop – The equivalent of the LAMP stack for graph data processing, pulling together the best open-source tools to help build a turnkey pipeline. The developer's avatars on the right of the page scare me though.

Gridded Population of the World – I've been looking for something like this for a long time. It's a breakdown of the surface of the earth as a grid, with an estimate of how many people live in each square. Why is this useful? Almost every geographic density map you produce will be dominated by the places people actually live, with the signal you really cared about drowned by the fact that most people live packed into a few urban areas. This data set could be used to correct for that, at least as a first approximation, so you can tell if more people than you'd expect by their raw population are visiting your site from particular areas, for example.

Data Alchemists – I like Ben's idea, resigned as I am to the dominance of the term data scientist.

ISPs are hijacking search queries – This is a hard-to-explain but important story. ISPs are using a third-party service to capture and redirect their users searches. The redirection is painful, but handing over everything their users are searching for to a random bunch of marketing companies with no disclosure is a really bad idea.

Three.js – I'm in love with this library, and with WebGL in general. The combination of the ease of use of Javascript, the power of OpenGL and the fact that the latest versions of Chrome, Firefox and Safari all support it means that we'll be seeing a ton of beautiful WebGL pages over the next few months. Check out some of the examples to get inspired.