Over the last couple of months I've been creating a large-scale data processing pipeline for my new startup. I've used all of the technologies involved before, but never all together or in an environment where the processing is so user-driven. The main ingredients are a Ruby/Sinatra frontend, a Postgres database for small-scale transactional information like user accounts, a Cassandra cluster for big data, and Hadoop for processing, all hosted on EC2. I've learned lots of lessons about integration, but one of the ones I found the least guidance on was security. I'll be talking about Hadoop at some point, but here's what I discovered about Cassandra:
– Most people use it on machines that are completely inaccessible from the outside world, so security just means keeping attackers outside your firewall. Since with EC2 your machines have to be minimally-accessible from outside the data center, it isn't straightforward to implement this strategy.
– I love the Datastax material on Cassandra, but their guide to setting up on EC2 suggests that you allow port 9160 to be reached from any address. This allows anyone who discovers the address of the machine to log in and look through your data. I don't want to beat up on them, that's actually a good way to get started with minimal hassle when you're experimenting, but it's worth calling out the implications.
– There's password authentication built into Cassandra but it's not very mature. As one of the commenters on this thread says "I am not aware of anyone using the security features of the SimpleAuthenticator anywhere in production" and my research showed a lot of fiddly things to get wrong, so I'm not ready to rely on it to protect my user's data.
So, what did I end up doing? I set up strict firewall rules using Amazon's security groups feature to block every port but 22 for ssh on my Cassandra cluster. I then added some exceptions, to allow any other machines in the 'Cassandra' security group to access port 7000 for internal cluster communications, and machines in the 'Frontend' and 'Hadoop' groups to call 9160, the external interface to the data. These machines themselves are in EC2, locked down behind their own firewall rules.
This makes the security problem very similar to the standard Cassandra setup within an intranet, where the goal is to keep attackers outside. It means I have to use ssh tunneling or similar techniques if I want to develop on my local machine connecting to the cluster, but that's not too much of an inconvenience.