Over the last couple of months I've been creating a large-scale data processing pipeline for my new startup. I've used all of the technologies involved before, but never all together or in an environment where the processing is so user-driven. The main ingredients are a Ruby/Sinatra frontend, a Postgres database for small-scale transactional information like user accounts, a Cassandra cluster for big data, and Hadoop for processing, all hosted on EC2. I've learned lots of lessons about integration, but one of the ones I found the least guidance on was security. I'll be talking about Hadoop at some point, but here's what I discovered about Cassandra:
– Most people use it on machines that are completely inaccessible from the outside world, so security just means keeping attackers outside your firewall. Since with EC2 your machines have to be minimally-accessible from outside the data center, it isn't straightforward to implement this strategy.
– I love the Datastax material on Cassandra, but their guide to setting up on EC2 suggests that you allow port 9160 to be reached from any address. This allows anyone who discovers the address of the machine to log in and look through your data. I don't want to beat up on them, that's actually a good way to get started with minimal hassle when you're experimenting, but it's worth calling out the implications.
– There's password authentication built into Cassandra but it's not very mature. As one of the commenters on this thread says "I am not aware of anyone using the security features of the SimpleAuthenticator anywhere in production" and my research showed a lot of fiddly things to get wrong, so I'm not ready to rely on it to protect my user's data.
So, what did I end up doing? I set up strict firewall rules using Amazon's security groups feature to block every port but 22 for ssh on my Cassandra cluster. I then added some exceptions, to allow any other machines in the 'Cassandra' security group to access port 7000 for internal cluster communications, and machines in the 'Frontend' and 'Hadoop' groups to call 9160, the external interface to the data. These machines themselves are in EC2, locked down behind their own firewall rules.
This makes the security problem very similar to the standard Cassandra setup within an intranet, where the goal is to keep attackers outside. It means I have to use ssh tunneling or similar techniques if I want to develop on my local machine connecting to the cluster, but that's not too much of an inconvenience.
Interesting blog post, it is still kind of up to date.
Found it, looking for how to properly lockdown a cluster.
Even Cassandra 2.0 still lacks some useful security features, that therefore need to be compensated by firewall rules. Unfortunately Datastax also seems to miss some reasonable chapter about that.
Currently there are like 3 interfaces to protect:
1. App->Cassandra (RPC Thrift/Native interface)
– can enable authentication (PasswordAuthenticator stores accounts in Cassandra cluster)
– PasswordAuthenticator seems to be ok to use using BCrypt hashes (imho)
– access can be restricted to localhost, but that means app needs to be deployed on same machines and suboptimal loadbalancing behavior
– can use transport security
2. Cassandra cluster replication, nodenode (storage interface)
– currently no authentication possible
– access cannot be restricted to localhost due to cluster members need to contact each other
– drawback: so the storage ports of all cluster servers need to be moved to a protected subnet/firewalled
– can use transport security
3. Admin scripts->Cassandra (JMX interface)
– can enable authentication (JMX password file)
– drawback: JMX passwords stored unhashed in text file on disk, which is not state of the art, currently cannot use the same accounts store as the rpc-interface (PasswordAuthenticator)
– access cannot be restricted to localhost due to JMX design limits (although some Cassandra patches seem to exist for that?)
– can use transport security
A further option to restrict all 3 interfaces could possibly be to use client certificates to probably partly replace firewall rules, but not sure if that works out for all 3 and requires dealing with lots of keys/certs… (haven’t tried that and I actually prefer proper and simple firewall rules)
It is like interface 2+3 could still need some improvements…