Lessons from a Cassandra disaster

Disaster
Photo by Earthworm

Yesterday one of my nightmares came true; our backend went down for seven hours! I'd received an email from Amazon warning me that one of the instances in our Cassandra cluster was having reliability issues and would be shut down soon, so I had to replace it with a new node. I'm pretty new to Cassandra and I'd never done that in production, so I was nervous. Rightfully so at it turned out.

It began simply enough, I created a new server using a Datastax AMI, gave it the cluster name, pointed it at one of the original nodes as a seed, and set 'bootstrapping' to true. It seemed to do the right thing, connecting to the cluster, picking a new token and streaming down data from the existing servers. After about an hour it appeared to complete, but the state shown with nodetool ring was still Joining, so it never became part of the cluster. After researching this on the web without any clear results, I popped over to the #cassandra IRC channel and asked for advice. I was running 0.8.1 on the original nodes and 0.8.8 on the new one, since that was the only Datastax AMI available, so the only suggestion I had was to upgrade all the nodes to a recent version and try again.

This is where things started to get tough. There's no obvious way to upgrade a Datastax image and IRC gave me no suggestions, so I decided to try to figure how to do it myself from the official binary releases. I took 0.8.7 and looked at where the equivalent files to the ones in the archive lived on disk. Some of them were in /usr/share/cassandra, others in /usr/bin, so I made backup copies of those directories on the machine I was upgrading. I then copied over the new files, and tried restarting Cassandra. I hit an error, and then I made the fatal mistake of trying to restore the original /usr/bin by first moving out the updated one, thus bricking that server.

Up until now the Cassandra cluster had still been functional, but the loss of the node I had the code contact initially meant we lost access to the data. Luckily I'd set things up so that the frontend was mostly independent of the backend data store, so we were still able to accept new users, but we couldn't process them or show their profiles. I considered rejigging the code so that we could limp along with two of the three nodes working, but my top priority was safeguarding the data, so I decided to focus on getting the cluster back up as quickly as I could.

I girded my loins and took another try at upgrading a second node to 0.8.7, since that was the most likely cause of the failure-to-join issue according to IRC. I was painstaking about how I did it this time though, and after a little trial and error, it worked. Here's my steps:

There were a couple of gotchas. You shouldn't copy the bin/cassandra.in.sh file from the distribution, that contains settings like the location of the library files that you want to retain from the Datastax AMI, and if you see this error:

ERROR 22:15:44,518 Exception encountered during startup.

java.lang.NullPointerException

at org.apache.cassandra.db.ColumnFamilyStore.scrubDataDirectories(ColumnFamilyStore.java:606)

it means you've forgotten to run Cassandra as an su user!

Finally I was able to upgrade both remaining nodes to 0.8.7, and retry adding a new node. Maddeningly it still made it all the way through the streaming and indexing, only to pause on Joining forever! I turned back to IRC and explained what I'd been doing, and asked for suggestions. Nobody was quite sure what was going on, but a couple of people suggested turning off bootstrapping and retrying. To my great relief, it worked! It didn't even have to restream, the new node slotted nicely into the cluster within a couple of minutes. Things were finally up and running again, but the downtime definitely gave me a few grey hairs. Here's what I took away from the experience:

Practice makes perfect. I should have set up a dummy cluster and tried a dry run of the upgrade there. It's cheap and easy to fire up extra machines for a few hours, and would have saved a lot of pain.

Paranoia pays. I was thankful I'd been conservative in my data architecture. I'd specified three-way replication, so that even if I'd bricked the second machine, no data would have been lost. I also kept all the non-recoverable data either on a separate PostGres machine, or in a Cassandra table that was backed up nightly. The frontend was still able to limp along with reduced functionality when the backend data store was down. There's still lots of potential showstoppers of course, but the defence-in-depth approach worked during this crisis.

Communicate clearly. I was thankful that I'd asked around the team before making the upgrade, since I knew there was a chance of downtime whenever you had to upgrade a database server. We had no demos to give that afternoon, so the consequences were a lot less damaging than they could have been.

The Cassandra community rocks. I'm very grateful for all the help the folks on the #cassandra IRC channel gave me. I chose it for the backend because I knew there was an active community of developers who I could turn to when things went wrong, even when the documentation was sparse. There's no such thing as a mature distributed database, so having experienced gurus to turn to is essential, and Cassandra has a great bunch of folks willing to help.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: