I've spent a lot of the last two years wrestling with different database technologies from vanilla relational systems to exotic key/value stores, but for OpenHeatMap I'm storing all data and settings in S3. To most people that sounds insane, but I've actually been very happy with that decision. How did I get to this point?
Like most people, I started by using MySQL. This worked pretty well for small data sets, though I did have to waste more time than I'd like on housekeeping tasks. The server or process would crash, or I'd change machines, or I'd run out of space on the drive, or a file would be corrupted, and I'd have to mess around getting it running again.
As I started to accumulate larger data sets (eg millions of Twitter updates) MySQL started to require more and more work to keep running well. Indexing is great for medium-scale data sets, but once the index itself grows too large, lots of hard-to-debug performance problems popped up. By the time that I was recompiling the source code and instrumenting it, I'd realized that its abstraction model was now more of a hindrance than a help. If you need to craft your SQL around the details of your database's storage and query optimization algorithms, then you might as well use a more direct low-level interface.
That led me to my dalliance with key-value stores, and my first love was Tokyo Cabinet/Tyrant. Its brutally minimal interface was delightfully easy to get predictable performance from. Unfortunately it was very high maintenance, and English-language support was very hard to find, so after a couple of projects using it I moved on. I still found the key/value interface the right level of abstraction for my work; its essential property was the guarantee that any operation would take a known amount of time, regardless of how large my data grows.
So I put Redis and MongoDB through their paces. My biggest issue was their poor handling of large data loads, and I submitted patches to implement Unix file sockets as a faster alternative to TCP/IP through localhost for that sort of upload. Mongo's support team are superb, and their reponsiveness made Mongo the winner in my mind. Still, I realized I was finding myself wasting too much time on the same mundane maintenance chores that frustrated me back in the MySQL days, which led me to look into databases-as-a-service.
The most well-known of these is Google's AppEngine datastore, but they don't have any way of loading large data sets, and I wasn't going to be able to run all my code on the platform. Amazon's SimpleDB was extremely alluring on the surface, so I spent a lot of time digging into it. They didn't have a good way of loading large data sets either, so I set myself the goal of building my own tool on top of their API. I failed. Their manual sharding requirements, extremely complex programming interface and mysterious threading problems made an apparently straightforward job into a death-march.
While I was doing all this, I had a revelation. Amazon already offered a very simple and widely used key/value database; S3. I'm used to thinking of it as a file system and anyone who's been around databases for a while knows that file systems make attractive small-scale stores that become problematic with large data sets. What I realized was that S3 was actually a massively key/value store dressed up to look like a file system, and so it didn't suffer from the 'too many files in a directory' sort of scaling problems. Here's the advantages it offers:
– Widely used. I can't emphasize how important this is for me, especially after spending so much time on more obscure systems. There's all sorts of beneficial effects that flow from using a tool that lots of others also use, from copious online discussions to the reassurance that it won't be discontinued.
– Reliable. We have very high expectations of up-time for file systems, and S3 has had to meet these. It's not perfect, but backups are easy as pie, and with so many people relying on it there's a lot of pressure to keep it online.
– Zero maintenance. I've never had to reboot my S3 server or repair a corrupted table. Enough said.
– Distributed and scalable. I can throw whatever I want at S3, and access it from anywhere else. The system hides all the details from me, so it's easy to have a whole army of servers and clients all hammering the store without it affecting performance.
Of course there's a whole shed-load of features missing, most obviously the fact that you can't run any kind of query. The thing is, I couldn't run arbitrary queries on massive datasets anyway, no matter what system I used. At least with S3 I can fire up Elastic MapReduce and feed my data through a Hadoop pipeline to pull out analytics.
So that's where I've ended up, storing all of the data generated by OpenHeatMap as JSON files within both private and public S3 buckets. I'll eventually need to pull in a more complex system like MongoDB as my concurrency and flexibility requirements grow, but it's amazing how far a pseudo-file system can get you.
I am slowly reaching a similar conclusion, especially since my data load is already optimized to be “primary-key access only”. What are your thoughts/experiences on the topic now, almost 2 years after you wrote this?
Still happy? Painpoints?
I’m still pretty happy with this approach for lightweight projects. The main pain points have been that S3 tools are not optimized for this use case (eg doing an ‘ls’ of hundreds of thousands of files isn’t fun) and that running analytics across your data isn’t practical. OpenHeatMap is still using this approach and chugging along nicely though! I would take a serious look at DynamoDB these days though.
I’m in the same way. Just a little curious about the latency between EC2 and S3. Do you create the data on S3 from an EC2 instance? How about performance/latency?
might want to check out snowflake. it allows for sql querying of json, avro, and xml at scale (and uses S3 under the covers….
Since these are all rest based calls,how do you model the resulting object
Could we use Mybatis/Hibernate as JPA?
BTW you could use engines such as Hive, Presto, Impala etc. to run SQL queries directly on the s3 files
I’ve come to the same conclusion Pete. Especially when paired with AWS Lambda it makes total sense. The only thing I did different was to pair it with Redis for indexing (I use RedisLabs). I’ve created a Python library (https://github.com/kevproject/kev) that does this pretty easily. It is declarative by nature and has a Django model type feel.
I also took this approach (party based on this article IIRC) and ended up writing something for node. Just in case anyone here finds it useful or, if anyone wants to provide some feedback as well, that would be great. serverless + node + lambda + s3 makes for some super easy development for quick projects.
I’m looking at doing something like this but on the order of hundreds of millions of files in a single bucket. I don’t need to do much other than load the individual json files to process data from them, although I do want to be able to do that quickly, looping through the bucket. I have also tested AWS’ new offering, Athena, as a query engine on top and that works great. Any thoughts on gotchas when trying to store/load 100MM files in an S3 Bucket? It’s data I rarely access, but once in a while I need to scan through it all to process it. This makes running and paying for something like RDS or Dynamo unnecessarily expensive.
What are the scenarios when you need to process/reprocess the S3 files? Is there a good use case for firing off a lambda function when the file itself is modified? As long as your available lambda processes are high enough, you could process an enormous amount of data simply by updating the files (or even re-copying them to trigger the process.)
Lambda is an interesting idea. Generally these files are processed in batch over the period of a few days, and then low tens of thousands are accessed directly until we process them all in batch again later. We get updates all at once as well, but maybe I can still work Lambda into our workflow here.
Would you expect there to be issues requesting specific files by Key when there are 100MM in a bucket? I’m wondering if the lookup is going to be painfully slow at that scale.
If you do plan to go this route, make sure you think your workflow through properly. Make use of filters and such so that you dont end up with a nasty little infinite loop where you read a file, then update it which causes another lambda event and a read and write again. It can be a very expensive mistake, having been there a couple times 🙂
I cant imagine that accessing 10’s of k’s of files will have to much of a measurable effect on S3. If you have enough processes with Lambda (have to raise your default limit of 100) then it will likely be the most performant option available to you. Loading a file from S3 with Lambda is faster than accessing most databases (assuming a fairly small file size).
NIce article Matt! I was google’ing for ideas on how to leverage S3 and I found your site. I have a sort-of-similar situation — I have 10 images/client (around 20 of them and 100 by the end of the year) and I plan to create a folder for each client and store the images in them. I can easily access each “folder” from the app I am creating and I believe this is pretty scalable *if* I create a new folder/client.
Going by your statement “S3 was actually a massively key/value store dressed up to look like a file system, and so it didn’t suffer from the ‘too many files in a directory’ sort of scaling problems”, my idea should be scalable — right? I hope so 🙂 Don’t want to go down the DB route ..
Nice read. What did you do re version control (e.g., preventing concurrent update overwrites) as S3 doesn’t support this?