One of the key principles you learn about relational database design is always normalize your data. This gets quietly thrown out of the window once you have to scale up web applications, either to cope with large numbers of users, or large amounts of data. I’m hitting this point with the 500,000 emails in the Enron collection, and chatting with Kwin from Oblong made me realize what a dirty little industry secret this optimization is.
Normalization means storing any fact in just one location in your database, so that you avoid update anomalies where you have conflicting information in multiple places. For instance, if you have someone’s address stored in several tables, you might forget and only update one of them when it changed, but still be using their old address for queries relying on the other tables. The downside to normalization is that many queries require joins to fetch all the information you need, and these can be slow on large data sets.
Denormalization means duplicating facts across different tables so that reads can avoid joins and so run much faster. It’s dangerous because you lose the safety net of automatically robust updates, and you have to make your data-writing code more complex. For my work on mail, and for most web services, the real time is spent on reading data, which is why it’s such an appealing optimization.
It’s actually just another form of caching, on the memory-intensive end of the classic performance/memory tradeoff spectrum. Memcached is another layer of caching that works well if you’ve got a lot of repeated queries, though again it complicates the update logic. Indexing within a database is another form of caching frequently needed data, though that’s handled behind the scenes for you.
There’s some fascinating case studies out there on how sites like Ebay and Flickr have broken all the old rules to get the performance they need. Google’s BigTable doesn’t specify anything about normalization, but the fact that it’s a simple map between keys and values, with no complex queries possible, makes it very tempting to duplicate your data with keys for the common read operations.