Storage and retrieval

This article is part of the Systems Design Series following the book Designing Data-Intensive Applications by Martin Kleppmann. If you missed the previous article, check that out first.

This chapter is all about getting and sending data. There are two main kinds of data stores:

  1. OLTP. This includes your traditional relational and NoSQL databases. They comfortably store terabytes of data, are highly available, and are designed for the end-users of your applications.
  2. OLAP. This includes data warehouses from names like Oracle and IBM Db2. They store into the petabytes of data, are primarily read-only copies of several OLTP databases, and are designed for business analysts.

One thing to note about OLAP data stores is the simplicity of their data models. While the entire last chapter was dedicated to the variety of OLTP data models, there are really only two models in use for OLAP systems:

  1. Star schemas, named after the star-like shape representing the central fact table that points out to supporting dimension tables
  2. Snowflake schemas, which add on sub-dimensional tables to the above star schema with a more complex, snowflake-like pattern

The seemingly obvious comparison is that snowflake schemas are more complex because of their additional normalization with sub-dimensional tables. Conversely, star schemas are simpler to conceptualize with less branching. Either way, both support lots (100+) of columns.

Column-store databases for big data

Wide-column fact tables benefit from column-store databases. As the name suggests, you’re inverting the way data is stored and retrieved. Rather than store data in rows, where each entry corresponds to a single object and various properties of that object in columns, you store all of the values for a single property together.

Since a massively wide fact table is likely only utilizing several columns in a given query, data organized by column lends itself well to these kinds of analytics queries. You can partition and shard data by column which makes searching against particular columns even more efficient.

Since all data in a given column adheres to the same type, you can think of a column as one giant array. This makes column-store databases effective at compressing data with techniques such as bitmap encoding. With more stuff able to fit in memory, column stores can effectively use computer caches to leverage lots of CPU cycles to iterate through data. You can further improve efficiency when grouping or filtering by sorting your columnar data.

Finally, column stores can leverage highly specialized materialized views called data cubes. Data cubes make it easy to reference a lot of aggregated data. This is a way to precompute data and display it for fast retrieval.

All of these make reads incredibly efficient for even the largest data sets. The clear downside is the cost to write data. A table with 500 columns might now be spread out across 500 files, one for each column. That means adding a new entry means writing 500 properties to 500 files. If that entry just happens to max out the file size on disk you’ll have to perform even more writes to partition and segment out your data. In other words, writing is expensive. As we’ll see later on in this post, there are tools like LSM-Trees that can effectively speed up data storage.

Data retrieval techniques

Storing data is straightforward. Write data to an append-only log and save the file to disk.

Retrieving data, on the other hand, that’s where things get interesting. The book presents a progressively complex way of retrieving data efficiently.

Hash index

Indexing is the general strategy for retrieving data quickly. By writing only the data you search against into memory you trade space for speed. The downside, besides the additional space requirements, is the cost to write the data. Data stored in multiple places requires multiple writes to disk and takes more time.

One way to index is with a hash code, just as hash maps/tables relate keys to values. An example would be Bitcask. Bitcask is the storage engine powering Riak.

As we saw from the YouTube example, we need to count video views. The video offers Cassandra as the data storage mechanism of choice. The book might suggest that Riak with Bitcask might be the ideal choice. Bitcask uses an in-memory hash table separate for what is stored in disk. This, in turn, makes frequent writes still fast on read.

Everything about this sounds great, right? We have a method that is:

  • Scalable: we can shard the data on disk while keeping the indexes in memory
  • Fault-tolerant: the database can go down in a separate data center that stores the index
  • Eventually consistent: indexes can now be updated independently of the data stored on disk
  • High throughput: the right hashing strategy limits collisions which allow for a lot of conflict-free storage

There are two limitations with hash indexes:

  1. Slow range queries on indexes. Since the index key is a code and the codes are unique you don’t gain any speed searching a range of values.
  2. Size limitations. If you run out of memory the whole thing falls apart. Extreme-scale applications that fill a hash table index completely are the one exception to scalability.

SSTable

The next evolution of hash indexes is Sorted String Tables, also called SSTables. SSTables provide some additional benefits:

  • They only store keys once, saving on storage space.
  • They sort the keys to further speed lookups.
  • They use a variant of merge sort to merge and compact segments.
  • The organization of that data can be implemented as a red-black or AVL tree, allowing you to write in any order and read in sorted order. This makes it easier to store in memory.

I didn’t touch on segments earlier. A segment is just a means of sectioning off a data log. To prevent running out of space you need a means of separating your indexes across files. To reduce the space requirements we compact the segments by removing duplicate keys and only keeping the most recent write to that key.

This implementation of an in-memory balanced tree for keeping a cascading tier of SSTables is aptly called a Log-Structured Merge-Tree or LSM-Tree for short. Lucene, the Apache search library that powers ElasticSearch and Solr, uses something like an LSM-Tree to store its dictionary of terms and SSTable-like files to map those terms to postings.

More specifically, if you’re dealing with a dictionary of words you’d want to use a structure that tailors itself to strings instead of numbers. We’ve already evolved past a hash table so that is out. A self-balancing binary tree is out, too. Suffix/prefix trees like tries are specific to strings but suffer from the same problem of taking up lots of space and eating away at RAM. So what’s left? That’s where the mighty Bloom filter comes to the rescue.

A Bloom filter is a probabilistic data structure that is sort of an inversion of other data structures. Rather than using the information to find if data is contained within a structure, a Bloom filter tries to determine if something is definitely not in the set or if it might be in a set. The “might” part here is key because it’s not guaranteeing validity. There is a chance for a false positive.

With a sufficient number of hashes, you can obtain a low false-positive rate to obtain extremely accurate results with a very low amount of indexing space. This solves our biggest problem with indexes when the data is really large and you’re worried you will run out of in-memory storage for your index. A Bloom filter can handle extreme scale and still perform well with the tradeoff that your results aren’t always 100% accurate.

B-tree

We’ve talked about efficiently retrieving data in memory. What if you want to retrieve data on disk? The B-tree is the most widely-used and heavily studied database index lookup tree. I had to build one of these for my relational database class in college. It’s another self-balancing tree structure used in all of your major RDBMSs like Postgres and MySQL.

While both LSM-Trees and B-trees contain key-value pairs sorted by key, the strategy for separating data is completely different. The Log-structured indexes can be of variable size. They can also be quite large, spanning into multiple megabytes. B-tree indexes, by contrast, are always fixed blocks of typically 4 KB or more. Since disk pages are also aligned in fixed blocks, the B-tree is better suited for sorting and retrieving data on disk instead of in memory.

B-trees are structured in layers so that the lower levels represent the range of values in between two values in the parent layer. As we saw earlier, in-memory indexes suffer from poor performance in range queries. B-trees do not.

The branching factor determines how wide each layer of the tree is before the range must be split down to a lower layer. For conventional databases with these 4 KB pages, you could have a branching factor of only 500 and only need to traverse down 4 levels before you have stored 256 terabytes of data. That’s a flat and efficient tree!

In terms of system design, how does this compare in performance to the in-memory solutions?

  • It is also scalable because the partitions are small, 4 KB pages.
  • The data structure inherently supports range queries as a result of the branching factor.
  • Fault tolerance is achieved through a write-ahead log which writes the data changes to the WAL before it is written to the B-tree (hence, write-_ahead_).
  • Locks on the B-tree called latches can be introduced to preserve consistency in an ACID-compliant manner, achieving a higher level of consistency than with a log-structured index.

When comparing log-structured against B-tree indexes, the main rule of thumb is to consider performance on reads or writes. Though performance testing will confirm this for your implementation, log-structured indexes are thought to be better for writes. Conversely, B-trees are considered better for reads. This is because LSM-Trees frequently write, segment, and compact their SSTables to create efficiency.

On the other hand, B-trees offer a more reliable indexing scheme because the keys are only written once and do not suffer from a high level of writes due to compaction. The pages are separated into small, predictable pages and do not require sizable computation for write compaction and segmentation. This is particularly useful for transactional databases when ACID guarantees are a must. That would explain why B-trees have stood the test of time for so long.

Other indexes

The book offers a few more indexing suggestions, such as secondary indexes for joins, clustered indexes where the value is a heap file rather than the raw row data, multi-column indexes like R-trees, and fuzzy-search indexes like those we talk about with ElasticSearch and Solr.

Then some indexes are colocated with the whole database in memory, such as the case with Memcached or Redis. Now the entire thing, both the index and the raw data, are stored in memory. You only need a hard disk for fault tolerance and durability. Further, in-memory databases allow for more flexibility in the data structures they utilize. It is not uncommon to interface with a priority queue or set with something like Redis.


Get the FREE UI crash course

Sign up for our newsletter and receive a free UI crash course to help you build beautiful applications without needing a design background. Just enter your email below and you'll get a download link instantly.

A new version of this app is available. Click here to update.