This article is part of the Systems Design Series following the book Designing Data-Intensive Applications by Martin Kleppmann. If you missed the previous article, check that out first.
This chapter is another central tenant of distributed systems: partitioning. Also known as sharding, we will explore how to split up large data sets into logical chunks.
Strategies for partitioning
If replication is about copying all data to different databases and datacenters, then partitioning is about slicing up that data across those databases and datacenters.
There are a few ways you can split up that data:
- By key. The key can split the rows of data. Imagine 26,000 names evenly split alphabetically. You could put 1,000 entries in a database for the A user names, 1,000 more for the B names, and so on.
- By hashed key. You could hash the same key and then split up the data that way as well. The easiest way to ensure information is evenly split is to use a modulo (
%
) operator (hint: just because it is easy doesn’t mean it is helpful). If you run the modulo over the number of datacenters, you will always have a key mapped into a datacenter ID.- Add randomness for active keys. The above assumes that all keys are read and written in an even distribution. For the ones that are not, you can break up keys further with prefix or suffix IDs. Now the same key can be segmented across shards.
- By a secondary key. Like the prefix/suffix keys, you can use an additional key with a secondary index. This will make searching faster for the data that is partitioned.
- For local document search. You could make this secondary index the key with the value being the original primary key that you added into the shard in the first place.
- For global document search. Also called a term index, you could segment the secondary keys and grab all of the keys across all shards (and shard the term). This is how full-text search databases work.
Strategies for rebalancing
Over time, these partitioning strategies will skew unevenly. The easiest way to visualize this is to imagine the old hardcover encyclopedias. They are not 26 volumes split by letter. Instead, letters likes Q
and X
are combined while A
and S
get their own book.
While you may have an even balance of data to start, it will not necessarily evolve that way. You will need methods to rebalance your data from time to time.
Remember how I suggested that we use the modulo operator to segment our keys against the datacenters? This works until you add more datacenters. Then your modulo value needs to change, and you need to rebalance all of your data which creates a lot of rewriting as your data scales. This is not ideal. Here are a few better ways:
- With a fixed number of partitions. Rather than use the physical shards as segments, create arbitrary partitions within each shard node. That way, as your shards grow or shrink, the number of partitions stays the same. Then, you simply adjust how many partitions you allocate to each node. Riak and ElasticSearch use this method.
- With a dynamic number of partitions. The previous example is sensible but suffers from the same problem as earlier that you have data evolve in very skewed portions. If your partitioning segments are chosen poorly, your data will be lopsided. Dynamic rebalancing is using something like a B-Tree to change the structure as partitions are added and removed. The data evolves, and so too do the partitions. HBase and MongoDB use this method.
- With a proportional number of partitions to nodes. This is somewhat of a blended approach. The number of partitions is not strictly fixed. It changes as the number of nodes grows and shrinks. This is not the same as dynamic rebalancing because the dynamism reflects solely on the number of nodes and not on arbitrary boundaries that you define. Cassandra uses this method.
These strategies all depend on the configurations you make as a developer. You can choose how you want to deploy this rebalancing: manually or automatically.
With manual rebalancing, a developer assigns partitions to nodes. This requires more effort and time, but it allows you to respond to the needs of your data. Conversely, automatic rebalancing enables the software to do this work for you. Tools like Couchbase and Riak suggest this approach (with approval), but the downside is the unpredictable and possibly suboptimal choices for segmenting and rebalancing your data. A wrong move by the software could be slow and costly.
How to find data
With all of this data split up amongst so many physical computers, how do you go about finding them?
- You can keep track of all of them on the client. The obvious solution is to keep track of each shard in the same way that you have to have your encyclopedias face out on your bookshelf. You have to see which letter you want to access when you approach your bookshelf. The downside is that this is tedious and must be reconfigured every time you add or remove nodes.
- You can have your request routed for you. A middle layer can interpret your query or mutation and figure out where your data is for you. This is like in the old days of telephones when operators connected you to the person on the other end. You did not need to know their phone number. You just needed to remember the number of the operator so they could route the call for you.
- You can throw a request to any node, and that node will route it. This is like calling up a corporation and going through their touchtone service. You know you want to call up Dell customer support, so all calls get routed to a 1-800 number, but it is up to the service to find the exact extension for you. The default number might get you who you are looking for if you know your party’s extension number, but you may have to navigate the tree of options.
Tools like Apache ZooKeeper handle this configuration management in many popular databases like HBase and Kafka. Engines like Riak and Cassandra use a gossip protocol to chain requests from node to node until the data is found.
Partitioning is straightforward and not nearly as dense as replication. There are only so many ways you can split up data. In the first post of this series, I noted I would be skipping the next chapter on transactions. The last sentence of the chapter explains why:
In this chapter, we explored ideas and algorithms mostly in the context of a database running on a single machine. Transactions in distributed databases open a new set of difficult challenges, which we’ll discuss in the next two chapters.
Since I care about educating this audience on distributed systems and systems design, it seems only fair to focus on chapters that tackle transactions in a distributed nature.
Get the FREE UI crash course
Sign up for our newsletter and receive a free UI crash course to help you build beautiful applications without needing a design background. Just enter your email below and you'll get a download link instantly.