Not that Cap

An idea!

Let’s pretend you woke up this morning and thought, “Today’s the day! Today’s the day I write a key-value store!”

You’re going to make a great key-value store. It’s going to take keys and values and store them. It’s going to let you access and delete values for a given key. Your minimum viable product is pretty quick – you pick a language, wrap some functionality around a hash map, and call it a day.

The next day, you wake up and think, “Hey, maybe I can make this better. Maybe I’ll make it distributed!” You realize this will add some complexity to your key-value store, and you think it’ll be just the thing you need to differentiate yourself from all the other fancily-dressed hash maps.

So, what’s special about making our key-value store distributed? For one, it’s more complicated, but you know deep down that’s more of a drawback than a benefit. Fault tolerance is a possible benefit. You like the idea of having multiple instances of our store, and it continuing to work if something happens to one of them.

You sit down and you work feverishly, and after a while, you’ve made your key-value store distributed. You named it Hydra. Hail Hydra.

Hydra works by having a set of nodes that know about each other. When a client contacts the node to update a value for a given key, the Hydra node updates its own internal store, and then broadcasts a message to all other Hydra nodes to tell them about the update. Let’s say we have nodes A, B, and C. If someone tells node A to store the value 123 at the key abc, node A will first add to its internal state abc: 123, and then broadcast an update to B and C to tell them about the new value it just got.

You take Hydra and release it into the wild. You deploy it somewhere, you make a website, and you offer it as a service. You gain a few users, and then a few more, and then several more! It’s growing quite rapidly. However, as it grows, you begin to get disturbing bug reports.

Users are saying that occasionally they’ll see keys with stale values, but then they’ll query again and they get the latest value. You look at the logs of your nodes, and you don’t see anything obviously wrong. What could be the problem here? To steer us in the right direction, let’s look at one of the core ideas of distributed systems – the CAP theorem!

CAP theorem

Eric Brewer proposed in the late 1990s that a web service can only provide two out of the following three guarantees – consistency, availability, and partition tolerance. This concept, known as the CAP theorem or Brewer’s theorem, is one of the building blocks of distributed systems theory today.

What are these three things? Why are they important? Let’s start with consistency.

In their proof of Brewer’s theorem, Seth Gilbert and Nancy Lynch define this consistency guarantee as requiring that there’s a total ordering of events where each operation looks as if it were completed instantaneously. We should always receive the most recent write, so no stale values. This seems to be relevant to the issues we’re having with Hydra, but let’s not stop here.

Availability is similarly straightforward – all requests eventually return a successful response eventually. Gilbert and Lynch note that there’s no bound on how long a process can wait before it returns as long as it eventually does.

Partition tolerance is a little trickier. First, we need to define what a partition is. In a network partition, some subset of nodes in the system is incapable of communicating with another subset of nodes. All messages are lost. Partition tolerance, therefore, is the ability to handle this situation.

The importance of these characteristics is self-evident. Always returning the most recent write, all requests eventually returning, and handling node failures gracefully – these are desirable in every system! It seems cruel that we can only have two of the three, but as we’ll see later, it’s not actually as strict as the theorem makes it out to be.

Choose your own distributed system

If we can only have two of the three guarantees, let’s take a look at what the choices mean for our system. We’ve got three options – a consistent and partition tolerant system, an available and partition tolerant system, and a consistent and available system.

In a consistent-partition tolerant system, we’re choosing to sacrifice availability. Our system will always be available when things are going well, but when a network partition (the thing we’re tolerating) occurs, the system chooses to stop responding to requests or otherwise exhibits unavailability. In these systems, there’s often a designated master node that is in charge of receiving and handling all updates. Losing this node can cause unavailability, as well as losing enough nodes to block the election of a new master node.

In an available-partition tolerant system, we’re sacrificing the guarantee that users always read the most recent write. In a network partition, we keep responding to requests, but they might return a stale value. If some subset of nodes is partitioned from another subset, they don’t stop returning values, but instead either keep on working normally or continue operating in a degraded capacity. For example, if a partitioned node can’t contact a majority of the other nodes, a system might choose to not accept any writes at that node, or maybe it returns its last known value, which isn’t guaranteed to be the most recent one.

What about a consistent-available system? We’ve talked about how we can choose to respond in the case of a network partition, but what happens if we look at an unreliable network and say, “Not today”? It turns out that this isn’t practical.

The fallacies of distributed computing tell us that the network is a scary place, with changing topologies and latencies and chaos everywhere. By choosing to ignore partition tolerance in a distributed system, we’re opting into failure from the very beginning.

While the CAP theorem says that you can only pick two out of three of consistency, availability, and partition tolerance, the real world tells us that forgetting partition tolerance isn’t an option. Accordingly, distributed systems have to balance between consistency and availability, knowing they’ll have to sacrifice somewhere to meet their use cases and requirements. Coda Hale has an awesome blog post going into further detail about you really can’t pick consistent-available. What’s the point of having a distributed system if it isn’t partition tolerant?

Poking holes in the CAP theorem

When applied, the CAP theorem tells us that we have to choose between consistency and availability. This choice seems very binary, but many systems “cheat” it by weakening consistency to provide stronger availability (or vice versa).

In our example systems above, we discussed how an available-partition tolerant system might reject a write request if it can’t contact a certain amount of other nodes in the system. You might have thought, “This doesn’t sound like availability!”

In practice, consistency and availability are treated as more of a sliding scale. A distributed system might choose to have weaker availability guarantees so that it can meet a desired level of consistency. These tradeoffs are how the CAP theorem manifests itself in implementations – you can’t have both 100% availability and the strongest possible level of consistency. You have to make sacrifices in one (or both) areas. Let’s take a look at an actual system and show how it addresses these tradeoffs.

Cassandra, a distributed NoSQL database, is typically classified as an available-partition tolerant system. While classifying systems according to the CAP theorem is often a giant oversimplification, it’s helpful here as a quick frame of reference.

In Cassandra, each node is in charge of coordinating with other nodes to persist data and respond to requests. When a user attempts a write, the receiving node persists the data to a tunable number of replicas. Changing the required number of replicas for a successful write lets users adjust Cassandras’s consistency and availability guarantees.

The consistency level can be set to require a write to be replicated to every other node in the system before being successful. By requiring it to be copied to every other node, every node will have that most recent write. This requirement, however, means that any node failure will block all writes to the system. Maybe that works well for you, but maybe it doesn’t.

If it doesn’t work well for you, you can take it to the opposite extreme, where every write only requires the data to be on a single node – the one who receives the request. In this scenario, as long as there’s one good and healthy node, your write will succeed. Data replication will spread it out, but if you try to request the data before its been replicated, you’ll get a stale read.

In between these two extremes, the tradeoffs come into play. You could tell Cassandra that it requires a quorum of nodes for a successful write, meaning it needs to successfully replicate to a majority of all nodes. Our availability guarantees are stronger than requiring the data to be successfully replicated everywhere, as we can now tolerate the failure of almost half of the nodes. Our consistency guarantees are stronger than requiring just one successful replica, because a majority of the replicas now have the data, making us less likely to get a stale read.

Just like we can tune our availability and consistency on writes, Cassandra lets us adjust the requirements for reads as well. We can tell it to read from all nodes before returning a response, guaranteeing us the most recent value. Similarly to our writes, we can tell it to return the response from a single node, meaning we might get a stale value. We are guaranteed to get a value, assuming there’s a healthy node that can service the request.

Cassandra shows how the CAP theorem plays out in practice. By giving us multiple options, we can choose what’s important to us and decide for ourselves how we want the system to respond in both normal operations and failure scenarios.

An Alternative CAP

If systems often adopt a hybrid approach, then we should consider a more nuanced view than the one the CAP theorem provides. In 2012, Daniel Abadi proposed an alternative way of classifying a distributed system, called PACELC (pronounced pass-elk).

The CAP theorem is all about how a system behaves when confronted with a network partition. When that’s not happening, the system isn’t really choosing between consistency and availability. Abadi points out that the choice is actually between consistency and latency.

We saw this behavior when we looked at Cassandra. When we required writes to be replicated to every node, that gives us stronger consistency, but it takes longer. Our node we’re writing to has to tell every other node in the system about our new write. If one of those requests takes longer, then we’re going to have to wait. On the other hand, if we only tell a single node about our write, there’s less room for error from a latency standpoint. But, if we only tell one node, then we’re likely to hit stale reads.

It’s this tension during happy times that PACELC helps explore and classify. “If there is a partition (P), how does the system trade off availability and consistency (A and C); else (E), when the system is running normally in the absence of partitions, how does the system trade off latency (L) and consistency (C)?”

When there’s a network partition, it trades off between availability and consistency, just like the CAP theorem says. However, when things are going well, a system makes choices that are biased towards stronger consistency or reduced latency. In this framework, Cassandra is a PA/EL. It picks availability when there’s partition, else it chooses reducing latency.

If a system is undergoing network partitions all the time, something’s not great. PACELC is useful here because it helps explain the choices and tradeoffs the system makes for the majority of its time – when things are happy and successful. This framework doesn’t just help us understand existing systems; it helps us when designing them as well.

Back to Hydra

You weren’t thinking about CAP or PACELC when you started writing Hydra, but now that you know about it, you can see some of the flaws in your system. Users are seeing stale values, but it seems they’re expecting consistency from you. Perhaps this was a marketing problem?

After learning about CAP, PACELC, and how systems apply these, you’ve got a greater understanding of what you want Hydra to be. You want Hydra to be resilient to failures, and after user feedback, you want to avoid stale reads. When you sit down and whiteboard it out, you realize how problematic a user partition is to the current iteration of Hydra. If a write request is dropped or lost in the middle of replication, then that value will never be correct until it’s overwritten. How do you get all nodes to agree?

This question hits at one of the hard problems in distributed systems – consensus. How can we design Hydra in such a way that we’re tolerant of failures and provide our desired level of consistency? CAP and PACELC are great for telling us what things we need to be aware of when designing a system, but they’re less helpful in actually designing that system. We know that consistency and availability are in constant tension, but how do we resolve that? To do this, we’ll next take a look at Raft, a consensus algorithm that helps us design a consistent and partition tolerant system.