System Design

CAP Theorem & Distributed Consensus

Mar 21, 2026·12 min read

In any distributed system, partition tolerance is a must. Network failures will happen, and your system needs to handle them. This means that in practice, CAP theorem really boils down to a single choice: do you prioritize consistency or availability when a network partition occurs?

Once you understand that tradeoff, the next question is: how do nodes actually agree on anything? That's where consensus algorithms come in. Raft is the most approachable one, and it's the engine behind systems like etcd, Consul, and CockroachDB.

The CAP Theorem

Three Guarantees, Pick Two

The CAP triangle: pick two guarantees

Eric Brewer proposed in 2000 that a distributed data store can only provide two of three guarantees at the same time:

CConsistency - Every read returns the most recent write, or an error. All nodes see the same data at the same time.

AAvailability - Every request gets a non-error response, even if some nodes are down. The system always answers.

PPartition Tolerance - The system keeps operating even when network messages between nodes are lost or delayed.

The “pick two” framing is a bit misleading. In any distributed system, network partitions will happen. Switches fail. Cables get cut. Cloud availability zones lose connectivity. You don't get to opt out of P.

So the real choice is: when a partition occurs, do you sacrifice consistency or availability?

→CP systems: Reject requests they can't confirm are consistent. You might get an error, but you'll never get stale data.
Examples: HBase, ZooKeeper, etcd, MongoDB (strong reads)

→AP systems: Always respond, even if the data might be outdated. You get an answer, but it might not be the latest one.
Examples: Cassandra, DynamoDB, CouchDB, DNS

CP vs AP in Action

During a network partition...

During a partition, a CP system would rather return an error than risk serving outdated data. A banking system does this: if the node handling your request can't verify your balance with the primary, it refuses the transaction.

An AP system takes the opposite approach. A social media feed keeps loading even if some backend nodes are unreachable. You might see a post from 30 seconds ago instead of the absolute latest, but the app stays responsive.

Neither is universally better. The right choice depends on what your users can tolerate: stale data or downtime.

Beyond CAP: PACELC

CAP only describes behavior during a partition. But partitions are rare. Most of the time your system is running fine. PACELC extends CAP to cover normal operation too.

The idea: if there's a Partition, choose between Availability and Consistency. Else (normal operation), choose between Latency and Consistency.

→DynamoDB (PA/EL): During partition: favor availability. Normal: favor low latency. Eventual consistency by default.

→MongoDB (PC/EC): During partition: favor consistency. Normal: also favor consistency. Reads go to the primary by default.

→Cassandra (PA/EL): During partition: stay available. Normal: prioritize fast reads. Tunable consistency per query.

PACELC is more useful in practice because it captures the tradeoff you make every day, not just during rare failure scenarios.

Real-World Tradeoffs

The consistency vs availability decision isn't abstract. It shows up in real product decisions every day.

Parking SystemConsistency

If two drivers see the same spot as available and both try to park, you have a conflict. The system needs to reject one of them. Showing an error is better than double-booking a spot.

Banking / PaymentsConsistency

You can't show a balance of $500 when the real balance is $0. An ATM that's temporarily unavailable is better than one that dispenses money it shouldn't.

Social Media FeedAvailability

If your Instagram feed is 30 seconds behind, you won't notice. But if the app refuses to load because one server is unreachable, you'll switch to another app.

E-commerce Product PageAvailability

Showing a product page with a slightly outdated price is better than showing an error. The inventory check at checkout is where you enforce consistency.

Collaborative Document EditingAvailability

Google Docs lets everyone keep typing even during network hiccups. Edits merge later. Blocking all users until every keystroke is confirmed would make the product unusable.

The pattern: choose consistency when wrong data causes real-world harm. Choose availability when a slightly stale experience is better than no experience at all.

Distributed Consensus

The Problem

If you have three database replicas and a client writes a value, how do all three nodes agree on what that value is? What if one node is slow? What if the leader crashes mid-write?

Consensus is the process of getting multiple nodes to agree on a single value, even when some nodes fail or messages get lost. Without it, replicas drift apart and your system returns contradictory answers.

This is the foundation of every CP system. Leader election, log replication, distributed locks, configuration management - they all depend on consensus under the hood.

Raft: Leader Election

A is the current leader

Raft was designed in 2014 specifically to be understandable. It breaks consensus into three sub-problems: leader election, log replication, and safety.

Every Raft cluster has exactly one leader at any time. All writes go through the leader. If the leader goes down, the remaining nodes detect the failure (via heartbeat timeouts) and elect a new one.

Election process:

1.A follower notices it hasn't heard from the leader (heartbeat timeout).

2.It becomes a candidate and increments its term number.

3.It votes for itself and sends vote requests to all other nodes.

4.If it gets votes from a majority, it becomes the new leader.

5.The new leader starts sending heartbeats to maintain authority.

The term number is critical. It acts as a logical clock. If a node receives a message with a higher term, it knows its information is outdated and steps down. This prevents stale leaders from causing split-brain scenarios.

Raft: Log Replication

1. Client sends write to leader

Once a leader is elected, it's responsible for accepting writes and replicating them to followers. Every write becomes an entry in an ordered log.

The leader appends the entry to its own log, sends it to all followers, and waits for a majority to acknowledge. Once a majority confirms, the entry is committed and the leader responds to the client. Even if a minority of nodes are down or slow, the system keeps making progress.

This is what makes Raft a CP protocol. A write is only considered successful when a majority of nodes have it. If the leader crashes after committing, the new leader is guaranteed to have all committed entries because it needed a majority vote - and at least one of those voters has the latest data.

Other Consensus Algorithms

Paxos came first (1989, published 1998). It solves the same fundamental problem as Raft but is notoriously hard to understand and even harder to implement correctly. Most engineers who work with Paxos rely on battle-tested libraries rather than writing it from scratch.

Multi-Paxos and Viewstamped Replication extend basic Paxos for continuous operation (not just single-value consensus). Raft is essentially a more structured and teachable version of Multi-Paxos.

PBFT (Practical Byzantine Fault Tolerance) handles a different threat model: nodes that actively lie or behave maliciously. This is relevant for blockchain systems but overkill for trusted datacenter environments.

Where This Shows Up

You don't usually implement consensus yourself, but understanding it helps you reason about the systems you depend on:

→etcd / Consul: Use Raft for distributed key-value storage. Kubernetes stores all cluster state in etcd.

→ZooKeeper: Uses ZAB (similar to Raft). Coordinates distributed systems - leader election, config management, distributed locks.

→CockroachDB / TiKV: Use Raft per data range to replicate and agree on writes across nodes.

→Kafka (KRaft): Replaced ZooKeeper with a built-in Raft-based controller for metadata consensus.

The pattern is the same everywhere: a small cluster of nodes (usually 3 or 5) runs consensus to agree on state, and the rest of the system trusts that source of truth. Understanding Raft gives you a mental model for how all of these work.