Skip to main content

Understanding the CAP Theorem in Distributed Systems

· 5 min read
Hieu Nguyen
Senior Software Engineer at OCB

A plain-English guide to the CAP Theorem — the fundamental trade-off in distributed databases, what happens during networking failures, and why you must choose between consistency and availability.

When building a system for a single local database, data accuracy and uptime seem easy to guarantee. But when you scale out to a distributed system (like microservices with replicated databases), everything changes. You can no longer have it perfectly your way.

In my experience working with high-volume banking systems and e-wallets, choosing the right database isn't just about SQL vs NoSQL — it's about what happens when the network fails. Do you show the user outdated data, or do you show them an error?

This dilemma is formalized by the CAP Theorem (also known as Brewer's theorem). Let's break it down.

What is the CAP Theorem?

The CAP Theorem states that a distributed data store can only guarantee two out of the following three characteristics simultaneously:

  1. Consistency
  2. Availability
  3. Partition Tolerance

Since network partitions (the "P") are unavoidable in a distributed system, the real choice you make is always between Consistency (C) and Availability (A).

1. Consistency (C)

In a highly consistent system, every read receives the most recent write (or an error). If you update a user's wallet balance on Node A, and immediately read it from Node B, you are guaranteed to see the updated balance. It ensures all nodes see the same data at the exact same time.

2. Availability (A)

Every request receives a (non-error) response, without the guarantee that it contains the most recent write. If Node A is disconnected from Node B, Node B will still respond to reads using whatever old data it has.

3. Partition Tolerance (P)

The system continues to operate despite an arbitrary number of messages being dropped (or delayed) by the network between nodes. In real life, network cables get cut, routers fail, and data centers lose power. You cannot choose to ignore Partition Tolerance.

The Illusion of "Choose 2"

You'll often see diagrams showing a triangle where you can pick any combination: CA, CP, or AP.

          Consistency
/\
/ \
/ \
CA / \ CP
/ \
/ \
/____________\
Availability Partition
AP Tolerance

This is misleading. Because network partitions (P) are a fact of life, your system must be Partition Tolerant. Therefore, when a partition occurs, your system is forced to choose between CP and AP.

  • Drop Availability (CP): Cancel the operation to ensure data remains consistent. (Show an error).
  • Drop Consistency (AP): Proceed with the operation using possibly stale data. (Always respond).

Real-World Scenarios

Imagine an ATM system built on a distributed database with two nodes: Node A (Hanoi) and Node B (HCM City). Currently, my balance is $100.

Suddenly, the network link between Hanoi and HCM City goes down (a Partition occurs). I withdraw $50 at the Hanoi ATM (updating Node A to $50). Immediately after, I check my balance on the mobile app, which routes to Node B in HCM City.

Node B cannot talk to Node A to sync the balance. What should Node B do?

Scenario 1: The CP Choice (Consistency over Availability)

Node B knows it cannot talk to Node A and realizes its data might be stale. To prevent a consistency error (like me withdrawing another $100), Node B refuses to process the request.

  • Result: The mobile app shows "Service Unavailable."
  • Use case: Financial systems, banking balances, inventory management.
  • Databases: MongoDB, Redis, HBase, traditional relation databases (PostgreSQL/MySQL) configured for synchronous replication.

Scenario 2: The AP Choice (Availability over Consistency)

Node B decides that answering the user is more important than absolute accuracy. It returns the data it currently has ($100).

  • Result: The mobile app shows a $100 balance, even though the actual balance is $50. Eventually (when the network recovers), Node B will sync and fix the balance.
  • Use case: Social media feeds, likes/comments, shopping cart items, metric tracking.
  • Databases: Cassandra, DynamoDB, CouchDB. (Often described as implementing "Eventual Consistency").

Beyond CAP: The PACELC Theorem

In 2010, the CAP theorem was expanded to PACELC to account for system behavior when the network is perfectly fine (no partitions).

PACELC states:

If there is a Partition (P), how does the system trade off Availability and Consistency (A and C)? Else (E), when the system is running normally, how does it trade off Latency (L) and Consistency (C)?

In a normal state (no partition), you still have a trade-off:

  • Do you update Node A, wait for it to sync to Node B, and then reply to the user? (High Consistency, High Latency).
  • Do you update Node A, reply "Success!" immediately, and sync to Node B in the background? (Low Latency, Eventual Consistency).

Key Takeaways

  1. You don't pick two out of three. You assume P (network partitions will happen), and you decide whether your system should act CP or AP when they do.
  2. CP Systems prioritize correctness. If nodes can't communicate, they stop serving data to prevent inconsistencies. Ideal for ledgers and billing.
  3. AP Systems prioritize uptime. They will always return data, even if it's stale. Ideal for recommendations, feeds, and analytics.
  4. Consistency isn't binary. In reality, "Eventual Consistency" is a spectrum, and tools like Cassandra let you tune consistency levels on a per-query basis.

Thanks for reading! Understanding the CAP theorem is a rite of passage for any backend engineer and a staple in system design interviews. For more insights on building robust backend systems, check out my Clean Architecture Guide and Saga Pattern post. 🚀