Kafka — A System That Had to Exist
From monoliths to distributed logs — why Kafka is an inevitable answer at scale.
1. Systems Change As Scale Changes
In the early days, most systems are simple.
You have:
- One application
- One database
- Users send requests
- Application writes to the database
This architecture works well for small systems. It is easy to understand, easy to deploy, and easy to maintain.
But as a system grows, things start to change. New requirements appear:
- More users
- More features
- More services
- More data
- More teams
- More integrations
The architecture begins to evolve.
| Stage | Architecture |
|---|---|
| Stage 1 | Monolith → Database |
| Stage 2 | Services → Shared Database |
| Stage 3 | Services → RPC calls |
| Stage 4 | Services → Message Queue |
| Stage 5 | Services → Distributed Log |
Kafka appears at Stage 5. It is not a random piece of technology. It is a natural evolution of systems that need to handle large amounts of data and many independent services.
To understand Kafka, we must first understand the problems that forced systems to evolve this way.
2. The Fundamental Problem
At scale, companies face a common problem:
Multiple systems need the same stream of data, reliably, at high throughput, independently, and with the ability to replay past events.
Let's break this problem down.
Problem 1 — Point-to-Point Integration Creates Tight Coupling
Imagine a system where Service A generates events, and multiple services need those events.
For example:
- Order Service creates orders
- Payment Service needs order events
- Notification Service needs order events
- Analytics Service needs order events
- Fraud Detection needs order events
If Service A sends data directly to each service:
Service A → Service B Service A → Service C Service A → Service D Service A → Service E
Now Service A is tightly coupled to every other service.
This creates several problems:
- Service A must know all consumers
- If one consumer is down, retries pile up
- Adding a new consumer requires changes in Service A
- Failures propagate across services
This architecture does not scale organizationally or technically.
We need decoupling between producers and consumers.
Problem 2 — Databases Are Not Built for Event Streams
A common early solution is to write everything to a database and let other services read from the database.
But databases are designed for:
- Transactions
- Queries
- Updates
- Random reads and writes
- Maintaining current state
Event data is different:
- It is append-only
- It is high throughput
- It is written continuously
- It is read by many systems
- It often needs to be stored for a long time
- It is sequential in nature
Using a traditional database for high-throughput event streams causes:
- Write bottlenecks
- Lock contention
- Expensive scaling
- Replication overhead
- Performance degradation
We need a system optimized for append-only, high-throughput writes.
Problem 3 — Multiple Consumers, Different Speeds
Different systems process data at different speeds.
For example:
- Notifications must be sent immediately
- Fraud detection must process quickly
- Analytics can process later
- Data warehouse may process in batches at night
If a slow consumer blocks a fast consumer, the entire system slows down.
So we need a system where:
- Multiple consumers can read the same data
- Each consumer can read at its own speed
- Slow consumers do not affect fast consumers
This is a very important requirement.
Problem 4 — Systems Fail, So Replay Is Necessary
In distributed systems, failures are not rare events. They are normal events.
Things that fail regularly:
- Consumers crash
- Machines restart
- Network calls fail
- Deployments introduce bugs
- Downstream systems go down
If a consumer is down for 2 hours, what happens to the data produced during those 2 hours?
If the system deletes messages after they are consumed (like traditional queues), then replay becomes difficult or impossible.
So we need a system where:
- Data is stored for a long time, and consumers can replay it from any point in time.
This requirement leads to a very important idea: Instead of a queue, we need a log.
3. First Principles of Distributed Systems (The Physics)
Before designing any distributed system, we must accept some fundamental truths.
These are not design choices. These are constraints of reality.
| First Principle | Implication |
|---|---|
| Networks fail | We must handle retries and partial failure |
| Machines fail | We need replication |
| Disks are persistent | Data should be stored on disk |
| Sequential disk writes are fast | Use append-only logs |
| Random disk writes are slow | Avoid random updates |
| Memory is fast but volatile | Use memory as cache, not source of truth |
| Batching improves throughput | Group operations together |
| Consumers run at different speeds | Consumers must be independent |
| Coordination is expensive | Avoid central coordination |
| Global ordering is hard | Limit ordering guarantees |
| Replay is necessary | Store data for long periods |
| Backpressure is necessary | Consumers should control read rate |
| Systems must scale horizontally | Split data across machines |
If we design a system while respecting these constraints, we will naturally arrive at something that looks like Kafka.
4. Deriving the System Step-by-Step
Now we design the system logically from the problems and constraints.
Step 1 — We Need a Buffer Between Producers and Consumers
To decouple producers and consumers, we introduce a buffer.
Producers → Buffer → Consumers
This buffer is traditionally called a queue.
But queues have a problem:
- Once a message is consumed, it is deleted
- New consumers cannot read old messages
- Replay is difficult
So a queue is not enough.
Step 2 — We Need Replay → So We Store Messages in a Log
Instead of deleting messages after consumption, we append messages to a log.
A log is an append-only data structure:
Message 1 Message 2 Message 3 Message 4 Message 5 ...
Now consumers can:
- Read messages now
- Read messages later
- Re-read messages
- Start from any point in time
This is a major conceptual shift:
A queue is for tasks. A log is for events and history.
Kafka is based on the log, not the queue.
Step 3 — One Log Is Not Enough → We Partition the Log
If we store everything in one log:
- One machine
- One disk
- One bottleneck
- Limited throughput
To scale, we split the log into partitions.
Partition 0 → Log Partition 1 → Log Partition 2 → Log Partition 3 → Log
Now we get:
- Parallel writes
- Parallel reads
- Horizontal scalability
Partition becomes the unit of scalability.
Step 4 — Machines Fail → So We Add Replication
If a machine storing a partition fails, we lose data.
So each partition must be replicated:
Partition 0 → Leader + Followers Partition 1 → Leader + Followers Partition 2 → Leader + Followers
Now if a leader fails, a follower can take over.
This gives us fault tolerance.
Step 5 — Many Consumers → We Introduce Consumer Groups
If multiple consumers read the same partition independently, they will duplicate work.
So we create consumer groups:
- Each partition is read by only one consumer in a group
- Multiple consumers can read different partitions in parallel
This gives us consumer scaling.
Step 6 — Consumers Need to Track Their Position → Offset
Consumers need to know where they stopped reading.
So each message in the log gets a number called an offset.
Offset 0 → Message Offset 1 → Message Offset 2 → Message Offset 3 → Message
Consumers store the offset they have processed.
This means:
- Consumers control their position
- Consumers can replay messages
- The system does not need complex tracking logic
This design simplifies the system and improves scalability.
Step 7 — Consumers Pull Data Instead of Server Pushing
If the server pushes data:
- The server must track consumer state
- Backpressure is difficult
- Slow consumers can cause problems
If consumers pull data:
- Consumers control their own speed
- Backpressure is natural
- The system is simpler
So the system uses a pull model.
5. What Kafka Really Is
If we combine all the steps, we get:
Kafka is a distributed, replicated, partitioned, append-only log where consumers control their own position.
This is Kafka in one sentence.
Everything else — producers, brokers, consumer groups — are just implementation details around this core idea.
6. Trade-offs Kafka Makes
Every engineering system is a set of trade-offs. Kafka is no different.
| Design Choice | Benefit | Trade-off |
|---|---|---|
| Partitioning | Horizontal scaling | No global ordering |
| Replication | Fault tolerance | More disk usage |
| Pull model | Backpressure, scalability | Slightly higher latency |
| Batching | High throughput | Increased latency |
| Offset-based consumption | Replay capability | Duplicate processing possible |
| Log storage | Event history | Disk heavy |
| Consumer groups | Parallel processing | Rebalance complexity |
Kafka is not magic. It is a set of carefully chosen trade-offs optimized for high-throughput event streaming.
7. Conclusion — Logs as the Backbone of Modern Systems
Traditional databases store the current state of the world.
But modern systems are increasingly built around events — things that happen over time:
- Order created
- Payment completed
- User signed up
- Item shipped
- Page viewed
- Sensor updated
These are not just states. These are events in a timeline.
Kafka does not just store data. Kafka stores the history of how data changed.
This leads to a powerful idea:
Databases store state. Logs store history. Modern systems are built on both.
Kafka became popular not because it was trendy, but because at scale, a distributed log becomes a necessity.
Kafka is not just a tool. It is a data backbone for modern distributed systems.
And once you understand the problems, the constraints, and the trade-offs, you realize something important:
A system like Kafka was not invented. It was inevitable.