Kafka — A System That Had to Exist

From monoliths to distributed logs — why Kafka is an inevitable answer at scale.

1. Systems Change As Scale Changes

In the early days, most systems are simple.

You have:

One application
One database
Users send requests
Application writes to the database

This architecture works well for small systems. It is easy to understand, easy to deploy, and easy to maintain.

But as a system grows, things start to change. New requirements appear:

More users
More features
More services
More data
More teams
More integrations

The architecture begins to evolve.

Stage	Architecture
Stage 1	Monolith → Database
Stage 2	Services → Shared Database
Stage 3	Services → RPC calls
Stage 4	Services → Message Queue
Stage 5	Services → Distributed Log

Kafka appears at Stage 5. It is not a random piece of technology. It is a natural evolution of systems that need to handle large amounts of data and many independent services.

To understand Kafka, we must first understand the problems that forced systems to evolve this way.

2. The Fundamental Problem

At scale, companies face a common problem:

Multiple systems need the same stream of data, reliably, at high throughput, independently, and with the ability to replay past events.

Let's break this problem down.

Problem 1 — Point-to-Point Integration Creates Tight Coupling

Imagine a system where Service A generates events, and multiple services need those events.

For example:

Order Service creates orders
Payment Service needs order events
Notification Service needs order events
Analytics Service needs order events
Fraud Detection needs order events

If Service A sends data directly to each service:

Service A → Service B
Service A → Service C
Service A → Service D
Service A → Service E

Now Service A is tightly coupled to every other service.

This creates several problems:

Service A must know all consumers
If one consumer is down, retries pile up
Adding a new consumer requires changes in Service A
Failures propagate across services

This architecture does not scale organizationally or technically.

We need decoupling between producers and consumers.

Problem 2 — Databases Are Not Built for Event Streams

A common early solution is to write everything to a database and let other services read from the database.

But databases are designed for:

Transactions
Queries
Updates
Random reads and writes
Maintaining current state

Event data is different:

It is append-only
It is high throughput
It is written continuously
It is read by many systems
It often needs to be stored for a long time
It is sequential in nature

Using a traditional database for high-throughput event streams causes:

Write bottlenecks
Lock contention
Expensive scaling
Replication overhead
Performance degradation

We need a system optimized for append-only, high-throughput writes.

Problem 3 — Multiple Consumers, Different Speeds

Different systems process data at different speeds.

For example:

Notifications must be sent immediately
Fraud detection must process quickly
Analytics can process later
Data warehouse may process in batches at night

If a slow consumer blocks a fast consumer, the entire system slows down.

So we need a system where:

Multiple consumers can read the same data
Each consumer can read at its own speed
Slow consumers do not affect fast consumers

This is a very important requirement.

Problem 4 — Systems Fail, So Replay Is Necessary

In distributed systems, failures are not rare events. They are normal events.

Things that fail regularly:

Consumers crash
Machines restart
Network calls fail
Deployments introduce bugs
Downstream systems go down

If a consumer is down for 2 hours, what happens to the data produced during those 2 hours?

If the system deletes messages after they are consumed (like traditional queues), then replay becomes difficult or impossible.

So we need a system where:

Data is stored for a long time, and consumers can replay it from any point in time.

This requirement leads to a very important idea: Instead of a queue, we need a log.

3. First Principles of Distributed Systems (The Physics)

Before designing any distributed system, we must accept some fundamental truths.

These are not design choices. These are constraints of reality.

First Principle	Implication
Networks fail	We must handle retries and partial failure
Machines fail	We need replication
Disks are persistent	Data should be stored on disk
Sequential disk writes are fast	Use append-only logs
Random disk writes are slow	Avoid random updates
Memory is fast but volatile	Use memory as cache, not source of truth
Batching improves throughput	Group operations together
Consumers run at different speeds	Consumers must be independent
Coordination is expensive	Avoid central coordination
Global ordering is hard	Limit ordering guarantees
Replay is necessary	Store data for long periods
Backpressure is necessary	Consumers should control read rate
Systems must scale horizontally	Split data across machines

If we design a system while respecting these constraints, we will naturally arrive at something that looks like Kafka.

4. Deriving the System Step-by-Step

Now we design the system logically from the problems and constraints.

Step 1 — We Need a Buffer Between Producers and Consumers

To decouple producers and consumers, we introduce a buffer.

Producers → Buffer → Consumers

This buffer is traditionally called a queue.

But queues have a problem:

Once a message is consumed, it is deleted
New consumers cannot read old messages
Replay is difficult

So a queue is not enough.

Step 2 — We Need Replay → So We Store Messages in a Log

Instead of deleting messages after consumption, we append messages to a log.

A log is an append-only data structure:

Message 1
Message 2
Message 3
Message 4
Message 5
...

Now consumers can:

Read messages now
Read messages later
Re-read messages
Start from any point in time

This is a major conceptual shift:

A queue is for tasks. A log is for events and history.

Kafka is based on the log, not the queue.

Step 3 — One Log Is Not Enough → We Partition the Log

If we store everything in one log:

One machine
One disk
One bottleneck
Limited throughput

To scale, we split the log into partitions.

Partition 0 → Log
Partition 1 → Log
Partition 2 → Log
Partition 3 → Log

Now we get:

Parallel writes
Parallel reads
Horizontal scalability

Partition becomes the unit of scalability.

Step 4 — Machines Fail → So We Add Replication

If a machine storing a partition fails, we lose data.

So each partition must be replicated:

Partition 0 → Leader + Followers
Partition 1 → Leader + Followers
Partition 2 → Leader + Followers

Now if a leader fails, a follower can take over.

This gives us fault tolerance.

Step 5 — Many Consumers → We Introduce Consumer Groups

If multiple consumers read the same partition independently, they will duplicate work.

So we create consumer groups:

Each partition is read by only one consumer in a group
Multiple consumers can read different partitions in parallel

This gives us consumer scaling.

Step 6 — Consumers Need to Track Their Position → Offset

Consumers need to know where they stopped reading.

So each message in the log gets a number called an offset.

Offset 0 → Message
Offset 1 → Message
Offset 2 → Message
Offset 3 → Message

Consumers store the offset they have processed.

This means:

Consumers control their position
Consumers can replay messages
The system does not need complex tracking logic

This design simplifies the system and improves scalability.

Step 7 — Consumers Pull Data Instead of Server Pushing

If the server pushes data:

The server must track consumer state
Backpressure is difficult
Slow consumers can cause problems

If consumers pull data:

Consumers control their own speed
Backpressure is natural
The system is simpler

So the system uses a pull model.

5. What Kafka Really Is

If we combine all the steps, we get:

Kafka is a distributed, replicated, partitioned, append-only log where consumers control their own position.

This is Kafka in one sentence.

Everything else — producers, brokers, consumer groups — are just implementation details around this core idea.

6. Trade-offs Kafka Makes

Every engineering system is a set of trade-offs. Kafka is no different.

Design Choice	Benefit	Trade-off
Partitioning	Horizontal scaling	No global ordering
Replication	Fault tolerance	More disk usage
Pull model	Backpressure, scalability	Slightly higher latency
Batching	High throughput	Increased latency
Offset-based consumption	Replay capability	Duplicate processing possible
Log storage	Event history	Disk heavy
Consumer groups	Parallel processing	Rebalance complexity

Kafka is not magic. It is a set of carefully chosen trade-offs optimized for high-throughput event streaming.

7. Conclusion — Logs as the Backbone of Modern Systems

Traditional databases store the current state of the world.

But modern systems are increasingly built around events — things that happen over time:

Order created
Payment completed
User signed up
Item shipped
Page viewed
Sensor updated

These are not just states. These are events in a timeline.

Kafka does not just store data. Kafka stores the history of how data changed.

This leads to a powerful idea:

Databases store state. Logs store history. Modern systems are built on both.

Kafka became popular not because it was trendy, but because at scale, a distributed log becomes a necessity.

Kafka is not just a tool. It is a data backbone for modern distributed systems.

And once you understand the problems, the constraints, and the trade-offs, you realize something important:

A system like Kafka was not invented. It was inevitable.