Back to essays

Kafka — A System That Had to Exist

From monoliths to distributed logs — why Kafka is an inevitable answer at scale.

1. Systems Change As Scale Changes

In the early days, most systems are simple.

You have:

  • One application
  • One database
  • Users send requests
  • Application writes to the database

This architecture works well for small systems. It is easy to understand, easy to deploy, and easy to maintain.

But as a system grows, things start to change. New requirements appear:

  • More users
  • More features
  • More services
  • More data
  • More teams
  • More integrations

The architecture begins to evolve.

StageArchitecture
Stage 1Monolith → Database
Stage 2Services → Shared Database
Stage 3Services → RPC calls
Stage 4Services → Message Queue
Stage 5Services → Distributed Log

Kafka appears at Stage 5. It is not a random piece of technology. It is a natural evolution of systems that need to handle large amounts of data and many independent services.

To understand Kafka, we must first understand the problems that forced systems to evolve this way.

2. The Fundamental Problem

At scale, companies face a common problem:

Multiple systems need the same stream of data, reliably, at high throughput, independently, and with the ability to replay past events.

Let's break this problem down.

Problem 1 — Point-to-Point Integration Creates Tight Coupling

Imagine a system where Service A generates events, and multiple services need those events.

For example:

  • Order Service creates orders
  • Payment Service needs order events
  • Notification Service needs order events
  • Analytics Service needs order events
  • Fraud Detection needs order events

If Service A sends data directly to each service:

Service A → Service B
Service A → Service C
Service A → Service D
Service A → Service E

Now Service A is tightly coupled to every other service.

This creates several problems:

  • Service A must know all consumers
  • If one consumer is down, retries pile up
  • Adding a new consumer requires changes in Service A
  • Failures propagate across services

This architecture does not scale organizationally or technically.

We need decoupling between producers and consumers.

Problem 2 — Databases Are Not Built for Event Streams

A common early solution is to write everything to a database and let other services read from the database.

But databases are designed for:

  • Transactions
  • Queries
  • Updates
  • Random reads and writes
  • Maintaining current state

Event data is different:

  • It is append-only
  • It is high throughput
  • It is written continuously
  • It is read by many systems
  • It often needs to be stored for a long time
  • It is sequential in nature

Using a traditional database for high-throughput event streams causes:

  • Write bottlenecks
  • Lock contention
  • Expensive scaling
  • Replication overhead
  • Performance degradation

We need a system optimized for append-only, high-throughput writes.

Problem 3 — Multiple Consumers, Different Speeds

Different systems process data at different speeds.

For example:

  • Notifications must be sent immediately
  • Fraud detection must process quickly
  • Analytics can process later
  • Data warehouse may process in batches at night

If a slow consumer blocks a fast consumer, the entire system slows down.

So we need a system where:

  • Multiple consumers can read the same data
  • Each consumer can read at its own speed
  • Slow consumers do not affect fast consumers

This is a very important requirement.

Problem 4 — Systems Fail, So Replay Is Necessary

In distributed systems, failures are not rare events. They are normal events.

Things that fail regularly:

  • Consumers crash
  • Machines restart
  • Network calls fail
  • Deployments introduce bugs
  • Downstream systems go down

If a consumer is down for 2 hours, what happens to the data produced during those 2 hours?

If the system deletes messages after they are consumed (like traditional queues), then replay becomes difficult or impossible.

So we need a system where:

  • Data is stored for a long time, and consumers can replay it from any point in time.

This requirement leads to a very important idea: Instead of a queue, we need a log.

3. First Principles of Distributed Systems (The Physics)

Before designing any distributed system, we must accept some fundamental truths.

These are not design choices. These are constraints of reality.

First PrincipleImplication
Networks failWe must handle retries and partial failure
Machines failWe need replication
Disks are persistentData should be stored on disk
Sequential disk writes are fastUse append-only logs
Random disk writes are slowAvoid random updates
Memory is fast but volatileUse memory as cache, not source of truth
Batching improves throughputGroup operations together
Consumers run at different speedsConsumers must be independent
Coordination is expensiveAvoid central coordination
Global ordering is hardLimit ordering guarantees
Replay is necessaryStore data for long periods
Backpressure is necessaryConsumers should control read rate
Systems must scale horizontallySplit data across machines

If we design a system while respecting these constraints, we will naturally arrive at something that looks like Kafka.

4. Deriving the System Step-by-Step

Now we design the system logically from the problems and constraints.

Step 1 — We Need a Buffer Between Producers and Consumers

To decouple producers and consumers, we introduce a buffer.

Producers → Buffer → Consumers

This buffer is traditionally called a queue.

But queues have a problem:

  • Once a message is consumed, it is deleted
  • New consumers cannot read old messages
  • Replay is difficult

So a queue is not enough.

Step 2 — We Need Replay → So We Store Messages in a Log

Instead of deleting messages after consumption, we append messages to a log.

A log is an append-only data structure:

Message 1
Message 2
Message 3
Message 4
Message 5
...

Now consumers can:

  • Read messages now
  • Read messages later
  • Re-read messages
  • Start from any point in time

This is a major conceptual shift:

A queue is for tasks. A log is for events and history.

Kafka is based on the log, not the queue.

Step 3 — One Log Is Not Enough → We Partition the Log

If we store everything in one log:

  • One machine
  • One disk
  • One bottleneck
  • Limited throughput

To scale, we split the log into partitions.

Partition 0 → Log
Partition 1 → Log
Partition 2 → Log
Partition 3 → Log

Now we get:

  • Parallel writes
  • Parallel reads
  • Horizontal scalability

Partition becomes the unit of scalability.

Step 4 — Machines Fail → So We Add Replication

If a machine storing a partition fails, we lose data.

So each partition must be replicated:

Partition 0 → Leader + Followers
Partition 1 → Leader + Followers
Partition 2 → Leader + Followers

Now if a leader fails, a follower can take over.

This gives us fault tolerance.

Step 5 — Many Consumers → We Introduce Consumer Groups

If multiple consumers read the same partition independently, they will duplicate work.

So we create consumer groups:

  • Each partition is read by only one consumer in a group
  • Multiple consumers can read different partitions in parallel

This gives us consumer scaling.

Step 6 — Consumers Need to Track Their Position → Offset

Consumers need to know where they stopped reading.

So each message in the log gets a number called an offset.

Offset 0 → Message
Offset 1 → Message
Offset 2 → Message
Offset 3 → Message

Consumers store the offset they have processed.

This means:

  • Consumers control their position
  • Consumers can replay messages
  • The system does not need complex tracking logic

This design simplifies the system and improves scalability.

Step 7 — Consumers Pull Data Instead of Server Pushing

If the server pushes data:

  • The server must track consumer state
  • Backpressure is difficult
  • Slow consumers can cause problems

If consumers pull data:

  • Consumers control their own speed
  • Backpressure is natural
  • The system is simpler

So the system uses a pull model.

5. What Kafka Really Is

If we combine all the steps, we get:

Kafka is a distributed, replicated, partitioned, append-only log where consumers control their own position.

This is Kafka in one sentence.

Everything else — producers, brokers, consumer groups — are just implementation details around this core idea.

6. Trade-offs Kafka Makes

Every engineering system is a set of trade-offs. Kafka is no different.

Design ChoiceBenefitTrade-off
PartitioningHorizontal scalingNo global ordering
ReplicationFault toleranceMore disk usage
Pull modelBackpressure, scalabilitySlightly higher latency
BatchingHigh throughputIncreased latency
Offset-based consumptionReplay capabilityDuplicate processing possible
Log storageEvent historyDisk heavy
Consumer groupsParallel processingRebalance complexity

Kafka is not magic. It is a set of carefully chosen trade-offs optimized for high-throughput event streaming.

7. Conclusion — Logs as the Backbone of Modern Systems

Traditional databases store the current state of the world.

But modern systems are increasingly built around events — things that happen over time:

  • Order created
  • Payment completed
  • User signed up
  • Item shipped
  • Page viewed
  • Sensor updated

These are not just states. These are events in a timeline.

Kafka does not just store data. Kafka stores the history of how data changed.

This leads to a powerful idea:

Databases store state. Logs store history. Modern systems are built on both.

Kafka became popular not because it was trendy, but because at scale, a distributed log becomes a necessity.

Kafka is not just a tool. It is a data backbone for modern distributed systems.

And once you understand the problems, the constraints, and the trade-offs, you realize something important:

A system like Kafka was not invented. It was inevitable.