
As a System Designer, let me take you through the journey of how data processing eventually gave birth to Kafka and real-time stream processing.
In the early days, data processing was basic and primitive: capture the data, store it somewhere, and then process it later. I think big data is not the invention but a large data set of certain domain being accumulated over a period of time. When big data came along, the pattern didn’t really change. We just started capturing huge amounts of data, dumping it into massive storage systems and only then processing it. That’s how the idea of a Data Lake was born. A data lake is basically a giant parking lot for data. You throw everything in there, and when needed, you take chunks out, process them in batches, and try to make sense of it.
But here’s the catch — while many organizations were busy expanding their data lakes, a new wave of companies started rethinking the entire approach. Instead of treating data as something that “sits and waits” to be processed, they started treating everything happening inside a business as a live stream of events.
Think about it: orders being placed, payments getting processed, users clicking buttons, sensors pushing updates, website click events — all of this can flow as continuous event streams. And when data is seen as a stream in motion, it completely changes how we build applications. Instead of pulling old data and processing it later, apps can directly tap into the live stream, consume events as they arrive, and react instantly — enabling real-time actions and decision-making.
This is a shift in perspective:
A data lake is stationary. A stream is dynamic, data in motion. When we work on data that’s already sitting in storage, we call it data processing.
When we process a flow of events as they’re happening, that’s stream processing.And when we’re doing it in seconds or even milliseconds, we’ve entered the world of real-time stream processing.Now, here’s the thing — you already know how batch data processing works. It’s straightforward:
Read from a database or data lake, Process using SQL or some program, Store the result back.That’s true whether it’s small data or big data.But stream processing? That’s a whole different ball game.A stream is endless. It never stops. Data keeps flowing into your system every second. You can’t just wait for it all to arrive and then process it — because it never ends.So how do you deal with that? How do you process an unbounded flow of data in real time?
These questions are exactly why technologies like Apache Kafka came into existence. Kafka provides a way to create, manage, and process streams of data — reliably, at scale, and in real time. If you’ve worked with data lakes before, you’ll notice the difference right away. With streams, you’re no longer waiting. You’re reacting.