In this Part 3 of the real-time streaming ecosystem, I will cover streaming platforms. Streaming platforms are distributed publish and subscribe systems for events. Part 1 of this blog series talks about the real-time use case I’m trying to assemble. If you haven’t read it, please do so.
Streaming Platform Vs Stream Processing Platform
The terms "streaming platform" and "stream processing platform" are often used interchangeably, but there are some differences between the two.
A streaming platform is a software system that publishes event streams for other applications to subscribe to. It holds events and maintains them in a stream to allow many consumers to process real-time events with differing latencies. It’s important to understand that a streaming platform enables the processing and analysis of streaming data in real-time or near real-time. It doesn’t do the processing and analysis itself.
On the other hand, a stream processing platform can process and analyze event streams sourced from streaming platforms.
Using the water metaphor, think of a streaming platform as a system that can organize and distribute water. Then provides it to many stream processors like inlets into your home. This would make stream processors the pipes that route, clean, and prepare the water for consumption.
Streaming platforms are the central nervous systems for events. They receive sensory inputs from sources for processing and signaling responses in real-time. Streaming platforms also provide a way to monitor the pulse of the business by measuring its health against key performance indicators. Businesses can now head-off incidents before they can cause significant damage. Streaming platforms are a requirement for any business.
Existing Streaming Platforms
The diagram below shows the current streaming platform ecosystem. It includes open-source projects and managed providers. You can find more details in Part 1 of this series.
Distribution of Data
Many of the existing streaming platforms hold their data in very similar constructs. These constructs have an abstract representation of a set of lower-level partitions of event data. In Apache Kafka, the abstraction is called a topic and the lower-level partitions are called partitions. In other streaming platforms, these names will be different. In Memphis, topics are called “stations” and partitions are called “streams.” In Apache Pulsar, topics are called “topics” and partitions are called “ledgers.” In Gazette, topics are called “selectors” and partitions are called “journals.” In this series, I’ll be using the terms “topic” and “partitions.”
A partition is a mechanism that scales out the streaming platform. The more partitions a topic has, the more it can distribute the event load, which enables more consumer instances to process the events in parallel. Partitioning the events is the job of the source connector or producing application. At the beginning of the development of any event-generating application, engineers need to think of how to distribute the events so that consumers can process them efficiently. Engineers need to design a key for each event. That key will be hashed to determine which partition it will be sent to. Designing a key that balances the events across all partitions will provide the best horizontal scale.
Smarter Trend
In a previous post, I talked about a growing trend of making smarter messaging brokers. You can read it here.
In streaming, the developer experience with real-time systems hasn’t been great. The learning curve was always high and it was hard to implement the specific security implementations companies required. Part of a growing trend is to make this experience simpler to enable rapid development using models that are easy to understand. This is especially true for streaming platforms.
Some of these enhancements made using the streaming platform very hard to manage. It required a deeper knowledge of the streaming platform. This deeper knowledge often is secondary to the engineers who tend to lead companies searching for enterprise or fully managed versions of the streaming platform. This relinquishes the need for their deeper knowledge so that engineers can focus on applications that directly support the business. Here are some of those features:
Auto-balancing of data across the distributed streaming platform
Tiered storage
Auto-scaling of topics
Dumber clients to shorten the learning curve
Schema registry for simple management of schema evolution
Intelligent and intuitive user interfaces like the console and CLI
Easier security setup
Built-in cross-region replication
Built-in stateless functions
Use case
In part 2 of this series, I introduced connectors, change data capture (CDC), ELT, and rETL.
Streaming platforms maintain events as real-time streams for many consumers to subscribe to. The proper next step in the real-time streaming pipeline is to configure connectors to write to a streaming platform.
The diagram above was modified to include a connector capturing events from sources. The connector converts the events into a stream and sends it to a topic. The topic will maintain the stream and publish it for consumers. But where does the transformation of data happen? In every data pipeline, almost always there is a need to cleanse, enrich, and prepare the data for consumption. In the next post of this series, I’ll go over the stream processors that can perform these transformations in a stream.