Real-Time Streaming Ecosystem Part 1
Setting up the use case
The real-time ecosystem is growing very quickly. Each member is finding their niche in real-time solutions. In this post, part 1 of a multi-series, I’ve compiled all the open source and vendor real-time solutions into an end-to-end real-time analytical use case. There are so many in the market now that it’s time to start organizing and categorizing the members of this thriving ecosystem.
Hubert’s Substack is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber. Also consider expensing it.
In the diagram below, I’ve mapped out all the vendors and open source projects that are related to real-time solutions. It’s overwhelming and will continue to quickly evolve. For experienced real-time architects or newcomers to real-time technologies, getting started with real-time could also be overwhelming which could lead to wrong decisions and bad implementations.
I'm trying to limit cloud service providers (CSPs) to only streaming platforms because they are the access ports to streaming use cases to their cloud. CSPs already have the advantage in many fronts including access to existing customers and large sales and marketing teams. CSPs may have the products or solutions that help contribute to real-time use cases but they don't need the extra help with exposure. 😉
CSPs tend to be followers as far as real-time services. Getting the pulse of the real-time market can be better detected from the smaller, scrappier players in the ecosystem. This is one of my goals which is to show how real-time & streaming was once a data frontier to be conquered. Now the chasm has been crossed to the early majority. We see this with the expansion of really scrappy startups in the market. Data experts are venturing into the real-time/streaming space because of needs expressed by the market.
I'd love to bring some organization into the ecosystem to help newcomers jump start to a real-time solution that serves their needs.
In the diagram above, I’ve defined seven categories into which I’ve placed open source projects and vendors that contribute to the real-time, streaming ecosystem. You may find some logos that fall into multiple categories.
First I want to clearly define the categories.
Streaming platforms provide publish and subscribe streaming capabilities. Streaming platforms are implemented as transaction logs (also called commit logs) which allow for consumers to keep an offset (the position of last record that was read). Transaction logs and offsets allow consumers to control how they are reading events from the stream without affecting other consumers reading from the same stream. The offset and transaction logs also help replenish data in cases where new tables are derived from older events. This is usually a requirement with real-time streaming use cases which I will talk more about in an later post in this series. Transaction logs are also append only and immutable. Data also has a retention where events that exceed the retention get deleted.
Kafka is the most popular streaming platform. Many other streaming platforms follow its design and even implement its API.
Streaming platforms can be divided into subgroups.
Kafka Compliant: Apache Kafka, Redpanda, AWS MSK, Azure Event Hub
These are streaming platforms that can be used using the Kafka protocol. This allows clients to migrate from Kafka. In a sense, the Kafka API is becoming the standard asynchronous API.
JVM: Apache Kafka, Apache Pulsar
These are streaming platforms that are executed using the Java virtual machine.
Non-JVM: Redpanda (C++), Gazette (GoLang), Memphis (GoLang), AWS Kinesis (C++), Azure Event Hub (C#)
Kafka Providers: Confluent, AWS, Instacluster, Aiven, Tibco, Cloudera
Pulsar Providers: Stream Native, Datastax
Connectors, CDC, ELT, rETL
These are solutions that do not hold state when streaming data from source to sink. They have transformation capabilities but they cannot create a materialized view but can invoke UPSERT commands to the sink.
Connectors act as sources or sinks. Most of the time, connectors adapt to data stores where data live at rest and have the difficult task of converting resting data to data in motion. This sometimes isn’t easy.
Change Data Capture (CDC)
Some connectors have CDC capabilities. CDC is the ability to capture change events in a database and sending it downstream. There are a few ways to implement CDC but only one of them can be considered true streaming. The others only emulate streaming. I will cover CDC on Part 2 of this series.
ELT stands for extract, load, and transform. The idea here is the connector extracts data from a source then loads it into an analytical data store or a data warehouse sink. Then the connector sends transformation logic to the sink as SQL.
The logic defined in the SQL usually needs to be represented as a series of steps or better yet, a workflow. The easiest way to represent a workflow in SQL is to use CTEs (common table expression). See below for an example.
with step1a as ( select a, b from stream_table1 where .... ), step2a as ( select a, count(b) from step1 group by a ), step1b as ( select a, d from stream_table2 where .... ), step2b as ( select a, count(d) from step1b group by a ) select * from step2a x join step2b y on x.a=x.y
The workflow logic for the SQL above is this:
I will talk more about this approach in Part 2 of this series.
rETL stands for reverse, extract, transform, and load. To better explain rETL, we need to understand ETL. ETL is the process of extracting data from a source and transforming the data BEFORE it gets to the sink. This is unlike ELT where the transformation happens in the sink data store.
rETL is the reverse of ETL where the sink becomes the source. The process extracts data from an analytical data store or a data warehouse (considered the system of truth) to a third party system (typically a CRM or another SaaS application) to take action. rETL is also called data activation.
The transformation in rETL oddly can occur in the data warehouse. So technically it should be TEL for transforming data in the data warehouse, extract the transformed data, then load it into the third party system.
rETL solutions being real-time streaming solutions is debatable. I’ll cover this more in Part 2 of this series.
Stateful Stream Processors
Stateful stream processors tend to already have some connectors built. Therefore we will assume they have most of the capabilities in the Connectors, CDC, ELT, rETL category.
Stateful stream processors can hold state for complex logic, transformations, and joins. Logic that cannot be accomplished with connectors.
For example, below is an Apache Flink SQL that uses MATCH_RECOGNIZE, a complex event processing (CEP) library which enables pattern matching between events in the stream.
SELECT T.aid, T.bid, T.cid FROM MyTable MATCH_RECOGNIZE ( PARTITION BY userid ORDER BY proctime MEASURES A.id AS aid, B.id AS bid, C.id AS cid PATTERN (A B C) DEFINE A AS name = 'a', B AS name = 'b', C AS name = 'c' ) AS T
In a later post, I’ll break this syntax down into its parts. Basically, the SQL above is looking for records that come in the exact order shown below.
name | aid | bid | cid | ------------------------- a | foo | bar | 1 | b | foo | bar | 2 | c | foo | bar | 3 |
If this pattern is found, then the stream processor sends the records downstream.
These solutions can consume data from a streaming platform or be the sink in a stateful stream processor pipeline. Real-Time OLAPs persist event data in such a way that querying the data has a very low latency. This to serve applications or dashboards that are submitting synchronous requests (requests & response) to present to the user. that are
These solutions combine the stateful stream processing and OLAP databases. They are able to build their own materialized view without the help of an external Stateful Stream Processor. These databases are able to serve analytical queries to the dashboards or applications.
Lake houses have streaming feature. Tables in lake houses can act like topics in a streaming platform. These tables can then be used as sources and sinks in in Apache Flink or Apache Spark.
Use Case Requirements
Let me first define what a “real-time use case” is. It’s a use case that requires capturing data as they occur and serving it quickly to a consumer with context. The consumer can then act on the data in near real-time. The “data” are called EVENTS.
Requirement 1: The flow of data that reaches to the consumer must always be in a stream to ensure real-time. If at any point the events are removed from the stream, the flow no longer qualifies as a real-time solution.
Requirement 2: The latency between the source of the event to the display of the analytics should be within 1 minute. If the analytics arrive after 1 minute, it would be considered too late and it’s value has no worth. Some examples of use cases that require this SLA are:
Personalization and product recommendation to keep customer’s attention.
These requirements will start to disqualify some of the logos in the diagram above. I will identify them and provide reasons why they were disqualified as we go through the series of posts.
Use Case Simple Diagram
Organizing the ecosystem could be best accomplished by defining a set of categories and then building a simple use case that identifies the niche for each of the players in the ecosystem.
The goal is to get the events produced on the left to the consumers on the right. The event producers could be almost anything. A lot of it is driven by humans which explains why humans also exist on the right as consumers. Humans want to know what other humans are doing including themselves. Also on the right is a gear that represents an application. Applications also exist on both sides because they both can produce and react to events.
We will assume the consumers need these events as close to real-time as possible. We also assume that any time lost (even in seconds) from the production to the consumption of the event could be a loss in revenue and/or puts the business at risk. This means the value of the event degrades as time moves forward.
What we need to do is fill in the middle of the diagram which I’ll do in a series of posts.