Stream Processing vs Real-time OLAP vs Streaming Database
How to use them and where they overlap
Recently there has been a lot of interest in real-time, streaming data. Naturally it has led to a lot of questions, especially about how to process and serve real-time data. If you do a web search on this topic, you’ll find there are many different products and vendors in this space which can be very confusing. They say:
“use the right tool for the job.”
But if you don’t know how to use the tool, how do you know if it’s the right tool?
The goal of this post is to describe the differences between - stream processing, real-time OLAP databases, and streaming databases. This knowledge will help you to organize and add structure to your research to enable you to make the best product decision for your real-time use case.
Typical Real-Time Solution
Figure 1 is a typical real-time data flow that captures data from a source, transforms it, and serves it using a real-time OLAP database. The diagram is completed with a real-time visualization to an end user.
We’re going to focus on the second and third icons from the left: Real-Time Transformation and Real-Time OLAP.
Real-time transformation engines are distributed, stateful stream processors that can perform complex transformations on real-time data. They can handle high throughput data at any scale. In this post, we are focusing on only stream processors that support ANSI SQL. The role of the stream processor is to preprocess the data before it reaches the real-time OLAP database.
Real-time OLAP (RTOLAP) are databases that can serve data with high query per second (QPS). QPS is a good indicator of good end user experience. End users could be actual users or other applications. The higher the QPS, the better the experience for end users. The stream processors take away the need for these databases to preprocess the data so that they can focus their resources on higher QPS.
In the previous section, we learned about stream processors and RTOLAP databases. Streaming databases simply try to do both. They have stream processing capabilities as well as the ability to serve real-time data with high QPS like a RTOLAP (see Figure 2).
Streaming database cannot be created by simply putting a SQL based stream processor and a RTOLAP together. They need to share the same SQL processor/engine for both transforming data and serving data. It provides a single SQL interface for user facing applications that pull data from a RTOLAP with high QPS and for developers building transformations that push data downstream in a data pipeline.
Push and Pull Queries
The queries that transform data and serve data are known as push and pull queries respectively in a streaming database. Push queries “push” their results downstream to either some operator or storage. These queries are good for preprocessing real-time data to prepare it for analytical queries by the end users. This is exactly what the stream processor was doing in Figure 1.
The analytical queries are the pull queries because the end user is “pulling” the preprocessed data from the database. This is exactly what the RTOLAP was doing in Figure 1.
In Figure 3 below, push queries can push to a table or a Kafka topic. Applications subscribe to the topic while end users pull analytical queries from the RTOLAP table. Streaming databases are able to do both of these tasks in real-time.
In many streaming databases, the Kafka topic can appear like another table. Developers that use a streaming database need to know the difference. This is where things start to get confusing because each streaming database has their own opinionated approach to defining them.
The tables that are actually Kafka topics are sometimes called “streams” or “append streams” or “append only streams”. If you were to query one of these tables, it would be an unbounded result. This means that the results would not end because streams don’t end.
The tables used to execute analytical pull queries are usually called “materialized views.” This is because the aggregation and preprocessing are always pushed into the table “materializing” the result. That result is queried by the end user as pull queries. If you were to query one of these tables, you would get a bounded result. This means that the result returned would eventually end.
There is a lot more to this topic and it deserves its own blog/book. For now, this is enough to understand that streaming databases can serve analytical queries and process streams of data as part of a data pipeline.
This is the “spicy” part of the blog. Figure 4 plots vendors and open source projects that provide either a stream processing engine, a real-time OLAP database, or streaming database.
The x-axis measures how well a product can perform stream processing. This includes these abilities:
Handle high throughput streaming data.
Consume from the popular streaming platforms like Kafka, Redpanda, Pulsar, Kinesis, etc.
Handle failures either from nodes in its cluster or clients.
Define transformations using SQL.
Hold state for complex transformations, joins, and aggregations.
The y-axis measures the QPS performance for a product. These are usually the RTOLAP databases.
In the far right on x-axis and low on y-axis exists only stream processing products. Products that are high on the y-axis and left on the x-axis are only RTOLAP databases. Any product in or near the blue box start to behave like streaming databases. Below is the icon legend.
Some additional information:
RisingWave - Stream processor similar to ksqlDB but uses Hummock (a LSM - Tree based storage engine built in RUST. In fact all of RisingWave is built using RUST. Preliminary benchmarks put it above ksqlDB’s QPS.
Materialize - Streaming database implemented in RUST with in-memory storage. SaaS version uses object store.
Timeplus - a unified database that supports both streaming query and historical query, written in C++. Transformation tasks are done in the stream processing layer and historical queries are done in underlying RTOLAP. (see details here)
StarRocks - Open source RTOLAP. Can also be used as a data warehouse. Provides transformation capabilities. Vendor is CelerData.
Tinybird - Uses Clickhouse as its underlying RTOLAP and exposes REST endpoints to real-time data. Provides transformation capabilities.
Decodable, Delta Stream, and Popsink all use Apache Flink. Delta Stream has the ability to swap out Flink with an alternative stream processing solution. They also will have a pull queries enabled first half of this year.
Apache Pinot - Open source RTOLAP. Provides the highest QPS out of all available RTOLAP databases. Vendor is StarTree.
Clickhouse - Open source RTOLAP.
Apache Druid - Open source RTOLAP. Vendor is Imply.
Striim and GCP Dataflow - managed stream processors
A product that is not on the graph but deserves a mention is Memgraph. It’s a streaming graph database which is slightly different than the streaming database discussed in this post. It appears to be the only one of its kind.
The information used to construct the graph came from documentations found on vendor sites, open source sites, and GitHub projects. As well as questions asked to the each corresponding community.
If you are a vendor and would like to adjust the position of your logo on the graph, please reach out to me on LinkedIn.