Dictionary.com defines “real-time” as meaning
The actual time during which a process or event occurs.
Even if you were standing next to the event as it occurs, there’s additional time for the light to reach your eyes and for your brain to process that information. By the time that has completed, the actual time has already passed. In real-time analytics, we have to take the same consideration: time to travel and time to process.
The term “analytics” in the context of data can mean many things to many different people. To some it could mean machine learning and to others simple joins and aggregations of data. At the end, they all have a common goal:
Discovering useful information, informing conclusions, and supporting decision-making.
Together we can define real-time analytics as discovering useful information from (near) actual time events to inform conclusion and supporting decision-making. Let’s apply it to a technical solution.
Event capture and processing
The event that has occurred is your data and it needs to travel to some input that can receive it. That receiver is usually some message broker like Apache Kafka facilitated by some producer like a Kafka connector. The message brokers are the eyes to all events happening in real-time and publishes them to the appropriate parts of the brain.
Next the events need to be set in a context. For example if you see a tree falling from a distance, it may not mean anything to you. But if you see your car in the path of the falling tree, then that event probably would matter much more.
Processing the event involves cleansing and enriching to form a full view of the effects of the event including its surroundings. This helps form conclusions of what is about to happen. In the case of the image above, the processing of all the data concludes you will need a ride home.
In real-time analytics, this would mean capturing the event in a messaging broker, processing it in real-time, then building a table to be displayed in a dashboard.
In the diagram above, the table being used in the dashboard is called a materialized view.
Oxford defines the term “materialize” below. When something materializes in front of you, it usually does it on its own. This understanding will be important later as we try to understand materialized views.
Traditional Views
To understand materialized views, we first need to understand traditional views. Both traditional views and materialized views both live in a database. Traditional views (or just “views”) are defined as a SQL statement that gets executed when the client selects from the view.
CREATE VIEW view_name AS
SELECT column1, column2, ...
FROM table_name
WHERE condition;
—- then select
SELECT * FROM view_name;
The view’s SQL is processed at the time submitted by the client. If there is a JOIN defined in the SQL or any complex logic, it is all processed at request time which could create a longer running query.
In the diagram above, table 1 & 2 represent data that is persisted. The application from the right executes the SQL “select * from view.” The view then JOINS two tables together then the data is returned to the client. This execution is done synchronously in a request/response pattern. Another way to describe these types of queries is to call them pull queries because the client is “pulling” the result from the VIEW to display in the dashboard. Notice the missing disk cylinder under the VIEW in the diagram. This indicates that the data in the view is not persisted and therefore needs to run the SQL to get latest changes each time. The diagram below may help simplify this concept.
In the dashboard, pull queries are submitted to the view via polling model (refresh interval) to get the latest data. Usually you can set the refresh interval from 5 seconds to 5 min or manually. If we set the refresh to every 5 seconds for example, then the query in your VIEW will be executed every 5 seconds. If you have many users viewing the dashboard, then the query will be executed every 5 seconds for every client watching the dashboard. This can put a lot of stress on the system serving the data.
Understanding Materialized Views
Alternatively, a materialized view executes the SQL but then persists it. In the diagram below, notice the cylinder under the materialized view. This means that the pull queries are only being served data that has already been preprocessed. These queries are a lot faster and can possibly handle millions of end users.
Any complex JOINs or transformations are done by the push query. The diagram below should help you understand this.
Both push and pull queries work together to provide real-time analytics.
Where to put what?
If you’re not lost, you may be asking “where do I put my transformation or my aggregation - the push or the pull query?” This is where you need to apply the requirements of your use case. Here are some suggestions.
For transformations that deal with cleansing, it’s best to do those in the push query. Most likely, the users executing the pull query will expect the data to be cleansed.
Any heavy workloads would be best done in the push query. You normally would want that work already done to make the pull queries fast.
Joining data is very use case specific. If you want fast pull queries, it’s best to get the join work out of the way and not part of the pull query latency. But if you want you users to define the queries and latency isn’t an issue, then you would put it in the pull query. The push query will limit you to one join.
Aggregations are best done in the pull query since they are often decided by the user requesting the information. Putting them in the push query will limit how you can slice-and-dice your data.
Summary
Materialized views can be fed by asynchronous stream-like processes. Clients issuing pull queries on them can expect real-time data because of this natural streaming characteristic. And since the data is persisted in a materialized view, the pull queries are fast.
In real-time analytics, you want the data to always be moving. Putting them at rest will create latency. But you also want to provide enough ad hoc / slice-and-dicing of your data so that users can build compelling reports and dashboards that are customer facing.
Keep the push/pull query in mind when building a solution. Understand the tradeoffs you may encounter when you implement them.
For more information about materialized views, join me at RTA Summit 2023. You can also use my discount code link here.