This post will cover connectors, change data capture (CDC), ELT, and rETL solutions and providers. Below is the current list of tools that fall into this category.
This is Part 2 of a multi-post series that will cover the real-time ecosystem. If you haven’t yet, read Part 1 of this series to get a description of the use case.
Connectors
In the diagram below, I’ve represented the flow of data from the sources systems as a stream of 1’s and 0’s. Traveling from left to right, data are extracted from the source systems by the connector represented as the plug in the middle.
Connectors are special because many of them have to transform data-at-rest into data-in-motion. Before I continue, It’s important to understand what data-in-motion means.
Data-in-motion does not mean batching or micro-batching. A dataset batched from one system to another has a beginning and end. The applications that process batched data can be started up using CRON then stopped when the data ends.
This is opposite to what data-in-motion means. There is no beginning or end with data-in-motion. Applications processing data-in-motion are always running listening for new data to arrive even if there aren’t any.
Data-in-motion implies streaming event data. Converting data-at-rest into data-in-motion can be difficult to do. This is because many times the source system does not provide a way to stream data. This forces the connector to emulate a stream from the source system. The semantics involved around emulating a stream could get complex because data on a stream are considered events. Taking a snapshot of data is not enough to emulate a stream. You also have to filter for recent events in the snapshot and only send those recent events downstream. We’ll talk about this more in the CDC section.
Connectors also have to convert a source protocol to a sink protocol. For example, a FTP (file transfer protocol) connector will need to convert a FTP protocol to its destination protocol. The connector will have to find some commonality between the two protocols in order to make them compatible. At the end, we get an exponential amount of connectors these providers have to build.
CDC
As mentioned previously, streams contain events. Snapshots of data aren’t events. Events are changes that have occurred to an entity (like a name change for example). If no changes were made to an entity, then there’s no event. CDC (change data capture) is the process of capturing events that have occurred in a transactional database. These include inserts, updates, and deletes transactions to records. Connectors that support CDC have to capture the changes one of three ways:
Listening to the write ahead log (WAL). As your applications interact with their data stores, they automatically log the transaction in a construct called a write ahead log. This action is done in real-time. In fact, each transaction is considered an event that has been recorded. This makes the WAL naturally have streaming properties. Debezium is an example of a connector that can read from a set of transactional databases (OLTP).
Comparing snapshots. This involves taking a snapshot of a table and comparing it to a previous snapshot to filter out changes. This act can be process intensive especially if the table is large. Also, this approach is not true realtime. Snapshots are taken in intervals. Changes that include a reversion that occur in between intervals would be lost. Change + reversion events are sometimes considered suspicious that would not be detected.
Comparing update timestamps. This approach saves the timestamp of the last batch of changes and filters for records with update timestamps that occur after it. This approach requires an update column included in the table that needs to be updated anytime the record is changed.
Fortunately most OLTP databases have some way of reading their WAL enabling connectors to capture change events in realtime. Some OLTP databases also have native support for submitting events to a Kafka compliant cluster. Below is an illustration of a WAL in a OLTP database.
NOTE:
WALs were intended to replicate transactions between database instances for high availability in cases of outages. The applications write to an active database which is replicated to a passive database. CDC connectors tap into this stream of transactions to be submitted to other systems as events.
ELT
Extract, load, and transform (ELT) is the process of extracting data from a source system, loading it into an analytical database (OLAP), and invoking a SQL statement to run on the OLAP datastore. The SQL statement has transformation logic. Let’s look into this in depth.
The diagram below illustrates the ELT process.
The data gets extracted from the OLTP database.
The data gets loaded into the OLAP datastore.
The SQL that transforms the data gets executed when the load completes.
The invocation of the SQL statement in the OLAP would only get triggered after the Load has completed. This implies that the data flowing through the data pipeline is batched because it has an end. If the data was a stream, it would never end and thus the SQL would never be executed.
Let’s assume there was some way the SQL could be triggered, the data is now at rest in the OLAP. This means the transformed data only exists in the OLAP and serving it to other systems in real-time would be difficult. OLAP databases don’t typically have WALs because they don’t handle transactions like OLTP databases do. External systems that need the data in the OLAP will need to schedule a read from the OLAP which means the data pipeline is no longer in realtime.
There are variations of this pattern but all require some polling of the OLAP datastore to find new information. If this pattern still meets the SLA requirements for the use case, then obviously it’s appropriate to use. Just keep in mind that the pipeline is no longer in realtime. It just happens to finish executing within the SLA requirements.
rETL
Reverse extract, transform, and load (rETL) is also called data activation. This is the process of getting data out of an OLAP like a data warehouse and getting the transformed data to the other systems. Most of the time these other systems are external SaaS services.
The term “data activation” implies that the data is inactive. Inactive data are data stored in inaccessible places or at least hard to get to. Getting data out of an OLAP or data warehouse can be hard to do because they are only optimized for analytical queries. OLAPs are the last mile to getting analytical data to end users. OLAPs are not optimized for realtime processing.
Today rETL is necessary because the OLAP or data warehouse is usually the last stop of the data pipeline. As mentioned previously, other systems have difficulty accessing the data in and OLAP system that isn’t meant for serving many subscribing systems. So how can we keep realtime data “active?”
Connect to Streaming Platforms
Streaming platforms like Apache Kafka enable publishing of real-time data in to topics. These topics can be subscribed to by many systems and can retain real-time data. By using connectors to publish data-in-motion into a streaming platform, you are able to keep the data in an active state. This allows you to serve it to multiple subscribers like the OLAP or data warehouse as well as external SaaS systems without having to reactivate your data.
Connector Architecture
To maintain realtime, connectors should extract data from source systems like OLTP databases and publish it into a streaming platform.
Connectors also should act more like agents that exist in the same infrastructure as the source or sink system. Agents sit alongside the source or sink systems as a way to proxy the streaming platform. This will simplify the architecture to avoid having to open up secured or private networks to the connector.
This unfortunately requires customers to manage these connectors themselves. Most SaaS managed connectors require you to open up your secured network to initialize connectivity to the systems.
Many OLTPs and OLAPs have the ability to integrate with streaming platforms. CockroachDB, Apache Pinot, Rockset, Clickhouse, Apache Druid etc all have this ability. This is a feature all data stores should provide.
Alternatively, BYOC (bring your own cloud) is a deployment model that deploys managed components into the source’s or sink’s infrastructure. The components are still controlled by the service provider. This is becoming a more popular model because it enables all the capabilities of the services without compromising the security standards of the company.
Next Post
In the next post, I’ll talk more about the streaming platforms and their providers. They all have their unique advantages and as always, your use case will drive the decision when choosing a stream processor.