The data mesh concept is a decentralized approach to data architecture. The core pillar of this concept is reassigning data ownership to the engineering team that produced it - or the origin domain. This shift in ownership is dependent on the other three pillars that make up the data mesh concept. Here are all the pillars of a more decentralized approach to data architecture that aims at finding the perfect balance between decentralization and centralization:
Reassigning data to the domain that captured the data.
Rethinking data as a product. This allows domains to share data between themselves.
Making production and consumption of data products simple by using simple self-services.
Applying policies to protect domains from each other.
This is a greatly simplified summary of the ideas that compose the data mesh concept. The most important point to understand is that data is owned by domains and hence decentralized. Because of this, we need to make it easy for domains to take part in creating a harmonious mesh of shared data products.
In a data mesh, the domains own and produce data products. The self-services and federated data governance that support the domains can be deployed and managed centrally or decentrally, creating different degrees of decentralization in the data mesh.
Centralized Data Mesh
Yes, this can be an oxymoron. It’s a decentralized concept that is implemented using centralized infrastructure (for processing and storage) and self-services (APIs). This model is easy for teams that are early in their data mesh journey because data can be published and served from a centralized location. It suites domains that don’t have the infrastructure or the skill-set to self-serve their data products.
Domains are equipped with tools and APIs that enable them to produce and publish data products from their domains. No additional infrastructure is needed for the domains to transform and store their data products, which means the data needs to be replicated to a centralized location to be refined and published to other domains.
Scaling in a centralized data mesh proceeds in a vertical manner. Infrastructure is monitored and scaled as needed to meet and maintain SLAs. It is rather straightforward to assemble the metadata into a centralized data catalog and schema registry. Also, a repository for artifacts like libraries is needed for building and sourcing domain data.
Replication can be done in two ways: push or pull:
Pull Model - In the pull model, the centralized services need to initialize a connection to internal domain data systems. They are given controlled access to these systems to pull data for centralized processing and serving.
Push Model - In the push model, domains initialize a connection to the centralized infrastructure. They manage credentials for the centralized infrastructure and push their data to them.
Your network topology may, in the end, decide which approach you choose. One-way private networking like AWS’ private link will force you to a push model because the centralized infrastructure cannot initialize a connection to internal domain systems. You may need to implement both push and pull models, which will add complexity to the APIs and the workflows behind them.
With centralized infrastructure, expect your data products to be processed, stored, and served on multi-tenant systems. Depending on the sensitivity and policies (like GDPR) attached to the data product, centralized multi-tenant infrastructure may not be allowed.
Another limitation is providing global access to data products; having a centralized infrastructure will make it difficult for domains that do not reside in the same or near the region of the centralized infrastructure. Domains consuming data products from vast distances will cause timeouts and disconnects, creating a frustrating user experience. You can resolve this by also decentralizing more of the infrastructure used to share data.
Data Mesh Control Plane & Data Plane
In this data mesh model, the domains hold the data (data plane), and the server is the control plane. The control plane configures, manages, and controls the data plane and is still considered centralized. In this data mesh model, the control plane controls connections between domains. These connections represent the data plane.
Domains publish and register only the metadata of their data products with the control plane. Metadata can include schemas and lineage. In contrast to centralized data mesh, domains are equipped with the infrastructure to serve their data products that are monitored and controlled by the control plane. The control plane still holds the data catalog, schema registry, and monitoring systems, but data products do not “pass through” it.
Data products are shared between domains through direct connections. The control plane controls these connections so that they can be denied, granted, and severed. This means that policies like GDPR are still controlled by the control plane.
The control plane can deny connections because of data policies or the lack of resources that can affect SLAs. Companies have the freedom to create any policy related to domain connectivity. These policies can be implemented as a simple rule or a complex workflow. SLAs, schemas, and policies all compose a data contract.
This data mesh has a data plane that can support a data mesh where data products are shared globally by replicating data locally to the domain. Domains can subscribe to a feed to build a replica of the data from which they locally consume.
This approach works very nicely for data that is considered dimensional. In a star schema in data warehousing, dimensional data have primary keys, and when replicating dimensional data, you will need to support UPSERT (insert or update) when building a local replica.
Fact data in data warehousing is append (or insert) only. Your data will grow proportional to its throughput. Consider only holding what you need of this data by either implementing retention or storing it locally in cheap cloud storage.
These techniques can be done using real-time streams or batching. If you need to backfill historical data, a one-time batch process to copy and upload data will suffice. A real-time stream will keep both your dimensional and fact data fresh. Keeping the star schema will allow you to build your own analytics by using joins instead of having them predetermined by replicating a denormalized view.
Consider databases like Apache Pinot that support UPSERT, streaming, joining, and backfilling to accomplish all of these requirements.
In this data mesh model, governance is still a centralized component. Policy enforcement is done by the control plane. To some, this may infringe on a domain’s ability to be autonomous.
Peer-To-Peer Data Mesh
In this approach, all data and metadata are decentralized. This gives complete autonomy to all data mesh domains. Domains handle their scalability and security, enforce policies, and publish data products. However, without a centralized location to discover data products, it could be difficult for domains to find the data products they need.
In this model, domains cannot discover each other without using some centralized search tool. When domains publish a data product, they can register them into a DNS-like system. Data product crawlers can crawl the network looking for published data products and make them available for semantic searching.
P2P data mesh is probably most suitable for B2B use cases where multiple businesses are sharing data. Partnerships are made between businesses and establish a dedicated connection between themselves. This may also be suitable for publicly available data products with minimal security and SLAs.
How Does Streaming Data Mesh Fit In?
A streaming data mesh can be implemented using either of the two previous approaches - using a data mesh control and data plane or going towards peer-to-peer. Interestingly, there is an overlap in the concepts behind the P2P approach followed by blockchain-based technologies and streaming (e.g., immutability), which is also reflected in hybrid streaming/blockchain technologies such as Goldsky (https://goldsky.com/). In practice, you will probably want to implement the streaming form of data mesh using the former approach e.g. using centralized Kafka clusters for publishing the data to. It is possible to provide a "self-serve infrastructure" so that each domain can, e.g., spin up its own Kafka clusters. Still, in practice, this would mean that each domain has to take care of many aspects concerning security, authentication, and authorization.
The big advantage of using a streaming flavor of data mesh is that data can always be published using the push mechanism, decreasing the source systems' load. On the other side of the equation, the target domains can always pull the data whenever they please without depending too much on the availability of the source systems in the source domains. On top, it is possible to consume the data in a near-real-time fashion which is required for an increasing number of use-cases especially in the B2C context. None of your customers wants to see stale data or wait for too long to get fresh responses.
In addition, basing your implementation of data mesh on streaming yields the advantage that you can always decide, for all the data floating around in your organization, whether to pull the data in a streaming or batch fashion. Streaming data mesh is downward compatible with the non-streaming version of data mesh. To convert streaming data to batch or bulk data, new technologies such as streaming databases (e.g., Materialize or RisingWave) can help if the target domains reading the data cannot or do not wish to build this conversion themselves.