One of the critical ideas behind data fabric is the capability to access any data asset in the organization through a central, easy-to-use access point. This can be achieved using a data virtualization layer that abstracts the complexity and offers a central access point. In addition to centralized access, this layer often provides other capabilities, like caching, security, modeling, and cross-source federation, which can be enforced with consistency across the organization. All these features enable end users to feel that all company data is consolidated and stored in a single system, even when the data is spread across dozens of heterogeneous systems. Data fabric vendors implement two main architectures to provide this capability:
- Specialized data virtualization layers
- Data engines with data virtualization extensions
In this paper, we will explore both architectures in detail, and we will focus on the implications that these implementation decisions have in terms of the performance of query execution. To further illustrate the differences between these two architectures, we have performed extensive benchmarks using TPC-H, which shows how both architectures perform under different scenarios.