Data lakes and semantic layers have existed for a long time – each living in their own walled gardens, tightly coupled to fairly narrow use cases. As data and analytics infrastructure migrates to the cloud, many are challenging how these foundational technology components fit into the modern data and analytics stack. In this article, we’ll take a look at how the data lakehouse and semantic layer together break down the traditional relationship between the data lake and analytics infrastructure. We’ll learn how Semantic LakeHouse can dramatically simplify cloud data architecture, eliminate redundant data movement, and reduce time to value and cloud costs.
Traditional data and analytics architecture
In 2006, Amazon introduced Amazon Web Services (AWS) as a new way to offload on-premises data centers to the cloud. A core AWS service was its file data store, and with it, the first cloud data lake, Amazon S3, was born. Other cloud vendors will then introduce their own versions of the cloud data lake infrastructure.
For most of its life, the cloud data lake has been relegated to playing the role of dumb, cheap data storage—a staging area for raw data until the data can be processed into something useful. For analytics, the data lake served as a holding pen for the data until it could be copied and loaded into a customized analytics platform, typically a relational cloud data warehouse or OLAP Cubes, proprietary business intelligence (BI) tools data extraction such as Tableau Hyper or Power BI Premium, or all of the above. As a result of this processing pattern, data needs to be stored at least twice, once in its raw form and once in its “analytics optimized” form.
Not surprisingly, most traditional cloud analytics architectures look like the diagram below:
Image 1: Traditional data and analytics stack
As you can see, the “Analytics Warehouse” is responsible for most of the work that provides analytics to consumers. The problem with this architecture is as follows:
Data is stored twice, which increases cost and creates operational complexity. The data in the Analytics Warehouse is a snapshot, which means that the data is instantly stale. The data in an analytics warehouse is usually a subset of the data in the data lake, which limits the queries that consumers can ask. The analytics warehouse scales separately and separately from cloud data platforms, introducing additional cost, security concerns, and operational complexity.
Given these drawbacks, you may ask “Why would a cloud data architect choose this design pattern?” The answer lies in the demands of analytics consumers. While data lakes can theoretically provide analytical queries directly to consumers, in practice, data lakes are very slow and incompatible with popular analytics tools.
If only Data Lake could provide the benefits of an analytics warehouse and we could avoid storing data twice!
Birth of the Data Lakehouse
The term “lakehouse” was introduced in the 2020 seminal Databricks white paper “What Is a Lakehouse?” By Ben Lorica, Michael Armbrust, Reynold Shin, Matei Zaharia, and Ali Ghodsi. The authors introduced the idea that a data lake could serve as an engine for delivering analytics, not just a static file store.
Data lakehouse vendors delivered on their vision by offering high-speed, scalable query engines that operate on raw data files in the data lake and expose an ANSI standard SQL interface. With this major innovation, proponents of this architecture argue that data lakes can behave like an analytics warehouse without the need to duplicate data.
However, it turns out that the analytics warehouse performs other important functions that are not satisfied by a data lakehouse architecture alone, including:
Delivering consistent “speed of thought” questions (questions within 2 seconds) on a wide range of questions. Presenting a business-friendly semantic layer that allows consumers to query without the need to write SQL. Enforcing data governance and security at query time.
So, for Data Lakehouse to truly replace the analytics warehouse, we need something else.
role of semantic layer
I’ve written extensively about the role of the semantic layer in the modern data stack. To summarize, Semantic Layer is a logical view of business data that leverages data virtualization technology to translate physical data into business-friendly data at query time.
By adding the Semantic Layer Platform on top of the Data Lakehouse, we can completely eliminate the analytics warehouse functions because the Semantic Layer Platform:
Provides “speed of views queries” on Data Lakehouse using data virtualization and automated query performance tuning. Provides a business-friendly semantic layer that replaces the proprietary semantic views embedded inside each BI tool and allows business users to query without the need to write SQL queries. Provides data governance and security at query time.
A semantic layer platform delivers the missing pieces that the data lakehouse is missing. By adding a semantic layer with a data lakehouse, organizations can:
Eliminate data duplication and simplify data pipelines. Consolidate data governance and security. Provide a “single source of truth” for business metrics. Reduce operational complexity by keeping data in a data lake. Give analytics consumers access to more data and more timely data.
Image 2: The new Data Lakehouse stack with semantic layers Semantic Lakehouse: Everyone Wins
Everyone wins with this Vastu. Consumers get access to more granular data without latency. IT and data engineering teams have less data to move and transform. Finance spends less money on cloud infrastructure costs.
As you can see, by adding a semantic layer with a data lakehouse, organizations can simplify their data and analytics operations, and deliver more data, faster, to more consumers, with less cost.