Lakehouse architecture

You’re unlikely to find that replacing one with the other is an optimum solution. The strategy for data democratization is co-existenceĬhoosing between a data warehouse or data lake or, dare I say, lakehouse need not be an either/or decision. This is especially true when you consider the decades of data warehouse development seen in areas such as query and performance optimization, Loading.in-database analytics, columnar storage and compression. And though a heated discussion will continue for some time, it’s unlikely to irrevocably remove the need for either the data lake or data warehouse, and more importantly, overturn the enormous amount of innovation in the market. Those on the data warehouse side of the fence build the lakehouse around relational technology concepts, while those on the data lake side have roots in Loading.machine learning and Spark processing where support for processing Loading.Java, Loading.Python and R workloads is paramount.Īlthough the lakehouse is an interesting concept, it’s still very ill-defined and (unsurprisingly) subject to a lot of hype and speculation. The main reason for this is that there are proponents on either side of the architectural divide. While this sounds good in practice, the lakehouse is an emerging and immature concept – which means there are differing views on how best to realize it. Put simply, they are designed to provide the lower costs of cloud storage for raw data alongside support for certain analytics concepts – such as SQL access to curated data tables, or support for large scale processing of Loading.machine learning workloads. On the other hand, data lakehouses aim to combine elements of data warehousing with core elements of the data lake. And a data lake can introduce analysis using approaches that leverage star schemas for batch orientated queries, for example. For instance, a data warehouse can also be used for operationalizing Loading.data science whereby Loading.machine learning models are run against governed data. Data lakes aren’t typically optimized for performance and the demands of production delivery – such as concurrency, latency and workload management. They support a range of different processing styles and approaches, including Loading.machine learning and batch-orientated workloads. Data lakes are also likely to contain data types that aren’t subject to rigorous governance. These collate raw or unrefined data captured from a diverse array of sources. Data warehouses are suited to complex queries, high levels of concurrent access and high-performance data access requirements. Also, they generally support a SQL processing strategy. As such, data warehouses are best suited to more structured and governed data – such as in the financial services or healthcare sectors. These are optimized for well-known, predefined and repeatable analytics needs that can be scaled across many users in the organization.

However, it’s also worth remembering that each serves different goals as borne out by their definitions. This means they’re used for a wide-ranging set of analytical use cases across both business and developer functions.

In terms of similarities, all are fundamentally used for the management of transactional and operational data that form the basis of BI and advanced analytical workloads. To understand how a data lake, data warehouse or data lakehouse can underpin a modern analytics infrastructure, it’s worth unpicking some of their similarities and differences. But in short, the data lakehouse refers to a hybrid data architecture that aims to mix the best of a data warehouse and data lake. If that phrase is new to you, you’re not alone – it’s a fairly new concept. The debate, however, has grown ever more heated in recent years, thanks to the greater prevalence of the cloud as a location for data analytics workloads, the brittleness of Loading.Hadoop deployments and the emergence of a new concept: the data ‘lakehouse’. This question has been asked for years by data professionals, as the merits of one approach have been weighed up against the other. Why are we debating the data warehouse versus data lake yet again? But is a resolution finally in sight with the so-called ‘data lakehouse’, or is this just another example of a new buzzword generating hype? Exasol’s Market Intelligence Lead, Helena Schwenk, investigates. The data lake versus data warehouse debate has raged for over a decade.