The evolution of Big Data towards the lakehouse model: benefits, challenges and use cases
Today, we live in a world where data is one of the most valuable assets for businesses. The rise of Artificial Intelligence in recent years has made it even more important, as data is the fundamental raw material behind its development. Big Data is certainly nothing new. It has been with us for more than two decades and, over time, the architectures used to store and leverage this data have evolved in order to improve both system performance and cost efficiency. In recent years, the lakehouse architecture has established itself as a new approach to storing and leveraging data, promising to optimise costs, improve data consistency and quality, and deliver high performance. But is this really the definitive solution for our systems? Data has become an extremely valuable business asset. The evolution of data architectures To understand the concept of a data lakehouse architecture, it is essential to first look at data warehouse and data lake architectures. The first thing we need to understand is that these three architectures are all ways of storing large volumes of data so that it can later be analysed and used. The idea behind data warehouses emerged during the 1980s and 1990s, when companies began storing data from their business processes and, with it, came the need to make use of that information. This architecture is normally made up of several layers: Data sources: transactional systems, applications and other sources that generate data. Data extraction and transformation process: a process in which data is extracted, cleansed, transformed and prepared for storage. Data storage: a central repository of structured and modelled data for analytics, generally using star or snowflake schemas. Today, some of the most widely used tools are Snowflake, BigQuery, Redshift and Synapse. Data exploitation: Business Intelligence (BI) tools that enable reporting, dashboards and analytics. The challenge today is no longer simply storing data, but making efficient and sustainable use of it. Over time, the volume, variety and speed of data increased significantly (Big Data, IoT, social media, etc.), creating the need for a new architecture capable of storing vast amounts of data in any format, whether structured, semi-structured or unstructured. This architecture is known as a data lake, and its usual layers are: Data sources: this first layer is the origin of the data that will feed the system. Data ingestion: raw data from source systems is ingested directly into storage. Storage: all data is stored in its raw, unorganised form. Some of the most widely used tools are Amazon S3, Azure Data Lake Storage Gen2 and Google Cloud Storage. Transformation and cleansing: data is cleansed and transformed before being sent to analytics and exploitation tools. Data exploitation: Business Intelligence and machine learning tools are used. Advantages and disadvantages of these architectures Because they work with clean, consistent and structured data, data warehouses provide high data quality, strong governance, and fast, efficient querying. However, they offer less flexibility, schema changes can be slow to implement, and storage and processing costs are often high. By contrast, data lakes make it possible to store large volumes of data in their original format, providing greater flexibility and enabling rapid data ingestion without the need for prior transformation. In addition, they tend to be more scalable and cost-effective, making them ideal for advanced analytics, data science and machine learning workloads. A data warehouse prioritises performance, while a data lake prioritises flexibility. On the other hand, they can become disorganised if they are not managed properly: queries may become more complex and slower, and users are required to process and cleanse the data before using it, which can increase complexity and reduce the initial quality of the data. Architecture use cases Based on the advantages and disadvantages of each architecture, it is important to choose the one best suited to the system being developed. In scenarios where reliable, fast and consistent reporting is required, a data warehouse architecture should be used. Where there are large volumes of information in different formats and a need to carry out machine learning workloads, a data lake architecture is more appropriate. However, when theory is put into practice, real-world use cases are often far more complex. In many scenarios, companies manage large volumes of data from multiple sources and in both structured and unstructured formats, while also needing to generate fast and reliable reports. The complexity of modern data requires a balance of performance, flexibility and governance. These requirements create an architectural dilemma: a data warehouse offers performance and reliability for analytics, while a data lake provides flexibility and scalability for storing heterogeneous data. So, which solution is best suited to this context? Faced with this challenge, many companies have opted to combine both architectures as follows: Data sources: transactional systems, applications and other sources that generate data. Data ingestion: raw data from source systems is ingested directly into storage. Data lake: all data is stored in its raw and unorganised form. Transformation and cleansing processing: a process in which data is extracted from the data lake, cleansed, transformed and prepared for storage. Data warehouse: a central repository of structured and modelled data for analytics. Data exploitation: Business Intelligence and Machine Learning tools are used. ■ Although this architecture addresses many business requirements, it also introduces greater operational complexity. The use of multiple layers and tools increases the resources required for storage, processing and maintenance, thereby driving up associated costs. Lakehouse architecture In 2019, Databricks popularised the data lakehouse architecture with the aim of combining the advantages of data lakes and data warehouses. The core idea is to store data in a data lake while adding a transactional layer that allows it to be managed as though it were stored in data warehouse tables. In Databricks, this layer is implemented using the Delta Lake storage format. Data is stored in low-cost Apache Parquet files, while a transaction log keeps track of all operations performed on the tables. As a result, capabilities such as ACID transactions, data versioning, schema enforcement and efficient SQL querying can be achieved. This also makes it possible to optimise storage management through techniques such as compaction and the removal of obsolete files. The lakehouse model aims to combine the scalability of the data lake with the reliability of the data warehouse. This approach does, however, introduce greater operational complexity, stemming both from the management of multiple files and versions and from the need to establish governance policies across heterogeneous data environments. Likewise, these architectures require a higher level of technical specialisation from developers. ■ Therefore, although the data lakehouse architecture can reduce costs in certain areas, it also introduces new complexities and operational overheads. Conclusion As with any architecture, the lakehouse model comes with both advantages and disadvantages. At present, there is no single solution capable of meeting every business requirement. For that reason, the most effective approach is to identify the most efficient option for each specific use case, rather than pursuing a perfect solution that, in practice, does not exist. The growth in data volume and variety drove the evolution from the data warehouse to the data lake. AI & Data Power BI in the age of AI: how Artificial Intelligence enhances human analysis and decision-making December 15, 2025
May 25, 2026