DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Zones

Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

Enterprise AI Trend Report: Gain insights on ethical AI, MLOps, generative AI, large language models, and much more.

2024 Cloud survey: Share your insights on microservices, containers, K8s, CI/CD, and DevOps (+ enter a $750 raffle!) for our Trend Reports.

PostgreSQL: Learn about the open-source RDBMS' advanced capabilities, core components, common commands and functions, and general DBA tasks.

AI Automation Essentials. Check out the latest Refcard on all things AI automation, including model training, data security, and more.

Related

  • Data Lake vs. Data Warehouse: 10 Key Differences
  • Databricks: An Understanding Inside the WH
  • Unlocking Data Insights and Architecture: Data Warehouses, Lakes, and Lakehouses
  • How VAST Data’s Platform Is Removing Barriers To AI Innovation

Trending

  • Distributed Caching: Enhancing Performance in Modern Applications
  • ChatGPT Code Smell [Comic]
  • Securing Cloud Storage Access: Approach to Limiting Document Access Attempts
  • Maximizing Developer Efficiency and Productivity in 2024: A Personal Toolkit
  1. DZone
  2. Data Engineering
  3. Big Data
  4. Demystifying Databases, Data Warehouses, Data Lakes, and Data Lake Houses

Demystifying Databases, Data Warehouses, Data Lakes, and Data Lake Houses

Databases for transactions, data warehouses for analytics, and data lakes and lake houses for analytics with flexibility and scalability.

By 
Anirudha Bhadoriya user avatar
Anirudha Bhadoriya
·
Nov. 29, 23 · Opinion
Like (9)
Save
Tweet
Share
3.6K Views

Join the DZone community and get the full member experience.

Join For Free

Have you ever wondered how data warehouses are different from Databases? And what are Data Lakes and Data Lake Houses? Let’s understand these with a hypothetical example. 

Bookster.biz is the new sensation in selling books worldwide. The business is flourishing, and they need to keep track of a lot of data: a large catalog of millions of books, millions of customers worldwide placing billions of orders to buy books. How do they keep track of all this data? How do they ensure their website and apps don’t grind to a halt because of all this load?

Databases to the Rescue

Databases are the workhorses of websites and mobile apps, handling all the data and millions of transactions. These databases come in many flavors (we will cover all different types of databases in a separate post). Still, the most popular ones are called Relational Databases (aka RDBMS), like MySQL, Postgres, Oracle, etc. 

Bookster would possibly have the following tables and schema (not exhaustive for brevity):

  • BookCatalog: book ID, ISBN, title, authors, description, publisher, …
  • BookInventory: book ID, number of books available for sale, ...
  • Users: user ID, user name, email, …
  • Orders: Order ID, book ID, user ID, payment information, order status, …

When a user orders a book, Bookster will update two records simultaneously: reducing book inventory and inserting a new order entry in the Orders table. RDBMSs support transactions that enable such atomic operations where either all such operations succeed or all fail. Imagine if two or more users could order the last copy of a popular book. Without transaction support, all customers will place orders, and Bookster will have many pissed-off customers except one. Similarly, if the Database host crashes during the processing, the data may be inconsistent without transactions.

This database interaction type is called Online Transaction Processing (aka OLTP), where the read and write operations happen very fast on a small amount of data, i.e., precisely two rows in the previous example. 

This is great. The customers are now happy, and they can order books fast. But the management wants to know what’s going on with the business. Which books are the best-sellers in different categories? Which authors are trending, and which are not selling much? How many orders are coming from which geographies or demographics? These kinds of answers are not accessible with just the databases.

Data Warehouses Shine for Analytical Queries

Data Warehouses (DWs) can handle large amounts of data, e.g., billions of orders, millions of book entries, etc. Bookster can load the data from the Database to the DW to answer the management questions. The analytical queries read a lot of data and summarise it in some form, like listing the total number of orders for a particular book broken down by geography and demographics. Examples of popular DWs are AWS Redshift, GCP BigQuery, etc. 

This database interaction type is called Online Analytical Processing (aka OLAP), where most reads happen on a large amount of data. The data is uploaded to the DWs in batches or can be streamed. The loading process is also known as ETL (Extract, Transform, and Load), which is done regularly to keep the DW in sync with the Database updates. DWs typically don't allow updating data but only add a newer version. 

Like RDBMS, DWs also have a notion of schema where tables and schema are well defined, and the ETL process converts the data into appropriate schema for loading. 

Some data doesn’t fit the schema easily but can be used by Machine Learning (ML) processes. For example, customers review different books as a text or a video review, and some rockstar ML engineers want to generate popular books by training an LLM on all books. So, the data can’t be structured as a strict schema anymore. Data Lakes help here by storing even more significant amounts of data with different formats and allowing efficient processing.

Data Lakes and Data Lake Houses Are the Relatively New Kids on the Block

Data Lakes (DLs) overcome the friction of converting the data into a specific format irrespective of if and when it will be used. Vast amounts of data in different native formats like JSON, text, binary, images, videos, etc., can be stored in a DL and converted to a specific schema at read time only when there is a need to process the data. The processing is flexible and scalable as DLs can support big data processing frameworks like Apache Spark. On the flip side, such flexibility could become a drawback if most of the data ingested is low quality due to the lack of data quality check or governance, making DL a ‘Data Swamp’ instead. 

That’s where the clever people of Databricks combined the goodness of DWs with DLs to create Data Lake Houses (DLHs). DLHs are more flexible than DWs, allowing schema both at the time of writing or reading, as needed, but with stricter mechanisms for data quality checks and metadata management, aka Data Governance. Also, DLHs allow flexibility in big data processing like DLs.

The following table summarises the differences between these technologies:


Key Characteristics

Suitable for 

Drawbacks

Examples

Database

Fast, small queries, transaction support

Online use cases (OLTP)

Not ideal for large analytical queries

RDBMS: MySQL

Data Warehouse

Slow, large queries, no updates after write

Analytics (OLAP)

Less flexible as strict schema and lack of support for big data processing frameworks

AWS Redshift, Google BigQuery, *Snowflake

Data Lake

Unstructured data, schema on read, flexible and big data processing

Analytics (OLAP)

Data quality issues due to lack of Data Governance

*Snowflake,
**AWS Lake Formation, **Databricks Delta Lake

Data Lake House

Structured or unstructured data, flexible with better Data Governance and supports big data processing

Analytics (OLAP) 

More complex, less performance, and more expensive compared to DW

*Snowflake,
**AWS Lake Formation, **Databricks Delta Lake


*Snowflake can be configured as a Data Warehouse, Data Lake, or Data Lake House.

**AWS Lake Formation and Databricks Delta Lake can be configured as either Data Lake or Data Lake House.  

Data lake Database Data warehouse

Opinions expressed by DZone contributors are their own.

Related

  • Data Lake vs. Data Warehouse: 10 Key Differences
  • Databricks: An Understanding Inside the WH
  • Unlocking Data Insights and Architecture: Data Warehouses, Lakes, and Lakehouses
  • How VAST Data’s Platform Is Removing Barriers To AI Innovation

Partner Resources


Comments

ABOUT US

  • About DZone
  • Send feedback
  • Community research
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • support@dzone.com

Let's be friends: