DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Zones

Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

Enterprise AI Trend Report: Gain insights on ethical AI, MLOps, generative AI, large language models, and much more.

2024 Cloud survey: Share your insights on microservices, containers, K8s, CI/CD, and DevOps (+ enter a $750 raffle!) for our Trend Reports.

PostgreSQL: Learn about the open-source RDBMS' advanced capabilities, core components, common commands and functions, and general DBA tasks.

AI Automation Essentials. Check out the latest Refcard on all things AI automation, including model training, data security, and more.

Related

  • A Data Warehouse Alone Won’t Cut It for Modern Analytics
  • How to Generate Customer Success Analytics in Snowflake
  • Unveiling the Clever Way: Converting XML to Relational Data
  • Enhancing Performance: Optimizing Complex MySQL Queries for Large Datasets

Trending

  • Automated Data Extraction Using ChatGPT AI: Benefits, Examples
  • Machine Learning: A Revolutionizing Force in Cybersecurity
  • DZone's Article Types
  • Building a Sustainable Data Ecosystem
  1. DZone
  2. Data Engineering
  3. Data
  4. Databricks: An Understanding Inside the WH

Databricks: An Understanding Inside the WH

This article presents an understanding of Databricks with helpful links and an explanation of the author's knowledge of the topic.

By 
Barath Ravichander user avatar
Barath Ravichander
·
Dec. 21, 23 · Opinion
Like (1)
Save
Tweet
Share
3.1K Views

Join the DZone community and get the full member experience.

Join For Free

Below is a summarized write-up of Databricks and my understanding of Databricks. There are many different types of data warehouses in the market, but here, we are just going to focus on Databricks alone. 

Databricks is a similar concept to a data catalog using a hive meta store. Your data resides in s3 and not in any storage database that resides inside an HDD or an SSD. 

After the data is in s3, the process is similar to the data catalog in Glue — with the help of crawlers how the data is read and ready for users. 

The source data can be in any format, but it is internally stored in parquet format. 

Data will be in your s3, and on top of it, there is a Unity Catalog, which does fine-grain governance on top of your data before it is ingested. 

Data Loading

One interesting feature that I liked in Databricks is the Autoloader. These are common practices that happen in any OLTP or OLAP databases, but a small change is that the file can be in any format after understanding the data structure loads into a parquet format. Let's say you have a CSV with four rows and four columns, the data will be loaded into databricks into a parquet format soon it identifies a file in the specified location. 

There are many other ways to load data, like any DBT tools, and we can also use Glue (if you are using AWS) — you can read more on Glue with Delta Lakes. 

Another way to load, similar to any data warehouse, is COPY INTO. Using a SQL query, you can just give the path and copy it into the delta table using the table name. 

You can also play around using SQL, Python, R, and Scala. 

SQL
 
COPY INTO my_table
FROM '/path/to/files'
FILEFORMAT = <format>
FORMAT_OPTIONS ('mergeSchema' = 'true')
COPY_OPTIONS ('mergeSchema' = 'true');


Reference SQL Source: COPY INTO

The data is stored in Delta Lakes in three tiers: Bronze, Silver, and Gold.

Bronze: This is used for RAW data ingestion and any historical data. 

Silver: Cleansed data or filtered or any augmented data. 

Gold: Used for business aggregations.  

Instance Types

There are two types of ways you can spin up Databricks, either through serverless or though on-demand instances. These on-demand instances are photon-type instances with Graviton or other instance types. 

It's easy to spin up one on the Databricks page: Databricks on AWS.

You can also calculate your instance pricing on the Pricing Calculator page.

IDE Setups

IDEs are important from a developer's perspective on how your teams can collab, run, and commit your code into your relevant code repositories. 

There are a couple of options where you can use the notebooks, which are collaborative with the development teams, SQL Editor, or you have a lot of extensions for any IDE. The one I liked was with VS Code Plugin.

Conclusion

Databricks has an ecosystem of a data warehouse that reads data directly from s3, where there is no need to have a storage layer. The combination of Ingestion, Data/AI Platform, and Data Warehousing is Databricks Lakehouse.

Above is my understanding of how Databricks work with my initial knowledge. Will keep adding more details to these blogs. Please share your experience with Databricks.

Links are given at regular checkpoints.

Data warehouse Database Data (computing)

Opinions expressed by DZone contributors are their own.

Related

  • A Data Warehouse Alone Won’t Cut It for Modern Analytics
  • How to Generate Customer Success Analytics in Snowflake
  • Unveiling the Clever Way: Converting XML to Relational Data
  • Enhancing Performance: Optimizing Complex MySQL Queries for Large Datasets

Partner Resources


Comments

ABOUT US

  • About DZone
  • Send feedback
  • Community research
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • support@dzone.com

Let's be friends: