DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Zones

Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

Enterprise AI Trend Report: Gain insights on ethical AI, MLOps, generative AI, large language models, and much more.

2024 Cloud survey: Share your insights on microservices, containers, K8s, CI/CD, and DevOps (+ enter a $750 raffle!) for our Trend Reports.

PostgreSQL: Learn about the open-source RDBMS' advanced capabilities, core components, common commands and functions, and general DBA tasks.

AI Automation Essentials. Check out the latest Refcard on all things AI automation, including model training, data security, and more.

Related

  • Seamless Integration: Connecting AWS Lambda to RDS and Writing Data Effortlessly
  • Unveiling the Clever Way: Converting XML to Relational Data
  • Keep Calm and Column Wise
  • Top 5 Benefits of Data Lineage

Trending

  • Understanding Kernel Monitoring in Windows and Linux
  • Automated Data Extraction Using ChatGPT AI: Benefits, Examples
  • DZone's Article Types
  • Building a Sustainable Data Ecosystem
  1. DZone
  2. Data Engineering
  3. Data
  4. Do We Need Data Normalization Anymore?

Do We Need Data Normalization Anymore?

Data normalization is supposed to be the way you optimize databases to ensure their integrity. But maybe with many databases, normalization becomes redundant.

By 
Jennifer Reif user avatar
Jennifer Reif
DZone Core CORE ·
Mar. 21, 24 · Analysis
Like (6)
Save
Tweet
Share
4.5K Views

Join the DZone community and get the full member experience.

Join For Free

Many different roles in the technology world come into contact with data normalization as a routine part of many projects. Developers, database administrators, domain modelers, business stakeholders, and many more progress through the normalization process just as they would breathing. And yet, can something that seems so integral become obsolete?

As the database landscape becomes more diverse and hardware becomes more powerful, we might wonder if the practice of data normalization is required anymore. Should we be fretting over optimizing data storage and querying so that we return the minimum amount of data? Or if we should, do certain data structures make it more vital to solve those problems than others?

In this article, we will review the process of data normalization and evaluate when this is needed, or if it is still a necessary part of digitally storing and retrieving data.

What Is Data Normalization?

Data normalization is optimizing data structures in a relational database to ensure data integrity and query efficiency. It reduces redundancy and improves accuracy by putting the data through a series of steps to normalize the structure (normal forms). At its core, data normalization helps avoid insert, update, and delete data anomalies. These anomalies occur when creating new data, updating existing data, or deleting data and cause challenges in keeping data values in sync (integrity). We will talk more about this when we step through the normalization process.

The steps entail verifying keys (links to related data), separating unrelated entities to other tables, and inspecting the row and columns as a unified data object. While the full list of normal form steps is rather rigorous, we will focus on those most commonly applied in business practice: 1st, 2nd, and 3rd normal forms. Other normal forms are mostly used in academics and statistics. Normal form steps must be done in order, and we cannot move to the next normal form until the previous is complete.

How Do We Do Data Normalization?

Since we have three normal forms to take our data through, we will have three steps to this process. They are as follows:

  1. 1st normal form (1NF)
  2. 2nd normal form (2NF)
  3. 3rd normal form (3NF)

A database professor from college taught my class to memorize the three normal forms as “the key, the whole key, and nothing but the key” (like the courtroom oath for the truth). I had to refresh some of the normal form details for this article, but that basic phrase has always stuck with me. Hopefully, it might help you remember them, too.

I recently came across a coffee shop data set that seems to be a good fit for us to use as an example of normalizing a data set. With a bit of adjustment for our examples here, we can step through the process.

Denormalized Data

transaction_date

transaction_time

instore_yn

customer

loyalty_num

line_item_id

product

quantity

unit_price

promo_item_yn

2019-04-01

12:24:53

Y

Camille Tyler

102-192-8157

1

Columbian Medium Roast Sm

1

2.00

N

2019-04-01

12:30:00

N

Griffith Lindsay

769-005-9211

1,2

Jamaican Coffee River Sm, Oatmeal Scone

1,1

2.45,3.00

N,N

2019-04-01

16:44:46

Y

Stuart Nunez

796-362-1661

1

Morning Sunrise Chai Rg

2

2.50

N

2019-04-01

14:24:55

Y

Allistair Ramirez

253-876-9471

1,2

Cappuccino Lg, Jumbo Savory Scone

2,1

4.25,3.75

N,N


The data contains sales receipts for the company and was originally published on the Kaggle Coffee Shop sample data repository, though I have also created a GitHub repository for today’s post. The data displayed above shows sales made to a customer for ordered products. 

Why is this data a problem? Earlier, we mentioned normalization to solve insert, update, and delete anomalies. If we try to insert a new row to this data, we could create a duplicate, or worse yet, have to gather all of the information on customers, products, and receipt date/time in order to create it. If we needed to update or delete the products purchased on the receipt, we would need to sort through the list in each product column to search for the value. So let’s see how to improve redundancy and integrity by normalizing this data.

1st Normal Form: The Key

For the first step in our “key, the whole key, and nothing but the key,"  a table should have a primary key (single or set of columns) that ensures a row is unique. Each column in a row should also contain only a single value; i.e., no nested tables. 

Our example data set needs some work to get it to 1NF. While we can get unique rows with a combination of date/time or maybe date/time/customer, it is often much simpler to reference rows with a generated unique value of some sort. Let’s do that by adding a transaction_id field to our receipt table.

There are also several rows that have more than one item ordered (transaction_id 156 and 199), so a few columns have line items with more than one value. We can correct this by separating the rows with multiple values into separate rows.

1NF Data

transaction_id

transaction_date

transaction_time

instore_yn

customer

loyalty_num

line_item_id

product

quantity

unit_price

promo_item_yn

150

2019-04-01

12:24:53

Y

Camille Tyler

102-192-8157

1

Columbian Medium Roast Sm

1

2.00

N

156

2019-04-01

12:30:00

N

Griffith Lindsay

769-005-9211

1

Jamaican Coffee River Sm

1

2.45

N

156

2019-04-01

12:30:00

N

Griffith Lindsay

769-005-9211

2

Oatmeal Scone

1

3.00

N

165

2019-04-01

16:44:46

Y

Stuart Nunez

796-362-1661

1

Morning Sunrise Chai Rg

2

2.50

N

199

2019-04-01

14:24:55

Y

Allistair Ramirez

253-876-9471

1

Cappuccino Lg

2

4.25

N

199

2019-04-01

14:24:55

Y

Allistair Ramirez

253-876-9471

2

Jumbo Savory Scone

1

3.75

N


With this data, a composite (multi-column) key uniquely identifies a row by the combination of transaction_id and line_item_id, as a single receipt cannot contain multiple line item #1s. Here is a look at the data if we just boil the table down to those primary key values.

transaction_id

line_item_id

150

1

156

1

156

5

165

1

199

1

199

5


Each combination of those two values is unique. We have applied the first normal form to our data, but we still have some potential data anomalies. If we wanted to add a new receipt, we might need to create multiple lines (depending on how many line items it contained), and duplicate transaction ID, date, time, and other information on each row. Updates and deletes cause similar problems because we would need to ensure we get all the rows affected for data to be consistent. This is where the second normal form comes into play.

2nd Normal Form: The Whole Key

The second normal form ensures that each non-key column is fully dependent on the whole key. This is more of a concern for tables with more than one column as the primary key (like our coffee receipt table). Here is our data again in its first normal form:

transaction_id

transaction_date

transaction_time

instore_yn

customer

loyalty_num

line_item_id

product

quantity

unit_price

promo_item_yn

150

2019-04-01

12:24:53

Y

Camille Tyler

102-192-8157

1

Columbian Medium Roast Sm

1

2.00

N

156

2019-04-01

12:30:00

N

Griffith Lindsay

769-005-9211

1

Jamaican Coffee River Sm

1

2.45

N

156

2019-04-01

12:30:00

N

Griffith Lindsay

769-005-9211

2

Oatmeal Scone

1

3.00

N

165

2019-04-01

16:44:46

Y

Stuart Nunez

796-362-1661

1

Morning Sunrise Chai Rg

2

2.50

N

199

2019-04-01

14:24:55

Y

Allistair Ramirez

253-876-9471

1

Cappuccino Lg

2

4.25

N

199

2019-04-01

14:24:55

Y

Allistair Ramirez

253-876-9471

2

Jumbo Savory Scone

1

3.75

N


We will need to evaluate each non-key field to see if we have any partial dependencies; i.e., the column depends on only part of the key and not the whole key. Since transaction_id and line_item_id make up our primary key, let’s start with the transaction_date field. The transaction date does depend on the transaction ID, as the same transaction ID could not be used again on another day. However, the transaction date doesn’t depend on the line item ID at all. Line items can be reused across transactions, days, and customers even.

Ok, so we already found that the table does not follow the second normal form, but let’s check another column. What about the customer column? The customer is not dependent on both the transaction ID and line item ID. If someone gave us a transaction ID, we would know which customer made the purchase, but if we were given a line item ID, we wouldn’t know which single customer that receipt belonged to. After all, multiple customers could have ordered one, two, or six items on their receipts. The customer is linked to the transaction ID (assume multiple customers cannot split receipts), but the customer is not dependent upon the line item. We need to fix these partial dependencies.

The most direct solution is to create a separate table for order line items, leaving the columns that are only dependent on transaction_id in the receipt table. The updated data in the second normal form looks like the one below.

Receipt

transaction_id

transaction_date

transaction_time

instore_yn

customer

loyalty_num

150

2019-04-01

12:24:53

Y

Camille Tyler

102-192-8157

156

2019-04-01

12:30:00

N

Griffith Lindsay

769-005-9211

165

2019-04-01

16:44:46

Y

Stuart Nunez

796-362-1661

199

2019-04-01

14:24:55

Y

Allistair Ramirez

253-876-9471


Receipt Line Item

transaction_id

line_item_id

product_id

product

quantity

unit_price

promo_item_yn

150

1

28

Columbian Medium Roast Sm

1

2.00

N

156

1

34

Jamaican Coffee River Sm

1

2.45

N

156

2

77

Oatmeal Scone

1

3.00

N

165

1

54

Morning Sunrise Chai Rg

2

2.50

N

199

1

41

Cappuccino Lg

2

4.25

N

199

2

79

Jumbo Savory Scone

1

3.75

N


Now let’s test that our change fixed the issue and follows the second normal form. For our Receipt table, transaction_id becomes the sole primary key. Transaction date is unique based on the transaction_id, as is transaction_time; i.e., there can only be one date and time for a transaction id.

Orders cannot be placed both in-store or outside it, so the value of whether a purchase was made in-store or not is dependent upon the transaction_id. Since customers cannot split a receipt, a transaction would also tell us a unique customer. Finally, if someone gave us a transaction ID, we could identify a single customer loyalty number that is attached to it.

Next is the Receipt Line Item table. Line items are dependent upon the transaction (receipt) with which they are associated, so we retained the transaction ID on our line item table. The combination of transaction_id and line_item_id becomes our composite key on the line item table. Product_id and product are determined based on the transaction and line item together. A single transaction ID wouldn’t tell us which product (if the receipt contains multiple products purchased), and a single line item ID wouldn’t tell us which purchase was being referenced (different receipts could order the same products). This means the product_id and product values are dependent on the whole key.

We can also associate a quantity from the transaction_id and line_item_id. Quantities could be the same across receipts or line item IDs, but the combination of both keys gives us a single value for quantity. We also cannot uniquely identify our unit_price or promo_item_yn column values without both the transaction ID and line item ID fields together.

Although we have satisfied the second normal form, some data anomalies still exist. If we tried to create a new product for purchase or a new customer, we couldn’t create them in our current tables because we may not have receipts tied to them yet. If we needed to update a product or customer (for typo or name change), we would need to update all the line item rows with those values. If we wanted to delete a product or customer, we couldn’t unless we removed receipts or line items that referenced them. To solve these problems, we can move to the third normal form.

3rd Normal Form: And Nothing but the Key

The third normal form ensures that non-key fields are dependent on nothing but the key. In other words, they are not dependent on other non-key fields, causing a transitive dependency. Let’s review our 2NF data again:

Receipt

transaction_id

transaction_date

transaction_time

instore_yn

customer

loyalty_num

150

2019-04-01

12:24:53

Y

Camille Tyler

102-192-8157

156

2019-04-01

12:30:00

N

Griffith Lindsay

769-005-9211

165

2019-04-01

16:44:46

Y

Stuart Nunez

796-362-1661

199

2019-04-01

14:24:55

Y

Allistair Ramirez

253-876-9471


Receipt Line Item

transaction_id

line_item_id

product_id

product

quantity

unit_price

promo_item_yn

150

1

28

Columbian Medium Roast Sm

1

2.00

N

156

1

34

Jamaican Coffee River Sm

1

2.45

N

156

2

77

Oatmeal Scone

1

3.00

N

165

1

54

Morning Sunrise Chai Rg

2

2.50

N

199

1

41

Cappuccino Lg

2

4.25

N

199

2

79

Jumbo Savory Scone

1

3.75

N


On our Receipt table, we need to check the non-key fields (everything except transaction_id) to see if the values depend on other non-key fields. The values for transaction date, time, and in-store do not change based on each other or the customer or loyalty number associated, so they are properly dependent on nothing but the key.

But what about customer info? The value of the loyalty number could change if the customer changes. For instance, if we needed to delete or update the customer who made the purchase, the loyalty number would also need to be deleted or updated along with it. So, the loyalty number is dependent on the customer, which is a non-key field. This means our Receipt table is not in the third normal form.

What about our Receipt Line Item table? Quantity, unit price, and promo item values don’t vary based on the values of each other, nor on the product information, because the three fields state the value of an item at the time of purchase. However, the product is dependent on the product_id because the value would change based on which product ID was referenced. So this table also needs some updates to comply with the third normal form.

Again, the best method to solve these issues is to pull the related columns to separate tables and leave a reference ID (foreign key) to link the original tables with the new ones. We eliminate data anomalies on insert, update, and delete, as well as reduce data redundancy and improve efficiency for storage and querying.

Receipt

transaction_id

transaction_date

transaction_time

instore_yn

customer_id

150

2019-04-01

12:24:53

Y

604

156

2019-04-01

12:30:00

N

32

165

2019-04-01

16:44:46

Y

127

199

2019-04-01

14:24:55

Y

112


Receipt Line Item

transaction_id

line_item_id

product_id

quantity

unit_price

promo_item_yn

150

1

28

1

2.00

N

156

1

34

1

2.45

N

156

2

77

1

3.00

N

165

1

54

2

2.50

N

199

1

41

2

4.25

N

199

2

79

1

3.75

N


Product

product_id

product

28

Columbian Medium Roast Sm

34

Jamaican Coffee River Sm

77

Oatmeal Scone

54

Morning Sunrise Chai Rg

41

Cappuccino Lg

79

Jumbo Savory Scone


Customer

customer_id

customer

loyalty_num

604

Camille Tyler

102-192-8157

32

Griffith Lindsay

769-005-9211

127

Stuart Nunez

796-362-1661

112

Allistair Ramirez

253-876-9471


Data Normalization Outside Relational Databases

So does this process of data normalization make sense outside of other databases? Is it needed for document, columnar, key-value, and/or graph databases?

From my perspective, the goals of data normalization - reducing redundancy, improving data integrity, and increasing query performance - are still highly valuable no matter the database you are working with. However, the process and rules for the normal forms in relational data normalization likely are not a one-for-one match with other data models. Let’s see some examples using our memory key of “the key, the whole key, and nothing but the key” for three main categories of databases: relational, document, and graph. 

The relational databases’ goal was to optimize for assembling data into various meaningful sets by joining the tables in SQL queries. We already stepped through the normalization process from this perspective, so the benefits of redundancy, efficiency, and data integrity hopefully are clear from our earlier discussion.

In document databases, the model is optimized for grouping related information together in a single document, so that looking up a single customer retrieves all receipts and any other details. This statement is already in conflict with our data redundancy goal because we could potentially duplicate or allow inconsistencies in product information in order to store that data with a customer. Primary keys for documents that will be queried still make sense to avoid multiple lookups, but additional normalization steps may or may not be in conflict with the goals of the database model itself.

Graph databases balance the data integrity provided by relational and pre-baked relationship data provided by documents to create a unique model optimized for assembling data relationships without creating more data redundancy. Unique entities via a primary key are still important to improve query and storage efficiency but joins are stored as separate entities, naturally teasing related data into separate entities without analyzing each field for partial or non-key dependencies. Normalization exists here, but it feels more organic and less process-driven.

Wrap Up

In summary, we covered the process of data normalization as it pertains to the traditional relational database world. We discussed each step of the three normal forms and applied each one to a coffee shop receipt data set.

Finally, we looked at whether data normalization looks like in other types of databases (document and graph) and what forms made sense based on the structure of the database model.

Data integrity Data redundancy Relational database Data (computing)

Published at DZone with permission of Jennifer Reif, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

Related

  • Seamless Integration: Connecting AWS Lambda to RDS and Writing Data Effortlessly
  • Unveiling the Clever Way: Converting XML to Relational Data
  • Keep Calm and Column Wise
  • Top 5 Benefits of Data Lineage

Partner Resources


Comments

ABOUT US

  • About DZone
  • Send feedback
  • Community research
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • support@dzone.com

Let's be friends: