DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Zones

Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

Enterprise AI Trend Report: Gain insights on ethical AI, MLOps, generative AI, large language models, and much more.

2024 Cloud survey: Share your insights on microservices, containers, K8s, CI/CD, and DevOps (+ enter a $750 raffle!) for our Trend Reports.

PostgreSQL: Learn about the open-source RDBMS' advanced capabilities, core components, common commands and functions, and general DBA tasks.

AI Automation Essentials. Check out the latest Refcard on all things AI automation, including model training, data security, and more.

Related

  • Unveiling the Clever Way: Converting XML to Relational Data
  • Enhancing Performance: Optimizing Complex MySQL Queries for Large Datasets
  • Recovering an MS SQL Database From Suspect Mode: Step-By-Step Guide
  • Keep Calm and Column Wise

Trending

  • 6 Agile Games to Enhance Team Building and Creativity
  • PostgresML: Streamlining AI Model Deployment With PostgreSQL Integration
  • OWASP Top 10 Explained: SQL Injection
  • Python for Beginners: An Introductory Guide to Getting Started
  1. DZone
  2. Data Engineering
  3. Databases
  4. The End of Data Refactoring?

The End of Data Refactoring?

For “extract/transform/load,” are you transforming your application’s data for the purposes of your applications or the requirements of your database?

By 
Jennifer Reif user avatar
Jennifer Reif
DZone Core CORE ·
Mar. 28, 24 · Analysis
Like (1)
Save
Tweet
Share
922 Views

Join the DZone community and get the full member experience.

Join For Free

In a former project, I remember going through requirements meetings, hashing out a data model, drawing up a database design, and submitting it to a DBA team for review and approval. There were numerous back-and-forth communications on naming, data types, and structure conventions. Weeks later, the tables were created in the development environment, so that I could ingest test data and build and test the code against it.

When requirements changed, understanding of the data model improved, test data iterations produced different results, or scope creeped in the project, we would start the data model -> database design -> DBA review/approval -> development creation process all over again.

This project is not the only one of its kind. There are a substantial number of project hours burned during each iteration of development, production, as well as maintenance/enhancement projects. By reducing or eliminating the translation step between a data model and database design, we can dramatically improve time-to-market and maintenance costs down the road.

What Is Data Refactoring?

The roots of data refactoring likely point back to code refactoring applied to computer programs. Martin Fowler defines refactoring as “a disciplined technique for restructuring an existing body of code, altering its internal structure without changing its external behavior.” There are entire books written on this topic, but how does this apply to data and databases?

Code refactoring is often done after an initial draft of code or when improvements/features need to be made. Any of these changes can impact a project before or after initial implementation to make code cleaner, more efficient, and more maintainable. Some refactorings might also occur because of changes to the data structure supporting the code, such as due to a deeper understanding of the data model or additions/subtractions to the collected data itself. Refactoring happens in the data and data model, too.

Notice that while code refactoring can occur without changes to the data model, code is nearly always impacted by changes to the underlying data model. Data refactoring becomes the source of refactors for the model, storage system, and code. Each piece needs careful planning and thorough understanding, which adds considerable project resources.

We see positive benefits from refactoring through better data and code, so isn’t the time spent well worth it? What’s the trouble?

Looking at the data refactoring component of a project, there are several tasks that accompany it. The first is making changes to the data model to fit the new case, feature, etc. The second is to align the data model with the storage technology’s format, i.e. relational, graph, or other. Each type of data storage format comes with its own set of rules, so we often end up making further changes to the data to accommodate the required structure.

This “translation” step between the data model and data storage steps is what can be cumbersome, accompanied by approvals to properly align data structures between the real-world and structured formats in a database. Graph databases can shorten or cut the translation phase because they more naturally model data as it exists in the real world.

Data Refactoring for Databases

In a previous article, we utilized a coffee shop data set with sales receipts, products, and customers. We can use the same data set to look at data refactoring today, but take a different section of the data — stores and their staff assignments. All of the data, change scripts, and more are available in the related Github repository.

Our example data set is in Tables 1 and 2 below.

sales_outlet_id

sales_outlet_type

store_square_feet

store_address

store_city

store_state_province

store_telephone

store_postal_code

manager

1

headquarters

0

unknown

New York

NY

111-222-3333

44444

2

2

warehouse

3400

164-14 Jamaica Ave

Jamaica

NY

972-871-0402

11432

1

3

retail

1300

32-20 Broadway

Long Island City

NY

777-718-3190

11106

6


Table 1. The shop location table

staff_id

first_name

last_name

position

start_date

location

1

Sue

Tindale

CFO

8/3/2001

1

2

Ian

Tindale

CEO

8/3/2001

1

3

Marny

Hermione

Roaster

10/24/2007

2

4

Chelsea

Claudia

Roaster

7/3/2003

2

5

Alec

Isadora

Roaster

4/2/2008

2

6

Xena

Rahim

Store Manager

7/24/2016

3

7

Kelsey

Cameron

Coffee Wrangler

10/18/2003

3

8

Hamilton

Emi

Coffee Wrangler

2/9/2005

3

9

Caldwell

Veda

Coffee Wrangler

9/9/2013

3

10

Ima

Winifred

Coffee Wrangler

12/10/2016

3


Table 2. The staff table

Shop locations include data about the location type, size, address details, phone number, and assigned manager. The staff table contains names, positions, start dates, and assigned shop locations for each associate.

A graph representation of this data could look something like we see below.

Figure 1: Graph representation of shop locations and staff

Figure 1: Graph representation of shop locations and staff

In the graph model, we have two main entities (nodes: Shop and Staff. The relationships between these nodes tell us how they are connected. Either a staff member is assigned to work at a location or a shop is managed by a particular staff member.

Next, let’s take a look at how a few specific refactorings would impact each of these models.

Refactoring 1: Adding a New Column

Storing additional data is a common change in projects. In our coffee shop case, we might want to also track the open date for a shop location (i.e. how long a location has been in operation).

In a relational format, the process would be to add a new column to the table. This likely means explaining the change to stakeholders, writing and executing a Data Definition Language (DDL) statement to alter the table structure (or dropping the entire table and rebuilding it with the new column in the DDL), adding the new data to the population set, then importing the actual data to the table. It may also require change approval steps added between certain tasks. This change does not impact the staff table, so no changes are required there.

sales_outlet_id

sales_outlet_type

store_open_date

store_square_feet

store_address

store_city

store_state_province

store_telephone

store_postal_code

manager

1

headquarters

8/1/2016

0

na

New York

NY

111-222-3333

44444

2

2

warehouse

7/3/2003

3400

164-14 Jamaica Ave

Jamaica

NY

972-871-0402

11432

1

3

retail

8/3/2001

1300

32-20 Broadway

Long Island City

NY

777-718-3190

11106

6


Table 3. Refactor #1 for shop location table

In a graph format, we would need to add a new property to the Shop node. Similar to our relational process above, we would need to explain the change to stakeholders and get any necessary change approvals. However, dropping the data structure, making the addition, and setting the new data structure is all eliminated in a graph because there is no strict DDL. The structure is not forced on the data — rather, the data itself determines the structure and can be adapted when data shifts.


Figure 2. Refactor #1 for graph

Figure 2. Refactor #1 for graph

Example scripts for both relational and graph processes are included in the code repository on GitHub.

Refactoring 2: Adding a New Table and Relationship

Next, perhaps we have issues with staffing coverage at our locations, so we want to retain employee addresses to help us determine who might be able to cover a shift at another location.

While we could store employee addresses directly in the Staff table, addresses are more likely to change than other data, and we might want to keep staff personal details separate from their business information. We can create a separate table to store addresses, which means creating a foreign key relationship between the staff row and related address with a new column, as well as statements for the new table structure and data insert.

staff_id

first_name

last_name

position

start_date

location

address_id

1

Sue

Tindale

CFO

8/3/2001

1

1

2

Ian

Tindale

CEO

8/3/2001

1

2

3

Marny

Hermione

Roaster

10/24/2007

2

3

4

Chelsea

Claudia

Roaster

7/3/2003

2

4

5

Alec

Isadora

Roaster

4/2/2008

2

5

6

Xena

Rahim

Store Manager

7/24/2016

3

6

7

Kelsey

Cameron

Coffee Wrangler

10/18/2003

3

7

8

Hamilton

Emi

Coffee Wrangler

2/9/2005

3

8

9

Caldwell

Veda

Coffee Wrangler

9/9/2013

3

9

10

Ima

Winifred

Coffee Wrangler

12/10/2016

3

10


Table 4. Add foreign key column address_id to Staff table

address_id

staff_id

staff_address

staff_city

staff_state_province

staff_postal_code

staff_telephone

1

1

111 Atlantic Avenue

Brooklyn

NY

11201

999-888-7777

2

2

111 Atlantic Avenue

Brooklyn

NY

11201

777-666-5555

3

3

94-22 117th Street

Jamaica

NY

11419

123-456-7890

4

4

171-12 104th Avenue

Jamaica

NY

11433

987-654-3210

5

5

8210 Surrey Place

Jamaica

NY

11432

234-567-8901

6

6

13 46th Street

Long Island City

NY

11101

345-678-9012

7

7

50-6 46th Street

Woodside

NY

11377

456-789-0123

8

8

212 Leonard Street

Brooklyn

NY

11211

567-890-1234

9

9

33-52 74th Street

Jackson Heights

NY

11372

678-901-2345

10

10

33-1 29th Street

Astoria

NY

11106

789-012-3456


Table 5. New staff_address table

In the graph, we would need to add the data for a new StaffAddress node and its relationship to Staff nodes. Existing data would not be impacted, so we would not need to alter Staff entities in the database.

Figure 3: Refactor No. 2 for graph

Refactoring 3: Adding Data To Existing Tables

For our third and final refactoring, business is booming, and we might want to add newly hired staff members to a new shop location.

In the relational structure, creating a new staff member in the table means we would need a shop location assignment to make the row complete. We would likely need some sort of dependency rule (constraint) that ensures we could not insert a value in the Staff table’s location column if it does not exist in the Shop_Location table. If we want to add the new staff member’s address to the staff_address table, we’d need to set up the same guardrails in that table, as well. This means any new staff member assigned to a new location requires us to first create the location, then the staff member, and then their address. Doing these steps in the wrong order would result in errors.

sales_outlet_id

sales_outlet_type

store_open_date

store_square_feet

store_address

store_city

store_state_province

store_telephone

store_postal_code

manager

1

headquarters

8/1/2016

0

na

New York

NY

111-222-3333

44444

2

2

warehouse

7/3/2003

3400

164-14 Jamaica Ave

Jamaica

NY

972-871-0402

11432

1

3

retail

8/3/2001

1300

32-20 Broadway

Long Island City

NY

777-718-3190

11106

6

4

retail

9/15/2022

1400

376 Union Avenue

Brooklyn

NY

978-878-0488

11211

11


Table 6. Shop_location table with new row

staff_id

first_name

last_name

position

start_date

location

address_id

1

Sue

Tindale

CFO

8/3/2001

1

1

2

Ian

Tindale

CEO

8/3/2001

1

2

3

Marny

Hermione

Roaster

10/24/2007

2

3

4

Chelsea

Claudia

Roaster

7/3/2003

2

4

5

Alec

Isadora

Roaster

4/2/2008

2

5

6

Xena

Rahim

Store Manager

7/24/2016

3

6

7

Kelsey

Cameron

Coffee Wrangler

10/18/2003

3

7

8

Hamilton

Emi

Coffee Wrangler

2/9/2005

3

8

9

Caldwell

Veda

Coffee Wrangler

9/9/2013

3

9

10

Ima

Winifred

Coffee Wrangler

12/10/2016

3

10

11

Jasmine

Patterson

Store Manager

9/12/2022

4

11

12

Jose

Vino

Coffee Wrangler

09/13/2022

4

12


Table 7. Staff table with 2 new staff

address_id

staff_id

staff_address

staff_city

staff_state_province

staff_postal_code

staff_telephone

1

1

111 Atlantic Avenue

Brooklyn

NY

11201

999-888-7777

2

2

111 Atlantic Avenue

Brooklyn

NY

11201

777-666-5555

3

3

94-22 117th Street

Jamaica

NY

11419

123-456-7890

4

4

171-12 104th Avenue

Jamaica

NY

11433

987-654-3210

5

5

8210 Surrey Place

Jamaica

NY

11432

234-567-8901

6

6

13 46th Street

Long Island City

NY

11101

345-678-9012

7

7

50-6 46th Street

Woodside

NY

11377

456-789-0123

8

8

212 Leonard Street

Brooklyn

NY

11211

567-890-1234

9

9

33-52 74th Street

Jackson Heights

NY

11372

678-901-2345

10

10

33-1 29th Street

Astoria

NY

11106

789-012-3456

11

11

107 Irving Avenue

Brooklyn

NY

11237

890-123-4567

12

12

43-1 Cambridge Place

Brooklyn

NY

11238

901-234-5678


Table 8. Staff_address table with 2 new addresses

For our graph version, we simply need to add the new data. The structure remains the same, and existing data is not impacted.

Figure 4: Refactor No. 3 for graph

Graphs Reduce Data Refactoring

We have seen how data refactoring impacts both relational and graph databases. Relational databases require a more intensive process to make changes because of the separation of table structure and actual data. In contrast, graphs remove the extra translation step between real-world data and database structure because they more naturally model data as it exists in the real world.

Time spent in today’s example might have seemed trivial, but what happens when you have thousands, millions, or billions of shops, staff members, and addresses? Business-critical systems discourage teams from making changes due to the labor involved and potential impacts. Graphs keep the data and model we already have and alter only what has changed.

Graph refactoring allows businesses to be adaptable and agile, giving them the power to morph as the industry or data around them changes. Moving to graphs in a current project can reduce time-to-market now, as well as improve future maintainability, risk mitigation, and additional feature development.

Learn more about graph modeling and refactoring through Neo4j GraphAcademy, where you can find self-paced and free online courses!

Database Database design Data (computing)

Published at DZone with permission of Jennifer Reif, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

Related

  • Unveiling the Clever Way: Converting XML to Relational Data
  • Enhancing Performance: Optimizing Complex MySQL Queries for Large Datasets
  • Recovering an MS SQL Database From Suspect Mode: Step-By-Step Guide
  • Keep Calm and Column Wise

Partner Resources


Comments

ABOUT US

  • About DZone
  • Send feedback
  • Community research
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • support@dzone.com

Let's be friends: