DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Zones

Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

Enterprise AI Trend Report: Gain insights on ethical AI, MLOps, generative AI, large language models, and much more.

2024 Cloud survey: Share your insights on microservices, containers, K8s, CI/CD, and DevOps (+ enter a $750 raffle!) for our Trend Reports.

PostgreSQL: Learn about the open-source RDBMS' advanced capabilities, core components, common commands and functions, and general DBA tasks.

AI Automation Essentials. Check out the latest Refcard on all things AI automation, including model training, data security, and more.

Related

  • Recovering an MS SQL Database From Suspect Mode: Step-By-Step Guide
  • Keep Calm and Column Wise
  • SQL Data Storytelling: A Comprehensive Guide
  • Restoring the MS SQL Server Database in Easy Steps

Trending

  • DZone's Article Submission Guidelines
  • Organizing Knowledge With Knowledge Graphs: Industry Trends
  • Getting Started With NCache Java Edition (Using Docker)
  • Data Processing in GCP With Apache Airflow and BigQuery
  1. DZone
  2. Data Engineering
  3. Data
  4. The Difficulty of SQL Stems From Relational Algebra

The Difficulty of SQL Stems From Relational Algebra

This article explores how the fundamental difficulty of SQL actually stems from its theoretical foundation, namely relational algebra.

By 
Judy Liu user avatar
Judy Liu
·
Mar. 27, 24 · Tutorial
Like (2)
Save
Tweet
Share
500 Views

Join the DZone community and get the full member experience.

Join For Free

In the field of structured data computing, SQL is still the most widely used working language, not only adopted by all relational databases but also targeted by many new big data platforms.

For a certain computing technology, people usually care about two efficiencies. One is the descriptive efficiency of operations, and the other is the execution efficiency of operations. This is easy to understand. If the descriptive efficiency is too low, it means that the development cost is too high and it is difficult to write programs for calculation; If the execution efficiency is low, it takes a long time to obtain results, which greatly reduces its practical value. Simply put, it means writing simply and running fast.

However, we have previously analyzed some computational scenarios for structured data, and SQL actually does not perform well in both simple writing and fast running. When the situation is slightly more complex, it is difficult to handle, often resulting in thousands of lines of nested N layers of code and tens of gigabytes of computation requiring several hours to run. Two typical examples are calculating the number of days a stock has been continuously rising and the TopN for the entire set and within groups of a large set. The details will not be repeated here, and those interested can refer to previous posts.

How many days does a stock continuously rise for the longest time? Originally a simple task, due to the lack of ordered computing ability, it needs to be written in a multi-layered nested form, which is difficult to write and understand:

SQL
select max(ContinuousDays) from (

    select count(*) ContinuousDays from (

        select sum(UpDownTag) over (order by TradeDate) NoRisingDays from (

            select TradeDate,case when Price>lag(price) over ( order by TradeDate)then 0 else 1 end UpDownTag from Stock ))

    group by NoRisingDays )


To get the top 10 out of 100 million pieces of data, and to get the top 10 of each group after grouping, there is a significant difference in writing, and they all have ORDER BY in the statement, which means that high complexity large sorting has to be performed, and can only rely on the database optimizer to find low complexity algorithms.

SQL
SELECT TOP 10 * FROM Orders ORDER BY Amount DESC
SELECT * FROM (
    	SELECT *, ROW_NUMBER() OVER (PARTITION BY Area ORDER BY Amount DESC) rn FROM Orders )
WHERE rn<=10


We have also analyzed the reasons. SQL lacks discreteness, resulting in incomplete set orientation, difficult ordered calculations, and difficulty in describing high-performance algorithms…

However, there is a deeper reason behind this, as the fundamental difficulty of SQL actually stems from its theoretical foundation, namely relational algebra. 

To explain this view, we need to analyze what using programs to implement calculations is actually doing.

Essentially, the process of writing a program is the process of translating problem-solving ideas into a precise formal language that can be executed by a computer. For example, just like elementary school students solving practical problems, after analyzing the problem and coming up with a solution, they also need to list four arithmetic expressions. The same goes for calculating with a program. Not only do we need to come up with a solution to the problem, but we also need to translate the solution into actions that the computer can understand and execute in order to implement the calculation. 

For the formal language used to describe computational methods, its core lies in the algebraic system used. We cannot strictly define the concept of an algebraic system here, we will only explain it in layman’s terms. To solve a certain computational problem, people define some data types and a set of operational rules for these data types, ensuring the closeness and consistency of these operations, which can be called an algebraic system. For example, the familiar rational numbers and their four arithmetic operations are an algebraic system used to solve the numerical calculation needs in daily life. Closeness refers to the requirement that the calculation result must still be a defined data object, such as the result of the four arithmetic operations on rational numbers being still a rational number. Consistency refers to the fact that these operations cannot produce contradictory results. For example, we need to agree that they cannot be divided by 0, otherwise dividing a certain number by 0 into any number will lead to logical contradictions.

If the design of this algebraic system is not carefully considered, and the provided data types and operations are inconvenient, it will make it very difficult to describe the algorithm. At this point, a strange phenomenon occurs the difficulty of translating the solution to the code far exceeds that of solving the problem itself. 

For example, we have learned to use Arabic numerals for daily calculations since childhood, and implementing addition, subtraction, multiplication, and division is very convenient. Everyone naturally believes that numerical operations should be like this. In fact, not necessarily! I think most people know about something called Roman numerals. I’m not sure if the Roman numeral system also has familiar addition, subtraction, multiplication, and division operations (which cannot be easily implemented like Arabic numerals, and the definition of operations may be different). I’ve also been confused about how ancient Romans went shopping.

Then, we know that whether a program can be written simply is actually an algebraic problem behind programming languages.

As we have mentioned before, running fast is essentially the same thing as writing simple, that is, it can make high-performance algorithms easy to write. In this way, running fast is still an algebraic problem.

Let’s make another analogy:

Most students who have attended elementary school in China know the story of Gauss calculating 1+2+3+…+100. Ordinary people just add 100 times step by step. The Gauss child is very smart and has found that 1+100=101, 2+99=101, …, 50+51=101, and the result is 50 multiplied by 101, he quickly finished calculating and went home for lunch.

After hearing this story, we all feel that Gauss is very clever and can come up with such a clever method, which is simple and fast. This is not wrong, but it is easy to overlook one point: in the era of Gauss, multiplication already existed in the human arithmetic system (also an algebra)! As mentioned earlier, when we learn the four arithmetic operations from a young age, we may take multiplication for granted, but in fact, multiplication was invented later than addition. Although multiplication is defined by addition, its characteristics can be utilized to calculate using a 9x9 table, which greatly improves efficiency. If there were no multiplication in Gauss’s era, even with clever Gauss, it would be impossible to quickly solve this problem.

The mathematical foundation of SQL is relational algebra, which is an algebraic system used to perform batch-structured data calculations. This is also why databases that use SQL are also called relational databases.

Relational algebra was invented fifty years ago, and the application requirements and hardware environment fifty years ago are vastly different from today. Due to the large number of existing users and the lack of mature new technologies, SQL based on relational algebra remains the most important database development language today. Although there have been some improvements in the past few decades, the foundation has not changed. Faced with contemporary complex requirements and hardware environments, relational databases are not as adept.

Relational algebra is too simple and lacks sufficient data types and operations. Therefore, when using SQL to describe the solution to a problem, it is necessary to find convoluted ways to implement it. For example, the problem of stock price increases, because relational algebra applies the theory of unordered sets in mathematics and does not create the concept of order for SQL, turns a simple problem into a difficult problem, even if it takes a detour, it is difficult to write. This leads to the phenomenon that the difficulty of solving translation problems is greater than that of solving the problem itself, as mentioned earlier. The top 10 problem is also the same case, the aggregation operation designed for relational algebra does not include TOPN and it does not have a set data type. Therefore, this operation cannot be designed as an aggregation operation, and can only be described as a large sort.

Relational algebra is like an arithmetic system that only involves addition but has not yet invented multiplication, and many things are inevitable if not done well.

When the calculation is simple or the performance requirements are not high, using SQL is still relatively convenient, after all, there are many masters and related software is also very rich. However, the data requirements in modern applications are becoming increasingly complex, and the amount of data is also increasing. Continuing to use SQL will seriously affect work efficiency. Moreover, unfortunately, this problem is theoretical, and no matter how optimization in engineering is done, it is of no use and can only be improved to a limited extent, not eradicated. However, the vast majority of database developers would not have thought of this layer, or rather, to cater to the compatibility of existing users, they did not intend to think of this layer. So, the mainstream database industry has been circling in this circle.

Then what should we do? How do we make calculations writing simpler and running faster?

Invent new algebra! Algebra with ‘multiplication.’ This is the difference between esProc SPL. We gave the algebraic foundation of SPL a mathematical name: discrete datasets. SPL is the formal language of this algebra. The advantages of SPL have been discussed multiple times in the previous articles. With the support of new algebra, we are truly able to write simple yet run fast.

sql Data (computing)

Published at DZone with permission of Judy Liu. See the original article here.

Opinions expressed by DZone contributors are their own.

Related

  • Recovering an MS SQL Database From Suspect Mode: Step-By-Step Guide
  • Keep Calm and Column Wise
  • SQL Data Storytelling: A Comprehensive Guide
  • Restoring the MS SQL Server Database in Easy Steps

Partner Resources


Comments

ABOUT US

  • About DZone
  • Send feedback
  • Community research
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • support@dzone.com

Let's be friends: