DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Zones

Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

Enterprise AI Trend Report: Gain insights on ethical AI, MLOps, generative AI, large language models, and much more.

2024 Cloud survey: Share your insights on microservices, containers, K8s, CI/CD, and DevOps (+ enter a $750 raffle!) for our Trend Reports.

PostgreSQL: Learn about the open-source RDBMS' advanced capabilities, core components, common commands and functions, and general DBA tasks.

AI Automation Essentials. Check out the latest Refcard on all things AI automation, including model training, data security, and more.

Related

  • Keep Calm and Column Wise
  • SQL Data Manipulation Language (DML) Operations: Insert, Update, Delete
  • Navigating the Divide: Distinctions Between Time Series Data and Relational Data
  • Architecture and Code Design, Pt. 1: Relational Persistence Insights to Use Today and On the Upcoming Years

Trending

  • Understanding Kernel Monitoring in Windows and Linux
  • Machine Learning: A Revolutionizing Force in Cybersecurity
  • DZone's Article Types
  • Building a Sustainable Data Ecosystem
  1. DZone
  2. Data Engineering
  3. Databases
  4. How to Geo-Partition Data in Distributed SQL

How to Geo-Partition Data in Distributed SQL

How to achieve high performance and regulatory compliance with partitioning.

By 
Karthik Ranganathan user avatar
Karthik Ranganathan
·
Dec. 01, 20 · Tutorial
Like (4)
Save
Tweet
Share
9.6K Views

Join the DZone community and get the full member experience.

Join For Free

We are excited to announce the availability of row-level geo-partitioning in YugabyteDB, a feature heavily requested by our user community and enterprise customers alike. This feature allows fine-grained control over pinning data in a user table (at a per-row level) to geographic locations, thereby allowing the data residency to be managed at the database level.

Making the nodes of a multi-region database cluster aware of the location characteristics of the data they store allows conforming to regulatory compliance requirements such as GDPR by keeping the appropriate subset of data local to different regions, and is arguably the most intuitive way to eliminate the high latency that would otherwise get incurred when performing operations on faraway, remote regions.

In keeping with our 100% open source ethos, this feature is available in YugabyteDB under the Apache 2.0 license, so you can try it out right from your laptop! This blog post will dive into the details of this feature such as the high-level design, use-case benefits, as well as the how to use it.

Why Geo-Partition Data?

Geo-partitioning of data is critical to some global applications. However, global applications are not deployed using one particular multi-region deployment topology. The multi-region deployment topologies could vary significantly depending on the needs of the application, some of which are very common and critical.

In order to understand why geo-partitioning is useful, it is necessary to understand some of these deployment topologies, which are summarized in the table below.


Synchronous multi-region Asynchronous
multi-region
Geo-partitioned
multi-region
Data replication across geographies All data replicated across regions All data replicated inside region, some data replicated across regions Data partitioned across regions, partitions replicated inside region
Latency of queries from different regions High Low Low
Consistency semantics Transactional Eventual consistency Transactional
Schema changes across regions Transparently managed Manually propagated Transparently managed
Data loss on region failure None Some data loss No data loss (partial unavailability is possible)

As seen from the table above, geo-partitioning can satisfy use cases that need low latencies without sacrificing transactional consistency semantics and transparently perform schema changes across the regions. Geo-partitioning makes it easy for developers to move data closer to users for lower latency, higher performance, and meeting data residency requirements to comply with regulations such as GDPR.

How Does It Work?

Geo-partitioning of data enables fine-grained, row-level control over the placement of table data across different geographical locations. This is accomplished in two simple steps – first, partitioning a table into user-defined table partitions, and subsequently pinning these partitions to the desired geographic locations by configuring metadata for each partition.

The first step of creating user-defined table partitions is done by designating a column of the table as the partition column that will be used to geo-partition the data. The value of this column for a given row is used to determine the table partition that the row belongs to. The figure below shows this.

The second step involves configuring the partitions created in step one to pin data to the respective geographic locations by setting the appropriate metadata. Note that the data in each partition can be configured to get replicated across multiple zones in a cloud provider region, or across multiple nearby regions / datacenters.

An entirely new geographic partition can be introduced dynamically by adding a new table partition and configuring it to keep the data resident in the desired geographic location. Data in one or more of the existing geographic locations can be purged efficiently simply by dropping the necessary partitions. Users of traditional RDBMS would recognize this scheme as being close to user-defined list-based table partitions, with the ability to control the geographic location of each partition.

In this deployment, users can access their data with low latencies because the data resides on servers that are geographically close by, and the queries do not need to access data in far away geographic locations. This is shown in the diagram below.

Example Scenario

Let us look at this feature in the context of a use case. Say that a large but imaginary bank, Yuga Bank, wants to offer an online banking service to users in many countries by processing their deposits, withdrawals, and transfers. The following attributes would be required in order to build such a service.

  • Transactional semantics with high availability: Consistency of data is paramount in a banking application, hence the database should be ACID compliant. Additionally, users expect the service to always be available, making high availability and resilience to failures a critical requirement.
  • High performance: The online transactions need to be processed with a low latency in order to ensure a good end-user experience. This requires that the data for a particular user is located in a nearby geographic region. Putting all the data in a single location in an RDBMS would mean the requests for users residing far away from that location would have very high latencies, leading to a poor user experience.
  • Data residency requirements for compliance: Many countries have regulations around which geographic regions the personal data of their residents can be stored in, and bank transactions being personal data are subject to these requirements. For example, GDPR has a data residency stipulation which effectively requires that the personal data of individuals in the EU be stored in the EU. Similarly, India has a requirement issued by the Reserve Bank of India (or RBI for short) making it mandatory for all banks, intermediaries, and other third parties to store all information pertaining to payments data in India – though in case of international transactions, the data on the foreign leg of the transaction can be stored in foreign locations.

Pitfalls of the Traditional “One Database Per Region” Approach

It is possible to deploy and manage independent relational databases in the different geographic regions, each storing the data for the appropriate set of users to achieve both data locality and compliance with regulatory requirements. The disadvantages of such an approach are:

  • Since a given user can travel and perform transactions from different geographic regions, the data for that user can get fragmented over different independent databases over time. Operations such as viewing the transaction history for a user can get hard to implement.
  • Additionally, the application would need to encode the database deployment topology in order to connect to the correct set of databases for a given user, and would need to be constantly updated as the deployment topology changes. This can make the application development very complex.
  • Ensuring high availability and scalability of multiple single-node RDBMS in production databases is operationally very hard and error prone.

Using Geo-Partitioning for the Example Scenario

In the geo-partitioning approach, we simply deploy one YugabyteDB cluster across the different regions and create a geo-partitioned table for storing the user transactions as shown below.

Step 1. Create the Parent Table and Partitions

First, we create the parent table that contains a geo_partition column which is used to create list-based partitions for each geographic region we want to partition data into.

SQL
 




x


 
1
CREATE TABLE transactions (
2
    user_id   INTEGER NOT NULL,
3
    account_id INTEGER NOT NULL,
4
    geo_partition VARCHAR,
5
    account_type VARCHAR NOT NULL,
6
    amount NUMERIC NOT NULL,
7
    txn_type VARCHAR NOT NULL, 
8
    created_at TIMESTAMP DEFAULT NOW()
9
) PARTITION BY LIST (geo_partition);


Next, we create one partition per desired geography under the parent table. In the example below, we create three table partitions – one for the EU region called transactions_eu, another for the India region called transactions_india, and a third default partition for the rest of the regions called transactions_default.

SQL
 




xxxxxxxxxx
1
20


 
1
CREATE TABLE transactions_eu 
2
    PARTITION OF transactions 
3
      (user_id, account_id, geo_partition, account_type, 
4
       amount, txn_type, created_at,
5
       PRIMARY KEY (user_id HASH, account_id, geo_partition))
6
    FOR VALUES IN ('EU');
7

          
8
CREATE TABLE transactions_india 
9
    PARTITION OF transactions
10
      (user_id, account_id, geo_partition, account_type, 
11
       amount, txn_type, created_at,
12
       PRIMARY KEY (user_id HASH, account_id, geo_partition))
13
    FOR VALUES IN ('India');
14

          
15
CREATE TABLE transactions_default 
16
    PARTITION OF transactions
17
      (user_id, account_id, geo_partition, account_type, 
18
       amount, txn_type, created_at,
19
       PRIMARY KEY (user_id HASH, account_id, geo_partition))
20
    DEFAULT;


Note that these statements above will create the partitions, but will not pin them to the desired geographical locations. This is done in the next step. The table and partitions created so far can be viewed using the \d command.

Java
 




xxxxxxxxxx
1


 
1
yugabyte=# \d
2
                List of relations
3
 Schema |         Name         | Type  |  Owner
4
--------+----------------------+-------+----------
5
 public | transactions         | table | yugabyte
6
 public | transactions_default | table | yugabyte
7
 public | transactions_eu      | table | yugabyte
8
 public | transactions_india   | table | yugabyte
9
(4 rows)


Step 2. Pin Partitions to Geographic Locations

Now that we have a table with the desired three partitions, the final step is to pin the data of these partitions to the desired geographical locations. In the example below, we are going to use regions and zones in the AWS cloud.

First, we pin the data of the EU partition transactions_eu to live across three zones of the Europe (Frankfurt) region eu-central-1 as shown below.

Shell
 




xxxxxxxxxx
1


 
1
$ yb-admin --master_addresses <yb-master-addresses>           \
2
    modify_table_placement_info ysql.yugabyte transactions_eu \
3
    aws.eu-central-1.eu-central-1a,aws.eu-central-1.eu-central-1b,\
4
    ... 3


Second, we pin the data of the India partition transactions_india to live across three zones in India – Asia Pacific (Mumbai) region ap-south-1 as shown below.

Shell
 




xxxxxxxxxx
1


 
1
$ yb-admin --master_addresses <yb-master-addresses>              \
2
    modify_table_placement_info ysql.yugabyte transactions_india \
3
    aws.ap-south-1.ap-south-1a,aws.ap-south-1.ap-south-1b,... 3


Finally, pin the data of the default partition transactions_default to live across three zones in the US West (Oregon) region us-west-2. This is shown below.

Shell
 




xxxxxxxxxx
1


 
1
$ yb-admin --master_addresses <yb-master-addresses>                \
2
    modify_table_placement_info ysql.yugabyte transactions_default \
3
    aws.us-west-2.us-west-2a,aws.us-west-2.us-west-2b,... 3


Step 3. Pinning User Transactions to Geographic Locations

Now, the setup should automatically be able to pin rows to the appropriate regions based on the value set in the geo_partition column. Let us test this by inserting a few rows of data and verifying they are written to the correct partitions.

First, we insert a row into the table with the geo_partition column value set to EU below.

SQL
 




xxxxxxxxxx
1


 
1
INSERT INTO transactions 
2
    VALUES (100, 10001, 'EU', 'checking', 120.50, 'debit');


All of the rows above should be inserted into the transactions_eu partition, and not in any of the others. We can verify this as shown below. Note that we have turned on the expanded auto mode output formatting for better readability by running the statement shown below.

Java
 




xxxxxxxxxx
1


 
1
yugabyte=# \x auto
2
Expanded display is used automatically.


The row must be present in the transactions table, as seen below.

Java
 




xxxxxxxxxx
1


 
1
yugabyte=# select * from transactions;
2
-[ RECORD 1 ]-+---------------------------
3
user_id       | 100
4
account_id    | 10001
5
geo_partition | EU
6
account_type  | checking
7
amount        | 120.5
8
txn_type      | debit
9
created_at    | 2020-11-07 21:28:11.056236


Additionally, the row must be present only in the transactions_eu partition, which can be easily verified by running the select statement directly against that partition. The other partitions should contain no rows.

Java
 




xxxxxxxxxx
1
20


 
1
yugabyte=# select * from transactions_eu;
2
-[ RECORD 1 ]-+---------------------------
3
user_id       | 100
4
account_id    | 10001
5
geo_partition | EU
6
account_type  | checking
7
amount        | 120.5
8
txn_type      | debit
9
created_at    | 2020-11-07 21:28:11.056236
10

          
11

          
12
yugabyte=# select count(*) from transactions_india;
13
 count
14
-------
15
     0
16

          
17
yugabyte=# select count(*) from transactions_default;
18
 count
19
-------
20
     0


Now, let us insert data into the other partitions.

SQL
 




xxxxxxxxxx
1


 
1
INSERT INTO transactions 
2
    VALUES (200, 20001, 'India', 'savings', 1000, 'credit');
3
INSERT INTO transactions 
4
    VALUES (300, 30001, 'US', 'checking', 105.25, 'debit');


These can be verified as shown below.

Java
 




xxxxxxxxxx
1
19


 
1
yugabyte=# select * from transactions_india;
2
-[ RECORD 1 ]-+---------------------------
3
user_id       | 200
4
account_id    | 20001
5
geo_partition | India
6
account_type  | savings
7
amount        | 1000
8
txn_type      | credit
9
created_at    | 2020-11-07 21:45:26.011636
10

          
11
yugabyte=# select * from transactions_default;
12
-[ RECORD 1 ]-+---------------------------
13
user_id       | 300
14
account_id    | 30001
15
geo_partition | US
16
account_type  | checking
17
amount        | 105.25
18
txn_type      | debit
19
created_at    | 2020-11-07 21:45:26.067444


Step 4. Users Traveling Across Geographic Locations

In order to make things interesting, let us say user 100, whose first transaction was performed in the EU region travels to India and the US, and performs two other transactions. This can be simulated by using the following statements.

SQL
 




xxxxxxxxxx
1


 
1
INSERT INTO transactions 
2
    VALUES (100, 10001, 'India', 'savings', 2000, 'credit');
3
INSERT INTO transactions 
4
    VALUES (100, 10001, 'US', 'checking', 105, 'debit');


Now, each of the transactions would be pinned to the appropriate geographic locations. This can be verified as follows.

Java
 




xxxxxxxxxx
1
19


 
1
yugabyte=# select * from transactions_india where user_id=100;
2
-[ RECORD 1 ]-+---------------------------
3
user_id       | 100
4
account_id    | 10001
5
geo_partition | India
6
account_type  | savings
7
amount        | 2000
8
txn_type      | credit
9
created_at    | 2020-11-07 21:56:26.760253
10

          
11
yugabyte=# select * from transactions_default where user_id=100;
12
-[ RECORD 1 ]-+---------------------------
13
user_id       | 100
14
account_id    | 10001
15
geo_partition | US
16
account_type  | checking
17
amount        | 105
18
txn_type      | debit
19
created_at    | 2020-11-07 21:56:26.794173


All the transactions made by the user can efficiently be retrieved using the following SQL statement.

Java
 




xxxxxxxxxx
1
25


 
1
yugabyte=# select * from transactions where user_id=100 order by created_at desc;
2
-[ RECORD 1 ]-+---------------------------
3
user_id       | 100
4
account_id    | 10001
5
geo_partition | US
6
account_type  | checking
7
amount        | 105
8
txn_type      | debit
9
created_at    | 2020-11-07 21:56:26.794173
10
-[ RECORD 2 ]-+---------------------------
11
user_id       | 100
12
account_id    | 10001
13
geo_partition | India
14
account_type  | savings
15
amount        | 2000
16
txn_type      | credit
17
created_at    | 2020-11-07 21:56:26.760253
18
-[ RECORD 3 ]-+---------------------------
19
user_id       | 100
20
account_id    | 10001
21
geo_partition | EU
22
account_type  | checking
23
amount        | 120.5
24
txn_type      | debit
25
created_at    | 2020-11-07 21:28:11.056236


Step 5. Adding a New Geographic Location

Assume that after a while, our fictitious Yuga Bank gets a lot of customers across the globe, and wants to offer the service to residents of Brazil, which also has data residency laws. Thanks to row-level geo-partitioning, this can be accomplished easily. We can simply add a new partition and pin it to the AWS South America (São Paulo) region sa-east-1 as shown below.

SQL
 




xxxxxxxxxx
1
10


 
1
CREATE TABLE transactions_brazil 
2
    PARTITION OF transactions
3
      (user_id, account_id, geo_partition, account_type, 
4
       amount, txn_type, created_at,
5
       PRIMARY KEY (user_id HASH, account_id, geo_partition))
6
    FOR VALUES IN ('Brazil');
7

          
8
$ yb-admin --master_addresses <yb-master-addresses>               \
9
    modify_table_placement_info ysql.yugabyte transactions_brazil \
10
    aws.sa-east-1.sa-east-1a,aws.sa-east-1.sa-east-1b,... 3


And with that, the new region is ready to store transactions of the residents of Brazil.

SQL
 




xxxxxxxxxx
1
12


 
1
INSERT INTO transactions 
2
    VALUES (400, 40001, 'Brazil', 'savings', 1000, 'credit');
3

          
4
yugabyte=# select * from transactions_brazil;
5
-[ RECORD 1 ]-+-------------------------
6
user_id       | 400
7
account_id    | 40001
8
geo_partition | Brazil
9
account_type  | savings
10
amount        | 1000
11
txn_type      | credit
12
created_at    | 2020-11-07 22:09:04.8537


Conclusion

The YugabyteDB 2.5 release adds row-level geo-partitioning capabilities, a very critical feature that enables some key use-cases, along with follower reads to the already extensive set of multi-region features in YugabyteDB. These new features, combined with the ability to perform synchronous replication across 3 or more regions and asynchronous replication across 2 or more regions (called xCluster replication), makes YugabyteDB the distributed SQL database with the most comprehensive set of multi-region deployment options. These deployment options across multiple data centers, regions and/or clouds give users even more control to bring data close to their customers for performance, costs, or compliance reasons. All of the features mentioned are 100% open source under the Apache v2 license.

Data (computing) Database sql Relational database Partition (database)

Opinions expressed by DZone contributors are their own.

Related

  • Keep Calm and Column Wise
  • SQL Data Manipulation Language (DML) Operations: Insert, Update, Delete
  • Navigating the Divide: Distinctions Between Time Series Data and Relational Data
  • Architecture and Code Design, Pt. 1: Relational Persistence Insights to Use Today and On the Upcoming Years

Partner Resources


Comments

ABOUT US

  • About DZone
  • Send feedback
  • Community research
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • support@dzone.com

Let's be friends: