DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Zones

Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

Enterprise AI Trend Report: Gain insights on ethical AI, MLOps, generative AI, large language models, and much more.

2024 Cloud survey: Share your insights on microservices, containers, K8s, CI/CD, and DevOps (+ enter a $750 raffle!) for our Trend Reports.

PostgreSQL: Learn about the open-source RDBMS' advanced capabilities, core components, common commands and functions, and general DBA tasks.

AI Automation Essentials. Check out the latest Refcard on all things AI automation, including model training, data security, and more.

Related

  • The Long Road to Java Virtual Threads
  • Exploring Exciting New Features in Java 17 With Examples
  • Proper Java Exception Handling
  • Generics in Java and Their Implementation

Trending

  • Harnessing the Power of SIMD With Java Vector API
  • DSL Validations: Properties
  • Build Your Own Programming Language
  • Enhancing Performance With Amazon Elasticache Redis: In-Depth Insights Into Cluster and Non-Cluster Modes
  1. DZone
  2. Data Engineering
  3. Data
  4. Proper String Normalization for Comparison Purposes

Proper String Normalization for Comparison Purposes

This article exposes some of the most important points to take into consideration when comparing strings from different systems.

By 
Bertrand Florat user avatar
Bertrand Florat
·
Updated Jan. 22, 21 · Tutorial
Like (3)
Save
Tweet
Share
11.4K Views

Join the DZone community and get the full member experience.

Join For Free

TL;DR

In Java, do:

Java
 




xxxxxxxxxx
1


 
1
String normalizedString = Normalizer.normalize(originalString,Normalizer.Form.NFKD)
2
.replaceAll("[^\\p{ASCII}]", "").toLowerCase().replaceAll("\\s{2,}", " ").trim();


Introduction to String Normalization

Nowadays, most strings are Unicode-encoded and we are able to work with many different native characters with diacritical signs/accents (like ö, é, À) or ligatures (like æ or ʥ). Characters can be stored in UTF-8 (for instance) and associated glyphs can be displayed properly if the font supports them.

However, we often observe recurring difficulties when comparing strings issued from different information systems and/or initially typed by humans.

The human brain is a machine to fill gaps. Hence it has absolutely no problem reading or typing 'e' instead of 'ê'.

But what if the word 'tête' ('head' in French) is correctly stored in an UTF-8 encoded database, but you have to compare it with text created by an end-user that's the missing accent?

We also have to deal with legacy systems or modern ones filled up with legacy data that doesn't support the Unicode standard.

Another simple illustration of this problem is the use of ligatures. Imagine a product database storing various items with an ID and a description. Some items contain ligatures (a combination of several letters joined together to create a single character like ’Œuf’ - egg in French). Like most French people, I have no idea of how to produce such a character, even using a French keyboard. I would search for the items descriptions using oeuf. Obviously, our code has to take care of ligatures if we want to return a useful result containing ’Œuf’.

How can we fix that mess?

Rule #1: Don't Ever Compare Human Text if You Can

When you can, never compare strings to heterogeneous systems. It is surprisingly tricky to do it properly (even if it is possible to handle most cases like we will see below). Instead, compare sequences, UUID or any other ASCII-based strings without spaces or ‘special’ characters. Strings that come from different information systems will probably store data differently (lower/upper case, with/without diacritics, etc.). On the contrary, good ids are made up of plain ASCII strings and are thus free from encoding issues.

Example:

System 1 : {"id":"8b286f72-b366-47a4-9537-59d39411979a","desc":"Œeuf brouillé"}

System 2 : {"id":"8b286f72-b366-47a4-9537-59d39411979a","desc":"OEUF BROUILLE"}

If you compare ids, everything is simple and you can go home early. If you compare descriptions, you'll have to normalize it as a prerequisite or you'll be in big trouble.

Character normalization is the action of computing a canonical form of a string. To avoid false positives when comparing strings coming from several information systems, normalize both strings and compare the result of their normalization.

In the previous example, we would compare normalize("Œeuf brouillé") with normalize("OEUF BROUILLE"). Using a proper normalization function, we should then compare 'oeuf brouille' with 'oeuf brouille' but if the normalization function is buggy or partial, the strings would mismatch. For example, if the normalize() function doesn't handle ligatures properly, you would get a false positive by comparing 'œuf brouille' with 'oeuf brouille'.

Rule #2: Normalize in Memory

It is better to compare strings at the last possible moment and to do so in memory and not to normalize strings at storage time. There are at least two reasons for this:

  1. If you only store a normalized version of your string, you lose information. You may need proper diacritics later for display purposes or other reasons. As an IT professional, one of your tasks is to never lose information humans provided you.
  2. What if some items have been stored before the normalization routine has been set up? What if the normalization function changed over time?

To avoid these common pitfalls, simply compare in memory normalize(<data system 1>) with normalize(<data system 2>). The CPU overhead should be negligible if you don't compare thousands of items per second.

Rule #3: Always Trim Externally and Internally

Another common trap when dealing with strings typed by humans is the presence of spaces at the beginning or in the middle of a sequence of characters.

As an example, look at these strings: ' Wiliam' (note the space at the beginning), 'Henry ' (note the space at the end), 'Gates  III' (see the double space in the middle of this family name, did you notice it at first?).

Appropriate solution:

  1. Trim the text to remove spaces at the beginning and at the end of the text.
  2. Remove superfluous spaces in the middle of the string.

In Java, one of the ways to achieve this is:

Java
 




x


 
1
s = s.replaceAll("\\s{2,}", " ").trim();


Rule #4: Harmonize Letter Casing

This is the most well-known and straightforward normalization method: simply put every letter into lower or upper case. As far as I know, there is no preference for one or the other choice. Most developers (myself included) use lower case.

In Java, just use toLowerCase():

Java
 




xxxxxxxxxx
1


 
1
s = s.toLowerCase();


Rule #5: Transform Characters With Diacritical Signs to ASCII

When typed out, diacritical signs are often omitted in favor of their ASCII version. For example, one can type the German word 'schon' instead of 'schön'.

Unicode proposes four Normalization forms that may be used for that purpose (NFC, NFD, NFKD, and NFKC). Check-out this enlightening illustration.

Detailing all these forms would go beyond the scope of this article but, basically, some Unicode characters can be encoded either as a single combined character or in a decomposed form. For instance, 'é' can be encoded as \u00e9 code point or as the decomposed form '\u0065' ('e' letter) + '\u0301' (the diacritic '◌́'') afterward.

We will perform an NFD ("Canonical Decomposition") normalization method on the initial text to make sure that every character with an accent is converted to its decomposed form. Then, all we have to do is to drop the diacritics and only keep the 'base' simple characters.

In Java, both operations can be done this way:

Java
 




xxxxxxxxxx
1


 
1
s = Normalizer.normalize(s, Normalizer.Form.NFD)
2
    .replaceAll("[^\\p{ASCII}]", "");


Note: even if the code covers this issue, I prefer the NFKD transformation to deal with ligatures as well (see below).

Rule #6: Decompose Ligatures to a Set of ASCII Characters

The other thing to understand is that Unicode maintains some compatibility mapping between about 5000 ‘composite’ characters (like ligatures or precomposed Roman numerals) and a list of regular characters. Characters supporting this feature are documented (check the 'decomposition' attribute in the Unicode characters documentation).

For instance; the Roman numeral Ⅻ (U+216B) can be decomposed with NFKD normalization as an 'X' and two 'I's. Likewise, the ij (U+0133) character (like in 'fijn' - 'nice' in Dutch) can be decomposed into an 'i' and a 'j'.

For these kinds of 'Siamese twin' characters, we have to apply the NFKD ("Compatibility Decomposition") normalization form that both decomposes the characters (see 'Rule #5' previously) but also maps ligatures to several 'base' characters. You can then drop the remaining diacritics.

In Java, use:

Java
 




x


 
1
s = Normalizer.normalize(s, Normalizer.Form.NFKD)
2
    .replaceAll("[^\\p{ASCII}]", "");


Now the bad news: for obscure reasons, Unicode doesn't support the decomposition equivalence of some widely used ligatures like the French 'œ' and 'æ' or the German eszett 'ß'. If you need to handle them, you will have to write your own replacements before applying the NFKD normalization:

Java
 




xxxxxxxxxx
1


 
1
    s = s.replaceAll("œ", "oe");
2
    s = s.replaceAll("æ", "ae");
3
    s = Normalizer.normalize(s, Normalizer.Form.NFKD)
4
    .replaceAll("[^\\p{ASCII}]", "");


Rule #7: Beware Punctuation

This is more of a minor issue, but, according to the context, you may want to normalize some special punctuation characters as well.

For example, in a literary context, like a text-revision software, it may be a good idea to map the em/long dash ('—') character to the regular ASCII hyphen ('-').

As far as I know, Unicode doesn't provide mapping for that, so you'll have to do it yourself the old fashioned way:

Java
 




xxxxxxxxxx
1


 
1
s = s.replaceAll("—", "-");


Final Word

String normalization is very helpful when comparing strings issued from different systems or to perform appropriate comparisons. Even fully English localized projects can benefit from it, for instance to take care of casing or trailing spaces or when dealing with foreign words with accents.

This article exposes some of the most important points to take into consideration but it is far from exhaustive. For instance, we omitted Asian character manipulation or cultural normalization of semantically equivalents items (like the 'St' abbreviation of 'Saint') but I hope it is a good start for most projects.

Strings Data Types Java (programming language) Comparison (grammar) Space (architecture)

Published at DZone with permission of Bertrand Florat. See the original article here.

Opinions expressed by DZone contributors are their own.

Related

  • The Long Road to Java Virtual Threads
  • Exploring Exciting New Features in Java 17 With Examples
  • Proper Java Exception Handling
  • Generics in Java and Their Implementation

Partner Resources


Comments

ABOUT US

  • About DZone
  • Send feedback
  • Community research
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • support@dzone.com

Let's be friends: