Skip to content
🛠️ToolsShed

Unicode Text Normalizer

Normalize Unicode text using NFC, NFD, NFKC, or NFKD forms.

Canonical Decomposition, then Canonical Composition

About this tool

Unicode text normalization is the process of converting strings into a consistent canonical form, ensuring that different visual or byte representations of the same character are treated identically. This is essential when comparing user input, matching database records, or handling text from diverse sources like web forms, APIs, or international documents. Without normalization, two strings that look identical to human eyes might differ at the code level, causing unexpected mismatches.

To use this tool, paste or type your text into the input field, select your preferred normalization form (NFC, NFD, NFKC, or NFKD), and click the Normalize button. The tool instantly displays the normalized result along with character counts and byte sizes before and after. NFC and NFKC are most common for general use and web applications, while NFD and NFKD are useful for detailed linguistic analysis or comparing decomposed forms.

Developers working with multilingual text, content managers handling international user data, and anyone dealing with text imports from legacy systems will benefit from this tool. Unicode normalization becomes particularly important when building search functions, form validation systems, or data deduplication pipelines where consistency is critical.

Frequently Asked Questions

Code Implementation

import unicodedata

text = "e\u0301"  # 'e' + combining acute accent (looks like 'é')

# NFC: Canonical Decomposition, then Canonical Composition
nfc = unicodedata.normalize("NFC", text)
print(f"NFC:  {nfc!r}  len={len(nfc)}")   # 'é'  len=1

# NFD: Canonical Decomposition
nfd = unicodedata.normalize("NFD", text)
print(f"NFD:  {nfd!r}  len={len(nfd)}")   # 'é'  len=2

# NFKC: Compatibility Decomposition, then Canonical Composition
full_width = "\uff41\uff42\uff43"  # abc full-width
nfkc = unicodedata.normalize("NFKC", full_width)
print(f"NFKC: {nfkc!r}")  # 'abc'

# NFKD: Compatibility Decomposition
nfkd = unicodedata.normalize("NFKD", full_width)
print(f"NFKD: {nfkd!r}")  # 'abc'

# Check if two strings are canonically equivalent
def canon_equal(a: str, b: str) -> bool:
    return unicodedata.normalize("NFC", a) == unicodedata.normalize("NFC", b)

print(canon_equal("é", "e\u0301"))  # True

Comments & Feedback

Comments are powered by Giscus. Sign in with GitHub to leave a comment.