#Unicode and .NET Requirements
XML technologies rely heavily on Unicode — for character encoding, string normalization, collation (sorting), and regular expression character classes. Getting this right in .NET requires understanding how .NET handles globalization, and specifically why PhoenixmlDb requires ICU.
This isn't just academic. Incorrect Unicode handling causes:
-
normalize-unicode()returning wrong results or throwing errors -
Collation-based sorting producing incorrect order
-
Regular expression character classes (
\p{L},\p{N}) not matching expected characters -
String comparisons failing for non-ASCII text
#Contents
#Why Unicode Matters for XML
XML was designed for Unicode from the start. The XML specification requires:
-
All XML processors must support UTF-8 and UTF-16
-
Element and attribute names can contain Unicode characters (not just ASCII)
-
The
xml:langattribute specifies content language -
XPath/XQuery string functions operate on Unicode codepoints
-
Collation determines sort order for different languages
XPath 3.1/4.0 has dedicated Unicode functions:
normalize-unicode("café", "NFC") (: Unicode normalization :)
compare("straße", "strasse", "http://www.w3.org/2013/collation/UCA?lang=de")
characters("café") (: split into grapheme clusters :)
codepoint("é") (: Unicode codepoint value :)
upper-case("straße") (: "STRASSE" — language-aware :)These functions depend on the Unicode Character Database (UCD) and locale-aware collation data. In .NET, this data comes from ICU.
#.NET Globalization Modes
.NET has two globalization modes:
#ICU Mode (Required)
Uses the International Components for Unicode (ICU) library for:
-
Unicode normalization (NFC, NFD, NFKC, NFKD)
-
Locale-aware string comparison and sorting
-
Case conversion with language-specific rules
-
Regular expression character classes
-
Calendar and number formatting
ICU is the same library used by Chrome, Node.js, Java, and most other platforms. It provides correct, standards-compliant Unicode behavior.
#Invariant Globalization Mode (Not Compatible)
When DOTNET_SYSTEM_GLOBALIZATION_INVARIANT=1 is set or <InvariantGlobalization>true</InvariantGlobalization> is in your project file, .NET uses a minimal, ordinal-only globalization implementation:
-
No locale-aware sorting
-
No Unicode normalization beyond basic case folding
-
Simplified regex character classes
-
Faster but incorrect for many internationalization scenarios
PhoenixmlDb does not work correctly in invariant mode. The XPath/XQuery functions that depend on Unicode behavior will produce wrong results or throw exceptions.
#ICU Requirement
PhoenixmlDb requires ICU-based globalization. This is configured in Directory.Build.props:
<PropertyGroup>
<InvariantGlobalization>false</InvariantGlobalization>
</PropertyGroup>
Critical: The environment variable DOTNET_SYSTEM_GLOBALIZATION_INVARIANT takes precedence over the project setting. If this variable is set to 1 in your environment (common in Docker images), PhoenixmlDb will silently produce incorrect results.
#What Breaks Without ICU
|
Function |
With ICU |
Without ICU |
|---|---|---|
|
|
Correctly normalized string |
May throw or return incorrect result |
|
|
Language-correct comparison |
Ordinal comparison (wrong for German) |
|
|
|
|
|
|
|
May not recognize |
|
|
Locale-aware key |
Throws or returns ordinal key |
|
|
System locale |
May return empty or "iv" |
#Checking Your Configuration
In your .NET application:
// Check if ICU is available
Console.WriteLine($"Globalization mode: {System.Globalization.CultureInfo.CurrentCulture.Name}");
Console.WriteLine($"ICU version: {System.Globalization.CultureInfo.CurrentCulture.CompareInfo}");
// This will throw in invariant mode:
try
{
"café".Normalize(System.Text.NormalizationForm.FormD);
Console.WriteLine("ICU: Available");
}
catch (PlatformNotSupportedException)
{
Console.WriteLine("ICU: NOT available — invariant mode!");
}#Platform Setup
#Windows
ICU is included with Windows 10+ and .NET 5+. No additional setup needed.
<!-- Directory.Build.props — this is all you need -->
<InvariantGlobalization>false</InvariantGlobalization>#Linux
Most Linux distributions include ICU. If you're using a minimal container image, you may need to install it:
Debian/Ubuntu:
apt-get install -y libicu-devAlpine Linux:
apk add icu-libs
Important for Docker: Many minimal .NET Docker images (especially Alpine-based) set DOTNET_SYSTEM_GLOBALIZATION_INVARIANT=1 by default. You must either:
-
Unset the variable:
ENV DOTNET_SYSTEM_GLOBALIZATION_INVARIANT=false -
Or install ICU and unset:
FROM mcr.microsoft.com/dotnet/runtime:10.0-alpine RUN apk add --no-cache icu-libs ENV DOTNET_SYSTEM_GLOBALIZATION_INVARIANT=false -
Or use the non-Alpine images (which include ICU by default):
FROM mcr.microsoft.com/dotnet/runtime:10.0 # ICU included, no extra setup needed
#macOS
ICU is included with macOS. No additional setup needed.
#CI/CD Pipelines
If your CI runs on minimal containers, ensure ICU is available:
GitHub Actions:
- name: Install ICU (if needed)
run: |
if [ -f /etc/alpine-release ]; then
apk add --no-cache icu-libs
fi
env:
DOTNET_SYSTEM_GLOBALIZATION_INVARIANT: false#Unicode Normalization
#What It Is
The same character can be represented multiple ways in Unicode:
-
éas a single codepoint: U+00E9 (precomposed) -
éas two codepoints: U+0065 (e) + U+0301 (combining accent) (decomposed)
Both look identical on screen, but they're different byte sequences. If you compare them without normalization, they won't match.
#Normalization Forms
|
Form |
Name |
Use |
|---|---|---|
|
NFC |
Canonical Composition |
Default for most use. Precomposes characters. |
|
NFD |
Canonical Decomposition |
Decomposes characters. Useful for stripping accents. |
|
NFKC |
Compatibility Composition |
NFC + resolves compatibility characters (e.g., fi → fi) |
|
NFKD |
Compatibility Decomposition |
NFD + resolves compatibility characters |
#In XPath
normalize-unicode("café") (: NFC — default :)
normalize-unicode("café", "NFD") (: decomposed form :)
normalize-unicode("file", "NFKC") (: "file" — compatibility normalization :)#In .NET
"café".Normalize(NormalizationForm.FormC); // NFC
"café".Normalize(NormalizationForm.FormD); // NFD#When to Normalize
-
Comparing strings from different sources (user input vs database)
-
Indexing text for search
-
Hashing strings (different normalizations produce different hashes)
-
Storing text in a database (choose one form and stick with it — NFC is standard)
#Collation
Collation determines how strings are compared and sorted. It's language-dependent:
|
Language |
Sort Order |
Why |
|---|---|---|
|
English |
a, b, c, ... z |
Alphabetical |
|
German |
ä sorts with a |
ä is a variant of a |
|
Swedish |
ä sorts after z |
ä is a separate letter |
|
Spanish |
ñ sorts between n and o |
ñ is a separate letter |
#In XPath/XSLT
(: Default collation — usually Unicode Collation Algorithm :)
sort(("ä", "z", "a"))
(: Depends on collation — could be "a, ä, z" or "a, z, ä" :)
(: Explicit German collation :)
sort(("ä", "z", "a"), "http://www.w3.org/2013/collation/UCA?lang=de")
(: "a, ä, z" — German rules :)
(: Explicit Swedish collation :)
sort(("ä", "z", "a"), "http://www.w3.org/2013/collation/UCA?lang=sv")
(: "a, z, ä" — Swedish rules :)In XSLT sorting:
<xsl:sort select="name" collation="http://www.w3.org/2013/collation/UCA?lang=de"/>#In .NET
// Default comparison (culture-dependent)
string.Compare("ä", "z", StringComparison.CurrentCulture);
// German comparison
var german = new CultureInfo("de-DE");
string.Compare("ä", "z", false, german); // ä before z
// Swedish comparison
var swedish = new CultureInfo("sv-SE");
string.Compare("ä", "z", false, swedish); // ä after z#Collation in PhoenixmlDb
PhoenixmlDb supports the W3C Unicode Collation Algorithm (UCA) with language parameters:
http://www.w3.org/2013/collation/UCA?lang=en
http://www.w3.org/2013/collation/UCA?lang=de&strength=secondaryStrength levels: | Level | Ignores | Example | |-------|---------|---------| | Primary | Case + accents | a = á = A | | Secondary | Case only | a = A ≠ á | | Tertiary (default) | Nothing | a ≠ A ≠ á |
#Regular Expressions
XPath/XQuery regex uses Unicode character classes:
matches("café123", "\p{L}+") (: matches "café" — Unicode letters :)
matches("café123", "\p{N}+") (: matches "123" — Unicode numbers :)
matches("café", "\p{Ll}+") (: matches "caf" — lowercase letters :)
matches("Ω", "\p{Lu}") (: true — uppercase Greek letter :)
In invariant mode, \p{L} may not recognize all Unicode letters — it might miss accented characters, non-Latin scripts, or characters added in recent Unicode versions.
#XPath vs .NET Regex Differences
|
Feature |
XPath Regex |
.NET Regex |
|---|---|---|
|
Syntax base |
XML Schema regex |
Perl-compatible |
|
Anchoring |
Implicit full match |
Partial match unless |
|
Character classes |
|
Same (with ICU) |
|
Backreferences in match |
Not supported |
Supported |
|
Backreferences in replace |
|
|
|
Named groups |
Not supported |
|
|
Lookahead/lookbehind |
Not supported |
Supported |
|
Flags |
|
|
The most important difference: XPath matches() tests the entire string by default. matches("hello world", "hello") is false because "hello" doesn't match the whole string. You need matches("hello world", ".*hello.*") or contains() instead.
#Common Issues and Solutions
#Issue: normalize-unicode() Throws PlatformNotSupportedException
Cause: Invariant globalization mode is active.
Fix:
# Check for the environment variable
echo $DOTNET_SYSTEM_GLOBALIZATION_INVARIANT
# Unset it
unset DOTNET_SYSTEM_GLOBALIZATION_INVARIANT
# Or in .env / Docker:
DOTNET_SYSTEM_GLOBALIZATION_INVARIANT=false#Issue: Sorting Produces Wrong Order for Non-English Text
Cause: Default collation is ordinal (invariant mode) or wrong locale.
Fix: Specify the collation explicitly:
<xsl:sort select="name" collation="http://www.w3.org/2013/collation/UCA?lang=de"/>#Issue: String Comparison Fails for Accented Characters
Cause: Comparing strings with different Unicode normalization forms.
Fix: Normalize before comparing:
normalize-unicode($a, "NFC") = normalize-unicode($b, "NFC")#Issue: Docker Container Silently Breaks Unicode
Cause: Alpine-based images default to invariant mode.
Fix: Use the standard (non-Alpine) images, or install ICU:
FROM mcr.microsoft.com/dotnet/aspnet:10.0
# ICU included — no DOTNET_SYSTEM_GLOBALIZATION_INVARIANT needed#Issue: Tests Pass Locally But Fail in CI
Cause: CI environment has different globalization settings.
Fix: Add to your test project:
<PropertyGroup>
<InvariantGlobalization>false</InvariantGlobalization>
</PropertyGroup>And ensure CI has ICU installed.
#Verifying Your Environment
Run this quick check to confirm everything is configured correctly:
using System.Globalization;
using System.Text;
// 1. Check ICU availability
Console.WriteLine($"Culture: {CultureInfo.CurrentCulture.Name}");
// 2. Check normalization
var nfc = "café".Normalize(NormalizationForm.FormC);
var nfd = "café".Normalize(NormalizationForm.FormD);
Console.WriteLine($"NFC length: {nfc.Length}, NFD length: {nfd.Length}");
Console.WriteLine($"NFC == NFD (ordinal): {nfc == nfd}");
Console.WriteLine($"NFC == NFD (normalized): {string.Compare(nfc, nfd, CultureInfo.InvariantCulture, CompareOptions.None) == 0}");
// 3. Check collation
var german = new CultureInfo("de-DE");
var result = string.Compare("ä", "b", false, german);
Console.WriteLine($"German: ä vs b = {result} (should be < 0)");If any of these fail or throw, your environment needs ICU configuration.