Why Your Fancy Unicode Text Reverts to Plain Letters
The bug report that made no sense
A support ticket landed on my desk a few years back with a screenshot attached: a user's display name showed as 𝐉𝐚𝐦𝐢𝐞 in the signup preview, but after they hit save and reloaded, it read plain Jamie. No error. No truncation. The bold letters had simply... un-bolded themselves. The user was convinced our app had a rendering bug. The frontend team swore the field was untouched. I spent a frustrating hour staring at the database column before I realized the truth: nothing was broken. Our backend was doing exactly what the Unicode standard told it to do, and the user's "font" had never been a font at all.
If you have ever pasted a stylish name out of a fancy text generator and watched it collapse into boring ASCII the moment a server touched it, this is the mechanism behind it. The styling was never decoration layered on top of letters. It was the letters — different codepoints entirely — and a normalization pass folded them back home.
Those "fonts" are real codepoints, not styling
Start with one concrete character. The bold A you copy out of a generator is not the letter A wearing a bold attribute. It is U+1D400 MATHEMATICAL BOLD CAPITAL A, a distinct character living up in the Supplementary Multilingual Plane (Plane 1, codepoints from U+10000 to U+1FFFF). Unicode is carved into 17 planes of 65,536 codepoints each. Plane 0, the Basic Multilingual Plane, holds everyday Latin, Cyrillic, CJK, and the like. Plane 1 holds emoji, ancient scripts, musical notation — and the Mathematical Alphanumeric Symbols block (U+1D400–U+1D7FF) that powers almost every "Instagram bold/italic/script font" you have ever seen.
So when a tool gives you 𝐀, it has substituted a mathematical symbol that happens to look like a bold A. Your browser's font renderer draws it. But to a database, a search index, or a username validator, it is a completely different character with a completely different numeric identity.
Here is the part that bites. Unicode defines a property called compatibility equivalence. U+1D400 is declared compatibility-equivalent to plain U+0041 (A). The standard is explicit that these characters carry the same underlying meaning and differ only in presentation. Any system that performs NFKC normalization — Normalization Form KC, Compatibility Composition — will decompose that mathematical bold A back to its canonical letter:
const fancy = "𝐉𝐚𝐦𝐢𝐞";
console.log(fancy); // 𝐉𝐚𝐦𝐢𝐞
console.log(fancy.normalize("NFKC")); // Jamie
console.log("𝐀".normalize("NFKC") === "A"); // true
That last line is the whole ticket. The K in NFKC stands for Kompatibilität (the German spelling baked into the spec's naming). It does not strip styling — there is no styling to strip. It maps each compatibility character to the canonical one it is equivalent to. The visual "boldness" is an artifact of the glyph the font happened to draw for U+1D400; once the codepoint changes to U+0041, the boldness has nowhere to live.
Why backends run NFKC at all
This is not a quirk of one ORM. NFKC is the recommended normalization for identifiers across the industry. The Unicode Identifier and Pattern Syntax annex (UAX #31) and the IDNA rules for internationalized domain names both lean on compatibility folding so that two strings that look the same cannot resolve to different accounts. PostgreSQL exposes normalize(text, NFKC), Python ships unicodedata.normalize, and most username and email pipelines call one of them before uniqueness checks.
The security reason is sound. If admin and 𝐚𝐝𝐦𝐢𝐧 were allowed to coexist, an attacker could register a confusable impersonator. So login forms, search boxes, and @mention resolvers flatten compatibility characters on purpose. Your fancy name survives in chat messages and bios because those are often stored verbatim — but the instant it passes through a field that demands a canonical identity, it normalizes. That asymmetry is exactly why the same string "works" in one box and "reverts" in another within the same product.
The length gotcha that crashes the validator
Now the second trap, and the one that has produced more 3 a.m. pages than the normalization itself. Plane 1 characters do not fit in a single UTF-16 code unit. JavaScript strings are UTF-16, so any codepoint at or above U+10000 is stored as a surrogate pair — two code units, a high surrogate (U+D800–U+DBFF) and a low surrogate (U+DC00–U+DFFF). That means .length lies to you:
"𝐀".length // 2 — it's a surrogate pair under the hood
[..."𝐀"].length // 1 — spread iterates by codepoint
"𝐀".codePointAt(0) // 119808 (0x1D400)
"𝐀".charCodeAt(0) // 55349 (the high surrogate alone)
I watched a "maximum 20 characters" username rule reject a perfectly reasonable ten-letter fancy name because every glyph counted as two. Worse, a naive slice(0, 20) once chopped a string right between a high and low surrogate, persisting a lone half-character. That orphaned surrogate is invalid UTF-8, and the downstream service that tried to re-encode it threw on insert. The fix is never str.length for human-facing counts — use [...str].length, which spreads the string by its iterator (codepoint by codepoint), or Intl.Segmenter if you care about grapheme clusters like flag emoji. If you are hashing such a string for a dedupe key, normalize first and only then feed it to a hash generator; otherwise 𝐉𝐚𝐦𝐢𝐞 and Jamie produce two different digests for what your backend will treat as one identity.
The tofu exception: the hole where italic h should be
There is one more layer, and it is my favorite because it exposes lazy generators. The Mathematical Alphanumeric Symbols block is not a clean, contiguous grid. When Unicode built it, twenty-four letterlike symbols had already been encoded years earlier in the Basic Multilingual Plane. Rather than duplicate them, the standard left holes in Plane 1 at exactly those slots and pointed elsewhere.
The famous one is mathematical italic small h. By the block's regular base + offset pattern it should sit at U+1D455. But that codepoint is unassigned — a reserved hole — because the character already existed as U+210E PLANCK CONSTANT, sitting in the Letterlike Symbols block (U+2100–U+214F) back in the BMP. A correct generator emits U+210E for italic h. A lazy one that just does codepoint = 0x1D44E + (letter - 'a') emits U+1D455, which no font can render, so you get tofu — the .notdef box □. You can see both behaviors directly:
const ITALIC_BASE = 0x1D44E; // mathematical italic small 'a'
function naiveItalic(c) {
return String.fromCodePoint(ITALIC_BASE + (c.charCodeAt(0) - 97));
}
console.log(naiveItalic('g')); // 𝑔 fine, real codepoint
console.log(naiveItalic('h')); // U+1D455 — unassigned, renders as tofu
console.log("ℎ"); // ℎ the actual Planck-constant glyph
console.log(naiveItalic('h').normalize("NFKC")); // — stays broken, no mapping
console.log("ℎ".normalize("NFKC")); // h — folds correctly
Note the asymmetry in those last two lines. The real Planck-constant character NFKC-folds to a clean h, so it round-trips through a backend the way every other letter does. The tofu codepoint at U+1D455 has no compatibility mapping at all — it is unassigned — so it neither renders nor normalizes. Input h through a careless generator and your output is a permanent broken box that no amount of normalization will rescue.
Across the math italic, script, fraktur, and bold-script styles, the same kind of pre-existing letters (think script capital P at U+2118, or the blackletter capitals in the Letterlike block) leave their own holes. Twenty-four reserved gaps, twenty-four chances for a base-plus-offset loop to spit out tofu. When I evaluate whether a generator is built correctly, I do not check whether bold A looks bold — anyone can get that right. I type a name containing h, lowercase italic, and a script P, and I watch for boxes.
What to actually do with this
Three rules survive every variation of this problem. First, treat fancy text as decorative and ephemeral: it is safe in a message body, doomed in an identifier field, and you cannot fight the NFKC fold — it is the spec working as designed. Second, never measure or slice these strings with .length; iterate by codepoint with the spread operator so surrogate pairs count as one and never get cut in half. Third, if you build or pick a generator, judge it by its handling of the BMP holes, not its happy path.
The next time a bold username quietly turns plain on save, you will know it is not a bug. It is U+1D400 going home to U+0041, exactly as Unicode promised it would.
Tools used in this guide
- Fancy Text Generator — Type normal text and copy Unicode style variants for profiles, bios, headings, notes, and quick social posts.
- SHA-256 Hash Generator — Paste text and generate a SHA-256 digest locally for checksums, examples, cache keys, and debugging.