The Encoding Bugs Behind Mojibake, Emoji Length, and Broken Hashes

By Alpha Loop · Published June 12, 2026 · Updated June 20, 2026 · 8 min read

Three bugs that look unrelated until you weigh them in bytes

A few months ago I lost an afternoon to a deduplication job. The pipeline was supposed to skip files it had already ingested by comparing a SHA-256 of the filename plus a few metadata fields. It worked perfectly in tests and then, in production, started re-ingesting the same files over and over. The string was identical on screen. The digest was not. That bug, it turned out, was the same bug as two others I'd written off as cosmetic months earlier — café showing up as cafÃ© in a CSV export, and a length check that insisted a single emoji was two characters long.

These three symptoms have one root, and once you see it you stop treating encoding problems as a grab-bag of unrelated curiosities. Here they are:

Mojibake. You type café, it round-trips through some boundary, and comes back as cafÃ©.
Emoji length lies. '😀'.length returns 2, even though you see one glyph.
Hash divergence. Two systems run SHA-256 over what looks like the same text and get two different digests.

The single cause: a string is not one thing. There are three distinct layers — bytes (UTF-8, what files and networks carry), code units (UTF-16, what a JavaScript string actually stores), and code points (the abstract Unicode scalar values you think in). Every one of these bugs is a place where code written for one layer collides with data living in another. Let me walk each symptom back to the layer that betrayed it.

Symptom 1: mojibake is a decoding disagreement

café in UTF-8 is five bytes: 63 61 66 C3 A9. The c, a, f are plain ASCII (one byte each), and é (U+00E9) is two bytes, C3 A9. That two-byte rule for accented Latin characters is the whole story here.

Now suppose something downstream — an old CSV reader, a misconfigured database connection, a Content-Type header that forgot its charset — decides those bytes are Latin-1 (ISO-8859-1) instead of UTF-8. Latin-1 is a one-byte-per-character encoding, so it reads C3 A9 as two characters: C3 is Ã, A9 is ©. The result is cafÃ©. I confirmed this exactly:

const bytes = new TextEncoder().encode("café"); // 63 61 66 c3 a9
const wrong = Buffer.from(bytes).toString("latin1");
console.log(wrong); // "cafÃ©"

Input café → output cafÃ©. No data was lost or corrupted in transit; every byte arrived intact. The bug is purely interpretive: the producer wrote UTF-8, the consumer read Latin-1. Mojibake is never random garbage — it's a deterministic mismatch, which is why the Ã© pattern is so recognizable. Whenever you see a stray Ã in front of a punctuation-looking character, you are looking at a UTF-8 multibyte sequence being read one byte at a time. The fix is never to "clean" the string after the fact; it's to make both ends agree on the encoding before a single byte moves.

Symptom 2: emoji length lies because JS counts UTF-16 code units

JavaScript strings are sequences of UTF-16 code units, not code points and not bytes. For characters in the Basic Multilingual Plane (everything below U+10000), one code unit equals one code point and .length matches your intuition. Emoji live above that ceiling.

😀 is U+1F600. UTF-16 can't fit it in a single 16-bit unit, so it encodes it as a surrogate pair: two code units, D83D DE00. That's why:

"😀".length;          // 2  — counts UTF-16 code units
[..."😀"].length;     // 1  — the iterator yields code points
"😀".codePointAt(0);  // 128512 (0x1F600)

The spread operator and for...of use the string iterator, which is code-point aware, so they recover the human answer of 1. Plain .length, .charAt(), and index access (str[0]) are code-unit operations and will happily hand you one half of a surrogate pair — a lone, meaningless \uD83D. This is the same trap behind a fancy bold or italic letter from a Unicode text generator reporting a length of 2: those styled glyphs are also astral-plane code points wearing surrogate pairs. If you ever truncate a string with slice at a code-unit boundary, you can cut a character in half and produce the replacement glyph �. The lesson isn't "emoji are special"; it's that .length answers a question (how many code units?) that is rarely the one you're asking (how many characters?).

Symptom 3: hashes diverge because the digest is over bytes, not text

This is the angle that ties the whole thing together, and it's where my deduplication bug lived. A cryptographic hash function does not hash "a string." It hashes a sequence of bytes. There is no text in SHA-256 — only an octet stream. So before any hashing happens, your string has to be serialized into bytes, and the encoding you pick for that step is part of the input. Change the encoding, change the digest. Same characters, different bytes, different hash.

Here is the proof, with the actual digests I generated:

const { createHash } = require("crypto");
const sha = (buf) => createHash("sha256").update(buf).digest("hex");

sha(Buffer.from("café", "utf8"));
// 850f7dc43910ff890f8879c0ed26fe697c93a067ad93a7d50f466a7028a9bf4e

sha(Buffer.from("café", "utf16le"));
// 8c9f3eed8d0b4c75bdde53bf22d847cb5a1b1318e9d5ce0186142c5602ca9baa

Same four characters on screen. Two completely different fingerprints, because UTF-8 serializes café as 63 61 66 C3 A9 (5 bytes) while UTF-16LE serializes it as 63 00 61 00 66 00 E9 00 (8 bytes). If your client hashes in UTF-8 and your server hashes a UTF-16 string, your signature checks will fail forever and you'll blame the algorithm. The algorithm is fine; the byte streams were never the same.

It gets subtler. Two strings can be canonically equal to a human and still produce different bytes because of Unicode normalization. The character é can be stored two ways: as a single precomposed code point U+00E9 (NFC), or as a base e plus a combining acute accent U+0301 (NFD). They render identically. They are not the same bytes:

const enc = (s) => Buffer.from(s, "utf8");
sha(enc("café".normalize("NFC")));
// 850f7dc43910ff890f8879c0ed26fe697c93a067ad93a7d50f466a7028a9bf4e
sha(enc("café".normalize("NFD")));
// 81ef060bcd98adc7824eb5c1ada83c32491b16018e11e79f00ab9d09e04b015a

NFC gives é as C3 A9; NFD gives 65 CC 81 — the letter e followed by the two-byte combining accent. Different byte count, different digest. macOS historically stored filenames in a decomposed form while most of the web emits composed form, which is exactly the kind of cross-platform mismatch that bites a deduplication job comparing filenames from two sources. That was my afternoon: one source handed me NFC, another handed me NFD, the strings printed identically in every log, and the hashes refused to match.

And then there's the byte-order mark. A UTF-8 BOM is the three bytes EF BB BF prepended to the text. It's invisible in most editors and contributes nothing to the meaning, but it's still bytes, so the hash sees it:

const withBom = Buffer.concat([Buffer.from([0xEF,0xBB,0xBF]), Buffer.from("café","utf8")]);
sha(withBom);
// 8236b2d43f17df6d2b0756436e79b8f602f7b1126b6d3a9ab23f991ca0ae76c0
sha(Buffer.from("café","utf8"));
// 850f7dc43910ff890f8879c0ed26fe697c93a067ad93a7d50f466a7028a9bf4e

One file saved "UTF-8" and another saved "UTF-8 with BOM" in Notepad or Excel will hash differently despite having identical visible content. If you've ever had a checksum mismatch on a file that "looks the same," check the first three bytes before you check anything else. You can watch this happen live by pasting a string into a SHA-256 hash generator and toggling a leading space or a BOM — the digest changes completely, which is the avalanche property doing its job over a byte stream you didn't think mattered.

In the browser: where the bytes actually come from

In a browser, crypto.subtle.digest makes the bytes-not-text reality explicit, because it refuses to take a string at all. You must hand it an ArrayBuffer, and the only sane way to produce one from text is TextEncoder, which always emits UTF-8:

async function sha256(text) {
  const bytes = new TextEncoder().encode(text); // UTF-8 bytes, always
  const digest = await crypto.subtle.digest("SHA-256", bytes);
  return [...new Uint8Array(digest)]
    .map(b => b.toString(16).padStart(2, "0")).join("");
}

await sha256("café");
// 850f7dc43910ff890f8879c0ed26fe697c93a067ad93a7d50f466a7028a9bf4e

Notice that this matches the Node UTF-8 digest above to the character — because both went through the same UTF-8 serialization. The Web Crypto API forcing you through TextEncoder is a feature: it removes the "which encoding?" ambiguity at the point where it matters most. The catch is that it locks you into UTF-8, so if you ever need to match a hash produced by a system that serialized in UTF-16, you have to build that byte buffer yourself rather than reaching for TextEncoder.

The one rule that dissolves all three

Stop asking "how long is this string" or "what is the hash of this string" as if a string were a single object. It isn't. It's bytes when it's in a file or on the wire, code units while it sits in a JS variable, and code points when you reason about it as text. Three of the nastiest, most time-wasting bugs in everyday development — mojibake, off-by-emoji length, and irreproducible hashes — are all the same mistake: code operating at one layer against data defined at another.

Concretely: decide your encoding (UTF-8, almost always) and your normalization form (NFC, almost always) explicitly and as early in the pipeline as you can. Normalize before you hash, before you index, before you compare. Count characters with the iterator, not .length, when "character" is what you mean. And when a checksum mismatches on text that looks identical, don't debug the algorithm — dump the raw bytes of both inputs and find the BOM, the stray combining accent, or the UTF-16/UTF-8 split. The bytes never lie. They were just answering a question you didn't know you'd asked.

Tools used in this guide

SHA-256 Hash Generator — Paste text and generate a SHA-256 digest locally for checksums, examples, cache keys, and debugging.
Fancy Text Generator — Type normal text and copy Unicode style variants for profiles, bios, headings, notes, and quick social posts.