Document PII Redactor

Strip names, emails, phone numbers, SSNs, credit cards, IBANs, and IP addresses out of a .pdf, .docx, .txt, or .md file. PDF pages are rasterised so the underlying text cannot be recovered. Runs entirely in your browser — nothing is uploaded.

Drop a document, or browse

.txt · .md · .docx · .pdf — up to 50 MB. Nothing is uploaded.

How It Works

Drop your document

Pick a .pdf, .docx, .txt, or .md file up to 50 MB. Parsed locally with pdf.js, mammoth.js, or the File API — never uploaded.

Choose detection layers

Regex + NER catches names, organisations, and locations on top of structured PII. Regex-only skips the model entirely.

Detect locally

A deterministic regex layer runs first (Luhn-checked cards, mod-97-checked IBANs). BERT NER runs in a Web Worker for the AI layer.

Export redacted copy

PDFs get opaque black bars on rasterised pages; DOCX, TXT, and MD get stable [REDACTED_NAME_1] tokens. Same format in, same format out.

What the Document PII Redactor Detects

The redactor runs two cooperating layers, both entirely in your browser. The deterministic regex layer runs first and catches structured PII with high precision. The optional on-device NER layer then catches unstructured PII like personal names that no pure regex can reliably recognise. The output is exported in the same format you uploaded, so a redacted PDF stays a PDF and a redacted DOCX stays a DOCX.

Category	Layer	Replacement	Validation
Email addresses	Regex	`[REDACTED_EMAIL_n]`	RFC 5322 lookalike
Phone numbers	Regex	`[REDACTED_PHONE_n]`	International + national formats
US Social Security Numbers	Regex	`[REDACTED_SSN_n]`	`###-##-####`
Credit card numbers	Regex	`[REDACTED_CARD_n]`	13–19 digits + Luhn checksum
IBANs	Regex	`[REDACTED_IBAN_n]`	Mod-97 checksum (ISO 13616)
IPv4 / IPv6 addresses	Regex	`[REDACTED_IPV4_n]` / `[REDACTED_IPV6_n]`	Octet / hextet range checks
URLs	Regex	`[REDACTED_URL_n]`	`http(s)://` prefix
Person names	BERT NER	`[REDACTED_NAME_n]`	CoNLL-2003 PER class
Organisations	BERT NER	`[REDACTED_ORG_n]`	CoNLL-2003 ORG class
Locations	BERT NER	`[REDACTED_LOC_n]`	CoNLL-2003 LOC class
Other named entities	BERT NER	`[REDACTED_MISC_n]`	CoNLL-2003 MISC class

How to Redact a Document

Drop a .pdf, .docx, .txt, or .md file into the upload area (up to 50 MB).
Pick a detection mode: Regex + NER for maximum coverage, or Regex only for instant deterministic redaction.
Click Redact document. Detected items are replaced with stable, category-labelled tokens for text outputs, or covered with opaque black rectangles for PDFs.
Click Download redacted to save the document in the same format you uploaded.
Open the redacted file in your usual app to confirm the PII has been removed before sharing.

Why PDF Redaction Must Rasterise

A "redacted" PDF that only paints a black rectangle on top of the original text stream is not actually redacted — copy-pasting, searching, or rendering the file in a different reader can recover the underlying characters. To prevent this, every PDF page that contains a detected PII span is rendered to a canvas at 2.5× scale, the black redaction rectangles are painted into the bitmap, and the entire page is re-embedded in the output PDF as a PNG image. The original text stream for those pages is discarded. Pages with no detections are copied as-is so the rest of the document retains its original quality and selectable text.

Key Features

Four formats in, same format out — accepts .pdf, .docx, .txt, and .md; exports a redacted copy in the same extension.
Two-layer detection — deterministic regex (with Luhn and mod-97 validation) plus on-device BERT NER for unstructured names, organisations, and locations.
Stable tokens — repeated values get the same placeholder, so the redacted document still reads naturally.
Verifiable PDF redaction — affected pages are rasterised, not just overlaid, so the original text cannot be recovered by selecting or extracting it from the output.
Offline after first load — the ~110 MB NER model is cached in IndexedDB on first use; the regex layer works offline immediately.
Regex-only fallback — skip the model download entirely if you only need to redact structured PII like emails, phones, SSNs, cards, IBANs, and IPs.

When to Use a Document Redactor

Sharing case files or contracts — strip client names, addresses, and bank details before sending a sample to a colleague.
Submitting a CV or résumé — remove email, phone, and address before posting to a public job board.
Reviewing internal incident reports — redact employee names and IP addresses before circulating a post-mortem.
GDPR / HIPAA / SOC 2 reviews — produce an auditable in-browser path for any team member who needs to share an internal document externally.
Compliance archival — keep a long-term copy of a customer record with personal data removed but structure preserved.

Why On-Device Detection Matters

Cloud-based document-redaction services typically require uploading the very file you are trying to protect. That defeats the threat model. This tool uses pdf.js, mammoth.js, and Transformers.js with a Web Worker to run the entire pipeline on your device — your document never crosses the network, and no detection result is logged anywhere outside this tab. Pair it with the PII Masker for AI Prompts when copy-pasting into ChatGPT or Claude, and the Prompt Secret Scrubber for AWS / GCP / OpenAI keys.

Validation Details

Credit cards. Numbers between 13 and 19 digits are accepted only if they pass the Luhn (mod-10) checksum, eliminating most false positives on order numbers and reference codes.
IBANs. Validated with the ISO 13616 mod-97 algorithm — a country prefix and matching checksum digits are required.
IPv4. Each octet is range-checked (0–255), so timestamp-like sequences are rejected.
NER spans. Aggregated with the Transformers.js simple strategy so multi-token entities (e.g. "Jane Doe", "Bank of America") are merged into a single redaction.
PDF text runs. Every text run that overlaps a detected PII character span is added to the redaction list for its page — an entity split across multiple runs is still fully covered.

Frequently Asked Questions

Is my document uploaded anywhere?

No. Files are parsed locally with pdf.js for PDFs, mammoth.js for DOCX, and the browser File API for text and Markdown. Detection and export happen entirely in your browser — nothing reaches our servers or any third party.

Which file types are supported?

Plain text (.txt), Markdown (.md), Microsoft Word (.docx), and PDF (.pdf). Maximum file size is 50 MB. Legacy .doc, .rtf, .pptx, and .xlsx are not supported.

How does the PDF redaction work?

pdf.js extracts each text run with its bounding box. Every text run that overlaps a detected PII span is added to a redaction list. The page is rendered to a canvas at 2.5× scale, opaque black rectangles are painted into the bitmap, and the page is re-embedded as a PNG image so the original text stream cannot be recovered.

How are DOCX, TXT, and Markdown documents redacted?

The document is parsed to plain text (mammoth.js handles DOCX paragraph extraction). Detected PII is replaced with stable category-labelled tokens like [REDACTED_NAME_1] and [REDACTED_EMAIL_1]. The redacted text is exported in the same format you supplied.

What kinds of PII does it detect?

Emails, phone numbers, US SSNs, credit cards (Luhn-validated), IBANs (mod-97-validated), IPv4 / IPv6 via the regex layer; names, organisations, locations, and miscellaneous entities via the BERT NER layer.

Why is the first redaction slower?

Regex + NER mode triggers a one-time ~110 MB download of the BERT NER model and caches it in IndexedDB. Subsequent runs reuse the cache. Regex-only mode requires no model.

Does it work offline?

Yes. Once the NER model is cached you can redact documents with no connection. Regex-only mode is offline-capable out of the box.

Will it catch every name?

No NER model is perfect. BERT-base-NER is accurate on Western personal names and known organisations but can miss nicknames, single-word names, rare entities, and names broken across multiple text runs in a PDF. Always eyeball the redacted output before sharing.

Which model powers the NER layer?

Xenova/bert-base-NER — an ONNX conversion of dslim/bert-base-NER (Apache 2.0). BERT-base fine-tuned on CoNLL-2003 for PER / ORG / LOC / MISC tagging.

Privacy & Security

Every step runs locally — pdf.js and mammoth.js parse your document in the browser, the regex layer runs in plain JavaScript, and the NER layer runs through Xenova/bert-base-NER inside a Web Worker. PDF pages with detections are rasterised so the underlying text cannot be recovered. No file, no detection result, and no redacted output ever reaches our servers or any third party.