Private Document Q&A

Chat with a PDF, DOCX, TXT, or Markdown file using on-device retrieval-augmented generation. MiniLM embeddings index the document, Llama 3.2 1B answers your question — everything runs in your browser.

Drop a document, or browse

.txt · .md · .docx · .pdf — up to 30 MB. Nothing is uploaded.

First-time download: ~23 MB embeddings now, ~990 MB Llama 3.2 1B language model on your first question.

Built with Meta Llama — license details

How It Works

Drop your document

Pick a .pdf, .docx, .txt, or .md file up to 30 MB. Parsed locally with pdf.js, mammoth.js, or the File API — never uploaded.

Index every passage

The document is split into sentence-aware chunks and embedded with MiniLM in a Web Worker. The 384-dimensional vectors stay in your tab.

Retrieve the top-3 matches

Your question is embedded into the same vector space. Cosine similarity ranks every chunk client-side; the best three feed the answer model.

Llama 3.2 answers with citations

Llama 3.2 1B generates a grounded answer from the retrieved passages — and lists the exact excerpts and page numbers it used.

What "Private Document Q&A" Means

Cloud-based "chat with your PDF" tools typically require you to upload the very file you want to keep confidential — contracts, financial statements, medical letters, internal reports. This tool flips that threat model. The document is parsed and indexed in your browser, the embedding model and the language model are downloaded once and cached in IndexedDB, and every question is answered without a single byte leaving the tab. No account. No telemetry. No third-party API call.

The On-Device Models

Stage	Model	License	Size
Passage and question embeddings	`Xenova/all-MiniLM-L6-v2` (ONNX export of sentence-transformers/all-MiniLM-L6-v2)	Apache 2.0	~23 MB
Answer generation	`onnx-community/Llama-3.2-1B-Instruct-ONNX` (Meta Llama 3.2 1B Instruct)	Llama 3.2 Community License	~990 MB (q4f16, WebGPU)
PDF parser	`pdf.js`	Apache 2.0	Bundled
DOCX parser	`mammoth.js`	BSD-2-Clause	Bundled

How to Use Private Document Q&A

Drop a .pdf, .docx, .txt, or .md file into the upload area (up to 30 MB).
Wait for the indexer — the document is split into sentence-aware passages and each one is embedded into a 384-dimensional vector on your device.
Type a question about the document. The tool retrieves the three most relevant passages and passes them to the answer model.
Read the grounded answer. Expand Show source passages to verify every claim against the original text, with page numbers for PDFs.
Ask follow-up questions — the document index is reused, so only the answer step runs each time.

Why Retrieval-Augmented Generation Beats a Plain LLM

Generative language models are notoriously prone to hallucination — they confidently invent facts when the prompt is ambiguous. Retrieval-augmented generation (RAG) sidesteps that by attaching the most relevant document excerpts to the prompt itself. The language model is no longer "answering from memory"; it is summarising the excerpts you can see for yourself. The system prompt also explicitly instructs Llama 3.2 to refuse when the answer is not in the excerpts — so unanswerable questions return "I could not find this in the document." rather than a fabricated reply.

Key Features

Four formats in — accepts .pdf, .docx, .txt, and .md documents up to 30 MB.
Grounded answers — every reply is generated from passages retrieved by cosine similarity against the document, not from the model's parametric memory.
Inline citations — each answer expands to reveal the exact source passages, with page numbers for PDFs, so you can verify every claim.
Offline after first load — the ~1 GB combined model bundle (MiniLM + Llama 3.2 1B) is cached in IndexedDB and works offline forever after.
Reusable index — once a document is embedded, every follow-up question only re-runs the small embedding pass and the answer pass.
Browser-only privacy — no upload, no account, no API key, no telemetry, no third-party service in the loop.

When to Use Private Document Q&A

Reviewing a contract — ask "what is the notice period for termination?" without sending the contract to a hosted AI.
Skimming research papers — ask "what dataset did the authors use?" or "what were the main findings?" on a 30-page PDF.
Onboarding documents — ask an employee handbook or a long PRD targeted questions instead of scrolling.
Compliance reviews — query an internal policy without copying snippets into ChatGPT or Claude.
Financial statements — ask "what was the operating margin in Q3?" against a 10-Q PDF.

How It Compares to "Chat With Your PDF" Services

Hosted services like ChatPDF, Humata, and AskYourPDF upload your document to their servers, store it in a vector database they own, and run a third-party LLM (typically OpenAI or Anthropic) over the retrieved chunks. The privacy contract is "we promise not to look at it." This tool replaces every step of that pipeline with an in-browser equivalent: pdf.js parses the file, MiniLM embeds the chunks into IndexedDB-cached memory, cosine similarity runs over Float32 vectors in JavaScript, and Llama 3.2 1B generates the answer in a Web Worker. Pair it with the Document PII Redactor if you need to strip personal data from the file before storing the redacted copy elsewhere, or the PII Masker for AI Prompts when you do need to use a hosted LLM and want to scrub names, emails, and IDs first.

Limits and Trade-offs

Document length. The index is capped at 80 passages (~96 000 characters of content). For very long PDFs split the file before uploading.
Adaptive context. Documents under ~12 000 characters (~3 000 tokens) skip retrieval entirely — the full document is passed to Llama in document order. Longer documents fall back to top-3 cosine-similarity retrieval. The answer card labels which mode ran.
Top-k window. When retrieval runs, only the three highest-similarity passages are passed to the answer model. Questions that need synthesis across many paragraphs may need to be asked more narrowly.
Reasoning depth. Llama 3.2 1B is a small on-device model — it is strong at extracting and lightly synthesising facts from the supplied passages but not at deep multi-step reasoning. For complex chained logic, use the answer as a starting point and verify against the cited excerpts.
First-time download. Around 1 GB total the first time (23 MB MiniLM + 990 MB Llama 3.2 1B in q4f16). Cached forever after, and re-used by the Resume Builder and Cover Letter Generator on this site.
Browser support. Chrome, Edge, Firefox, and Brave are supported. Safari is hard-gated until JavaScriptCore reclaims WebAssembly memory during decoder runs.

Frequently Asked Questions

Is my document uploaded anywhere?

No. PDFs are parsed with pdf.js, DOCX files with mammoth.js, and TXT/Markdown via the browser File API. The embedding model and the language model run in a Web Worker on your device via Transformers.js. The document, the question, the retrieved passages, and the answer are all kept inside your browser tab.

Which file types are supported?

Plain text (.txt), Markdown (.md), Microsoft Word (.docx), and PDF (.pdf). Maximum file size is 30 MB. Legacy .doc, .rtf, .pptx, and .xlsx are not supported.

How does the retrieval work?

The document is split into sentence-aware passages of ~1,200 characters. Each passage is embedded with Xenova/all-MiniLM-L6-v2 into a 384-dimensional vector. Your question is embedded into the same space; cosine similarity ranks every chunk; the top three feed the answer model.

Which language model generates the answers?

Meta's Llama 3.2 1B Instruct (onnx-community/Llama-3.2-1B-Instruct-ONNX) under the Llama 3.2 Community License. A 1-billion-parameter decoder-only instruction-tuned model running locally via Transformers.js on WebGPU with a chat-template prompt that forbids inventing facts outside the retrieved passages. The required "Built with Meta Llama" attribution badge is rendered on this page.

Why is the first question slower than the rest?

The first upload triggers a one-time ~23 MB MiniLM download. The first question then triggers a one-time ~990 MB Llama 3.2 1B download (q4f16). Both are cached in IndexedDB; subsequent documents and questions reuse the cache and start instantly. The same Llama files are reused by the Resume Builder and Cover Letter Generator on this site.

Does it work offline?

Yes — once both models are cached you can index new documents and ask questions with no internet connection. The IndexedDB cache persists across sessions on the same device.

Why is Safari not supported?

Llama 3.2 1B is decoder-only; each generated token grows the KV-cache tensor. Safari's JavaScriptCore cannot run FinalizationRegistry callbacks during a synchronous WebAssembly block, so the WASM heap grows unboundedly. We hard-gate the tool to Chrome, Edge, Firefox, and Brave until Safari fixes this.

Will the model invent answers that are not in the document?

Its system prompt and three few-shot examples force Llama 3.2 to refuse padding lists and to reply "I could not find this in the document." when the answer is missing. Every answer also lists the exact passages the model saw, so you can verify it against the source text.

How accurate is it on long documents?

The index is capped at 80 passages and the top-3 retrieval limits how much context the model sees. Split very long PDFs into shorter parts and ask narrowly-scoped questions for the best results.

Why do answers on short documents look better than on long ones?

When the entire document fits below ~12,000 characters (~3,000 tokens) the tool skips retrieval and passes the full document to Llama 3.2 in document order. This is important for brochures and flyers with multi-column or boxed layouts: PDF text extraction returns text in reading order, not visual order, so section headings and their body text can get scrambled — passing the whole document lets the model re-associate them itself. Each answer shows a small "Full document context" or "Top-3 retrieval" badge so you can see which mode ran.

Privacy & Security

Every step runs locally — pdf.js and mammoth.js parse your document in the browser, Xenova/all-MiniLM-L6-v2 embeds every passage inside a Web Worker, cosine similarity runs in plain JavaScript, and Meta Llama 3.2 1B Instruct answers the question from the retrieved excerpts. No file, no question, no embedding, and no generated answer ever reaches our servers or any third party.