Supported formats

Coverage matrix

Format	Text content	Tables	Headers/Footers	Comments	Track changes	Embedded images	Notes
`.docx`	✅	✅	✅	⚠️ stripped, not redacted	⚠️ stripped, not redacted	❌ (signatures pass through)	Output preserves styles, lists, tables
`.pdf` (text layer)	✅	✅ (if extractable)	✅	n/a	n/a	⚠️ not OCR-checked	Uses PyMuPDF redaction annotations
`.pdf` (scanned or hybrid)	✅ with local OCR	⚠️ layout-dependent	✅ as OCR text	n/a	n/a	✅ for scanned pages	Requires Tesseract with `eng` + `rus` language packs
`.xlsx`	✅	✅ (cells)	❌ (no headers in xlsx)	⚠️ stripped	n/a	❌	Formulas preserved; values redacted
`.txt` / `.md` / `.csv`	✅ (line by line)	⚠️ treated as plain text	n/a	n/a	n/a	n/a	Original encoding and newlines preserved
`.html` / `.htm`	✅ (block-level text)	✅ (`td` text)	n/a	⚠️ warned, left unchanged	n/a	❌	`mailto:`/`tel:`/`alt`/`title` attributes redacted too
`.rtf`	✅ (paragraph text)	⚠️ flattened to paragraphs	n/a	n/a	n/a	❌ dropped with a warning	Input only — output is a fresh `.docx`

Legend:

✅ — detected and redacted in place
⚠️ — content removed entirely from output (not redacted in place because the surrounding structure isn’t user-visible)
❌ — not handled; surfaces a warning to the user

Scanned and hybrid PDFs are processed locally through Tesseract OCR. The downloaded PDF keeps the original pages and overlays redaction/token boxes where OCR found text; it does not receive a full raw OCR text layer. If Tesseract or the English/Russian language packs are missing, scanned-page processing is rejected with installation guidance instead of silently skipping pages.

Plain text (`.txt`, `.md`, `.csv`)

Text files are anonymized line by line. Markdown markup and CSV columns are treated as plain text — no structural parsing, and replacements stay within their lines. UTF-8 and UTF-8-BOM input is preserved as is on output; cp1251 input is also preserved but surfaces an informational TEXT_ENCODING_LEGACY warning. Binary content or an unsupported text encoding is rejected. A text file whose content legitimately starts with %PDF is still processed as text; binary content under a text suffix falls back to magic-byte detection.

HTML (`.html`, `.htm`)

Text of block-level elements (p, li, td, headings, div, …) is anonymized, including text spanning inline tags. href attributes with mailto:/tel: targets and alt/title attributes are redacted as well. Content the engine can’t safely rewrite is left unchanged and warned about: <script> bodies, HTML comments, and selected <meta> tags (author, description, og:*) — warnings escalate to unsafe when the content looks like PII. Hidden text (display:none / hidden) is anonymized and adds an informational warning.

RTF (`.rtf`)

RTF is input-only: paragraph text is extracted with striprtf, anonymized, and written into a fresh .docx — there is no RTF writer, and source formatting is not preserved (informational RTF_CONVERTED_TO_DOCX warning). Embedded objects and pictures are dropped during conversion and surface the risky RTF_OBJECT_DROPPED warning.

What “warning” means

When you upload a file with unsupported features, the UI shows three tiers of warnings:

Informational — “Document contains comments. Those will be removed entirely from the output.” Common, expected, safe to ignore.
Risky but allowed — “Document contains hyperlinks; their text is redacted but the underlying URL targets aren’t analyzed.”
Unsafe — download blocked — “Document contains attachments. We can’t be sure their contents are PII-clean. Download blocked.” You’d need to manually extract and process the attachment.
OCR warnings — low-confidence OCR or unchecked embedded images mean the output needs visual review, especially for low-quality scans.

Roadmap

Planned: Password-protected PDFs
Planned: Image content in .docx/.xlsx (signatures, logos with PII)
Future OCR export: recognized document download as editable .docx
MVP-2: Additional language packs