Supported formats
Coverage matrix
| Format | Text content | Tables | Headers/Footers | Comments | Track changes | Embedded images | Notes |
|---|---|---|---|---|---|---|---|
.docx | ✅ | ✅ | ✅ | ⚠️ stripped, not redacted | ⚠️ stripped, not redacted | ❌ (signatures pass through) | Output preserves styles, lists, tables |
.pdf (text layer) | ✅ | ✅ (if extractable) | ✅ | n/a | n/a | ❌ | Uses PyMuPDF redaction annotations |
.pdf (scanned, no text) | ❌ | ❌ | ❌ | n/a | n/a | ❌ | MVP-1 brings OCR |
.xlsx | ✅ | ✅ (cells) | ❌ (no headers in xlsx) | ⚠️ stripped | n/a | ❌ | Formulas preserved; values redacted |
Legend:
- ✅ — detected and redacted in place
- ⚠️ — content removed entirely from output (not redacted in place because the surrounding structure isn’t user-visible)
- ❌ — not handled; surfaces a warning to the user
What “warning” means
When you upload a file with unsupported features, the UI shows three tiers of warnings:
- Informational — “Document contains comments. Those will be removed entirely from the output.” Common, expected, safe to ignore.
- Risky but allowed — “Document contains hyperlinks; their text is redacted but the underlying URL targets aren’t analyzed.”
- Unsafe — download blocked — “Document contains attachments. We can’t be sure their contents are PII-clean. Download blocked.” You’d need to manually extract and process the attachment.
Roadmap
- MVP-1: OCR for scanned PDFs (Tesseract baseline)
- MVP-1.1: Password-protected PDFs
- MVP-1.2: Image content in
.docx/.xlsx(signatures, logos with PII) - MVP-2: Additional language packs