Skip to content

Supported formats

Coverage matrix

FormatText contentTablesHeaders/FootersCommentsTrack changesEmbedded imagesNotes
.docx⚠️ stripped, not redacted⚠️ stripped, not redacted❌ (signatures pass through)Output preserves styles, lists, tables
.pdf (text layer)✅ (if extractable)n/an/aUses PyMuPDF redaction annotations
.pdf (scanned, no text)n/an/aMVP-1 brings OCR
.xlsx✅ (cells)❌ (no headers in xlsx)⚠️ strippedn/aFormulas preserved; values redacted

Legend:

  • ✅ — detected and redacted in place
  • ⚠️ — content removed entirely from output (not redacted in place because the surrounding structure isn’t user-visible)
  • ❌ — not handled; surfaces a warning to the user

What “warning” means

When you upload a file with unsupported features, the UI shows three tiers of warnings:

  • Informational — “Document contains comments. Those will be removed entirely from the output.” Common, expected, safe to ignore.
  • Risky but allowed — “Document contains hyperlinks; their text is redacted but the underlying URL targets aren’t analyzed.”
  • Unsafe — download blocked — “Document contains attachments. We can’t be sure their contents are PII-clean. Download blocked.” You’d need to manually extract and process the attachment.

Roadmap

  • MVP-1: OCR for scanned PDFs (Tesseract baseline)
  • MVP-1.1: Password-protected PDFs
  • MVP-1.2: Image content in .docx/.xlsx (signatures, logos with PII)
  • MVP-2: Additional language packs