Architecture

Three layers

┌─────────────────────────────────────────┐
│  webapp (FastAPI + htmx on 127.0.0.1)   │  ← user interaction
├─────────────────────────────────────────┤
│  cli  (argparse — anonymize subcommands)│  ← scripts / automation
├─────────────────────────────────────────┤
│  core (parsers, detection, tokens,      │  ← pure Python, no network
│        writers, language packs)          │
└─────────────────────────────────────────┘

core is a regular Python package with no network dependency. An integration test asserts that no socket opens during normal processing.

The CLI and webapp both depend on core and present different UIs for the same operations.

A separate anonymizer.updater process handles in-app updates — see Updates.

Data flow for a single file

Parse — core.parsers reads .docx (python-docx + lxml), .pdf (PyMuPDF), or .xlsx (openpyxl + lxml)
Detect — registered detectors run over the parsed text. Regex detectors (emails, phones, IBANs) and NER detectors (spaCy + Natasha for names, addresses, companies) emit Entity records with offsets
Tokenize — TokenManager assigns stable IDs per category ([Person_1], [Person_2], …) within a session
Write — core.writers produces a new file with detected spans replaced by token strings. Original structure (table cells, page layout, cell formulas) is preserved
Metadata clear — core.metadata strips authorship, comment history, revision info from the output container

The original file is never modified.

Pluggable detectors

Detectors implement a simple interface:

class Detector(Protocol):
    name: str
    category: str
    enabled_by_default: bool

    def detect(self, doc: ParsedDoc) -> list[Entity]: ...

Registered in core.detection.registry. See Detectors catalog.

Pluggable language packs

NER detectors delegate to a LanguagePack. Russian (Natasha) and English (spaCy) ship with MVP-0. Adding a new language never requires changes to core.detection or core.tokens.

See Language packs.