Architecture
Three layers
┌─────────────────────────────────────────┐│ webapp (FastAPI + htmx on 127.0.0.1) │ ← user interaction├─────────────────────────────────────────┤│ cli (argparse — anonymize subcommands)│ ← scripts / automation├─────────────────────────────────────────┤│ core (parsers, detection, tokens, │ ← pure Python, no network│ writers, language packs) │└─────────────────────────────────────────┘core is a regular Python package with no network dependency. An integration test asserts that no socket opens during normal processing.
The CLI and webapp both depend on core and present different UIs for the same operations.
A separate anonymizer.updater process handles in-app updates — see Updates.
Data flow for a single file
- Parse —
core.parsersreads.docx(python-docx + lxml),.pdf(PyMuPDF), or.xlsx(openpyxl + lxml) - Detect — registered detectors run over the parsed text. Regex detectors (emails, phones, IBANs) and NER detectors (spaCy + Natasha for names, addresses, companies) emit
Entityrecords with offsets - Tokenize —
TokenManagerassigns stable IDs per category ([Person_1],[Person_2], …) within a session - Write —
core.writersproduces a new file with detected spans replaced by token strings. Original structure (table cells, page layout, cell formulas) is preserved - Metadata clear —
core.metadatastrips authorship, comment history, revision info from the output container
The original file is never modified.
Pluggable detectors
Detectors implement a simple interface:
class Detector(Protocol): name: str category: str enabled_by_default: bool
def detect(self, doc: ParsedDoc) -> list[Entity]: ...Registered in core.detection.registry. See Detectors catalog.
Pluggable language packs
NER detectors delegate to a LanguagePack. Russian (Natasha) and English (spaCy) ship with MVP-0. Adding a new language never requires changes to core.detection or core.tokens.
See Language packs.