Skip to content

Architecture

Three layers

┌─────────────────────────────────────────┐
│ webapp (FastAPI + htmx on 127.0.0.1) │ ← user interaction
├─────────────────────────────────────────┤
│ cli (argparse — anonymize subcommands)│ ← scripts / automation
├─────────────────────────────────────────┤
│ core (parsers, detection, tokens, │ ← pure Python, no network
│ writers, language packs) │
└─────────────────────────────────────────┘

core is a regular Python package with no network dependency. An integration test asserts that no socket opens during normal processing.

The CLI and webapp both depend on core and present different UIs for the same operations.

A separate anonymizer.updater process handles in-app updates — see Updates.

Data flow for a single file

  1. Parsecore.parsers reads .docx (python-docx + lxml), .pdf (PyMuPDF), or .xlsx (openpyxl + lxml)
  2. Detect — registered detectors run over the parsed text. Regex detectors (emails, phones, IBANs) and NER detectors (spaCy + Natasha for names, addresses, companies) emit Entity records with offsets
  3. TokenizeTokenManager assigns stable IDs per category ([Person_1], [Person_2], …) within a session
  4. Writecore.writers produces a new file with detected spans replaced by token strings. Original structure (table cells, page layout, cell formulas) is preserved
  5. Metadata clearcore.metadata strips authorship, comment history, revision info from the output container

The original file is never modified.

Pluggable detectors

Detectors implement a simple interface:

class Detector(Protocol):
name: str
category: str
enabled_by_default: bool
def detect(self, doc: ParsedDoc) -> list[Entity]: ...

Registered in core.detection.registry. See Detectors catalog.

Pluggable language packs

NER detectors delegate to a LanguagePack. Russian (Natasha) and English (spaCy) ship with MVP-0. Adding a new language never requires changes to core.detection or core.tokens.

See Language packs.