Skip to content

Tokens

When a detector finds an entity, it doesn’t get replaced by a generic placeholder. Instead, TokenManager assigns a stable, numbered token:

Иван Петров → [Person_1]
ООО Ромашка → [Company_1]
i.petrov@example.com → [Email_1]

If Иван Петров appears five times in a document, all five occurrences become [Person_1]. A second person becomes [Person_2].

Why numbered tokens

AI tools reading the redacted document can still tell:

  • Which entities are the same (co-reference resolution preserved)
  • Which category each token belongs to (Person, Company, Tax_ID, …)
  • Where structural relationships hold ([Person_1] is the CEO of [Company_2])

A flat [REDACTED] would lose all of that.

Session vs cross-session

ScopeStable?
Within one anonymization runYes — same entity always gets the same number
Across two separate runs of the same documentNo — fresh numbering each time
Across runs of different documentsNo — independent numbering

This is by design. Cross-session stable tokens would require storing a mapping from real entity to token ID somewhere — that’s a PII leak vector. If you need to compare two redacted documents, run both in a single batch via the CLI.

Token format

[<Category>_<N>]
  • Category is one of Person, Company, Email, Phone, Tax_ID, IBAN, Card, Address, Date, IP, MAC, URL
  • N is a positive integer, starting from 1, unique within the category for that session

The brackets are part of the token. AI tools should treat the whole bracketed expression as a single opaque term.