Tokens
When a detector finds an entity, it doesn’t get replaced by a generic placeholder. Instead, TokenManager assigns a stable, numbered token:
Иван Петров → [Person_1]ООО Ромашка → [Company_1]i.petrov@example.com → [Email_1]If Иван Петров appears five times in a document, all five occurrences become [Person_1]. A second person becomes [Person_2].
Why numbered tokens
AI tools reading the redacted document can still tell:
- Which entities are the same (co-reference resolution preserved)
- Which category each token belongs to (
Person,Company,Tax_ID, …) - Where structural relationships hold (
[Person_1] is the CEO of [Company_2])
A flat [REDACTED] would lose all of that.
Session vs cross-session
| Scope | Stable? |
|---|---|
| Within one anonymization run | Yes — same entity always gets the same number |
| Across two separate runs of the same document | No — fresh numbering each time |
| Across runs of different documents | No — independent numbering |
This is by design. Cross-session stable tokens would require storing a mapping from real entity to token ID somewhere — that’s a PII leak vector. If you need to compare two redacted documents, run both in a single batch via the CLI.
Token format
[<Category>_<N>]Categoryis one ofPerson,Company,Email,Phone,Tax_ID,IBAN,Card,Address,Date,IP,MAC,URLNis a positive integer, starting from 1, unique within the category for that session
The brackets are part of the token. AI tools should treat the whole bracketed expression as a single opaque term.