Tokens

When a detector finds an entity, it doesn’t get replaced by a generic placeholder. Instead, TokenManager assigns a stable, numbered token:

Иван Петров  → [Person_1]
ООО Ромашка  → [Company_1]
i.petrov@example.com → [EMAIL_1]

If Иван Петров appears five times in a document, all five occurrences become [Person_1]. A second person becomes [Person_2].

Why numbered tokens

AI tools reading the redacted document can still tell:

Which entities are the same (co-reference resolution preserved)
Which category each token belongs to (Person, Company, NUMBER, …)
Where structural relationships hold ([Person_1] is the CEO of [Company_2])

A flat [REDACTED] would lose all of that.

Session vs cross-session

Scope	Stable?
Within one anonymization run	Yes — same entity always gets the same number
Across two separate runs of the same document	No — fresh numbering each time
Across runs of different documents	No — independent numbering

This is by design. Cross-session stable tokens would require storing a mapping from real entity to token ID somewhere — that’s a PII leak vector.

Token format

[<Category>_<N>]

Category is one of Person, Company, ADDRESS, DATE, EMAIL, PHONE, NUMBER, DATA, CRYPTO
N is a positive integer, starting from 1, unique within the category for that session

NUMBER covers financial and government identifiers (IBAN, cards, tax IDs, passports, contract numbers, …). CRYPTO covers cryptocurrency wallet addresses, transaction hashes, and private/extended keys.

The brackets are part of the token. AI tools should treat the whole bracketed expression as a single opaque term.