Language packs
Это содержимое пока не доступно на вашем языке.
A language pack bundles the NER model and the tokenization rules for one language. MVP-0 ships with two: Russian (via Natasha) and English (via spaCy en_core_web_lg).
What a pack provides
- An NER model that emits
Entityrecords for Persons, Companies, and Locations - A tokenizer that respects the language’s sentence boundaries and case
- Optional surname/given-name postprocessing (Russian patronymics, English suffixes like Jr.)
Packs are loaded on demand. If you process only .docx files in English, the Russian pack stays on disk but never loads into memory.
Active packs
The default ANONYMIZER_LANG_PACKS=ru,en activates both. To disable English (faster startup):
export ANONYMIZER_LANG_PACKS=ruanonymizeThe active set is detected from the document’s predominant language. If a .docx mixes RU and EN paragraphs, both packs run.
Roadmap for additional languages
MVP-2 adds the remaining languages from the customer brief (German, Spanish, Italian, French, Portuguese, Dutch, Polish, Czech, Turkish, and more — 25 in total). Until then, language-agnostic regex detectors (emails, phones, IBANs, etc.) still work on documents in any language.
Adding a custom pack
If you need a language not yet supported, see the engineer-facing guide in the product repo: docs/agents/extending-language.md. It walks through the LanguagePackRegistry registration.