Azerbaijani Tokenizer — Three Algorithms, 64k Vocab, 1.727 Fertility

Status: Complete · WordPiece-uncased (1.727 fertility) selected for AzBERT pretraining · Component 1 of the AzBERT pipeline

TL;DR
#

I trained and compared three tokenizer algorithms — SentencePiece Unigram, SentencePiece BPE, and WordPiece — each in cased and uncased variants (6 total), on a preprocessed 100 GB Azerbaijani corpus. WordPiece-uncased was selected for AzBERT pretraining with fertility 1.727, UNK rate 0.0053%, and 4.572 characters per token — outperforming all SPM variants and mBERT (fertility 2.846). All three algorithms were evaluated rigorously on 47,934 test samples.

The entire preprocessing and training pipeline processes ~100 GB of raw text via MongoDB streaming cursors and multiprocessing — no data ever fits in memory.

Problem
#

Azerbaijani is a morphologically rich agglutinative language with ~10 million speakers and almost no dedicated NLP infrastructure. Every Azerbaijani NLP application in production — including e-qanun.ai, which I also built — relies on multilingual tokenizers like mBERT or XLM-R that were never designed for Azerbaijani.

The cost of this is measurable. mBERT fragments the average Azerbaijani word into 3.7 subword tokens. A word like müqavilələrindən (from their contracts) becomes six pieces. This fragmentation wastes token budget, degrades attention locality, and limits downstream task performance on a language that packs dense semantic content per word.

The fix requires training a tokenizer on a large, clean, representative Azerbaijani corpus — which first requires building that corpus.

Corpus
#

Collection	Documents	Size	Source Type
hplt3_final	11,000,000	32.3 GB	Filtered web crawl
fineweb2_final	7,200,000	21.2 GB	Curated web
nllb_final	96,000,000	—	Sentence-level parallel
culturax_final	5,000,000	13.0 GB	mC4 + OSCAR
cc100_final	3,900,000	5.5 GB	Web crawl
news_v1	3,200,000	7.9 GB	News articles
court_cases	1,900,000	1.3 GB	Legal text
pdfs_part1	1,600,000	1.2 GB	PDF extractions
e-qanun	531,000	466 MB	Legislative text (100% included)
folklor	338,000	—	Folkloric / dialectal (100% included)

Total raw corpus: ~100–150 GB, ~95.7 billion characters (measured post-normalization steps 1–3).

The folklor and e-qanun collections are included in full — folkloric text provides irreplaceable dialectal coverage, and legislative text is a high-signal, near-zero-noise domain. For the larger web crawl sources, sampling rates are set by per-collection character analysis: contamination rates, Cyrillic ratios, UTF-8 mojibake presence, and replacement character frequency.

Tokenizer training sample: 1,550,000 documents drawn across all 10 collections after evaluation holdout separation.

Preprocessing Pipeline
#

All scripts follow the same architecture: MongoDB streaming cursor → multiprocessing pool → batch bulk_write. No collection is ever loaded into memory.

Step 1 — Entity Sanitization
#

Regex-based detection replaces structured entities with placeholder tokens that the tokenizer will learn as atomic units:

Entity	Placeholder
URLs	`[URL]`
Email addresses	`[EMAIL]`
Phone numbers	`[PHONE]`

Input:  "Visit https://example.com or call +994501234567"
Output: "Visit [URL] or call [PHONE]"

Step 2 — Emoji Normalization
#

All emoji replaced with [EMOJI]. Precompiled regex covering all major Unicode emoji ranges. This step’s effect on tokenizer quality is isolated in Experiment 2: emoji removal alone reduces mean fertility by 0.215 points.

Step 3 — Punctuation Normalization
#

Visually similar punctuation canonicalized to ASCII equivalents using a translation table:

Source	Canonical
`"` `"` `„`	`"` (U+0022)
`'` `'`	`'` (U+0027)
`—` `–` `−`	`-` (U+002D)
`…`	`...`
NBSP / thin space / zero-width space	regular space

Step 4 — Character Statistics
#

Corpus-wide character frequency analysis across all collections, streaming per-document into a shared Counter. Output (char_stats.json) drives all subsequent alphabet decisions — rare character thresholds, Cyrillic contamination rates, and vocabulary boundary choices.

Tokenizer Training
#

After preprocessing, three algorithms were trained on the cleaned, sampled corpus (1.55M documents):

Algorithm	Cased	Uncased
SentencePiece Unigram	`spm_files_3`	`spm_files_2`
SentencePiece BPE	`spm_files_5`	`spm_files_4`
WordPiece	`wp_1`	`wp_2`

Shared Configuration:

Vocabulary size: 64,000
Normalizer: NFKC
Pre-tokenizer: Whitespace (for WordPiece); none for SPM
Special tokens: [PAD] [UNK] [CLS] [SEP] [MASK] [EMOJI] [URL] [PHONE] [EMAIL]

WordPiece specifics:

Continuation prefix: ##
Max chars per word: 100
Vocabulary composition (uncased): 79.6% whole-word, 17.7% continuation, 2.7% single-char tokens

All six variants were evaluated on a held-out test set (47,934 samples) with per-token metrics and suffix atomicity analysis.

Evaluation & Algorithm Comparison
#

All six tokenizer variants were evaluated on a held-out test set of 47,934 Azerbaijani documents, with metrics spanning efficiency (fertility, chars/token), reliability (UNK rate), and linguistic quality (suffix atomicity).

Performance by Algorithm
#

Ranking across key metrics (lower fertility and UNK are better):

Metric	wp_2_uncased (Selected)	wp_1_uncased	spm_files_2_uncased	spm_files_4_uncased	wp_1_cased	spm_files_3_cased
Fertility mean	1.727 ✓	1.756	1.838	1.845	1.833	1.932
UNK rate	0.0053%	0.0012%	0.0000%	0.0000%	0.0012%	0.0000%
Chars / token	4.572	4.537	4.791	4.779	4.375	4.601
Continuation rate	100%	100%	34.9%	35.2%	100%	38.3%
Suffix atomicity	100%	100%	90.9%	95.5%	100%	72.7%
Vocab utilization	51.1%	51.3%	51.8%	53.1%	53.9%	54.1%

Key observations:

WordPiece dominates on fertility. All WordPiece variants (1.727–1.833) outperform SPM variants (1.838–1.932) despite SPM’s byte-fallback coverage (0.0% UNK).
Uncased consistently beats cased. Across all algorithms, uncased variants achieve lower fertility (+0.08–0.10 points for cased) with identical suffix atomicity (100% for WordPiece, 90–95% for SPM).
WordPiece vs. SPM trade-off: WordPiece achieves 100% continuation rate (all subwords are learned tokens) and perfect suffix atomicity for both uncased variants; SPM variants have zero UNK via byte-fallback but fragment more aggressively. For sequence-efficient pretraining, WordPiece’s 1.727 fertility gains are worth the 0.0053% UNK cost.
Selection criterion: WordPiece-uncased (wp_2) selected for AzBERT pretraining based on lowest fertility, 100% suffix atomicity (preserves linguistic structure), and 51% vocabulary utilization (room for rare subwords without waste).

Comparison vs. Existing Tokenizers
#

Final evaluation of selected tokenizer (wp_2_uncased) against established multilingual baselines:

Metric	wp_2_uncased	HPLT (32k)	XLM-R (250k)	mBERT (119k)
Fertility mean	1.727	2.068	2.167	2.846
UNK rate	0.0053%	0.0%	0.0002%	0.0045%
Chars / token	4.572	4.163	3.733	2.776
Suffix atomicity	100%	45.5%	100%	100%

Summary: WordPiece-uncased outperforms mBERT by 39% on fertility and achieves competitive performance with HPLT and XLM-R while maintaining full linguistic atomicity and a practical vocabulary size (64k vs. 250k).

Selected Tokenizer: WordPiece-uncased (wp_2)
#

Fertility comparison across algorithms: wp_2_uncased 1.727, wp_1_uncased 1.756, spm_files_2 1.838, HPLT 2.068, XLM-R 2.167, mBERT 2.846

Why WordPiece-uncased was selected for AzBERT pretraining:

Criterion	Rationale
Fertility (1.727)	16% lower than HPLT, 35% lower than XLM-R, 39% lower than mBERT. Minimizes token overhead on morphologically rich Azerbaijani.
Suffix atomicity (100%)	All subword boundaries are learned, preserving Azerbaijani morphology. SPM variants achieve 90–95% (word-final characters sometimes dropped).
UNK rate (0.0053%)	Negligible; comparable to HPLT and mBERT. SPM’s zero UNK comes via byte-fallback, which degrades pretraining signal.
Vocabulary utilization (51.1%)	Balanced: 33k out of 64k tokens appear in training data, leaving room for domain-specific subwords in downstream tasks without overfitting.
Continuation rate (100%)	All subwords are in the vocabulary (no backoff to byte pairs or fallback). Improves downstream prediction and reduces inference latency.

Key Findings
#

Finding	Evidence
WordPiece outperforms SPM on fertility	WordPiece 1.73–1.83, SPM 1.84–1.95. Gap widens when cased (+0.08–0.11). WordPiece’s learned continuation tokens beat SPM’s split-first approach for Azerbaijani.
Uncasing is a free lunch	Removing case tags reduces fertility by 0.08–0.11 across all algorithms with zero downside on suffix atomicity or UNK. Strongly recommended for pretraining.
SPM trades fertility for zero UNK via byte-fallback	SPM achieves 0.0% UNK (byte pairs handle any character) but sacrifices 0.11–0.20 fertility points. For Azerbaijani pretraining, rare-subword handling via learned UNK is more efficient.
WordPiece’s 100% suffix atomicity matters	All 22 common Azerbaijani suffixes are single tokens (##lar, ##lər, ##dan, etc.), preserving morphological structure for downstream tasks. SPM achieves 86–95% (spm_files_3 only 72.7% on cased).
Vocabulary utilization of 51–54% is healthy	~33k of 64k tokens appear in 1.55M training documents. Reserves room for domain-specific subwords without overfitting or token waste.
All three algorithms beat mBERT convincingly	Even worst-case (SPM cased, 1.932) beats mBERT (2.846) by 32%. Monolingual > multilingual for morphologically rich Azerbaijani.

Stack
#

Python · HuggingFace Tokenizers · SentencePiece · MongoDB · multiprocessing · NFKC normalization · regex

Component 1 of the AzBERT pretraining pipeline. WordPiece-uncased (fertility 1.727, 100% suffix atomicity) selected and ready for pretraining.

TL;DR#

Problem#

Corpus#

Preprocessing Pipeline#

Step 1 — Entity Sanitization#

Step 2 — Emoji Normalization#

Step 3 — Punctuation Normalization#

Step 4 — Character Statistics#

Tokenizer Training#

Evaluation & Algorithm Comparison#

Performance by Algorithm#

Comparison vs. Existing Tokenizers#

Selected Tokenizer: WordPiece-uncased (wp_2)#

Key Findings#

Stack#