Skip to main content
  1. Projects/

Azerbaijani Tokenizer — Three Algorithms, 64k Vocab, 1.727 Fertility

Status: Complete · WordPiece-uncased (1.727 fertility) selected for AzBERT pretraining · Component 1 of the AzBERT pipeline

TL;DR
#

I trained and compared three tokenizer algorithms — SentencePiece Unigram, SentencePiece BPE, and WordPiece — each in cased and uncased variants (6 total), on a preprocessed 100 GB Azerbaijani corpus. WordPiece-uncased was selected for AzBERT pretraining with fertility 1.727, UNK rate 0.0053%, and 4.572 characters per token — outperforming all SPM variants and mBERT (fertility 2.846). All three algorithms were evaluated rigorously on 47,934 test samples.

The entire preprocessing and training pipeline processes ~100 GB of raw text via MongoDB streaming cursors and multiprocessing — no data ever fits in memory.


Problem
#

Azerbaijani is a morphologically rich agglutinative language with ~10 million speakers and almost no dedicated NLP infrastructure. Every Azerbaijani NLP application in production — including e-qanun.ai, which I also built — relies on multilingual tokenizers like mBERT or XLM-R that were never designed for Azerbaijani.

The cost of this is measurable. mBERT fragments the average Azerbaijani word into 3.7 subword tokens. A word like müqavilələrindən (from their contracts) becomes six pieces. This fragmentation wastes token budget, degrades attention locality, and limits downstream task performance on a language that packs dense semantic content per word.

The fix requires training a tokenizer on a large, clean, representative Azerbaijani corpus — which first requires building that corpus.


Corpus
#

CollectionDocumentsSizeSource Type
hplt3_final11,000,00032.3 GBFiltered web crawl
fineweb2_final7,200,00021.2 GBCurated web
nllb_final96,000,000Sentence-level parallel
culturax_final5,000,00013.0 GBmC4 + OSCAR
cc100_final3,900,0005.5 GBWeb crawl
news_v13,200,0007.9 GBNews articles
court_cases1,900,0001.3 GBLegal text
pdfs_part11,600,0001.2 GBPDF extractions
e-qanun531,000466 MBLegislative text (100% included)
folklor338,000Folkloric / dialectal (100% included)

Total raw corpus: ~100–150 GB, ~95.7 billion characters (measured post-normalization steps 1–3).

The folklor and e-qanun collections are included in full — folkloric text provides irreplaceable dialectal coverage, and legislative text is a high-signal, near-zero-noise domain. For the larger web crawl sources, sampling rates are set by per-collection character analysis: contamination rates, Cyrillic ratios, UTF-8 mojibake presence, and replacement character frequency.

Tokenizer training sample: 1,550,000 documents drawn across all 10 collections after evaluation holdout separation.


Preprocessing Pipeline
#

Tokenizer pipeline: corpus ingestion → entity sanitization → emoji normalization → punctuation normalization → character statistics → tokenizer training (3 algorithms × 2 variants) → evaluation & selection

All scripts follow the same architecture: MongoDB streaming cursor → multiprocessing pool → batch bulk_write. No collection is ever loaded into memory.

Step 1 — Entity Sanitization
#

Regex-based detection replaces structured entities with placeholder tokens that the tokenizer will learn as atomic units:

EntityPlaceholder
URLs[URL]
Email addresses[EMAIL]
Phone numbers[PHONE]
Input:  "Visit https://example.com or call +994501234567"
Output: "Visit [URL] or call [PHONE]"

Step 2 — Emoji Normalization
#

All emoji replaced with [EMOJI]. Precompiled regex covering all major Unicode emoji ranges. This step’s effect on tokenizer quality is isolated in Experiment 2: emoji removal alone reduces mean fertility by 0.215 points.

Step 3 — Punctuation Normalization
#

Visually similar punctuation canonicalized to ASCII equivalents using a translation table:

SourceCanonical
" " " (U+0022)
' '' (U+0027)
- (U+002D)
...
NBSP / thin space / zero-width spaceregular space

Step 4 — Character Statistics
#

Corpus-wide character frequency analysis across all collections, streaming per-document into a shared Counter. Output (char_stats.json) drives all subsequent alphabet decisions — rare character thresholds, Cyrillic contamination rates, and vocabulary boundary choices.


Tokenizer Training
#

After preprocessing, three algorithms were trained on the cleaned, sampled corpus (1.55M documents):

AlgorithmCasedUncased
SentencePiece Unigramspm_files_3spm_files_2
SentencePiece BPEspm_files_5spm_files_4
WordPiecewp_1wp_2

Shared Configuration:

  • Vocabulary size: 64,000
  • Normalizer: NFKC
  • Pre-tokenizer: Whitespace (for WordPiece); none for SPM
  • Special tokens: [PAD] [UNK] [CLS] [SEP] [MASK] [EMOJI] [URL] [PHONE] [EMAIL]

WordPiece specifics:

  • Continuation prefix: ##
  • Max chars per word: 100
  • Vocabulary composition (uncased): 79.6% whole-word, 17.7% continuation, 2.7% single-char tokens

All six variants were evaluated on a held-out test set (47,934 samples) with per-token metrics and suffix atomicity analysis.


Evaluation & Algorithm Comparison
#

All six tokenizer variants were evaluated on a held-out test set of 47,934 Azerbaijani documents, with metrics spanning efficiency (fertility, chars/token), reliability (UNK rate), and linguistic quality (suffix atomicity).

Performance by Algorithm
#

Ranking across key metrics (lower fertility and UNK are better):

Metricwp_2_uncased (Selected)wp_1_uncasedspm_files_2_uncasedspm_files_4_uncasedwp_1_casedspm_files_3_cased
Fertility mean1.7271.7561.8381.8451.8331.932
UNK rate0.0053%0.0012%0.0000%0.0000%0.0012%0.0000%
Chars / token4.5724.5374.7914.7794.3754.601
Continuation rate100%100%34.9%35.2%100%38.3%
Suffix atomicity100%100%90.9%95.5%100%72.7%
Vocab utilization51.1%51.3%51.8%53.1%53.9%54.1%

Key observations:

  1. WordPiece dominates on fertility. All WordPiece variants (1.727–1.833) outperform SPM variants (1.838–1.932) despite SPM’s byte-fallback coverage (0.0% UNK).

  2. Uncased consistently beats cased. Across all algorithms, uncased variants achieve lower fertility (+0.08–0.10 points for cased) with identical suffix atomicity (100% for WordPiece, 90–95% for SPM).

  3. WordPiece vs. SPM trade-off: WordPiece achieves 100% continuation rate (all subwords are learned tokens) and perfect suffix atomicity for both uncased variants; SPM variants have zero UNK via byte-fallback but fragment more aggressively. For sequence-efficient pretraining, WordPiece’s 1.727 fertility gains are worth the 0.0053% UNK cost.

  4. Selection criterion: WordPiece-uncased (wp_2) selected for AzBERT pretraining based on lowest fertility, 100% suffix atomicity (preserves linguistic structure), and 51% vocabulary utilization (room for rare subwords without waste).

Comparison vs. Existing Tokenizers
#

Final evaluation of selected tokenizer (wp_2_uncased) against established multilingual baselines:

Metricwp_2_uncasedHPLT (32k)XLM-R (250k)mBERT (119k)
Fertility mean1.7272.0682.1672.846
UNK rate0.0053%0.0%0.0002%0.0045%
Chars / token4.5724.1633.7332.776
Suffix atomicity100%45.5%100%100%

Summary: WordPiece-uncased outperforms mBERT by 39% on fertility and achieves competitive performance with HPLT and XLM-R while maintaining full linguistic atomicity and a practical vocabulary size (64k vs. 250k).


Selected Tokenizer: WordPiece-uncased (wp_2)
#

Fertility comparison across algorithms: wp_2_uncased 1.727, wp_1_uncased 1.756, spm_files_2 1.838, HPLT 2.068, XLM-R 2.167, mBERT 2.846

Why WordPiece-uncased was selected for AzBERT pretraining:

CriterionRationale
Fertility (1.727)16% lower than HPLT, 35% lower than XLM-R, 39% lower than mBERT. Minimizes token overhead on morphologically rich Azerbaijani.
Suffix atomicity (100%)All subword boundaries are learned, preserving Azerbaijani morphology. SPM variants achieve 90–95% (word-final characters sometimes dropped).
UNK rate (0.0053%)Negligible; comparable to HPLT and mBERT. SPM’s zero UNK comes via byte-fallback, which degrades pretraining signal.
Vocabulary utilization (51.1%)Balanced: 33k out of 64k tokens appear in training data, leaving room for domain-specific subwords in downstream tasks without overfitting.
Continuation rate (100%)All subwords are in the vocabulary (no backoff to byte pairs or fallback). Improves downstream prediction and reduces inference latency.

Key Findings
#

FindingEvidence
WordPiece outperforms SPM on fertilityWordPiece 1.73–1.83, SPM 1.84–1.95. Gap widens when cased (+0.08–0.11). WordPiece’s learned continuation tokens beat SPM’s split-first approach for Azerbaijani.
Uncasing is a free lunchRemoving case tags reduces fertility by 0.08–0.11 across all algorithms with zero downside on suffix atomicity or UNK. Strongly recommended for pretraining.
SPM trades fertility for zero UNK via byte-fallbackSPM achieves 0.0% UNK (byte pairs handle any character) but sacrifices 0.11–0.20 fertility points. For Azerbaijani pretraining, rare-subword handling via learned UNK is more efficient.
WordPiece’s 100% suffix atomicity mattersAll 22 common Azerbaijani suffixes are single tokens (##lar, ##lər, ##dan, etc.), preserving morphological structure for downstream tasks. SPM achieves 86–95% (spm_files_3 only 72.7% on cased).
Vocabulary utilization of 51–54% is healthy~33k of 64k tokens appear in 1.55M training documents. Reserves room for domain-specific subwords without overfitting or token waste.
All three algorithms beat mBERT convincinglyEven worst-case (SPM cased, 1.932) beats mBERT (2.846) by 32%. Monolingual > multilingual for morphologically rich Azerbaijani.

Stack
#

Python · HuggingFace Tokenizers · SentencePiece · MongoDB · multiprocessing · NFKC normalization · regex


Component 1 of the AzBERT pretraining pipeline. WordPiece-uncased (fertility 1.727, 100% suffix atomicity) selected and ready for pretraining.