AzNEOBERT — Azerbaijani BERT from Scratch on 12B Tokens

Thu, 01 Jan 2026 00:00:00 +0000

Status: In training · Phase 1 active on 8× NVIDIA H200 · Component 2 of the AzBERT pipeline

TL;DR
#

I am training AzNEOBERT — an Azerbaijani encoder language model from scratch, based on the NeoBERT architecture. The model runs on 8× NVIDIA H200 GPUs via SLURM, trained on ~12.2B tokens of Azerbaijani text across 11 corpus collections. Infrastructure stack: DeepSpeed ZeRO-2, Flash Attention 3, and torch.compile, reaching 1.24M tokens/sec throughput. Loss dropped from 11.1 (random init) to 2.35 within the first 4,100 steps.

HuggingFace Accelerate on Rauf Ibishov

AzNEOBERT — Azerbaijani BERT from Scratch on 12B Tokens

TL;DR #

TL;DR
#