ATLAS multilingual scaling laws for global language model training
Most AI usage now spans many languages, but public scaling research has mainly focused on English-only models. ATLAS, a new multilingual scaling-law study from Google DeepMind and Google Cloud, provides quantitative guidance on how model size, data volume, and language mix interact when serving hundreds of languages.
Executive snapshot
ATLAS is the largest public multilingual pretraining study to date and proposes practical rules for adding languages without sacrificing model quality.
- Over 50% of AI model users speak non-English languages, yet prior public scaling laws focus largely on English-only training, leaving a gap for teams building for global users [S2][S3].
- ATLAS runs 774 multilingual pretraining experiments on 10M-8B parameter models, covering 400+ languages in training, evaluating on 48 languages, and estimating cross-lingual transfer for ~1,400 language pairs [S1][S2].
- When doubling the number of training languages from K to 2K, ATLAS finds that increasing model size by 1.18× and total data by 1.66× maintains performance, even though per-language data falls to 83% of the original level [S1].
- Training with a multilingual vocabulary and fully multilingual data carries a measurable compute tax compared with monolingual training at the same quality level, especially for English, while low-resource languages show diminishing returns once available data is exhausted [S2].
- For 2B-parameter models, fine-tuning a strong multilingual checkpoint outperforms pretraining from scratch until roughly 144-283B training tokens; beyond that range, pretraining from scratch tends to yield better ultimate performance, with the crossover point rising with model size [S1][S2].
Implication for marketers and product teams: multilingual coverage can be expanded with modest extra compute if language mixes and model choices follow ATLAS-style scaling rules, making wider language support more cost-predictable than before.
ATLAS study design, datasets, and multilingual model samples
ATLAS (ATLAS: Adaptive Transfer Scaling Laws for Multilingual Pretraining, Finetuning, and Decoding the Curse of Multilinguality) is presented as an ICLR 2026 paper and summarized in a Google Research blog post [S1][S2]. It focuses on how to choose model size, data volume, and language mixtures when training large language models to serve many languages at once.
Key study and dataset facts
- Authors and sponsors: Led by Shayne Longpre (Google Cloud) and Sayna Ebrahimi (Google DeepMind), with collaborators at Google [S2].
- Model scale: 10M-8B parameters, with 774 multilingual pretraining runs covering monolingual, bilingual, and massively multilingual configurations [S1][S2].
- Data: Uses MADLAD-400, a web-scale corpus spanning 400+ languages [S4].
- Evaluation: Uses a vocabulary-insensitive loss metric to compare models across 750+ independent runs and different vocabularies and data mixes [S1][S2][S5].
- Cross-lingual transfer: Estimates how much each of ~1,400 language pairs help or hurt one another during training, forming a large transfer matrix [S1][S2].
- Checkpoints: Compares training from scratch with fine-tuning from a strong multilingual Unimax checkpoint [S2][S6].
Key limitations and caveats
- Experiments are capped at 8B parameters; behavior at 30B-100B+ scale is not directly tested [S1][S2].
- Results are based on a single large web-scale dataset (MADLAD-400); domains such as code, enterprise documents, or speech are not evaluated [S1][S4].
- Evaluations cover 48 languages, a broad but partial subset of the 400+ training languages [S2].
- Interpretation: precise numeric factors (like the 1.18× / 1.66× scaling) should be treated as approximate guidance rather than exact policy when applied to very different domains or training recipes.
Key findings on multilingual model efficiency and language mixes
Compute costs of multilingual vocabularies and datasets
ATLAS investigates six languages in detail - English (EN), French (FR), Russian (RU), Chinese (ZH), Hindi (HI), and Swahili (SW) - and compares three setups: monolingual vocabulary and data, multilingual vocabulary with monolingual data, and fully multilingual vocabulary plus multilingual data [S2].
Main factual observations
- Scaling curves (optimal combinations of model size N and data size D) look similar across all six languages, suggesting that high-level scaling behavior is stable across families and scripts [S2].
- For a fixed target quality, training with a multilingual vocabulary and multilingual data requires more compute than monolingual training. In the ATLAS plots, the fully multilingual setup consistently lies above monolingual curves, with multilingual vocabulary plus monolingual data in between [S2].
- This compute tax is most noticeable for English, which already has abundant, high-quality web data. A shared multilingual vocabulary appears to reduce English-specific efficiency compared with an English-only vocabulary [S2].
- Low-resource languages show an upward bend in scaling curves as training repeats the same limited data, leading to diminishing returns. ATLAS explicitly models this saturation effect [S2].
Interpretation (likely)
- For businesses targeting both English and many other languages in a single model, there is a real but manageable efficiency penalty on English when sharing capacity and vocabulary, relative to a pure English model.
- For very low-resource languages, simply adding more training steps on the same data yields weaker gains; data quality and augmentation matter more than repeat counts once the saturation bend appears.
Cross-lingual transfer patterns across 400+ languages
ATLAS measures language-to-language synergies with a large transfer matrix that quantifies how much training on language A helps or harms performance on language B [S1][S2].
Key empirical findings
- Language family and script are strong predictors of positive transfer. Sharing a script (for example, Latin) or language family is statistically associated with positive transfer at p < 0.001 [S1][S2].
- Expected patterns emerge:
- Norwegian gains most from Swedish and German.
- Malay gains from Indonesian.
- Arabic gains from Hebrew.
- English, French, and Spanish provide broadly helpful transfer to many other languages, likely due to the volume and heterogeneity of their web text [S2].
- Transfer is not symmetric: language A can help B more than B helps A [S2].
- Some language pairs show interference (negative transfer), where training on one language reduces performance on another, showing that adding more languages is not automatically beneficial without careful selection [S1][S2].
Interpretation (likely)
- Positive transfer is strongest within shared scripts and families, giving practitioners a concrete way to group languages in training mixes.
- High-resource hub languages such as English, French, and Spanish often serve as effective anchors for regional models, but should not be assumed to help every target language equally.
Scaling to more languages: model and data growth rules
The curse of multilinguality refers to the tendency for per-language performance to degrade as more languages are packed into a fixed-capacity model [S7]. ATLAS formalizes this as a scaling law that includes the number of languages K alongside model size N and data size D [S1][S2].
Main quantitative rule
- When increasing the number of training languages from K to 2K, ATLAS finds that maintaining performance on a given language can be achieved by:
- Increasing model size N by 1.18×, and
- Increasing total data D by 1.66×.
- Despite the smaller per-language dataset, positive cross-lingual transfer across the enlarged language set offsets capacity limits, so the expected performance drop from the curse is largely neutralized when these factors are applied [S1][S2].
- The study notes a mild capacity tax from increasing K, but emphasizes that positive transfer is high-degree, meaning shared structure across languages carries substantial benefit at these scales [S1][S2].
Context and comparison
- Earlier work on multilingual machine translation found that adding many languages at a fixed model size could harm high-resource language quality, a phenomenon labeled the curse of multilinguality [S7].
- ATLAS does not contradict this effect; instead, it quantifies how much additional capacity and data are needed per extra language to keep performance stable [S1][S2].
Interpretation (likely)
- Doubling language coverage does not require doubling compute. The ATLAS ratios suggest that wider coverage is feasible with sub-linear increases in model size and total data, as long as training mixes are constructed to exploit positive transfer.
Pretraining versus fine-tuning multilingual checkpoints
For ten languages, ATLAS compares two training paths to reach a strong model for a specific language [S1][S2]:
- Pretraining from scratch on the target language (monolingual pretraining).
- Fine-tuning from a strong multilingual Unimax checkpoint, which already performs well across languages.
Key empirical results
- Fine-tuning from the multilingual checkpoint performs better at low compute budgets.
- Pretraining from scratch eventually overtakes fine-tuning if training continues for long enough [S2].
- For 2B-parameter models, the crossover point - where pretraining from scratch surpasses fine-tuning - typically appears between ~144B and 283B tokens, depending on the language [S1][S2].
- The crossover threshold (in tokens or compute C ≈ 6ND) rises with model size, and ATLAS provides an estimated curve relating model size to this crossover point [S1][S2].
- Exact thresholds vary by base model and language mixture, so reported numbers are approximate rather than universal cutoffs [S1][S2].
Interpretation (likely)
- For many real-world business workloads, where training budgets are well below 100-200B tokens, fine-tuning a high-quality multilingual checkpoint is likely to be more efficient and to reach stronger performance faster.
- Training from scratch is mainly attractive for organizations able to sustain very large training budgets and with a need for the highest possible ceiling on a specific language or mix.
Business and marketing implications of multilingual scaling laws
This section interprets the technical findings for product, marketing, and analytics teams. All points here are interpretation, labeled by confidence.
Likely implications
- Global coverage is more affordable than previously assumed. The ATLAS rule for doubling languages (1.18× model size, 1.66× data) suggests that expanding from, for example, 25 to 50 languages does not require doubling compute [S1]. For businesses relying on custom or partner models, the marginal cost per new language is therefore moderate, provided languages are grouped thoughtfully.
- Ignoring non-English users carries opportunity cost. More than half of AI users operate in non-English languages [S3]. ATLAS shows that with structured training mixes, supporting these users does not require a linear increase in budget [S1][S2]. Businesses that continue to deploy English-only or English-first models will likely see weaker performance and adoption in large growth markets.
- Language mix design is a real performance lever. The cross-lingual transfer matrix demonstrates that some language pairs help substantially, while others interfere [S1][S2]. For teams commissioning or fine-tuning models, asking vendors how they choose language mixes - and whether they co-train target languages with related, script-aligned languages - becomes a meaningful technical and commercial question.
- Fine-tuning strong multilingual checkpoints is generally the practical route. Given the 144-283B token crossover range at 2B parameters [S1][S2], most enterprise fine-tuning tasks - often in the tens or hundreds of millions of tokens - fall well below this threshold. For such cases, adapting a strong multilingual base model is likely more cost-efficient than training a specialized model from scratch.
Tentative implications
- Separate English-only and multilingual models may coexist. Because multilingual vocabularies impose a compute tax on English [S2], organizations with large English-speaking user bases may keep an English-optimized model for peak benchmark performance and a multilingual model for broad coverage. Clear communication about which model powers which features will matter for expectation setting.
- Regional clusters may be a cost-effective path for long-tail languages. Given the strong positive transfer within families and scripts [S1][S2], one practical approach is to maintain regional multilingual models (for example, Latin-script European languages, Indic scripts, major Arabic-script languages), providing better quality for smaller markets than a single global mix that pairs them with very distant languages.
Speculative implications (to validate case by case)
- Training-data strategy could become a marketing differentiator. Vendors able to show that their models co-train client-relevant languages with empirically strong helper languages (for example, Catalan with Spanish, Portuguese, Italian [S2]) may deliver better quality on niche markets without needing huge proprietary corpora.
- Pricing models might shift from per language to per cluster. As more teams adopt ATLAS-style thinking, commercial structures could increasingly be tied to language clusters and marginal compute, rather than flat per-language fees.
Contradictions, research gaps, and open questions in multilingual AI
Factual gaps and limitations
- Model scale ceiling. ATLAS evaluates models up to 8B parameters [S1][S2]. Behavior at tens or hundreds of billions of parameters - where many commercial frontier models operate - remains untested within this framework. The 1.18× / 1.66× rule could change at larger scales.
- Domain coverage. MADLAD-400 is web-text oriented [S4]. How the same scaling laws behave for code, legal documents, medical text, or multimodal inputs is unknown.
- Task diversity. ATLAS reports results using vocabulary-insensitive loss and language modeling metrics [S1][S2][S5]. Downstream tasks such as search relevance, classification, or conversational quality in business settings might respond differently to language mixes and scaling.
- Evaluation languages. Only 48 languages are evaluated in detail [S2]. For many of the 400+ training languages, performance and transfer patterns remain indirectly inferred rather than directly measured.
- Dynamic data quality. The transfer matrix implicitly reflects current web data quality and volume [S1][S2][S4]. As the web changes or as curated datasets grow for under-served languages, transfer relationships could shift.
Contextual contradictions or tensions
- Prior curse of multilinguality results versus ATLAS. Earlier machine translation work documented clear performance declines when adding many languages to a fixed-size model [S7]. ATLAS confirms that there is still a capacity tax but shows that positive transfer can offset it if model size and data scale with K at specific rates [S1][S2]. This does not invalidate prior findings; it reframes them as a special case of under-provisioned capacity.
- One law versus many recipes. ATLAS proposes a unified scaling law for multilingual settings [S1], but real-world training recipes differ in tokenization, optimization schedules, and reinforcement-learning stages. The extent to which a single law can guide all these choices remains to be tested at scale.
Open questions for practitioners (tentative)
- How stable are transfer patterns as more curated, domain-specific data becomes available in under-served languages?
- To what extent do ATLAS findings apply to instruction-tuned and RLHF-aligned chat models, which are the main workhorses in many businesses?
- Could future gains from better pretraining layouts reduce the compute tax on English in multilingual settings, altering the tradeoff between single-language and shared models?
Data appendix: selected ATLAS metrics for quick reference
Factual summary table of selected metrics from ATLAS and related sources:
| Item | Metric | Value / Range | Source |
|---|---|---|---|
| Share of AI users speaking non-English languages | Proportion of users | >50% | [S3] |
| Model sizes studied | Parameters | 10M-8B | [S1][S2] |
| Pretraining runs | Count | 774 | [S1][S2] |
| Training languages | Count in corpus | 400+ | [S1][S2][S4] |
| Evaluation languages | Count | 48 | [S2] |
| Language pairs in transfer matrix | Approximate count | ~1,400 | [S1][S2] |
| Independent evaluation runs | Count | 750+ | [S2] |
| Doubling languages K → 2K | Model size multiplier (N) | 1.18× | [S1] |
| Doubling languages K → 2K | Data multiplier (D) | 1.66× | [S1] |
| Per-language data at 2K vs K | Relative tokens | 83% | [S1] |
| Crossover point, 2B models (pretrain vs fine-tune) | Training tokens | ~144B-283B | [S1][S2] |
Sources
- [S1] Longpre, S., Ebrahimi, S. et al. "ATLAS: Adaptive Transfer Scaling Laws for Multilingual Pretraining, Finetuning, and Decoding the Curse of Multilinguality," arXiv:2510.22037, to appear at ICLR 2026 (as described in [S2]).
- [S2] Longpre, S., Ebrahimi, S. "ATLAS: Practical scaling laws for multilingual models," Google Research Blog, Jan 27, 2026.
- [S3] Anthropic. "Anthropic Economic Index - September 2025 Report," usage and economic impact of AI systems.
- [S4] Kudugunta, S. et al. "MADLAD-400: A Multilingual Dataset for Language Modeling in 400+ Languages," arXiv:2309.04662, 2023.
- [S5] Reference to vocabulary-insensitive loss, arXiv:2407.13623, as cited in [S2].
- [S6] Zhang, B. et al. "Unimax" multilingual model, as cited in [S2] and original paper.
- [S7] Arivazhagan, N. et al. "The Curse of Multilinguality: Scaling Neural Machine Translation to 100 Languages," arXiv:1907.05019, 2019.






