EuroBERT: Scaling Multilingual Encoders for European Languages
Title: EuroBERT: Scaling Multilingual Encoders for European Languages
Abstract:
Bidirectional encoder models have long served as the standard for generating general-purpose multilingual vector representations, which are essential for tasks such as classification, regression, and retrieval. Although these models are broadly applicable, their prominence has recently been eclipsed by the rapid advancements in generative, decoder-only architectures. Yet, many of the breakthroughs fueling this trend are not exclusive to decoder-based structures. This study re-evaluates the evolution of multilingual encoders by applying these recent insights, leading to the introduction of EuroBERT. EuroBERT is a series of multilingual encoders designed to support both widely spoken global languages and those specific to Europe.
The proposed models demonstrate superior performance compared to current alternatives across a wide variety of benchmarks, including multilingual proficiency, mathematical reasoning, and code understanding. Additionally, EuroBERT natively accommodates input sequences of up to 8,192 tokens. The paper also delves into the critical design choices underlying EuroBERT, providing detailed analysis of the training pipeline and dataset construction. To foster further research and development, we make the EuroBERT models, along with intermediate training checkpoints and the associated training framework, publicly available.
Source: arXiv Generated at: 2026-06-02 00:00:00 UTC




