SindBERT, the Sailor: Charting the Seas of Turkish NLP
Title: SindBERT, the Sailor: Navigating the Waters of Turkish Natural Language Processing
Abstract:
While Transformer architectures have fundamentally transformed the field of Natural Language Processing (NLP), large-scale pre-training initiatives have historically overlooked morphologically complex languages. Addressing this gap, we introduce SindBERT, a pioneering RoBERTa-based encoder designed specifically for Turkish. Built entirely from scratch, the model was trained on a massive corpus comprising 312 GB of Turkish text drawn from mC4, OSCAR23, and Wikipedia. SindBERT is offered in both base and large configurations, marking it as the first large-scale encoder-only language model accessible for the Turkish language.
To assess its capabilities, we tested SindBERT across several tasks, including part-of-speech tagging, named entity recognition, offensive language detection, and the TurBLiMP benchmark for linguistic acceptability. The evaluation reveals that SindBERT competes effectively with established Turkish and multilingual models. Specifically, the large variant secured the highest scores in two out of the four evaluated tasks; however, it did not demonstrate a uniform performance boost attributable to scale. This pattern of flat scaling mirrors trends seen in models like XLM-R and EuroBERT, implying that current benchmarks for Turkish may have reached saturation.
Furthermore, when contrasted with smaller, more carefully curated models like BERTurk, the findings suggest that the quality and diversity of the training corpus can be more impactful than raw data quantity. Ultimately, SindBERT serves a dual purpose: it provides an open-source resource for the Turkish NLP community and acts as an empirical examination of the constraints of scaling laws, emphasizing the critical importance of corpus composition in morphologically rich languages. The SindBERT models are distributed under the MIT license and are compatible with both fairseq and Huggingface frameworks.
Source: arXiv Generated at: 2026-06-02 00:00:00 UTC





