arXiv

GeistBERT: Breathing Life into German NLP

June 2, 2026 · Raphael Scheible-Schmitt, Johann Frei · Original Source

Title: GeistBERT: Revitalizing German Natural Language Processing

Abstract:

The evolution of transformer-based language models has underscored the significant advantages of pre-training on high-quality, language-specific corpora. Within this landscape, German Natural Language Processing (NLP) is poised for enhancement through the adoption of contemporary architectures and datasets refined to capture the unique linguistic nuances of German. GeistBERT aims to elevate German language processing capabilities by undergoing incremental training on a varied corpus, thereby optimizing performance across a spectrum of NLP applications.

Built using the fairseq framework, GeistBERT adheres to the RoBERTa base configuration, incorporating Whole Word Masking (WWM). The model initialization utilized weights from GottBERT. Training was conducted on a substantial 1.3 TB German dataset, employing dynamic masking techniques and a fixed sequence length of 512 tokens.

To assess performance, we fine-tuned the model on established downstream tasks, encompassing Named Entity Recognition (NER) via CoNLL 2003 and GermEval 2014, text classification through GermEval 2018 (coarse and fine-grained) and 10kGNAD, and Natural Language Inference (NLI) using German XNLI. Evaluation metrics included $F_1$ score and accuracy.

GeistBERT delivered robust results across all evaluated tasks, establishing itself as a leading base model and achieving a new state-of-the-art (SOTA) in the GermEval 2018 fine-grained text classification benchmark. Notably, the model surpassed several larger architectures, particularly within classification benchmarks. In support of the German NLP research community, GeistBERT is released under the MIT license.

Source: arXiv Generated at: 2026-06-02 00:00:00 UTC