arXiv

GeistBERT: Breathing Life into German NLP

Title: GeistBERT: Revitalizing German Natural Language Processing

Abstract:

The evolution of transformer-based language models has underscored the significant advantages of pre-training on high-quality, language-specific corpora. Within this landscape, German Natural Language Processing (NLP) is poised for enhancement through the adoption of contemporary architectures and datasets refined to capture the unique linguistic nuances of German. GeistBERT aims to elevate German language processing capabilities by undergoing incremental training on a varied corpus, thereby optimizing performance across a spectrum of NLP applications.

Built using the fairseq framework, GeistBERT adheres to the RoBERTa base configuration, incorporating Whole Word Masking (WWM). The model initialization utilized weights from GottBERT. Training was conducted on a substantial 1.3 TB German dataset, employing dynamic masking techniques and a fixed sequence length of 512 tokens.

To assess performance, we fine-tuned the model on established downstream tasks, encompassing Named Entity Recognition (NER) via CoNLL 2003 and GermEval 2014, text classification through GermEval 2018 (coarse and fine-grained) and 10kGNAD, and Natural Language Inference (NLI) using German XNLI. Evaluation metrics included $F_1$ score and accuracy.

GeistBERT delivered robust results across all evaluated tasks, establishing itself as a leading base model and achieving a new state-of-the-art (SOTA) in the GermEval 2018 fine-grained text classification benchmark. Notably, the model surpassed several larger architectures, particularly within classification benchmarks. In support of the German NLP research community, GeistBERT is released under the MIT license.


Source: arXiv Generated at: 2026-06-02 00:00:00 UTC

Related Articles

Law’s Billable Hour Is Being Shredded by AI
Bloomberg

Law’s Billable Hour Is Being Shredded by AI

AI is dismantling the billable hour by automating routine legal tasks. This technological shift threatens the traditiona...

Iran War: Trump Tries to Stop Israel’s Lebanon Push | The Opening Trade 6/2/2026
Bloomberg

Iran War: Trump Tries to Stop Israel’s Lebanon Push | The Opening Trade 6/2/2026

SoftBank in Early Talks to Back $800 Million Agile Robots Round
Bloomberg

SoftBank in Early Talks to Back $800 Million Agile Robots Round

SoftBank is in early talks to back Agile Robots’ $800 million funding round. The Japanese tech giant is currently in pre...

Amundi Is Diversifying Risk Via Commodity Currencies, Gold
Bloomberg

Amundi Is Diversifying Risk Via Commodity Currencies, Gold

Amundi diversifies risk by investing in commodity-linked currencies and gold. This strategy hedges against market volati...

Reuters

Marvell Technology surges after Nvidia's Huang calls it 'next trillion-dollar company'

Marvell Technology shares surged after Nvidia CEO Jensen Huang labeled the firm the “next trillion-dollar company.”

Russia Says It Found Foreign Spyware on Top Officials’ Phones
Bloomberg

Russia Says It Found Foreign Spyware on Top Officials’ Phones

Russia’s FSB claims to have discovered foreign spyware on senior officials’ phones. Moscow attributes the intrusion to h...