arXiv

KliniskVestBERT: BERT Model Specialised to Norwegian Clinical Texts

June 2, 2026 · Christian Autenried, Cosimo Persia · Original Source

Title: KliniskVestBERT: A BERT Architecture Tailored for Norwegian Clinical Documentation

Abstract

As the integration of Natural Language Processing (NLP) into healthcare continues to expand, there is a growing necessity for language models that are finely tuned to the unique intricacies of clinical terminology. This paper presents KliniskVestBERT, a collection of three BERT-based encoder models that have been pre-trained on a large-scale corpus of real-world, de-identified Norwegian clinical records sourced from Helse Vest. The study involves continuing the pre-training process of established language models—specifically Nb-BERT-large, NorBERT3-large, and ModernBERT—using this specialized clinical dataset.

The dataset reflects a representative cross-section of the Helse Vest patient population. It comprises carefully selected document types, such as discharge summaries, surgical reports, and nursing notes, spanning both bokmål and nynorsk. This curation ensures a comprehensive coverage of the linguistic diversity found within Norwegian healthcare environments.

To validate the models, evaluations were conducted using three synthetic Norwegian clinical benchmark datasets alongside two real-world clinical challenges. The results indicate that each of the clinically specialized models consistently surpasses their baseline counterparts. These findings underscore the substantial advantages of employing domain-specific pre-training for NLP applications in the medical field. This initiative was a collaborative project involving all Helse Vest entities—Helse Bergen, Helse Fonna, Helse Førde, and Helse Stavanger—with DIPS serving as the project lead under the direction of Helse Vest ICT.

Source: arXiv Generated at: 2026-06-02 00:00:00 UTC