Bridging the Gap: Transfer Learning from English PLMs to Malaysian English
Title: Bridging the Gap: Transfer Learning from English PLMs to Malaysian English
Malaysian English is a low-resource creole that integrates elements from Malay, Chinese, and Tamil alongside Standard English. Traditional Named Entity Recognition (NER) models often struggle to accurately identify entities within Malaysian English texts, largely due to the language’s unique morphosyntactic structures, semantic characteristics, and frequent code-switching between English and Malay.
To address these challenges, we present MENmBERT and MENBERT, pre-trained language models (PLMs) equipped with contextual understanding specifically designed for Malaysian English. We refined these models by fine-tuning them on manually annotated entities and relations extracted from the Malaysian English News Article (MEN) Dataset. This fine-tuning enables the PLMs to learn representations that effectively capture the specific nuances of Malaysian English required for NER and Relation Extraction (RE) tasks.
In comparative evaluations, MENmBERT outperformed the bert-base-multilingual-cased model by 1.52% on NER tasks and by 26.27% on RE tasks. While the aggregate NER performance gains may appear modest, our deeper analysis reveals statistically significant improvements when assessing performance across the 12 distinct entity labels. These results indicate that pre-training language models on corpora that are both language-specific and geographically targeted offers a promising strategy for enhancing NER capabilities in low-resource contexts. Furthermore, the dataset and code released in this study serve as essential resources for NLP research focused on Malaysian English.
Source: arXiv Generated at: 2026-06-02 00:00:00 UTC





