Malaysian English News Decoded: A Linguistic Resource for Named Entity and Relation Extraction
Title: Malaysian English News Decoded: A Linguistic Resource for Named Entity and Relation Extraction
Abstract:
The distinct linguistic characteristics of Malaysian English (Manglish) compared to Standard English create significant hurdles for natural language processing (NLP) applications targeting this variety. Currently, the majority of available datasets are grounded in Standard English, rendering them insufficient for enhancing NLP performance on Malaysian English text. Recent experiments employing state-of-the-art Named Entity Recognition (NER) systems on Malaysian English news articles revealed an inability to manage the language’s morphosyntactic variations. To our knowledge, no annotated datasets currently exist to help refine these models.
In response to this gap, we have developed the Malaysian English News (MEN) dataset, comprising 200 news articles that have been manually labeled with entities and relations. We subsequently fine-tuned the spaCy NER tool, demonstrating that utilizing a dataset specifically designed for Malaysian English leads to substantial improvements in NER performance. This study details our methodology for data collection and annotation, alongside a comprehensive analysis of the resulting dataset.
To ensure annotation quality, we measured inter-annotator agreement and resolved discrepancies through review by a subject matter expert. The final dataset includes 6,061 entities and 3,268 relation instances. The paper further examines the spaCy fine-tuning configuration and evaluates NER outcomes. This novel resource is poised to significantly advance NLP research in Malaysian English, enabling researchers to accelerate their work, especially in the areas of relation extraction and NER. The dataset and associated annotation guidelines are publicly available on GitHub.
Source: arXiv Generated at: 2026-06-02 00:00:00 UTC





