arXiv

Malaysian English News Decoded: A Linguistic Resource for Named Entity and Relation Extraction

Title: Malaysian English News Decoded: A Linguistic Resource for Named Entity and Relation Extraction

Abstract:

The distinct linguistic characteristics of Malaysian English (Manglish) compared to Standard English create significant hurdles for natural language processing (NLP) applications targeting this variety. Currently, the majority of available datasets are grounded in Standard English, rendering them insufficient for enhancing NLP performance on Malaysian English text. Recent experiments employing state-of-the-art Named Entity Recognition (NER) systems on Malaysian English news articles revealed an inability to manage the language’s morphosyntactic variations. To our knowledge, no annotated datasets currently exist to help refine these models.

In response to this gap, we have developed the Malaysian English News (MEN) dataset, comprising 200 news articles that have been manually labeled with entities and relations. We subsequently fine-tuned the spaCy NER tool, demonstrating that utilizing a dataset specifically designed for Malaysian English leads to substantial improvements in NER performance. This study details our methodology for data collection and annotation, alongside a comprehensive analysis of the resulting dataset.

To ensure annotation quality, we measured inter-annotator agreement and resolved discrepancies through review by a subject matter expert. The final dataset includes 6,061 entities and 3,268 relation instances. The paper further examines the spaCy fine-tuning configuration and evaluates NER outcomes. This novel resource is poised to significantly advance NLP research in Malaysian English, enabling researchers to accelerate their work, especially in the areas of relation extraction and NER. The dataset and associated annotation guidelines are publicly available on GitHub.


Source: arXiv Generated at: 2026-06-02 00:00:00 UTC

Related Articles

Law’s Billable Hour Is Being Shredded by AI
Bloomberg

Law’s Billable Hour Is Being Shredded by AI

AI is dismantling the billable hour by automating routine legal tasks. This technological shift threatens the traditiona...

Iran War: Trump Tries to Stop Israel’s Lebanon Push | The Opening Trade 6/2/2026
Bloomberg

Iran War: Trump Tries to Stop Israel’s Lebanon Push | The Opening Trade 6/2/2026

SoftBank in Early Talks to Back $800 Million Agile Robots Round
Bloomberg

SoftBank in Early Talks to Back $800 Million Agile Robots Round

SoftBank is in early talks to back Agile Robots’ $800 million funding round. The Japanese tech giant is currently in pre...

Amundi Is Diversifying Risk Via Commodity Currencies, Gold
Bloomberg

Amundi Is Diversifying Risk Via Commodity Currencies, Gold

Amundi diversifies risk by investing in commodity-linked currencies and gold. This strategy hedges against market volati...

Reuters

Marvell Technology surges after Nvidia's Huang calls it 'next trillion-dollar company'

Marvell Technology shares surged after Nvidia CEO Jensen Huang labeled the firm the “next trillion-dollar company.”

Russia Says It Found Foreign Spyware on Top Officials’ Phones
Bloomberg

Russia Says It Found Foreign Spyware on Top Officials’ Phones

Russia’s FSB claims to have discovered foreign spyware on senior officials’ phones. Moscow attributes the intrusion to h...