How AI Fails: An Interactive Pedagogical Tool for Demonstrating Dialectal Bias in Automated Toxicity Models
Title: Exposing the Flaws in AI: An Interactive Educational Resource for Highlighting Dialectal Bias in Automated Toxicity Detection
Abstract:
As artificial intelligence-driven content moderation becomes a ubiquitous feature of daily life, accusations of algorithmic bias are frequently voiced. Though often uttered in jest, these comments underscore a serious underlying anxiety: how can users trust that a post marked as "inappropriate" hasn’t merely fallen prey to a prejudiced system? This study addresses this issue through a two-pronged methodology.
Initially, a quantitative assessment was performed on a commonly utilized toxicity classifier (unitary/toxic-bert) to evaluate performance gaps between texts written in African-American English (AAE) and Standard American English (SAE). The results expose a pronounced, systematic inequity: the model assigns AAE text a toxicity score that is, on average, 1.8 times higher than that of SAE. Furthermore, it rates AAE content 8.8 times more severely under the "identity hate" category.
Secondly, this paper presents an interactive educational instrument designed to render these abstract biases concrete. Central to this tool is a user-adjustable "sensitivity threshold," which illustrates that the algorithmic score alone is not the primary source of harm. Rather, the greater danger lies in the human-defined, ostensibly neutral policies that translate these scores into discriminatory actions. By offering both empirical proof of disparate impact and a public-facing resource, this work aims to cultivate critical awareness and literacy regarding AI systems.
Source: arXiv Generated at: 2026-06-02 00:00:00 UTC





