AICompanionBench: Benchmarking LLMs-as-Judges for AI Companion Safety
Title: AICompanionBench: Evaluating the Efficacy of LLMs-as-Judges in Ensuring AI Companion Safety
The rapid expansion of AI companion services like Character.AI and Replika has heightened worries regarding the safety of human-AI interactions. Addressing this gap, this research presents AICompanionBench, which appears to be the inaugural publicly accessible benchmark dataset featuring human-AI companion dialogues categorized by detailed safety risk levels.
The dataset comprises 2,123 authentic conversations sourced from Replika and retrieved from Reddit. These entries were labeled via a collaborative process involving both humans and AI, spanning nine distinct categories: no-harm, substance abuse, physical aggression, verbal aggression, antisocial behavior, sexual behavior, self-harm and suicide, control, and manipulation.
Leveraging this benchmark, the study assesses the performance of 20 leading large language models (LLMs)—both open-source and closed-source—within an LLM-as-judge framework designed to identify unsafe exchanges. The analysis reveals significant disparities in model capabilities. While more robust models demonstrate high overall accuracy, they continue to face difficulties with subtle categories like manipulation and often misclassify harmless conversations as dangerous.
These results indicate that although contemporary LLMs are proficient at spotting overt harmful material, they lack the sensitivity required to detect implicit unsafe dynamics. This work provides the safety research community with a novel benchmark dataset for AI companionship and offers valuable perspectives on utilizing LLMs to oversee AI companion platforms. The dataset can be accessed publicly at: https://github.com/anonymousresearcher2026/AICompanionBench/blob/main/AICompanionBench.xlsx
Source: arXiv Generated at: 2026-06-04 00:00:00 UTC





