HumorRank: A Tournament-Based Leaderboard for Evaluating Humor Generation in Large Language Models
Title: HumorRank: A Tournament-Based Leaderboard for Evaluating Humor Generation in Large Language Models
Abstract:
Evaluating humor in large language models (LLMs) presents a significant challenge, as comedic appeal is inherently subjective, comparative, and driven by complex interactions among various comedic devices rather than a single measurable trait. Consequently, current evaluation methods often result in fragmented scores or judgments tied to specific tasks, making it difficult to compare performance across different models. To address this, we present HumorRank, a framework that utilizes tournament-style structures to rank textual humor generation based on theory-informed pairwise preferences. By applying the General Theory of Verbal Humor (GTVH) to guide LLM-based comparative assessments, HumorRank evaluates nine models—ranging from proprietary and open-weight to specialized systems—across the SemEval-2026 MWAHAHA and Humor Transfer Bench datasets. Global rankings are derived through tournament aggregation using Bradley-Terry estimation. This approach ensures cross-judge stability, with independent Llama and Qwen judges producing a Kendall τ of 0.889 on both benchmarks. The resulting leaderboard highlights distinct model stratification, indicating that effective humor generation relies not just on model size, but on the proficiency to wield specific comedic mechanisms like incongruity, conciseness, escalation, and absurdity. Ultimately, HumorRank offers a scalable and interpretable method for benchmarking LLM-generated humor, moving beyond the limitations of isolated automatic metrics and constrained human evaluation.
Source: arXiv Generated at: 2026-06-02 00:00:00 UTC





