arXiv

Long Live Fine-Tuning: Task-Specific Transformers Outperform Zero-Shot LLMs for Misinformation Response Classification on Reddit

Long Live Fine-Tuning: Task-Specific Transformers Outperform Zero-Shot LLMs for Misinformation Response Classification on Reddit

Abstract

As large language models (LLMs) increasingly serve as the primary instruments for verifying online information, a prevailing assumption suggests that their sheer scale and general-purpose capabilities are adequate for the nuanced task of classifying misinformation discourse. This study directly challenges that premise by analyzing 900 Reddit comments related to three specific misinformation claims verified by PolitiFact (covering environmental, health, and immigration topics). Each comment was categorized into one of three labels: "belief" (propagating the claim), "fact-check" (correcting the claim), or "other."

The research evaluated nine distinct models across three different paradigms: BART-MNLI, three variants of Llama, three commercial frontier LLMs (Claude Haiku 4.5, Gemini Flash Lite 2.5, and Claude Sonnet 4.6), and fine-tuned versions of DistilBERT and RoBERTa. These models were tested under both universal and topic-specific label schemas. The results unequivocally refute the assumption that scale guarantees performance.

Fine-tuned RoBERTa achieved a macro-$F_1$ score of 0.62, significantly surpassing the best zero-shot performance of 0.50 recorded by Claude Haiku 4.5, all while incurring a fraction of the per-query cost. This supervised advantage was particularly pronounced in the "belief" category, an implicit and affective classification that zero-shot models consistently failed to detect adequately.

Contrary to expectations, increasing model size did not yield better results. Llama-3-8B performed on par with the larger Llama-3-70B. Furthermore, Claude Sonnet 4.6 performed worse than the smaller Haiku model when using generic labels, with belief detection plummeting to 0.17. Additionally, Sonnet 4.6 refused to process a subset of comments identified as sensitive. The authors note that this is a consequence of safety alignment protocols rather than a limitation in the model’s inherent capacity.

The study also found that label schema and topic interact to shape zero-shot performance, causing the same model to vary by more than 0.13 macro-$F_1$ across different topics when labels were matched. In contexts where failing to identify belief statements carries a higher cost, task-specific fine-tuning remains the superior and more reliable approach, despite the growing dominance of large generative models.


Source: arXiv Generated at: 2026-06-04 00:00:00 UTC

Related Articles

Glazer Family Members Said to Study Manchester United Stake Sale
Bloomberg

Glazer Family Members Said to Study Manchester United Stake Sale

Reports indicate the Glazer family is evaluating a potential sale of their Manchester United stake, with family members ...

Ares' Blair Jacbobson: Disconnect Over Private Credit Headlines
Bloomberg

Ares' Blair Jacbobson: Disconnect Over Private Credit Headlines

Ares’ Blair Jacobson argues that private credit headlines misrepresent reality, highlighting a disconnect between media ...

Nvidia-Backed Robotics Startup Generalist AI Valued at $2 Billion
Bloomberg

Nvidia-Backed Robotics Startup Generalist AI Valued at $2 Billion

Nvidia-backed robotics startup Generalist AI has reached a $2 billion valuation. Founders Pete Florence, Andy Zeng, and ...

TechCrunch

Oura Ring 5 review: Thinner, lighter, better

The Oura Ring 5 is 40% smaller and lighter than its predecessor, offering superior comfort and a discreet, jewelry-like ...

Financial Times

How AI has de-skilled translation

AI fragments specialist translation into routine tasks, effectively de-skilling the profession. This shift reduces compl...

Zurich Insurance Expands Data-Center Offering Beyond the US
Bloomberg

Zurich Insurance Expands Data-Center Offering Beyond the US

Zurich Insurance Group is expanding its data center insurance products internationally, extending coverage beyond the Un...