arXiv

Whose Name Comes Up? II: Benchmarking and Intervention-Based Auditing of LLM-Based Scholar Recommendation

June 3, 2026 · Lisette Esp\'in-Noboa, Gonzalo Gabriel M\'endez · Original Source

Title: Whose Name Comes Up? II: Benchmarking and Intervention-Based Auditing of LLM-Based Scholar Recommendation

Original: arXiv:2602.08873v2 Announce Type: replace-cross Abstract: Large language models (LLMs) are now used for academic expert recommendation. Existing audits typically evaluate such recommendations in isolation, ignoring end-user inference-time interventions. Thus, it remains unclear whether failures (e.g., refusals, hallucinations, uneven coverage) stem from model choice or deployment decisions. We introduce LLMScholarBench, a benchmark for auditing LLM-based scholar recommendation that jointly evaluates model infrastructure and end-user interventions across multiple tasks. LLMScholarBench measures technical quality and social representation using nine metrics. We instantiate the benchmark in physics expert recommendation and audit 22 LLMs under temperature variation, representation-constrained prompting, and retrieval-augmented generation (RAG) via web search. Our results show that each intervention entails distinct tradeoffs. Higher temperature degrades validity, consistency, and factuality. Representation-constrained prompting improves diversity at the expense of factuality, while RAG primarily improves technical quality while reducing diversity and parity. Overall, end-user interventions reshape trade-offs rather than providing uniform gains. LLMScholarBench makes all these dynamics auditable across models and interventions in LLM-based scholar recommendations.

Rewrite: Large language models (LLMs) are increasingly deployed for identifying academic experts. However, current evaluation methods often assess these recommendations in isolation, overlooking the impact of interventions made by users during inference. Consequently, it is difficult to determine whether issues such as refusals, hallucinations, or biased coverage originate from the underlying model architecture or from specific deployment choices. To address this gap, we present LLMScholarBench, a novel benchmark designed to audit LLM-based scholar recommendation systems. This framework simultaneously assesses both the model’s infrastructure and the effects of end-user interventions across various tasks. LLMScholarBench utilizes nine distinct metrics to gauge both technical performance and social representation. We applied this benchmark to the domain of physics expert recommendation, conducting audits on 22 different LLMs. These tests examined the impact of varying temperatures, employing representation-constrained prompting, and utilizing retrieval-augmented generation (RAG) through web search. The findings reveal that every intervention introduces specific trade-offs. For instance, increasing the temperature negatively impacts validity, consistency, and factuality. In contrast, representation-constrained prompting enhances diversity but compromises factuality, whereas RAG boosts technical quality while diminishing diversity and parity. Ultimately, our study demonstrates that end-user interventions do not offer blanket improvements; instead, they alter the existing balance of trade-offs. LLMScholarBench enables the systematic auditing of these complex dynamics across different models and intervention strategies in the context of scholar recommendations.

Source: arXiv Generated at: 2026-06-03 00:00:00 UTC