arXiv

GEO-Bench: Benchmarking Ranking Manipulation in Generative Engine Optimization

June 2, 2026 · Ojas Nimase, Zhe Chen, Gengpei Qi, Yue Zhao, Xiyang Hu · Original Source

Title: GEO-Bench: Establishing a Standard for Evaluating Ranking Manipulation in Generative Engine Optimization

Abstract:

As large language models (LLMs) become the primary arbiters of rankings for user queries—sorting products, documents, and recommendations—the potential for manipulating these outputs has emerged as a significant threat to information integrity and fairness. While the field of generative engine optimization (GEO) has generated numerous manipulation techniques, the lack of standardized evaluation frameworks has left the relative efficacy and detectability of these methods largely unknown. Each study typically relies on unique datasets and metrics, preventing meaningful comparison. To address this gap, we introduce GEO-Bench, a unified benchmark designed to assess GEO ranking-manipulation attacks under a consistent protocol.

This benchmark integrates a diverse array of techniques, including black-box prompt-based attacks such as TAP and Zero-Shot, white-box gradient-based methods like STS, RAF, and StealthRank, alongside ten white-hat C-SEO strategies. We rigorously test these methods across five distinct datasets using a fixed open-weight ranker, Llama-3.1-8B-Instruct. Our evaluation framework employs a dual-axis scoring system: effectiveness, measured by NRG, Success@{\alpha}, and Promote@{\alpha}, and stealth, quantified by keyword violation rates and perplexity ratios.

The results reveal a distinct trade-off between effectiveness and stealth across various adversarial attacks. Notably, black-box content rewriting techniques were found to match or surpass gradient-based attacks in rank promotion while generating more fluent text. Furthermore, these methods demonstrated the ability to evade detection based on both keyword violations and perplexity in certain domains. Crucially, our findings indicate that the access model employed by an attacker does not reliably predict the strength of the attack. By standardizing datasets, attack implementations, and evaluation metrics, GEO-Bench facilitates the first direct comparison across different attack paradigms, thereby aiding in the advancement of robust detection methodologies.

Source: arXiv Generated at: 2026-06-02 00:00:00 UTC