arXiv

K-BrowseComp: A Web Browsing Agent Benchmark Grounded in Korean Contexts

June 2, 2026 · Nahyun Lee, Dongkeun Yoon, Guijin Son, Geewook Kim, Dayoon Ko, Jeonghun Park, Haneul Yoo, Jaewon Cho, Junghun Park, Changyoon Lee, Kyochul Jang, Jaeyeon Kim, Eunsu Kim, Woojin Cho, Seungone Kim · Original Source

Title: K-BrowseComp: A Web Browsing Agent Benchmark Grounded in Korean Contexts

Abstract: As the focus of frontier model evaluations shifts from basic competencies like reasoning and instruction adherence toward more complex, agentic tasks, there is a notable lack of benchmarks tailored for Korean environments. To address this gap, we present K-BrowseComp, a novel benchmark designed to assess web-browsing agents within Korean-specific contexts, comprising a total of 400 distinct problems. A verified subset of 300 tasks was meticulously crafted and validated by native Korean speakers. Performance on this subset reveals significant challenges for leading models: state-of-the-art LLMs such as GPT-5.5, DeepSeek-V4-Pro, and GLM-5.1 achieved scores ranging merely between 30.00% and 45.67%. This represents a marked decline compared to their performance on the general BrowseComp benchmark. Meanwhile, Korean LLMs developed under the nation’s Proprietary AI Foundation Model program performed even lower, scoring between 0.00% and 10.33%.

To further probe these limitations, we generated a synthetic split of 100 problems. This subset leverages difficult few-shot examples and targeted generation techniques designed to exploit the inherent asymmetry between solving and creating web-browsing challenges. When subjected to this adversarially filtered diagnostic split, the most capable model managed only a 26.00% success rate, highlighting the need for such targeted stress tests. We have made both the dataset and the associated code publicly available.

Source: arXiv Generated at: 2026-06-02 00:00:00 UTC