arXiv

Taming System Complexity: Demystifying Software Engineering Agents in Diagnosing Linux Kernel Faults

June 2, 2026 · Zhenhao Zhou, Zhuochen Huang, Yike He, Chong Wang, Jiajun Wang, Yijian Wu, Xin Peng, Yiling Lou · Original Source

Title: Managing System Complexity: Unraveling the Role of Software Engineering Agents in Identifying Linux Kernel Defects

Original: arXiv:2505.19489v2 Announce Type: replace Abstract: The Linux kernel is a critical system, serving as the foundation for numerous systems. Bugs in the Linux kernel can cause serious consequences, affecting billions of users. Fault localization (FL), which aims at identifying the buggy code elements in software, plays an essential role in software quality assurance. While recent LLM agents have achieved promising accuracy in FL on recent benchmarks like SWE-bench, it remains unclear how well these methods perform in the Linux kernel, where FL is much more challenging due to the large-scale code base, limited observability, and diverse impact factors. In this paper, we introduce LinuxFLBench, a FL benchmark constructed from real-world Linux kernel bugs. We conduct an empirical study to assess the performance of state-of-the-art LLM agents on the Linux kernel. Our initial results reveal that existing agents struggle with this task, achieving a best top-1 accuracy of only 41.6% at file level. To address this challenge, we propose LinuxFL$^+$, an enhancement framework designed to improve FL effectiveness of LLM agents for the Linux kernel. LinuxFL$^+$ substantially improves the FL accuracy of all studied agents (e.g., 7.2% - 11.2% accuracy increase) with minimal costs.

Rewrite: As the bedrock for countless systems, the Linux kernel is a vital infrastructure component; consequently, defects within it can trigger severe repercussions, impacting billions of users globally. Fault localization (FL)—the process of pinpointing erroneous code segments—is a cornerstone of software quality assurance. Although Large Language Model (LLM) agents have demonstrated notable success in FL tasks on contemporary benchmarks such as SWE-bench, their efficacy in the Linux kernel environment remains an open question. This domain presents unique hurdles, including an expansive codebase, restricted observability, and a wide array of influencing variables, making FL significantly more difficult.

To bridge this knowledge gap, this study presents LinuxFLBench, a new benchmark built upon actual Linux kernel defects. We performed an empirical evaluation of leading LLM agents within this context. Our findings indicate that current agents face considerable difficulties, recording a peak top-1 accuracy of merely 41.6% at the file level. In response to these limitations, we introduce LinuxFL$^+$, a framework aimed at boosting the FL capabilities of LLM agents specifically for the Linux kernel. LinuxFL$^+$ delivers substantial gains in accuracy across all evaluated agents—ranging from 7.2% to 11.2%—while incurring negligible additional costs.

Source: arXiv Generated at: 2026-06-02 00:00:00 UTC