arXiv

Evaluating LLMs' Effectiveness on Real-World Consumer Device Repair Questions

June 3, 2026 · Atm Mizanur Rahman (University of Illinois Urbana-Champaign), Md Arid Hasan (University of Toronto), Syed Ishtiaque Ahmed (University of Toronto), Sharifa Sultana (University of Illinois Urbana-Champaign) · Original Source

Title: Assessing the Reliability of Large Language Models in Handling Practical Consumer Device Repair Inquiries

Abstract:

Large language models (LLMs) face a significant but largely unexamined challenge in the domain of consumer device repair. These tasks demand sophisticated reasoning capabilities, including the ability to interpret incomplete problem statements, conduct hardware-specific diagnostics, offer actionable troubleshooting steps, and make safety-critical decisions. The stakes are high, as inaccurate guidance can lead to permanent data loss, battery hazards, or physical damage to the device.

To address this gap, we present a new benchmark comprising 991 authentic repair questions sourced from Reddit. The dataset covers three key areas: phone repair, computer repair, and data recovery. Each question is accompanied by reference solutions authored by professional technicians. Additionally, we provide Bangla translations of the data to assess cross-lingual capabilities.

We tested six state-of-the-art LLMs in both English and Bangla, evaluating their performance against four repair-specific metrics: correctness, completeness, practicality, and safety. Our findings indicate that while LLMs can offer helpful assistance, they remain unreliable for high-risk, real-world repair scenarios unless subjected to rigorous evaluation and equipped with explicit safety safeguards.

Phone repair emerged as the most challenging and safety-sensitive category. Across all models, we observed significant errors in board-level diagnosis, repair prioritization, and safe recovery procedures. Furthermore, Bangla responses consistently underperformed compared to their English counterparts across all domains. Among the models tested, GPT-5.4 demonstrated the strongest overall performance.

Source: arXiv Generated at: 2026-06-03 00:00:00 UTC