arXiv

SMH-Bench: Benchmarking LLM Agents for Environment-Grounded Reasoning and Action in Smart Homes

June 2, 2026 · Kuan Li, Shuo Zhang, Huacan Wang, Fangzhou Yu, Zecheng Sheng, Yi Gu, Weipeng Ming, Lei Xue, Chen Liu, Sen Hu, Ronghao Chen, Siyue Lin, Yuqing Hou, Xiaofeng Mou, Yi Xu · Original Source

Title: SMH-Bench: Evaluating LLM Agents in Environment-Grounded Reasoning and Action within Smart Homes

Abstract: As smart homes transition into intricate, state-dependent living spaces, Large Language Models (LLMs) are increasingly required to interpret user intent, preferences, and multi-device dynamics. Yet, current benchmarks in this domain frequently rely on static instruction-to-API mappings or constrained simulations, which fall short of assessing an LLM’s ability to reason, interact, and act reliably in realistic domestic settings. To bridge this gap, we present SMH-Bench, a holistic evaluation framework for LLMs operating in smart-home contexts. Leveraging HomeEnv, a simulator that is both executable and verifiable, SMH-Bench comprises 1,100 high-quality tasks distributed across 7 main categories and 22 specific subcategories. The benchmark stratifies these tasks by home complexity, covering environments from small apartments to dense, multi-room setups equipped with up to 135 devices. Our experiments reveal that while state-of-the-art LLMs perform well on explicit control and query tasks, they demonstrate notable deficiencies in scheduling automation, resolving ambiguities, and engaging in personalized reasoning, particularly as the complexity of the home environment grows. We anticipate that SMH-Bench will support the advancement of smart-home agents that are more reliable, context-aware, and suitable for real-world deployment.

Source: arXiv Generated at: 2026-06-02 00:00:00 UTC