TimeSage-MT: A Multi-Turn Benchmark for Evaluating Agentic Time Series Reasoning
Title: TimeSage-MT: A Multi-Turn Benchmark for Evaluating Agentic Time Series Reasoning
Abstract:
Time series data plays a pivotal role in informing high-stakes decisions across numerous real-world sectors. Although Large Language Model (LLM) agents possess the capability to analyze data via natural language interfaces and auxiliary tools, it is currently unknown whether they can perform reliable time series analysis within the context of multi-turn conversations. Current evaluation benchmarks predominantly address single-step objectives, such as anomaly detection or forecasting, thereby neglecting practical workflows. In real-world scenarios, user objectives often shift dynamically; agents are required to build upon previous analyses, and final conclusions are typically derived from a cumulative body of evidence.
To address this gap, we present TimeSage-MT, a multi-turn benchmark designed for agentic time series reasoning. This benchmark comprises 240 tasks and 2,680 dialogue turns, covering eight distinct real-world domains and ranging from basic data exploration to decision-oriented analysis. TimeSage-MT is constructed using a reproducible pipeline that transforms actual time series datasets into multi-turn conversations featuring verifiable answers. It offers a standardized evaluation protocol and a public leaderboard to facilitate the comparison of time series agentic systems.
To validate the benchmark’s effectiveness, we assessed state-of-the-art LLMs as well as TimeSage, a newly developed structured agent equipped with an extensive library of time series skills. Our findings reveal significant performance declines in decision-oriented tasks, primarily attributed to deficiencies in memory retention, uncertainty management, and domain-specific decision-making processes. Consequently, TimeSage-MT highlights critical shortcomings in current agentic reasoning capabilities, establishing a rigorous foundation to guide future advancements in the field.
Source: arXiv Generated at: 2026-06-02 00:00:00 UTC




