arXiv

It's TIME: Towards the Next Generation of Time Series Forecasting Benchmarks

Title: It's TIME: Towards the Next Generation of Time Series Forecasting Benchmarks

Original: arXiv:2602.12147v4 Announce Type: replace Abstract: Time series foundation models (TSFMs) are revolutionizing the forecasting landscape from specific dataset modeling to generalizable task evaluation. However, we contend that existing benchmarks exhibit common limitations in four dimensions: constrained data composition dominated by reused legacy sources, compromised data integrity lacking rigorous quality assurance, misaligned task formulations detached from real-world contexts, and rigid analysis perspectives that obscure generalizable insights. To bridge these gaps, we introduce TIME, a next-generation task-centric benchmark comprising 50 fresh datasets and 98 forecasting tasks, tailored for strict zero-shot TSFM evaluation free from data leakage. Integrating large language models and human expertise, we establish a human-in-the-loop benchmark construction pipeline to ensure high data integrity and redefine task formulation by aligning forecasting configurations with real-world operational requirements and variate predictability. Furthermore, we propose a novel pattern-level evaluation perspective that moves beyond traditional dataset-level evaluations based on static meta labels. By leveraging structural time series features to characterize intrinsic temporal properties, this approach offers generalizable insights into model capabilities across diverse patterns. We evaluate 12 TSFMs and establish a multi-granular leaderboard to facilitate in-depth analysis and visualized inspection. The leaderboard is available at https://huggingface.co/spaces/Real-TSF/TIME-leaderboard.

Rewrite: The emergence of time series foundation models (TSFMs) is transforming the forecasting field, shifting the focus from isolated dataset modeling to the assessment of broadly applicable tasks. Nevertheless, we argue that current benchmarks suffer from four primary shortcomings: a restricted data mix heavily reliant on recycled historical sources, questionable data integrity due to insufficient quality control, task designs that are disconnected from practical scenarios, and inflexible analytical frameworks that hinder the extraction of universal insights. Addressing these deficiencies, we present TIME, a forward-looking, task-oriented benchmark featuring 98 forecasting tasks and 50 newly curated datasets. It is specifically designed to support rigorous zero-shot evaluation of TSFMs while preventing data leakage. To guarantee robust data quality and realign task definitions with practical operational needs and the predictability of individual variates, we developed a human-in-the-loop construction process that combines human judgment with large language models. Additionally, we introduce a new evaluation paradigm focused on patterns rather than static, dataset-level meta-labels. By utilizing structural time series attributes to define inherent temporal characteristics, this method provides broader insights into how models perform across various patterns. We assessed 12 TSFMs and created a multi-granular leaderboard to enable detailed analysis and visual review, which can be accessed at https://huggingface.co/spaces/Real-TSF/TIME-leaderboard.


Source: arXiv Generated at: 2026-06-04 00:00:00 UTC

Related Articles

Glazer Family Members Said to Study Manchester United Stake Sale
Bloomberg

Glazer Family Members Said to Study Manchester United Stake Sale

Reports indicate the Glazer family is evaluating a potential sale of their Manchester United stake, with family members ...

Ares' Blair Jacbobson: Disconnect Over Private Credit Headlines
Bloomberg

Ares' Blair Jacbobson: Disconnect Over Private Credit Headlines

Ares’ Blair Jacobson argues that private credit headlines misrepresent reality, highlighting a disconnect between media ...

Nvidia-Backed Robotics Startup Generalist AI Valued at $2 Billion
Bloomberg

Nvidia-Backed Robotics Startup Generalist AI Valued at $2 Billion

Nvidia-backed robotics startup Generalist AI has reached a $2 billion valuation. Founders Pete Florence, Andy Zeng, and ...

TechCrunch

Oura Ring 5 review: Thinner, lighter, better

The Oura Ring 5 is 40% smaller and lighter than its predecessor, offering superior comfort and a discreet, jewelry-like ...

Financial Times

How AI has de-skilled translation

AI fragments specialist translation into routine tasks, effectively de-skilling the profession. This shift reduces compl...

Zurich Insurance Expands Data-Center Offering Beyond the US
Bloomberg

Zurich Insurance Expands Data-Center Offering Beyond the US

Zurich Insurance Group is expanding its data center insurance products internationally, extending coverage beyond the Un...