ForeSci: Evaluating LLM Agents for Forward-Looking AI Research Judgment
Title: ForeSci: Assessing LLM Agents on Their Ability to Judge Future AI Research Directions
Abstract: Strategic AI research frequently demands choices made in the absence of future data, such as identifying key bottlenecks, selecting research trajectories, or determining project positioning. To address this, we present ForeSci, a benchmark designed with temporal controls to test whether Large Language Model (LLM) agents can render forward-looking research judgments based on historical information. The benchmark comprises 500 tasks distributed across four rapidly evolving AI domains and four distinct decision categories. Each task is associated with an offline knowledge base aligned to a specific cutoff date; papers published after this cutoff are excluded during the generation phase and serve solely for validation purposes. To prevent agents from merely guessing future events, tasks are constructed from pre-cutoff taxonomy branches and evidence signals, while the backbone models used for answer generation are selected to precede the respective task cutoffs. We assess native LLMs, Hybrid Retrieval-Augmented Generation (RAG), and three specialized research-agent adaptations across four different backbones. Our findings indicate that while explicit evidence organization enhances traceability and factual grounding, the extent of these improvements varies significantly depending on the decision family. Diagnostic analysis uncovers a persistent issue of evidence-decision decoupling, where agents reference pertinent evidence yet predict incorrect research outcomes. Ultimately, ForeSci establishes a controlled framework for evaluating research agents as decision-making systems by focusing on forward-looking AI research judgment.
Source: arXiv Generated at: 2026-06-02 00:00:00 UTC




