Technology
Breaking the Reversal Curse in Autoregressive Language Models via Identity Bridge
The paper introduces an "Identity Bridge" training method to overcome the reversal curse in LLMs. This low-cost approach enables models to learn higher-level rules, significantly improving logical reasoning performance.
Beyond End-to-End Video Models: An LLM-Based Multi-Agent System for Educational Video Generation
LASEV is an LLM-based multi-agent system that generates precise educational videos by orchestrating specialized agents for reasoning, visualization, and narration. It ensures logical accuracy and synchronization through structured script assembly rather than direct pixel generation.
Prototype Transformer: Towards Language Model Architectures Interpretable by Design
The Prototype Transformer replaces self-attention with linear-cost prototypes, enabling inherent interpretability by learning identifiable concepts. It maintains competitive performance while offering transparent, scalable language modeling.
REAL: Resolving Knowledge Conflicts in Knowledge-Intensive Visual Question Answering via Reasoning-Pivot Alignment
REAL resolves knowledge conflicts in VQA via Reasoning-Pivot Alignment, using RPA-SFT and RPGD to detect and mitigate contradictions. This approach significantly improves discrimination accuracy and overall performance across datasets.
Benchmarking at the Edge of Comprehension
Critique-Resilient Benchmarking uses adversarial verification to evaluate LLMs in the "post-comprehension regime," where human understanding is insufficient. This method maintains evaluation integrity by focusing on localized claims rather than full task comprehension.
LLM4Cov: Execution-Aware Agentic Learning for High-coverage Testbench Generation
LLM4Cov is an offline framework for high-coverage hardware verification that uses execution-aware agentic learning. A 4B model achieved 90.4% coverage, outperforming its teacher despite being significantly smaller.
LLM-WikiRace Benchmark: How Far Can LLMs Plan over Real-World Knowledge Graphs?
LLM-WikiRace reveals that while top LLMs excel at easy tasks, their performance drops significantly on hard reasoning challenges. Results show long-horizon planning, not just knowledge, is the primary bottleneck for current models.
PATRA: Pattern-Aware Alignment and Balanced Reasoning for Time Series Question Answering
PATRA addresses LLM limitations in time series reasoning by extracting trend/seasonality patterns and using balanced rewards. It outperforms baselines in TSQA tasks, enhancing cross-modal understanding and deep logical analysis.
On Information Self-Locking in Reinforcement Learning for Active Reasoning of LLM agents
The paper identifies "information self-locking" in RL-based LLM agents, where poor action selection and belief tracking create a bottleneck. It proposes AREW, an advantage reweighting technique, to alleviate this issue and boost performance by up to 60 points.
OpenHospital: A Thing-in-itself Arena for Evolving and Benchmarking LLM-based Collective Intelligence
OpenHospital is an interactive simulation for evolving and benchmarking LLM-based collective intelligence. It enables physician agents to develop medical competence through dynamic patient interactions.
AgentProcessBench: Diagnosing Step-Level Process Quality in Tool-Using Agents
AgentProcessBench evaluates step-level quality in tool-using agents using 1,000 trajectories and human annotations. It reveals model fragilities and shows process evaluation complements outcome-based supervision.
Retrieval-aligned Tabular Foundation Models Enable Robust Clinical Risk Prediction in Electronic Health Records Under Real-world Constraints
The study introduces AWARE, a retrieval-aligned framework that significantly improves clinical risk prediction in EHRs by addressing data heterogeneity and imbalance, outperforming naive retrieval methods.
Rashomon Memory: Towards Argumentation-Driven Retrieval for Multi-Perspective Agent Memory
Rashomon Memory uses parallel, goal-conditioned agents to encode contradictory experiences, employing argumentation to retrieve and explain multi-perspective interpretations.
Vision Language Models Cannot Reason About Physical Transformation
Vision-Language Models fail to reason about physical transformations, performing at chance on ConservationBench. They rely on textual priors rather than visual understanding, unable to maintain invariant representations of physical attributes.
FeynmanBench: Benchmarking Multimodal LLMs on Diagrammatic Physics Reasoning
FeynmanBench reveals that while multimodal LLMs excel at local diagram recognition, they fail at global topological and algebraic reasoning, exposing critical architectural limits in scientific diagram comprehension.
PECKER: A Precisely Efficient Critical Knowledge Erasure Recipe For Machine Unlearning in Diffusion Models
PECKER is an efficient machine unlearning method for diffusion models using saliency masks to prioritize critical parameter updates, reducing computational overhead while maintaining unlearning efficacy.
What's Missing in Screen-to-Action? Towards a UI-in-the-Loop Paradigm for Multimodal GUI Reasoning
The authors propose UILoop, an iterative UI-in-the-loop framework for multimodal GUI reasoning, enhancing element comprehension and transparency. They also introduce UI Comprehension-Bench, a new benchmark with 26,000 samples to evaluate these advancements.
MAVEN-T: Reinforced Heterogeneous Distillation for Real-Time Multi-Agent Trajectory Prediction
MAVEN-T uses reinforced heterogeneous distillation to create a real-time, lightweight multi-agent trajectory predictor. It combines graph-based teacher knowledge with PPO-refined student training for safe, efficient autonomous driving.
Process Reward Agents for Steering Knowledge-Intensive Reasoning
Process Reward Agents (PRA) provide online, step-by-step rewards to guide knowledge-intensive reasoning, outperforming baselines on MedQA. PRA boosts accuracy across 0.5B-8B models without retraining, decoupling reasoning engines from domain-specific rewards.
RadAgent: A tool-using AI agent for stepwise interpretation of chest computed tomography
RadAgent is a tool-using AI agent that generates chest CT reports via stepwise, transparent reasoning. It outperforms 3D VLMs in accuracy, robustness, and faithfulness, enabling clinicians to inspect and validate AI decisions.