Global News Digest

Technology

arXiv

Breaking the Reversal Curse in Autoregressive Language Models via Identity Bridge

The paper introduces an "Identity Bridge" training method to overcome the reversal curse in LLMs. This low-cost approach enables models to learn higher-level rules, significantly improving logical reasoning performance.

arXiv

Beyond End-to-End Video Models: An LLM-Based Multi-Agent System for Educational Video Generation

LASEV is an LLM-based multi-agent system that generates precise educational videos by orchestrating specialized agents for reasoning, visualization, and narration. It ensures logical accuracy and synchronization through structured script assembly rather than direct pixel generation.

arXiv

Prototype Transformer: Towards Language Model Architectures Interpretable by Design

The Prototype Transformer replaces self-attention with linear-cost prototypes, enabling inherent interpretability by learning identifiable concepts. It maintains competitive performance while offering transparent, scalable language modeling.

arXiv

REAL: Resolving Knowledge Conflicts in Knowledge-Intensive Visual Question Answering via Reasoning-Pivot Alignment

REAL resolves knowledge conflicts in VQA via Reasoning-Pivot Alignment, using RPA-SFT and RPGD to detect and mitigate contradictions. This approach significantly improves discrimination accuracy and overall performance across datasets.

arXiv

Benchmarking at the Edge of Comprehension

Critique-Resilient Benchmarking uses adversarial verification to evaluate LLMs in the "post-comprehension regime," where human understanding is insufficient. This method maintains evaluation integrity by focusing on localized claims rather than full task comprehension.

arXiv

LLM4Cov: Execution-Aware Agentic Learning for High-coverage Testbench Generation

LLM4Cov is an offline framework for high-coverage hardware verification that uses execution-aware agentic learning. A 4B model achieved 90.4% coverage, outperforming its teacher despite being significantly smaller.

arXiv

LLM-WikiRace Benchmark: How Far Can LLMs Plan over Real-World Knowledge Graphs?

LLM-WikiRace reveals that while top LLMs excel at easy tasks, their performance drops significantly on hard reasoning challenges. Results show long-horizon planning, not just knowledge, is the primary bottleneck for current models.

arXiv

PATRA: Pattern-Aware Alignment and Balanced Reasoning for Time Series Question Answering

PATRA addresses LLM limitations in time series reasoning by extracting trend/seasonality patterns and using balanced rewards. It outperforms baselines in TSQA tasks, enhancing cross-modal understanding and deep logical analysis.

arXiv

On Information Self-Locking in Reinforcement Learning for Active Reasoning of LLM agents

The paper identifies "information self-locking" in RL-based LLM agents, where poor action selection and belief tracking create a bottleneck. It proposes AREW, an advantage reweighting technique, to alleviate this issue and boost performance by up to 60 points.

arXiv

OpenHospital: A Thing-in-itself Arena for Evolving and Benchmarking LLM-based Collective Intelligence

OpenHospital is an interactive simulation for evolving and benchmarking LLM-based collective intelligence. It enables physician agents to develop medical competence through dynamic patient interactions.

arXiv

AgentProcessBench: Diagnosing Step-Level Process Quality in Tool-Using Agents

AgentProcessBench evaluates step-level quality in tool-using agents using 1,000 trajectories and human annotations. It reveals model fragilities and shows process evaluation complements outcome-based supervision.

arXiv

Retrieval-aligned Tabular Foundation Models Enable Robust Clinical Risk Prediction in Electronic Health Records Under Real-world Constraints

The study introduces AWARE, a retrieval-aligned framework that significantly improves clinical risk prediction in EHRs by addressing data heterogeneity and imbalance, outperforming naive retrieval methods.

arXiv

Rashomon Memory: Towards Argumentation-Driven Retrieval for Multi-Perspective Agent Memory

Rashomon Memory uses parallel, goal-conditioned agents to encode contradictory experiences, employing argumentation to retrieve and explain multi-perspective interpretations.

arXiv

Vision Language Models Cannot Reason About Physical Transformation

Vision-Language Models fail to reason about physical transformations, performing at chance on ConservationBench. They rely on textual priors rather than visual understanding, unable to maintain invariant representations of physical attributes.

arXiv

FeynmanBench: Benchmarking Multimodal LLMs on Diagrammatic Physics Reasoning

FeynmanBench reveals that while multimodal LLMs excel at local diagram recognition, they fail at global topological and algebraic reasoning, exposing critical architectural limits in scientific diagram comprehension.

arXiv

PECKER: A Precisely Efficient Critical Knowledge Erasure Recipe For Machine Unlearning in Diffusion Models

PECKER is an efficient machine unlearning method for diffusion models using saliency masks to prioritize critical parameter updates, reducing computational overhead while maintaining unlearning efficacy.

arXiv

What's Missing in Screen-to-Action? Towards a UI-in-the-Loop Paradigm for Multimodal GUI Reasoning

The authors propose UILoop, an iterative UI-in-the-loop framework for multimodal GUI reasoning, enhancing element comprehension and transparency. They also introduce UI Comprehension-Bench, a new benchmark with 26,000 samples to evaluate these advancements.

arXiv

MAVEN-T: Reinforced Heterogeneous Distillation for Real-Time Multi-Agent Trajectory Prediction

MAVEN-T uses reinforced heterogeneous distillation to create a real-time, lightweight multi-agent trajectory predictor. It combines graph-based teacher knowledge with PPO-refined student training for safe, efficient autonomous driving.

arXiv

Process Reward Agents for Steering Knowledge-Intensive Reasoning

Process Reward Agents (PRA) provide online, step-by-step rewards to guide knowledge-intensive reasoning, outperforming baselines on MedQA. PRA boosts accuracy across 0.5B-8B models without retraining, decoupling reasoning engines from domain-specific rewards.

arXiv

RadAgent: A tool-using AI agent for stepwise interpretation of chest computed tomography

RadAgent is a tool-using AI agent that generates chest CT reports via stepwise, transparent reasoning. It outperforms 3D VLMs in accuracy, robustness, and faithfulness, enabling clinicians to inspect and validate AI decisions.