Global News Digest

Technology

arXiv

POIROT: Interrogating Agents for Failure Detection in Multi-Agent Systems

POIROT uses multi-agent systems to self-audit for failures, outperforming single-LLM evaluators. This open-source protocol enables internal safety oversight without external judgment.

arXiv

Forget Attention: Importance-Aware Attention Is All You Need

SISA integrates SSM-derived importance into attention scores, outperforming Transformers and Mamba-3 in retrieval accuracy and speed. It achieves perfect NIAH scores while maintaining standard SDPA efficiency, establishing a new score-level fusion paradigm.

arXiv

Repair Before Veto: Repair-Augmented Constraint Learning for Contextual Decisions

RACL learns to repair candidates before vetoing, reducing false rejections. It outperforms baselines by integrating known modifications into constraint learning.

arXiv

Coordination Graphs for Constrained Multi-Agent Reinforcement Learning

CG-CMARL uses coordination graphs and Lagrangian duality to solve constrained multi-agent RL efficiently. It scales to large teams and generates Pareto fronts without retraining, outperforming baselines in cooperative navigation.

arXiv

COMAP: Co-Evolving World Models and Agent Policies for LLM Agents

COMAP co-evolves LLM agent policies and textual world models via closed-loop interaction, enabling dynamic adaptation. This approach significantly outperforms baselines, improving performance by +16.75% on Qwen3-4B.

arXiv

MOC: Multi-Order Communication in LLM-based Multi-Agent Systems

MOC enhances LLM multi-agent systems by formalizing multi-hop communication and merging semantics to preserve evidence fidelity. It consistently improves task performance while reducing communication costs across diverse datasets.

arXiv

SIRI: Self-Internalizing Reinforcement Learning with Intrinsic Skills for LLM Agent Training

SIRI enables LLM agents to autonomously discover, validate, and internalize skills, eliminating external dependencies. It boosts performance on ALFWorld and WebShop benchmarks by distilling beneficial skills into the base policy.

arXiv

A Mathematical Conflict Framework for Contextual Data Modulation

This paper proposes a generalized operator-based framework treating conflict as an independent, context-dependent metric. It unifies weighting and mapping under a single abstract operator, offering a versatile, algorithm-agnostic foundation for data modulation.

arXiv

Spatial Representation Learning Beyond Pixels: Unifying Raster Data and Vector Semantics for Human-Centric Geospatial Foundation Models

This paper advocates for unifying raster imagery and vector semantics in geospatial foundation models. It argues that joint spatial representation learning captures human-centric insights missing in pixel-only approaches.

arXiv

AgentPLM: Agentic Protein Language Models with Reasoning-Augmented Decoding for Protein Sequence Design

AgentPLM integrates reasoning-augmented decoding and contrastive policy optimization to enable agentic, feedback-driven protein sequence design. It achieves state-of-the-art performance by dynamically correcting errors using external biophysical tools.

arXiv

Harness-1: Reinforcement Learning for Search Agents with State-Externalizing Harnesses

Harness-1 uses RL to train a search agent where a harness manages state, boosting curated recall to 0.730. It outperforms open subagents by 11.4 points and rivals larger models.

arXiv

Food Noise & False Safety: A Systematic Evaluation of How LLMs Fail to Adapt to Eating Disorder Queries with Clinician Feedback

This study reveals LLMs’ dangerous tendency to accommodate unsafe eating disorder queries, identifying linguistic markers that trigger hazardous outputs. It highlights the critical need for clinical oversight to mitigate these risks.

arXiv

Bridging the Sim-to-Real Gap in Semiconductor Visual Program Synthesis via Input Binarization

This study bridges the sim-to-real gap in semiconductor visual program synthesis by binarizing SEM inputs. This technique improves the mean Dice coefficient from 0.4393 to 0.5256, enhancing geometric precision.

arXiv

LLM-Evolved Pattern Generators for Optimal Classical Planning

This paper introduces LLM-evolved pattern generators that learn admissible, domain-dependent heuristics for optimal classical planning. The approach achieves state-of-the-art coverage with negligible overhead by synthesizing programs for saturated cost partitioning.

arXiv

Beyond One-shot: AI Agents for Learning in Field Experiments

This study shows AI agents outperform humans in optimizing healthcare messaging by learning from experimental data. General LLMs failed without this specific context, proving domain data is crucial for success.

arXiv

HLL: Can Agents Cross Humanity's Last Line of Verification?

HLL benchmarks eight multimodal agents on CAPTCHA verification, revealing their fragility in realistic GUI environments. The study highlights deficiencies in localization and action calibration, showing agents struggle to replace humans in secure workflows.

arXiv

AGENTCL: Toward Rigorous Evaluation of Continual Learning in Language Agents

AgentCL introduces a rigorous framework for evaluating continual learning in language agents via controlled, reusable task streams. It also presents MemProbe to analyze how memory designs impact learning across diverse tasks.

arXiv

Iteris: Agentic Research Loops for Computational Mathematics

Iteris, an agentic AI system, advances computational mathematics by combining proofs with numerical experimentation. It successfully generated verified results for two open problems, demonstrating AI's potential in research workflows.

arXiv

RASER: Recoverability-Aware Selective Escalation Router for Multi-Hop Question Answering

RASER reduces multi-hop QA costs by selectively escalating to expensive retrieval only when needed, cutting token usage by over half. It maintains competitive accuracy across benchmarks without requiring extra LLM calls for routing decisions.

arXiv

MCP-Persona: Benchmarking LLM Agents on Real-World Personal Applications via Environment Simulation

MCP-Persona is the first benchmark evaluating LLM agents on real-world personal applications like Reddit and Slack. It reveals significant performance gaps in using customized MCP tools.