Technology
POIROT: Interrogating Agents for Failure Detection in Multi-Agent Systems
POIROT uses multi-agent systems to self-audit for failures, outperforming single-LLM evaluators. This open-source protocol enables internal safety oversight without external judgment.
Forget Attention: Importance-Aware Attention Is All You Need
SISA integrates SSM-derived importance into attention scores, outperforming Transformers and Mamba-3 in retrieval accuracy and speed. It achieves perfect NIAH scores while maintaining standard SDPA efficiency, establishing a new score-level fusion paradigm.
Repair Before Veto: Repair-Augmented Constraint Learning for Contextual Decisions
RACL learns to repair candidates before vetoing, reducing false rejections. It outperforms baselines by integrating known modifications into constraint learning.
Coordination Graphs for Constrained Multi-Agent Reinforcement Learning
CG-CMARL uses coordination graphs and Lagrangian duality to solve constrained multi-agent RL efficiently. It scales to large teams and generates Pareto fronts without retraining, outperforming baselines in cooperative navigation.
COMAP: Co-Evolving World Models and Agent Policies for LLM Agents
COMAP co-evolves LLM agent policies and textual world models via closed-loop interaction, enabling dynamic adaptation. This approach significantly outperforms baselines, improving performance by +16.75% on Qwen3-4B.
MOC: Multi-Order Communication in LLM-based Multi-Agent Systems
MOC enhances LLM multi-agent systems by formalizing multi-hop communication and merging semantics to preserve evidence fidelity. It consistently improves task performance while reducing communication costs across diverse datasets.
SIRI: Self-Internalizing Reinforcement Learning with Intrinsic Skills for LLM Agent Training
SIRI enables LLM agents to autonomously discover, validate, and internalize skills, eliminating external dependencies. It boosts performance on ALFWorld and WebShop benchmarks by distilling beneficial skills into the base policy.
A Mathematical Conflict Framework for Contextual Data Modulation
This paper proposes a generalized operator-based framework treating conflict as an independent, context-dependent metric. It unifies weighting and mapping under a single abstract operator, offering a versatile, algorithm-agnostic foundation for data modulation.
Spatial Representation Learning Beyond Pixels: Unifying Raster Data and Vector Semantics for Human-Centric Geospatial Foundation Models
This paper advocates for unifying raster imagery and vector semantics in geospatial foundation models. It argues that joint spatial representation learning captures human-centric insights missing in pixel-only approaches.
AgentPLM: Agentic Protein Language Models with Reasoning-Augmented Decoding for Protein Sequence Design
AgentPLM integrates reasoning-augmented decoding and contrastive policy optimization to enable agentic, feedback-driven protein sequence design. It achieves state-of-the-art performance by dynamically correcting errors using external biophysical tools.
Harness-1: Reinforcement Learning for Search Agents with State-Externalizing Harnesses
Harness-1 uses RL to train a search agent where a harness manages state, boosting curated recall to 0.730. It outperforms open subagents by 11.4 points and rivals larger models.
Food Noise & False Safety: A Systematic Evaluation of How LLMs Fail to Adapt to Eating Disorder Queries with Clinician Feedback
This study reveals LLMs’ dangerous tendency to accommodate unsafe eating disorder queries, identifying linguistic markers that trigger hazardous outputs. It highlights the critical need for clinical oversight to mitigate these risks.
Bridging the Sim-to-Real Gap in Semiconductor Visual Program Synthesis via Input Binarization
This study bridges the sim-to-real gap in semiconductor visual program synthesis by binarizing SEM inputs. This technique improves the mean Dice coefficient from 0.4393 to 0.5256, enhancing geometric precision.
LLM-Evolved Pattern Generators for Optimal Classical Planning
This paper introduces LLM-evolved pattern generators that learn admissible, domain-dependent heuristics for optimal classical planning. The approach achieves state-of-the-art coverage with negligible overhead by synthesizing programs for saturated cost partitioning.
Beyond One-shot: AI Agents for Learning in Field Experiments
This study shows AI agents outperform humans in optimizing healthcare messaging by learning from experimental data. General LLMs failed without this specific context, proving domain data is crucial for success.
HLL: Can Agents Cross Humanity's Last Line of Verification?
HLL benchmarks eight multimodal agents on CAPTCHA verification, revealing their fragility in realistic GUI environments. The study highlights deficiencies in localization and action calibration, showing agents struggle to replace humans in secure workflows.
AGENTCL: Toward Rigorous Evaluation of Continual Learning in Language Agents
AgentCL introduces a rigorous framework for evaluating continual learning in language agents via controlled, reusable task streams. It also presents MemProbe to analyze how memory designs impact learning across diverse tasks.
Iteris: Agentic Research Loops for Computational Mathematics
Iteris, an agentic AI system, advances computational mathematics by combining proofs with numerical experimentation. It successfully generated verified results for two open problems, demonstrating AI's potential in research workflows.
RASER: Recoverability-Aware Selective Escalation Router for Multi-Hop Question Answering
RASER reduces multi-hop QA costs by selectively escalating to expensive retrieval only when needed, cutting token usage by over half. It maintains competitive accuracy across benchmarks without requiring extra LLM calls for routing decisions.
MCP-Persona: Benchmarking LLM Agents on Real-World Personal Applications via Environment Simulation
MCP-Persona is the first benchmark evaluating LLM agents on real-world personal applications like Reddit and Slack. It reveals significant performance gaps in using customized MCP tools.