Technology
MindClaw: Closed-Loop Embodied Mental-State Reasoning for Precision Intervention
MindClaw is a closed-loop framework for embodied Theory of Mind that enables precise, real-time assistance by integrating belief memory and cognitive triggers. It outperforms VLM baselines by optimizing intervention timing and maintaining dynamic environmental connectivity.
TriLens: Per-Layer Logit-Lens Entropy for White-Box Hallucination Detection
TriLens detects LLM hallucinations by tracking per-layer logit-lens entropy across attention, FFN, and residual streams. This white-box method uses compact entropy trajectories to identify internal uncertainty without storing high-dimensional states.
Before the Model Learns the Bug:Fuzzing RLVR Verifiers
This paper introduces a fuzzing framework for RLVR verifiers to detect flaws before models learn them. It quantifies verifier bugs via adversarial testing and performance metrics.
AnyEdit++: Adaptive Long-Form Knowledge Editing via Bayesian Surprise
AnyEdit++ enhances long-form knowledge editing in LLMs by using Bayesian Surprise to identify semantic boundaries, ensuring structural coherence. This approach outperforms baselines in reasoning, coding, and narrative tasks.
CAREAgent: Clinical Agent with Structured Reasoning and Tool-Integrated for Order Generation
CAREAgent generates executable clinical orders using structured reasoning and tools. It outperforms existing methods on ClinicalBench, achieving significant F1 score improvements.
Diagnosing LLM Arbitration Behavior over Pre-evidence Epistemic States in RAG-based Fact-Checking
The study introduces PAVE to diagnose how LLMs resolve conflicts between prior beliefs and retrieved evidence in RAG fact-checking. It reveals inconsistent arbitration behaviors and proposes a lightweight JSD-based method to improve factual reliability.
Reasoning4Sciences: Bridging Reasoning Language Models to All Scientific Branches
This survey examines Reasoning Language Models across 28 scientific fields, revealing significant adoption disparities. It proposes a maturity framework to address imbalances and guide broader, equitable integration in scientific research.
Expected Value Alignment for Generative Reward Modeling in Formal Mathematics Verification
The paper introduces Expected Value Alignment (EVA) to improve generative reward modeling for formal math verification. EVA extracts continuous scores from discrete token distributions, reducing discretization artifacts while preserving interpretability.
SkillRevise: Improving LLM-Authored Agent Skills via Trace-Conditioned Skill Revision
SkillRevise iteratively improves LLM-authored agent skills using execution traces, boosting success rates from 36.05% to 61.63%. It outperforms baselines by leveraging empirical data for robust, transferable procedural knowledge.
The Case for Model Science: Verify, Explore, Steer, Refine
The authors propose "Model Science" to replace limited benchmarking with a systematic approach using Verify, Explore, Steer, and Refine. This discipline aims to explain model mechanisms and failures, drawing insights from fields like neuroscience and medicine.
Deft Scheduling of Dynamic Cloud Workflows with Varying Deadlines via Mixture-of-Experts
DEFT is a novel DRL scheduler using a Mixture-of-Experts architecture to optimize dynamic cloud workflows with varying deadlines. It outperforms baselines by reducing costs and deadline violations via a graph-adaptive gating mechanism.
Can LLM Agents Sustain Long-Horizon Organizational Dynamics?
TaskWeave, a hierarchical framework using dependency-aware trace memory, enables LLM agents to sustain coherent, long-term organizational dynamics. Evaluated on a year-long IT simulation, it outperforms baselines in coherence and grounding.
"Skill issues'': data-centric optimization of lakehouse agents
This study optimizes lakehouse agents via data-centric pipelines, achieving a 31.9% accuracy gain. By verifying lakehouse states rather than just outputs, it refines agent capabilities for write-heavy workflows.
The Shape of Wisdom: Decision Trajectories in Language Models
This study maps decision trajectories in LLMs, revealing that "unstable-correct" answers are most common. It offers a methodology to distinguish stable from precarious model outputs.
Advanced Mathematics Learning Behavior Prediction and Academic Early Warning Model Based on Multimodal Data Analysis
This study uses multimodal data and hierarchical knowledge graphs to predict advanced math learning behaviors and issue early academic warnings. Empirical results show the model effectively identifies at-risk students and enhances mastery through targeted interventions.
HomeFlow: A Data Flywheel for Smart Home Agent Training with Verifiable Simulation
HomeFlow is a verifiable data flywheel using simulation and tree search to train smart home agents. Its models outperform GPT-5.5, achieving up to 87% task success rates on the new SmartHome-Bench.
Application of Algorithms in Energy-Efficient Design Platforms for Green Building
A novel BIM-integrated platform using evolutionary algorithms reduced office building energy use by 29.3% with minimal cost and discomfort. This validates its effectiveness for sustainable, energy-efficient green building design.
SIRIUS-SQL: Anchoring Multi-Candidate Text-to-SQL in Execution Feedback
SIRIUS-SQL improves text-to-SQL by using execution feedback and reinforcement learning to generate diverse, executable candidates. It outperforms existing systems, achieving 75.88% on BIRD and 91.20% on SPIDER.
Emergent Ordinal Geometry in Transformers Trained on Local Comparisons
Transformers trained on local comparisons develop an internal number line, mirroring human symbolic distance effects. Their embeddings collapse into a 1D manifold recovering hidden rank order, bridging cognitive science and neural networks.
ANDES: Agent Native Data Evolving Synthesis Tool for Autonomous Instruction Alignment
ANDES is an agent-native framework that enhances autonomous instruction alignment by providing a modular data synthesis skill. It overcomes agent context limits via a self-evolving World Tree, achieving state-of-the-art post-training results.