Global News Digest

Technology

arXiv

MindClaw: Closed-Loop Embodied Mental-State Reasoning for Precision Intervention

MindClaw is a closed-loop framework for embodied Theory of Mind that enables precise, real-time assistance by integrating belief memory and cognitive triggers. It outperforms VLM baselines by optimizing intervention timing and maintaining dynamic environmental connectivity.

arXiv

TriLens: Per-Layer Logit-Lens Entropy for White-Box Hallucination Detection

TriLens detects LLM hallucinations by tracking per-layer logit-lens entropy across attention, FFN, and residual streams. This white-box method uses compact entropy trajectories to identify internal uncertainty without storing high-dimensional states.

arXiv

Before the Model Learns the Bug:Fuzzing RLVR Verifiers

This paper introduces a fuzzing framework for RLVR verifiers to detect flaws before models learn them. It quantifies verifier bugs via adversarial testing and performance metrics.

arXiv

AnyEdit++: Adaptive Long-Form Knowledge Editing via Bayesian Surprise

AnyEdit++ enhances long-form knowledge editing in LLMs by using Bayesian Surprise to identify semantic boundaries, ensuring structural coherence. This approach outperforms baselines in reasoning, coding, and narrative tasks.

arXiv

CAREAgent: Clinical Agent with Structured Reasoning and Tool-Integrated for Order Generation

CAREAgent generates executable clinical orders using structured reasoning and tools. It outperforms existing methods on ClinicalBench, achieving significant F1 score improvements.

arXiv

Diagnosing LLM Arbitration Behavior over Pre-evidence Epistemic States in RAG-based Fact-Checking

The study introduces PAVE to diagnose how LLMs resolve conflicts between prior beliefs and retrieved evidence in RAG fact-checking. It reveals inconsistent arbitration behaviors and proposes a lightweight JSD-based method to improve factual reliability.

arXiv

Reasoning4Sciences: Bridging Reasoning Language Models to All Scientific Branches

This survey examines Reasoning Language Models across 28 scientific fields, revealing significant adoption disparities. It proposes a maturity framework to address imbalances and guide broader, equitable integration in scientific research.

arXiv

Expected Value Alignment for Generative Reward Modeling in Formal Mathematics Verification

The paper introduces Expected Value Alignment (EVA) to improve generative reward modeling for formal math verification. EVA extracts continuous scores from discrete token distributions, reducing discretization artifacts while preserving interpretability.

arXiv

SkillRevise: Improving LLM-Authored Agent Skills via Trace-Conditioned Skill Revision

SkillRevise iteratively improves LLM-authored agent skills using execution traces, boosting success rates from 36.05% to 61.63%. It outperforms baselines by leveraging empirical data for robust, transferable procedural knowledge.

arXiv

The Case for Model Science: Verify, Explore, Steer, Refine

The authors propose "Model Science" to replace limited benchmarking with a systematic approach using Verify, Explore, Steer, and Refine. This discipline aims to explain model mechanisms and failures, drawing insights from fields like neuroscience and medicine.

arXiv

Deft Scheduling of Dynamic Cloud Workflows with Varying Deadlines via Mixture-of-Experts

DEFT is a novel DRL scheduler using a Mixture-of-Experts architecture to optimize dynamic cloud workflows with varying deadlines. It outperforms baselines by reducing costs and deadline violations via a graph-adaptive gating mechanism.

arXiv

Can LLM Agents Sustain Long-Horizon Organizational Dynamics?

TaskWeave, a hierarchical framework using dependency-aware trace memory, enables LLM agents to sustain coherent, long-term organizational dynamics. Evaluated on a year-long IT simulation, it outperforms baselines in coherence and grounding.

arXiv

"Skill issues'': data-centric optimization of lakehouse agents

This study optimizes lakehouse agents via data-centric pipelines, achieving a 31.9% accuracy gain. By verifying lakehouse states rather than just outputs, it refines agent capabilities for write-heavy workflows.

arXiv

The Shape of Wisdom: Decision Trajectories in Language Models

This study maps decision trajectories in LLMs, revealing that "unstable-correct" answers are most common. It offers a methodology to distinguish stable from precarious model outputs.

arXiv

Advanced Mathematics Learning Behavior Prediction and Academic Early Warning Model Based on Multimodal Data Analysis

This study uses multimodal data and hierarchical knowledge graphs to predict advanced math learning behaviors and issue early academic warnings. Empirical results show the model effectively identifies at-risk students and enhances mastery through targeted interventions.

arXiv

HomeFlow: A Data Flywheel for Smart Home Agent Training with Verifiable Simulation

HomeFlow is a verifiable data flywheel using simulation and tree search to train smart home agents. Its models outperform GPT-5.5, achieving up to 87% task success rates on the new SmartHome-Bench.

arXiv

Application of Algorithms in Energy-Efficient Design Platforms for Green Building

A novel BIM-integrated platform using evolutionary algorithms reduced office building energy use by 29.3% with minimal cost and discomfort. This validates its effectiveness for sustainable, energy-efficient green building design.

arXiv

SIRIUS-SQL: Anchoring Multi-Candidate Text-to-SQL in Execution Feedback

SIRIUS-SQL improves text-to-SQL by using execution feedback and reinforcement learning to generate diverse, executable candidates. It outperforms existing systems, achieving 75.88% on BIRD and 91.20% on SPIDER.

arXiv

Emergent Ordinal Geometry in Transformers Trained on Local Comparisons

Transformers trained on local comparisons develop an internal number line, mirroring human symbolic distance effects. Their embeddings collapse into a 1D manifold recovering hidden rank order, bridging cognitive science and neural networks.

arXiv

ANDES: Agent Native Data Evolving Synthesis Tool for Autonomous Instruction Alignment

ANDES is an agent-native framework that enhances autonomous instruction alignment by providing a modular data synthesis skill. It overcomes agent context limits via a self-evolving World Tree, achieving state-of-the-art post-training results.