Global News Digest

Technology

arXiv

LC-ERD: Mining Latent Logic for Self-Evolving Reasoning via Consistency-Regulated Reward Decomposition

LC-ERD mines latent logic via consistency-regulated reward decomposition to solve label noise and coarse supervision in LLM self-alignment. It enables resilient self-evolving reasoning by cleaning the reasoning manifold and measuring individual step utility.

arXiv

When Does Multi-Agent RL Improve LLM Workflows? Workflow, Scale, and Policy-Sharing Tradeoffs

Multi-agent RL improves LLMs via complex tradeoffs: isolated policies yield higher peak accuracy but risk terminal cliffs, while shared policies cause asymmetric gradient dominance.

arXiv

Fundamental Limitation in Explaining AI

This paper proves a fundamental quadrilemma: AI performance, interpretability, faithfulness, and environmental complexity cannot all be maximized simultaneously. Thus, complete faithfulness in AI explanations is theoretically impossible.

arXiv

Test-Time Deep Thinking to Explore Implicit Rules

TTExplore uses a 7B Exp-Thinker to deduce implicit rules via stable RL, boosting agent performance by 14–19 points in embodied tasks.

arXiv

Hypothesis Generation and Inductive Inference in Children and Language Models

This study compares children and LLMs in an inductive inference task, finding both discount unreliable evidence and dissociate task completion from rule generalization.

arXiv

Experiments in Agentic AI for Science

This study introduces DeepTS and DeepScribe, agentic AI frameworks for automating time-series data curation and physics lecture analysis. These systems leverage hybrid architectures to enhance scientific workflows beyond current LLM limitations.

arXiv

Beyond the Frontier: Stochastic Backtracking for Efficient Test-Time Scaling

This paper introduces stochastic backtracking over a persistent prefix pool to improve test-time scaling. It achieves higher accuracy with fewer tokens than frontier-only PRM methods.

arXiv

BatteryMFormer: Multi-level Learning for Battery Degradation Trajectory Forecasting

BatteryMFormer is a multi-level Transformer framework for early battery degradation forecasting. It outperforms baselines by leveraging aging-condition priors, meta-pattern memory, and dual-view encoding.

arXiv

FrontierOR: Benchmarking LLMs' Capacity for Efficient Algorithm Design in Large-Scale Optimization

FrontierOR benchmarks LLMs on large-scale optimization, revealing that even top models struggle to outperform standard solvers in efficiency and quality.

arXiv

Cross-Entropy Games and Frost Training

Frost Training uses reward gradients in embedding space to accelerate LLM policy optimization in Cross-Entropy Games. It significantly boosts output quality and computational efficiency during GRPO training.

arXiv

RULER: Representation-Level Verification of Machine Unlearning

RULER introduces representation-level metrics (M2, M4) to verify machine unlearning, revealing that models passing output-level checks often retain forgotten data. This framework exposes hidden memorization across diverse domains where traditional methods fail.

arXiv

Agyn: An Open-Source Platform for AI Agents with Scalable On-Demand Execution, Agent Definition as a Code, and Zero-Trust Access

Agyn is an open-source platform for scalable AI agent deployment, featuring a Kubernetes-based serverless runtime, Terraform-based code definition, and zero-trust security.

arXiv

Asking Is Not Enough: Protocol Sensitivity in LLM Confidence Calibration

This study reveals LLM confidence calibration is highly sensitive to measurement protocols, challenging assumptions about Instruct model improvements. It shows verbalized confidence fails to distinguish correct from plausible incorrect answers.

arXiv

Benchmarking AI for low-resource contexts: Thinking beyond leaderboards

This study argues for evaluating fully deployed AI systems in low-resource contexts, integrating deployment variables like hardware limits and connectivity. It proposes a standardized reporting framework to replace abstract leaderboard metrics with actionable, context-aware assessments.

arXiv

FundaPod: A Multi-Persona Agent Pod Platform with Knowledge Graph Memory for AI-Assisted Fundamental Investment Research

FundaPod uses multi-persona AI agents and knowledge graph memory to support transparent, human-centric fundamental investment research. It enables independent agent analysis for portfolio managers to adjudicate divergent views and build verifiable investment plans.

arXiv

c-TPE: Tree-structured Parzen Estimator with Inequality Constraints for Expensive Hyperparameter Optimization

c-TPE modifies Tree-Structured Parzen Estimators to handle inequality constraints in expensive hyperparameter optimization. It outperforms existing methods across 81 tasks and is available via OptunaHub.

arXiv

Cookie-Bench: Continuous On-screen Key Interaction Evaluation for Web Generation

Cookie-Bench introduces a reference-free, autonomous framework evaluating web generation via continuous on-screen interaction. It correlates strongly with human ratings, offering scalable, holistic assessment of functionality and aesthetics.

arXiv

Stability Analysis of Sharpness-Aware Minimization

This study proves SAM gets trapped at saddle points due to inferior diffusion compared to vanilla gradient descent. It shows momentum and batch size are crucial for alleviating this instability and improving generalization.

arXiv

DeepIPCv2: LiDAR-powered Robust Environmental Perception and Navigational Control for Autonomous Vehicle

DeepIPCv2 is an end-to-end autonomous driving system using LiDAR for robust perception and control. It outperforms existing methods in lighting variations and precision, with code to be open-sourced.

arXiv

Score Function Gradient Estimation to Widen the Applicability of Decision-Focused Learning

This paper introduces a novel decision-focused learning method using score function gradient estimation to remove structural assumptions. It effectively handles nonlinear objectives and uncertain constraints, matching specialized techniques while offering broader applicability.