Technology
LC-ERD: Mining Latent Logic for Self-Evolving Reasoning via Consistency-Regulated Reward Decomposition
LC-ERD mines latent logic via consistency-regulated reward decomposition to solve label noise and coarse supervision in LLM self-alignment. It enables resilient self-evolving reasoning by cleaning the reasoning manifold and measuring individual step utility.
When Does Multi-Agent RL Improve LLM Workflows? Workflow, Scale, and Policy-Sharing Tradeoffs
Multi-agent RL improves LLMs via complex tradeoffs: isolated policies yield higher peak accuracy but risk terminal cliffs, while shared policies cause asymmetric gradient dominance.
Fundamental Limitation in Explaining AI
This paper proves a fundamental quadrilemma: AI performance, interpretability, faithfulness, and environmental complexity cannot all be maximized simultaneously. Thus, complete faithfulness in AI explanations is theoretically impossible.
Test-Time Deep Thinking to Explore Implicit Rules
TTExplore uses a 7B Exp-Thinker to deduce implicit rules via stable RL, boosting agent performance by 14–19 points in embodied tasks.
Hypothesis Generation and Inductive Inference in Children and Language Models
This study compares children and LLMs in an inductive inference task, finding both discount unreliable evidence and dissociate task completion from rule generalization.
Experiments in Agentic AI for Science
This study introduces DeepTS and DeepScribe, agentic AI frameworks for automating time-series data curation and physics lecture analysis. These systems leverage hybrid architectures to enhance scientific workflows beyond current LLM limitations.
Beyond the Frontier: Stochastic Backtracking for Efficient Test-Time Scaling
This paper introduces stochastic backtracking over a persistent prefix pool to improve test-time scaling. It achieves higher accuracy with fewer tokens than frontier-only PRM methods.
BatteryMFormer: Multi-level Learning for Battery Degradation Trajectory Forecasting
BatteryMFormer is a multi-level Transformer framework for early battery degradation forecasting. It outperforms baselines by leveraging aging-condition priors, meta-pattern memory, and dual-view encoding.
FrontierOR: Benchmarking LLMs' Capacity for Efficient Algorithm Design in Large-Scale Optimization
FrontierOR benchmarks LLMs on large-scale optimization, revealing that even top models struggle to outperform standard solvers in efficiency and quality.
Cross-Entropy Games and Frost Training
Frost Training uses reward gradients in embedding space to accelerate LLM policy optimization in Cross-Entropy Games. It significantly boosts output quality and computational efficiency during GRPO training.
RULER: Representation-Level Verification of Machine Unlearning
RULER introduces representation-level metrics (M2, M4) to verify machine unlearning, revealing that models passing output-level checks often retain forgotten data. This framework exposes hidden memorization across diverse domains where traditional methods fail.
Agyn: An Open-Source Platform for AI Agents with Scalable On-Demand Execution, Agent Definition as a Code, and Zero-Trust Access
Agyn is an open-source platform for scalable AI agent deployment, featuring a Kubernetes-based serverless runtime, Terraform-based code definition, and zero-trust security.
Asking Is Not Enough: Protocol Sensitivity in LLM Confidence Calibration
This study reveals LLM confidence calibration is highly sensitive to measurement protocols, challenging assumptions about Instruct model improvements. It shows verbalized confidence fails to distinguish correct from plausible incorrect answers.
Benchmarking AI for low-resource contexts: Thinking beyond leaderboards
This study argues for evaluating fully deployed AI systems in low-resource contexts, integrating deployment variables like hardware limits and connectivity. It proposes a standardized reporting framework to replace abstract leaderboard metrics with actionable, context-aware assessments.
FundaPod: A Multi-Persona Agent Pod Platform with Knowledge Graph Memory for AI-Assisted Fundamental Investment Research
FundaPod uses multi-persona AI agents and knowledge graph memory to support transparent, human-centric fundamental investment research. It enables independent agent analysis for portfolio managers to adjudicate divergent views and build verifiable investment plans.
c-TPE: Tree-structured Parzen Estimator with Inequality Constraints for Expensive Hyperparameter Optimization
c-TPE modifies Tree-Structured Parzen Estimators to handle inequality constraints in expensive hyperparameter optimization. It outperforms existing methods across 81 tasks and is available via OptunaHub.
Cookie-Bench: Continuous On-screen Key Interaction Evaluation for Web Generation
Cookie-Bench introduces a reference-free, autonomous framework evaluating web generation via continuous on-screen interaction. It correlates strongly with human ratings, offering scalable, holistic assessment of functionality and aesthetics.
Stability Analysis of Sharpness-Aware Minimization
This study proves SAM gets trapped at saddle points due to inferior diffusion compared to vanilla gradient descent. It shows momentum and batch size are crucial for alleviating this instability and improving generalization.
DeepIPCv2: LiDAR-powered Robust Environmental Perception and Navigational Control for Autonomous Vehicle
DeepIPCv2 is an end-to-end autonomous driving system using LiDAR for robust perception and control. It outperforms existing methods in lighting variations and precision, with code to be open-sourced.
Score Function Gradient Estimation to Widen the Applicability of Decision-Focused Learning
This paper introduces a novel decision-focused learning method using score function gradient estimation to remove structural assumptions. It effectively handles nonlinear objectives and uncertain constraints, matching specialized techniques while offering broader applicability.