Technology
TuneAgent: Agentic Operating System Kernel Tuning with Reinforcement Learning
TuneAgent uses RL-driven LLMs to autonomously tune Linux kernels, achieving up to 5.6% performance gains. It ensures valid configurations via structured rewards and a two-phase training approach.
Language-Native Materials Processing Design by Lightly Structured Text Database and Reasoning Large Language Model
This framework optimizes materials synthesis by using lightly structured text and reasoning LLMs to extract procedural logic from unstructured data. It successfully streamlined boron nitride nanosheet production, reducing trial-and-error cycles through iterative, evidence-based protocol refinement.
Position: Beyond Sensitive Attributes, ML Fairness Should Quantify Structural Injustice via Social Determinants
This paper argues ML fairness must quantify structural injustice via social determinants, not just sensitive attributes. Auditing these determinants before mitigation prevents new injustices and addresses systemic inequities.
Towards a Physics Foundation Model
The General Physics Transformer (GPhyT) demonstrates that a single model can simulate diverse physical phenomena, outperforming specialized solvers and enabling zero-shot generalization. This work establishes a viable foundation for universal Physics Foundation Models.
Deep Learning as the Disciplined Construction of Tame Objects
This paper uses tame geometry to provide convergence guarantees for stochastic gradient descent in nonsmooth, nonconvex deep learning. It frames deep learning models as compositions of tame functions, offering a rigorous mathematical framework for AI analysis.
T-POP: Test-Time Personalization with Online Preference Feedback
T-POP enables real-time LLM personalization by learning user preferences via online feedback and dueling bandits, without updating model parameters. It effectively solves the cold-start problem, outperforming existing baselines with rapid, data-efficient adaptation.
End-to-End Deep Learning for Predicting Metric Space-Valued Outputs
E2M predicts metric space-valued outputs via weighted Fréchet means, preserving intrinsic geometry without surrogate embeddings. It achieves state-of-the-art results on diverse structured data, including networks and distributions.
v-HUB: A Benchmark for Video Humor Understanding from Vision and Sound
v-HUB is a new benchmark for evaluating multimodal LLMs on video humor understanding. It reveals that audio cues significantly aid models in comprehending humor compared to visual-only inputs.
Distillation of Large Language Models via Concrete Score Matching
Concrete Score Distillation (CSD) improves LLM distillation by aligning relative logit differences, overcoming softmax blurring and shift invariance limits. It consistently outperforms recent methods in fidelity and diversity across various benchmarks.
Make a Video Call with LLM: A Measurement Campaign over Six Mainstream Apps
This study benchmarks six LLM video chat apps across quality, latency, and overhead, revealing that AI capabilities, not network latency, primarily drive user experience.
Simultaneous Multi-objective Alignment Across Verifiable and Non-verifiable Rewards
MAHALO aligns LLMs across verifiable and subjective rewards using PRM-guided decoding and multi-action heads, enabling concurrent optimization with minimal interference and flexible user control.
Verifying Meta-Awareness via Predictive Rewards in Reasoning Models
MAPR enhances reasoning models by predicting rollout statistics to optimize processing, boosting accuracy by 83.18% on AIME25 and accelerating training by 1.28x.
Margin Adaptive DPO: Leveraging Reward Model for Granular Control in Preference Optimization
MADPO uses a reward model to adaptively weight DPO loss per sample, improving granular control over heterogeneous preference data. It outperforms existing methods by stabilizing training and preserving valuable signals.
Domain-Shift-Aware Conformal Prediction for Large Language Models
DS-CP adapts conformal prediction for LLMs under domain shifts by weighting calibration samples based on test prompt proximity. It ensures reliable coverage and computational efficiency, enhancing trustworthy uncertainty quantification.
HRTFformer: A Spatially-Aware Transformer for Individual HRTF Upsampling in Immersive Audio Rendering
HRTFformer is a transformer-based model that upsamples sparse HRTF data using spherical harmonics and attention mechanisms. It outperforms existing methods in accuracy and perceptual realism for immersive audio rendering.
Value Flows
Value Flows uses flow-based models to estimate full return distributions, improving decision-making by quantifying state uncertainty. It achieves a 1.3x success rate boost across 62 benchmark tasks.
StreamingVLM: Real-Time Understanding for Infinite Video Streams
StreamingVLM enables real-time comprehension of infinite video streams via efficient KV caching and SFT. It achieves 8 FPS on H100, outperforming GPT-4O mini and boosting general VQA capabilities.
SHERLOCK: Towards Dynamic Knowledge Adaptation in LLM-enhanced E-commerce Risk Management
SHERLOCK integrates domain knowledge with LLMs to automate e-commerce fraud detection. It boosts investigation throughput by 386.7% and maintains accuracy via a self-evolving data flywheel.
Rethinking RL Evaluation: Can Benchmarks Truly Reveal Failures of RL Methods?
This study reveals that current RL benchmarks fail to distinguish genuine progress due to data leakage, hiding poor generalization. It proposes new principles for robust evaluation to accurately assess RL methods.
Catch-Only-One: Non-Transferable Examples for Model-Specific Authorization
Non-Transferable Examples (NTEs) recode data into model-specific subspaces, enabling authorized models to access information while blocking unauthorized ones. This training-free method ensures purpose limitation without relying on data perturbation or controlled training processes.