Global News Digest

Technology

arXiv

Cross-modal linkage risk in clinical vision-language models

Clinical vision-language models risk re-linking de-identified images to reports via shared embeddings. This privacy vulnerability persists even with pathology-matched negatives, posing significant data security concerns.

arXiv

SeClaw: Spec-Driven Security Task Synthesis for Evaluating Autonomous Agents

SeClaw is a framework for synthesizing security tasks and evaluating autonomous LLM agents via execution-based, trajectory-aware assessments. It addresses limitations of manual benchmarks by enabling scalable, reproducible security testing.

arXiv

Repurposing Adversarial Perturbations for Continual Learning: From Defense to Active Alignment

AdvCL repurposes adversarial perturbations to stabilize continual learning, reducing catastrophic forgetting and boosting robustness. Its plug-in modules offer a versatile geometric control mechanism for various CL paradigms.

arXiv

FOAM: Frequency and Operator Error-Based Adaptive Damping Method for Reducing Staleness-Oriented Error for Shampoo

FOAM stabilizes Shampoo by adaptively adjusting damping and eigendecomposition frequency based on staleness error. This reduces computational costs while maintaining robust convergence and accuracy.

arXiv

Who Annotates in NLP? A Large-scale Assessment of Human Annotation Reporting between 2018 and 2025

This study audits NLP annotation reporting (2018–2025), revealing that while operational details are often documented, critical validity metrics like compensation and inter-annotator agreement are frequently omitted.

arXiv

Do Multimodal Agents Really Benefit from Tool Use? A Systematic Study of Capability Gains

This study finds that multimodal agents gain minimal capability from tool use, as most tool-solved problems were already solvable without them. Agents master tool mechanics rather than leveraging tools for genuine problem-solving.

arXiv

SPADE-Bench: Evaluating Spontaneous Strategic Deception in Agents via Plan-Action Divergence

SPADE-Bench evaluates spontaneous strategic deception in LLM agents by measuring plan-action divergence under pressure. This benchmark addresses critical safety gaps in autonomous systems by distinguishing deception from hallucination.

arXiv

Policy and World Modeling Co-Training for Language Agents

PaW co-trains policy and world models using on-policy RL rollouts, avoiding extra simulators or inference costs. It consistently outperforms RL baselines across benchmarks by leveraging inherent transition data for stable, informative supervision.

arXiv

When Do Attention Circuits Form? Developmental Trajectories of Capability and Attention-Sink Emergence Across Three 1B-ClassArchitectures

This study maps attention-circuit emergence across three 1B models, revealing distinct developmental trajectories and separate capability-sink transitions. Key findings include an inherent L0/L1 BOS-floor and early circuit identification via capability screens.

arXiv

Evolutionary Discovery of Bivariate Bicycle Codes with LLM-Guided Search

An LLM-guided evolutionary search identified 465 novel quantum codes, including indecomposable [[288,16,12]] and high-weight variants, demonstrating the efficacy of AI in navigating complex algebraic design spaces.

arXiv

AutoForest: Automatically Generating Forest Plots from Biomedical Studies with End-to-End Evidence Extraction and Synthesis

AutoForest automates forest plot generation from biomedical papers via end-to-end evidence extraction and synthesis. It streamlines meta-analysis by proposing ICOS, retrieving data, and creating publication-quality visuals.

arXiv

ODTQA-FoRe: An Open-Domain Tabular Question Answering Dataset for Future Data Forecasting and Reasoning

ODTQA-FoRe introduces a dataset for future tabular data forecasting, addressed by the TimeFore LLM agent framework. This system combines retrieval, forecasting, and analysis to improve accuracy and consistency in answering complex queries.

arXiv

Not All Errors Are Equal: A Systematic Study of Error Propagation in Large Language Model Inference

This study introduces LLMFI to systematically analyze error propagation in LLMs, revealing 17 insights and proposing four software-only strategies to enhance inference reliability.

arXiv

GC-MoE: Genomics-Guided Cell-Type-Specific Mixture of Experts for Histology-Based Single-Cell Spatial Transcriptomics

GC-MoE predicts single-cell gene expression from histology images using a genomics-guided mixture of experts. It outperforms existing methods by modeling cell-type-specific variability and neighbor interactions.

arXiv

Initialization is Half the Battle: Generating Diverse Images from a Guidance Potential Posterior

DivIn generates diverse images by sampling initial noise from a guidance potential posterior using Langevin dynamics. This inference-time enhancement outperforms existing methods and complements trajectory-based techniques.

arXiv

PaSBench-Video: A Streaming Video Benchmark for Proactive Safety Warning

PaSBench-Video evaluates MLLMs’ proactive safety warnings, revealing poor performance with high false positives. Models struggle to distinguish emerging threats from routine scenes across domains like driving and healthcare.

arXiv

MASER: Modality-Adaptive Specialist Routing for Embodied 3D Spatial Intelligence

MASER uses a neural router to dynamically select optimal modality adapters for embodied 3D spatial queries. It outperforms baselines by leveraging point clouds in over half of cases, achieving 51.3% oracle agreement.

arXiv

Ghost Tool Calls: Issue-Time Privacy for Speculative Agent Tools

"Ghost tool calls" leak user intent via speculative agent actions. Speculative Tool Privacy Contracts mitigate this by suppressing pre-commit calls, outperforming standard post-hoc filters.

arXiv

Learning When to Translate for Multilingual Reasoning

Luar trains RLMs to selectively translate non-English inputs only when direct understanding fails, improving multilingual reasoning. It outperforms baselines, especially for low-resource languages, by avoiding unnecessary translations.

arXiv

Moment-Video: Diagnosing Temporal Fidelity of Video MLLMs on Momentary Visual Events

Moment-Video benchmarks 33 MLLMs on transient visual events, revealing a significant performance gap with top models achieving only 39.6% accuracy. This highlights a critical deficit in temporal fidelity and reliance on sparse frame sampling.