Global News Digest

arXiv

Low-Resource Safety Failures Are Action Failures, Not Representation Failures

Title: Safety Gaps in Low-Resource Settings Stem from Action Deficits, Not Representation Gaps

Abstract

Safety alignment acquired in high-resource languages often fails to generalize effectively to low-resource contexts. While models typically reject malicious queries in English, they frequently fail to do so when those same prompts are translated into languages such as Swahili or Burmese. This cross-lingual vulnerability also affects adaptive steering techniques, including AdaSteer and CAST. In this study, we investigate the precise point where this transfer mechanism breaks down.

We analyzed three models—Qwen2.5-7B, Gemma-2-9B, and Llama-3.1-8B—across 23 different languages. Our findings indicate that the "harmfulness direction" derived from high-resource activations serves as an effective linear separator between harmful and harmless prompts in low-resource settings, performing nearly as well as it does in high-resource languages. This confirms that the necessary safety representation is indeed present within the model. However, despite this intact representation, the rate of harmful refusal drops significantly from 87.9% to 43.9%. This disparity suggests that the model struggles to translate the identified representation into an actual refusal action. Consequently, the element that fails to transfer is not the underlying representation, but rather the calibration of the safety decision-making process.

To address this, we propose a method that recalibrates an existing high-resource gate rather than retraining the model. This approach utilizes a low-rank logistic readout, resetting its decision threshold with as few as one to four examples per class in the target language. This gate facilitates routing between refusal steering and harmfulness-direction ablation. The method substantially improves mean refusal selectivity ($\Delta$ = harmful $-$ harmless refusal), increasing it from 33.6 in the strongest adapted baseline to 54.5, all while maintaining MMLU utility. These findings imply that certain safety failures in low-resource scenarios can be resolved by recalibrating existing representations instead of learning new ones. Our code is available at: https://github.com/rashadaziz/low-resource-safety.


Source: arXiv Generated at: 2026-06-02 00:00:00 UTC

Related Articles

Schroders Renewable Unit Targets AI Assets as Power Demand Soars
Bloomberg

Schroders Renewable Unit Targets AI Assets as Power Demand Soars

Schroders’ renewable unit targets AI infrastructure, pivoting to meet soaring energy demand from artificial intelligence...

State Street's Paglia on SBI Group Partnership, ETFs
Bloomberg

State Street's Paglia on SBI Group Partnership, ETFs

State Street's Paglia discusses the SBI Group partnership and ETFs, but the source text is missing. Please provide the a...

Nvidia Boss Says Workers Should Be Paid ‘as Much as Possible’
Bloomberg

Nvidia Boss Says Workers Should Be Paid ‘as Much as Possible’

Nvidia CEO Jensen Huang advocates for paying workers “as much as possible,” emphasizing maximum compensation. This stanc...

TSE Talking With Regulator For Easing ETF Listing Rules
Bloomberg

TSE Talking With Regulator For Easing ETF Listing Rules

The Tokyo Stock Exchange is discussing with regulators to ease ETF listing rules. This aims to simplify market access an...

S&P DJI CEO on Japan Markets, Mega IPOs
Bloomberg

S&P DJI CEO on Japan Markets, Mega IPOs

S&P DJI CEO discusses Japan's financial markets and major IPOs.