Low-Resource Safety Failures Are Action Failures, Not Representation Failures
Title: Safety Gaps in Low-Resource Settings Stem from Action Deficits, Not Representation Gaps
Abstract
Safety alignment acquired in high-resource languages often fails to generalize effectively to low-resource contexts. While models typically reject malicious queries in English, they frequently fail to do so when those same prompts are translated into languages such as Swahili or Burmese. This cross-lingual vulnerability also affects adaptive steering techniques, including AdaSteer and CAST. In this study, we investigate the precise point where this transfer mechanism breaks down.
We analyzed three modelsâQwen2.5-7B, Gemma-2-9B, and Llama-3.1-8Bâacross 23 different languages. Our findings indicate that the "harmfulness direction" derived from high-resource activations serves as an effective linear separator between harmful and harmless prompts in low-resource settings, performing nearly as well as it does in high-resource languages. This confirms that the necessary safety representation is indeed present within the model. However, despite this intact representation, the rate of harmful refusal drops significantly from 87.9% to 43.9%. This disparity suggests that the model struggles to translate the identified representation into an actual refusal action. Consequently, the element that fails to transfer is not the underlying representation, but rather the calibration of the safety decision-making process.
To address this, we propose a method that recalibrates an existing high-resource gate rather than retraining the model. This approach utilizes a low-rank logistic readout, resetting its decision threshold with as few as one to four examples per class in the target language. This gate facilitates routing between refusal steering and harmfulness-direction ablation. The method substantially improves mean refusal selectivity ($\Delta$ = harmful $-$ harmless refusal), increasing it from 33.6 in the strongest adapted baseline to 54.5, all while maintaining MMLU utility. These findings imply that certain safety failures in low-resource scenarios can be resolved by recalibrating existing representations instead of learning new ones. Our code is available at: https://github.com/rashadaziz/low-resource-safety.
Source: arXiv Generated at: 2026-06-02 00:00:00 UTC




