SoLoPO: Unlocking Long-Context Capabilities in LLMs via Short-to-Long Preference Optimization
Title: SoLoPO: Unlocking Long-Context Capabilities in LLMs via Short-to-Long Preference Optimization
Abstract:
Although pretraining techniques have expanded the context window sizes of large language models (LLMs), these models continue to struggle with effectively processing real-world long-context data. This limitation stems largely from inadequate long-context alignment, which is driven by poor data quality, training inefficiencies, and the absence of well-structured optimization objectives. To overcome these hurdles, we introduce Sh\textbf{o}rt-to-\textbf{Lo}ng \textbf{P}reference \textbf{O}ptimization (SoLoPO). Backed by both theoretical analysis and empirical validation, our framework decouples long-context preference optimization (PO) into two distinct phases: short-context PO and short-to-long reward alignment (SoLo-RA).
The short-context PO component utilizes preference pairs derived from shorter contexts to bolster the model’s capacity to utilize contextual information. Concurrently, SoLo-RA promotes consistency in reward scores for responses conditioned on both short and long contexts, provided they contain the same task-relevant information. This mechanism effectively transfers the model’s proficiency with short contexts into long-context scenarios. SoLoPO is designed to be compatible with existing mainstream preference optimization algorithms, significantly streamlining both data construction and training workflows. Our experiments demonstrate that integrating SoLoPO improves all tested algorithms, yielding superior generalization across length and domain in various long-context benchmarks, while also delivering substantial gains in computational and memory efficiency.
Source: arXiv Generated at: 2026-06-04 00:00:00 UTC




