arXiv

Aligning Deep Implicit Preferences by Learning to Reason Defensively

Title: Aligning Deep Implicit Preferences by Learning to Reason Defensively

Abstract:

To facilitate meaningful, user-centric interactions, Large Language Models (LLMs) require robust personalized alignment. However, existing approaches are hindered by a dual limitation: they struggle to deduce users' profound, implicit preferences—such as unspoken objectives, semantic nuances, and risk thresholds—and they lack the defensive reasoning capabilities necessary to handle real-world ambiguity. This disconnect results in responses that are shallow, fragile, and lacking in foresight.

To overcome these hurdles, we introduce Critique-Driven Reasoning Alignment (CDRA), a framework that shifts the paradigm of alignment from simple scalar reward matching to a structured reasoning methodology. Our approach addresses the preference inference deficit through the introduction of DeepPref, a novel benchmark. This dataset contains 3,000 preference-query pairs spanning 20 distinct topics. It was developed by simulating a multi-dimensional cognitive council that generates critique-annotated reasoning chains, thereby deconstructing query semantics and exposing latent risks.

Furthermore, to embed defensive reasoning, we propose the Personalized Generative Process Reward Model (Pers-GenPRM). This model treats reward modeling as a personalized reasoning exercise. It produces a critique chain to assess how well a response aligns with user preferences before deriving a final score based on that rationale. This interpretable, structured reward signal directs the policy model via Critique-Driven Policy Alignment, a process-level online reinforcement learning algorithm that incorporates both numerical and natural language feedback. Experimental results indicate that CDRA is highly effective at identifying and adhering to users' genuine preferences while maintaining robust reasoning capabilities. The associated code and dataset can be accessed at https://github.com/Zephyrian-Hugh/Deep-pref.


Source: arXiv Generated at: 2026-06-04 00:00:00 UTC

Related Articles

The Do’s and Don’ts of Buying Used Tech Gadgets
New York Times

The Do’s and Don’ts of Buying Used Tech Gadgets

Refurbished tech offers a cost-effective alternative amid component shortages and inflated prices. This guide outlines e...

Who is Elon Musk and what is his net worth?
BBC News

Who is Elon Musk and what is his net worth?

Elon Musk, CEO of Tesla and SpaceX, became the first person to surpass a $500 billion net worth in October 2025. His wea...

AI Boom Propels China Optical Maker to Top Weighting on CSI 300
Bloomberg

AI Boom Propels China Optical Maker to Top Weighting on CSI 300

Driven by surging AI demand, a Chinese optical maker has reached the highest weighting in the CSI 300 index.

AI Bubble 'Something to Look At,' BNP's Huynh Says (Video)
Bloomberg

AI Bubble 'Something to Look At,' BNP's Huynh Says (Video)

BNP Paribas’ Huynh describes the AI bubble as “something to look at,” signaling cautious interest in the sector’s potent...

SoftBank’s PayPay to Buy T&D’s Life Insurer for $840 Million
Bloomberg

SoftBank’s PayPay to Buy T&D’s Life Insurer for $840 Million

PayPay is acquiring T&D Holdings’ life insurer for $840 million, shortly after its historic $879.8 million Nasdaq IPO.

Goldman Sachs CEO David Solomon on Running a Bank in the Age of AI | Odd Lots
Bloomberg

Goldman Sachs CEO David Solomon on Running a Bank in the Age of AI | Odd Lots

Goldman Sachs CEO David Solomon discusses integrating AI into banking operations. He explores how artificial intelligenc...