Towards Data-Centric RLHF: Simple Metrics for Preference Dataset Comparison
AuthorsJudy Hanwen Shen, Archit Sharma, Jun Qin
Towards Data-Centric RLHF: Simple Metrics for Preference Dataset Comparison
AuthorsJudy Hanwen Shen, Archit Sharma, Jun Qin
The goal of aligning language models to human preferences requires data that reveal these preferences. Ideally, time and money can be spent carefully collecting and tailoring bespoke preference data to each downstream application. However, in practice, a select few publicly available preference datasets are often used to train reward models for reinforcement learning from human feedback (RLHF). While new preference datasets are being introduced with increasing frequency, there are currently no existing efforts to measure and compare these datasets. In this paper, we systematically study preference datasets through three perspectives: scale, label noise, and information content. We propose specific metrics for each of these perspectives and uncover different axes of comparison for a better understanding of preference datasets. Our work is a first step towards a data-centric approach to alignment by providing perspectives that aid in training efficiency and iterative data collection for RLHF.
PREDICT: Preference Reasoning by Evaluating Decomposed preferences Inferred from Candidate Trajectories
December 3, 2025research area Human-Computer Interaction, research area Methods and Algorithms
Accommodating human preferences is essential for creating AI agents that deliver personalized and effective interactions. Recent work has shown the potential for LLMs to infer preferences from user interactions, but they often produce broad and generic preferences, failing to capture the unique and individualized nature of human preferences. This paper introduces PREDICT, a method designed to enhance the precision and adaptability of inferring…
On the Limited Generalization Capability of the Implicit Reward Model Induced by Direct Preference Optimization
October 9, 2024research area Methods and Algorithms, research area Speech and Natural Language Processingconference EMNLP
Reinforcement Learning from Human Feedback (RLHF) is an effective approach for aligning language models to human preferences. Central to RLHF is learning a reward function for scoring human preferences. Two main approaches for learning a reward model are 1) training an explicit reward model as in RLHF, and 2) using an implicit reward learned from preference data through methods such as Direct Preference Optimization (DPO). Prior work has shown…