Towards Data-Centric RLHF: Simple Metrics for Preference Dataset Comparison

AuthorsJudy Hanwen Shen, Archit Sharma, Jun Qin

The goal of aligning language models to human preferences requires data that reveal these preferences. Ideally, time and money can be spent carefully collecting and tailoring bespoke preference data to each downstream application. However, in practice, a select few publicly available preference datasets are often used to train reward models for reinforcement learning from human feedback (RLHF). While new preference datasets are being introduced with increasing frequency, there are currently no existing efforts to measure and compare these datasets. In this paper, we systematically study preference datasets through three perspectives: scale, label noise, and information content. We propose specific metrics for each of these perspectives and uncover different axes of comparison for a better understanding of preference datasets. Our work is a first step towards a data-centric approach to alignment by providing perspectives that aid in training efficiency and iterative data collection for RLHF.

Related readings and updates.

PREDICT: Preference Reasoning by Evaluating Decomposed preferences Inferred from Candidate Trajectories

December 3, 2025research area Human-Computer Interaction, research area Methods and Algorithms

Accommodating human preferences is essential for creating AI agents that deliver personalized and effective interactions. Recent work has shown the potential for LLMs to infer preferences from user interactions, but they often produce broad and generic preferences, failing to capture the unique and individualized nature of human preferences. This paper introduces PREDICT, a method designed to enhance the precision and adaptability of inferring…

On the Limited Generalization Capability of the Implicit Reward Model Induced by Direct Preference Optimization

October 9, 2024research area Methods and Algorithms, research area Speech and Natural Language Processingconference EMNLP

Reinforcement Learning from Human Feedback (RLHF) is an effective approach for aligning language models to human preferences. Central to RLHF is learning a reward function for scoring human preferences. Two main approaches for learning a reward model are 1) training an explicit reward model as in RLHF, and 2) using an implicit reward learned from preference data through methods such as Direct Preference Optimization (DPO). Prior work has shown…

Towards Data-Centric RLHF: Simple Metrics for Preference Dataset Comparison

Related readings and updates.

PREDICT: Preference Reasoning by Evaluating Decomposed preferences Inferred from Candidate Trajectories

On the Limited Generalization Capability of the Implicit Reward Model Induced by Direct Preference Optimization

Discover opportunities in Machine Learning.