View publication

Preference based Reinforcement Learning (PbRL) has shown great promise in learning from human preference binary feedback on agent's trajectory behaviors, where one of the major goals is to reduce the number of queried human feedback. While the binary labels are a direct comment on the goodness of a trajectory behavior, there is still a need for resolving credit assignment especially in limited feedback. We propose our work, PRIor On Rewards (PRIOR) that learns a forward dynamics world model to approximate apriori selective attention over states which serves as a means to perform credit assignment over a given trajectory. Further, we propose an auxiliary objective that redistributes the total predicted return according to these PRIORs as a simple, yet effective means of improving reward learning performance. Our experiments on six robot-manipulation and three locomotion PbRL benchmarks demonstrates PRIOR's significant improvements in feedback-sample efficiency and reward recovery. Finally, we present our extensive ablations that study our design decisions and the ease of using PRIOR with existing PbRL methods.

Related readings and updates.

On the Limited Generalization Capability of the Implicit Reward Model Induced by Direct Preference Optimization

Reinforcement Learning from Human Feedback (RLHF) is an effective approach for aligning language models to human preferences. Central to RLHF is learning a reward function for scoring human preferences. Two main approaches for learning a reward model are 1) training an explicit reward model as in RLHF, and 2) using an implicit reward learned from preference data through methods such as Direct Preference Optimization (DPO). Prior work has shown…
See paper details

Symbol Guided Hindsight Priors for Reward Learning from Human Preferences

This paper was accepted at the "Human in the Loop Learning Workshop" at NeurIPS 2022. Specification of reward functions for Reinforcement Learning is a challenging task which is bypassed by the framework of Preference Based Learning methods which instead learn from preference labels on trajectory queries. These methods, however, still suffer from high requirements of preference labels and often would still achieve low reward recovery. We present…
See paper details