Hindsight PRIORs for Reward Learning from Human Preferences
AuthorsMudit Verma, Rin Metcalf Susa
AuthorsMudit Verma, Rin Metcalf Susa
Preference based Reinforcement Learning (PbRL) has shown great promise in learning from human preference binary feedback on agent's trajectory behaviors, where one of the major goals is to reduce the number of queried human feedback. While the binary labels are a direct comment on the goodness of a trajectory behavior, there is still a need for resolving credit assignment especially in limited feedback. We propose our work, PRIor On Rewards (PRIOR) that learns a forward dynamics world model to approximate apriori selective attention over states which serves as a means to perform credit assignment over a given trajectory. Further, we propose an auxiliary objective that redistributes the total predicted return according to these PRIORs as a simple, yet effective means of improving reward learning performance. Our experiments on six robot-manipulation and three locomotion PbRL benchmarks demonstrates PRIOR's significant improvements in feedback-sample efficiency and reward recovery. Finally, we present our extensive ablations that study our design decisions and the ease of using PRIOR with existing PbRL methods.