This paper was accepted on the workshop at “Human-in-the-Loop Studying Workshop” at NeurIPS 2022.
Choice-based reinforcement studying (RL) algorithms assist keep away from the pitfalls of hand-crafted reward features by distilling them from human choice suggestions, however they continue to be impractical as a result of burdensome variety of labels required from the human, even for comparatively easy duties. On this work, we reveal that encoding setting dynamics within the reward operate (REED) dramatically reduces the variety of choice labels required in state-of-the-art preference-based RL frameworks. We hypothesize that REED-based strategies higher partition the state-action house and facilitate generalization to state-action pairs not included within the choice dataset. REED iterates between encoding setting dynamics in a state-action illustration through a self-supervised temporal consistency job, and bootstrapping the preference-based reward operate from the state-action illustration. Whereas prior approaches practice solely on the preference-labelled trajectory pairs, REED exposes the state-action illustration to all transitions skilled throughout coverage coaching. We discover the advantages of REED throughout the PrefPPO [1] and PEBBLE [2] choice studying frameworks and reveal enhancements throughout experimental circumstances to each the velocity of coverage studying and the ultimate coverage efficiency. For instance, on quadruped-walk and walker-walk with 50 choice labels, REED-based reward features get well 83% and 66% of floor reality reward coverage efficiency and with out REED solely 38% and 21% are recovered. For some domains, REED-based reward features lead to insurance policies that outperform insurance policies skilled on the bottom reality reward.