Better NN EA - page 13

 

Neural networks made easy (Part 68): Offline Preference-guided Policy Optimization 

Neural networks made easy (Part 68): Offline Preference-guided Policy Optimization

Reinforcement learning is a universal platform for learning optimal behavior policies in the environment under exploration. Policy optimality is achieved by maximizing the rewards received from the environment during interaction with it. But herein lies one of the main problems of this approach. The creation of an appropriate reward function often requires significant human effort. Additionally, rewards may be sparse and/or insufficient to express the true learning goal. As one of the options for solving this problem, the authors if the paper "Beyond Reward: Offline Preference-guided Policy Optimization" suggested the OPPO method (OPPO stands for the Offline Preference-guided Policy Optimization). The authors of the method suggest the replacement of the reward given by the environment with the preferences of the human annotator between two trajectories completed in the environment under exploration. Let's take a closer look at the proposed algorithm.
Neural networks made easy (Part 68): Offline Preference-guided Policy Optimization
Neural networks made easy (Part 68): Offline Preference-guided Policy Optimization
  • www.mql5.com
Since the first articles devoted to reinforcement learning, we have in one way or another touched upon 2 problems: exploring the environment and determining the reward function. Recent articles have been devoted to the problem of exploration in offline learning. In this article, I would like to introduce you to an algorithm whose authors completely eliminated the reward function.
Reason: