Discussing the article: "Neural networks made easy (Part 68): Offline Preference-guided Policy Optimization"
You are missing trading opportunities:
- Free trading apps
- Over 8,000 signals for copying
- Economic news for exploring financial markets
Registration
Log in
You agree to website policy and terms of use
If you do not have an account, please register
Check out the new article: Neural networks made easy (Part 68): Offline Preference-guided Policy Optimization.
Since the first articles devoted to reinforcement learning, we have in one way or another touched upon 2 problems: exploring the environment and determining the reward function. Recent articles have been devoted to the problem of exploration in offline learning. In this article, I would like to introduce you to an algorithm whose authors completely eliminated the reward function.
In the context of offline preference-guided learning, the general approach consists of two steps and typically involves optimizing the reward function model using supervised learning, and then training the policy using any offline RL algorithm on transitions redefined using the learned reward function. However, the practice of separate training of the reward function may not directly instruct the policy how to act optimally. The preference labels define the learning task, and thus the goal is to learn the most preferred trajectory rather than to maximize the reward. In cases of complex problems, scalar rewards can create an information bottleneck in policy optimization, which in turn leads to suboptimal behavior of the Agent. Additionally, offline policy optimization can exploit vulnerabilities in incorrect reward functions. This in turn leads to unwanted behavior.
As an alternative to this two-step approach, the authors of the Offline Preference-guided Policy Optimization method (OPPO) aim to learn policy directly from an offline preference-guided dataset. They propose a one-step algorithm that simultaneously models offline preferences and learns the optimal decision policy without the need to separately train the reward function. This is achieved through the use of two goals:
Author: Dmitriy Gizlyk