Discussing the article: "Neural networks made easy (Part 61): Optimism issue in offline reinforcement learning"

 

Check out the new article: Neural networks made easy (Part 61): Optimism issue in offline reinforcement learning.

During the offline learning, we optimize the Agent's policy based on the training sample data. The resulting strategy gives the Agent confidence in its actions. However, such optimism is not always justified and can cause increased risks during the model operation. Today we will look at one of the methods to reduce these risks.

Recently, offline reinforcement learning methods have become widespread, which promises many prospects in solving problems of varying complexity. However, one of the main problems that researchers face is the optimism that can arise while learning. The agent optimizes its strategy based on the data from the training set and gains confidence in its actions. But the training set is quite often not able to cover the entire variety of possible states and transitions of the environment. In a stochastic environment, such confidence turns out to be not entirely justified. In such cases, the agent's optimistic strategy may lead to increased risks and undesirable consequences.

In search of a solution to this problem, it is worth paying attention to research in the field of autonomous driving. It is obvious that the algorithms in this area are aimed at reducing risks (increasing user safety) and minimizing online training. One such method is SeParated Latent Trajectory Transformer (SPLT-Transformer) presented in the article "Addressing Optimism Bias in Sequence Modeling for Reinforcement Learning" (July 2022).

Author: Dmitriy Gizlyk

 

Neural Networks - It's Simple (Part 61)

Part 61, can you see the result in monetary terms?

 
Vladimir Pastushak #:

Neural networks are easy (Part 61)

61 parts, can you see the result in monetary terms?

Easy: $200 * 61 = $12,200.
 

I must say a big thank you to the author, who takes a purely theoretical article and explains in popular language how it can:

a) apply it in trading,

b) program and test it in a strategy tester.

Take a look at the original article and see for yourself what kind of work Dmitry has done - https://arxiv.org/abs/2207.10295.

Addressing Optimism Bias in Sequence Modeling for Reinforcement Learning
Addressing Optimism Bias in Sequence Modeling for Reinforcement Learning
  • arxiv.org
Impressive results in natural language processing (NLP) based on the Transformer neural network architecture have inspired researchers to explore viewing offline reinforcement learning (RL) as a generic sequence modeling problem. Recent works based on this paradigm have achieved state-of-the-art results in several of the mostly deterministic offline Atari and D4RL benchmarks. However, because these methods jointly model the states and actions as a single sequencing problem, they struggle to disentangle the effects of the policy and world dynamics on the return. Thus, in adversarial or stochastic environments, these methods lead to overly optimistic behavior that can be dangerous in safety-critical systems like autonomous driving. In this work, we propose a method that addresses this optimism bias by explicitly disentangling the policy and world models, which allows us at test time to search for policies that are robust to multiple possible futures in the environment. We...
Reason: