文章 "神经网络变得简单（第 61 部分）：离线强化学习中的乐观情绪问题"

MetaQuotes 2024.06.05 10:26

在离线学习期间，我们基于训练样本数据优化了智能体的政策。成品政策令智能体对其动作充满信心。然而，这种乐观情绪并不总是正当的，并且可能会在模型操作期间导致风险增加。今天，我们要寻找降低这些风险的方法之一。

最近，离线强化学习方法已广泛普及，其在解决不同复杂度的问题方面具有许多前景。然而，研究人员面临的主要问题之一是当学习时可能浮现的乐观情绪。智能体基于训练集中的数据优化其策略，并获得对其动作的信心。但是训练集往往无法涵盖环境的所有可能状态和转变。在随机环境中，这种信心被揭示是不完全正当的。在这种情况下，智能体的乐观情绪策略可能会导致风险增加，以及不良后果。

为了搜索这个问题的解决方案，值得关注的是自主驾驶领域的研究。很明显，该领域的算法旨在降低风险（提高用户安全性），并最大限度地减少在线训练。其中一种方法是分离潜在轨迹（SPLT）转换器，其在文章《定位强化学习序列建模中的优化乖离》（2022 年 7 月）中阐述。

作者：Dmitriy Gizlyk

Vladimir Pastushak 2023.11.02 22:10 #1

神经网络--很简单（第 61 部分）

第 61 部分，你能用金钱来衡量结果吗？

Denis Kirichenko 2023.11.03 07:53 #2

Vladimir Pastushak #:

神经网络--很简单（第 61 部分）

61 个部分，你能用金钱来衡量结果吗？

很简单：200 美元 * 61 = 12,200 美元。

Rashid Umarov 2023.11.03 11:44 #3

我必须对作者表示衷心的感谢，他将一篇纯理论的文章用通俗的语言解释了如何能够

a) 将其应用于交易、

b) 在策略测试器中进行编程和测试。

请阅读原文，亲自了解德米特里所做的工作 -https://arxiv.org/abs/2207.10295。

Addressing Optimism Bias in Sequence Modeling for Reinforcement Learning

arxiv.org

Impressive results in natural language processing (NLP) based on the Transformer neural network architecture have inspired researchers to explore viewing offline reinforcement learning (RL) as a generic sequence modeling problem. Recent works based on this paradigm have achieved state-of-the-art results in several of the mostly deterministic offline Atari and D4RL benchmarks. However, because these methods jointly model the states and actions as a single sequencing problem, they struggle to disentangle the effects of the policy and world dynamics on the return. Thus, in adversarial or stochastic environments, these methods lead to overly optimistic behavior that can be dangerous in safety-critical systems like autonomous driving. In this work, we propose a method that addresses this optimism bias by explicitly disentangling the policy and world models, which allows us at test time to search for policies that are robust to multiple possible futures in the environment. We...

新评论