Discussing the article: "Neural networks made easy (Part 48): Methods for reducing overestimation of Q-function values"

 

Check out the new article: Neural networks made easy (Part 48): Methods for reducing overestimation of Q-function values.

In the previous article, we introduced the DDPG method, which allows training models in a continuous action space. However, like other Q-learning methods, DDPG is prone to overestimating Q-function values. This problem often results in training an agent with a suboptimal strategy. In this article, we will look at some approaches to overcome the mentioned issue.

The problem of overestimating the Q-function values appears quite often when training various models using the DQN method and its derivatives. It is characteristic of both models with discrete actions and when solving problems in a continuous space of actions. The causes of this phenomenon and methods of combating its consequences may be specific in each individual case. Therefore, an integrated approach to solving this problem is important. One such approach was presented in the article "Addressing Function Approximation Error in Actor-Critic Methods" published in February 2018. It proposed the algorithm called Twin Delayed Deep Deterministic policy gradient (TD3). The algorithm is a logical continuation of DDPG and introduces some improvements to it that boosts the quality of model training.

First, the authors add a second Critic. The idea is not new and has previously been used for discrete action space models. However, the authors of the method contributed their understanding, vision and approach to the use of the second Critic.

The idea is that both Critics are initialized with random parameters and trained in parallel on the same data. Initialized with different initial parameters, they begin their training from different states. But both Critics are trained on the same data, therefore they should move towards the same (desirably global) minimum. It is quite natural that during training the results of their forecasts will converge. However, they will not be identical due to the influence of various factors. Each of them is subject to the problem of overestimating the Q-function. But at a single point in time, one model will overestimate the Q-function, while the second one will underestimate it. Even when both models overestimate the Q-function, the error of one model will be less than that of the second one. Based on these assumptions, the method authors propose to use the minimal prediction to train both Critics. Thus, we minimize the impact of overestimation of the Q-function and the accumulation of errors during the learning process.

Let's move on to training and testing the results obtained. As usual, the models were trained on historical data of EURUSD H1 from January–May 2023. The indicator parameters and all hyperparameters were set to their default values.

Training was quite prolonged and iterative. At the first stage, a database of 200 trajectories was created. The first training process was run for 1,000,000 iterations. The Actor's policy was updated once after every 10 iterations of updating the Critics' parameters. A soft update of the target models was carried out after every 1,000 iterations of the Critics' update.


After that, another 50 trajectories were added to the example database and the second stage of model training was launched. At the same time, the number of iterations before updating the Actor and target models was reduced to 3 and 100, respectively.

After approximately 5 training cycles (50 trajectories were added at each cycle), a model was obtained that was capable of generating profit on the training set. After 5 months of the training sample, the model was able to receive almost 10% of the income. This is not the greatest result. 58 transactions were made. The share of profitable ones approached a meager 40%. Profit factor - 1.05, recovery factor - 1.50. The profit was achieved due to the size of profitable positions. The average profit from one trade is 1.6 times the average loss. The maximum profit is 3.5 times the maximum loss from one trading operation.

Author: Dmitriy Gizlyk

Reason: