Discussing the article: "Neural networks made easy (Part 50): Soft Actor-Critic (model optimization)" - page 2

 
Viktor Kudriavtsev #:

Etc. Am I understanding? What confuses me is that the test often gives very large negative passes with -7000 or even -9000. And there will be a lot of them in the base. Won't the network be trained to trade in the negative on purpose?

Let's look at the learning process without going too deep into the maths, so to speak on our fingers. In classical reinforcement learning, each action is evaluated by the environment when it gives us a reward. We give this reward to the model as a target outcome. Thus, we train the model not to choose an action, but to predict the expected reward for the action (Q-function) and then from the predicted rewards we choose the maximum reward for performing the action. Let me say right away that the option of doing nothing if all actions are unprofitable does not pass, because we evaluate "doing nothing" as a separate action and it also has its own level of reward.

In a continuous action space, we cannot directly specify a reward for an action. After all, the model has to return the level of impact (transaction volume, SL level, etc.), not the reward. Therefore, we use a Critic to evaluate actions. We feed the state and action from the example database to the Critic's input, and it returns a predicted reward. We compare it with the model reward and train the Critic to evaluate actions correctly. As a result, the Critic forms an abstract idea about the influence of the state and action on the expected reward. For example, in a bullish trend, an increase in buying volume will increase revenue, while a decrease in selling volume will reduce losses.

Next, we train the Actor. We take a separate state and feed it to the Actor's input. The Actor generates some action. We feed this action and the initial state of the environment and it evaluates the Actor's action. We then tell the Critic that we need to improve the result (we specify a target result higher than the received result). In response, the Critic tells the Actor how it needs to change its action (transmits an error gradient).

So theoretically we can get a positive result on negative passes as well. But the problem is that the function of income dependence on the action is not linear. And for a more correct evaluation of actions we better passes.

Now, as for the direct work and training of the Actor. At the initial stage we initialise the model with random parameters. And the Actor's actions are just as random. In the process of training, for example, the Critic, for example, says we need to open a position for 1 lot. But we train the model with a low learning rate. And at the next iteration the Actor opens a position for 0.01 lot. Obviously, we need to make 100 such training iterations to achieve the desired result.

It would seem, why beat our heads against the wall? Let's increase the learning coefficient to 1 and memorise the gained experience at once. But there is another side of the coin. In this case, the model will immediately forget all the accumulated experience and generalisation is out of the question.

That is why we keep beating our heads against the wall until we learn a simple truth. And the whole process of model training is a directed selection of parameters by trial and error method.

 
Dmitriy Gizlyk #:

Let's look at the learning process without going too deep into the maths, so to speak, on our fingers. In classical reinforcement learning, each action is evaluated by the environment when it gives us a reward. We give this reward to the model as a target outcome. Thus, we train the model not to choose an action, but to predict the expected reward for the action (Q-function) and then from the predicted rewards we choose the maximum reward for performing the action. Let me say right away that the option of doing nothing if all actions are unprofitable does not work, because we evaluate "doing nothing" as a separate action and it also has its own level of reward.

In a continuous action space, we cannot directly specify a reward for an action. After all, the model has to return the level of impact (transaction volume, SL level, etc.), not the reward. Therefore, we use a Critic to evaluate actions. We feed the state and action from the example database to the Critic's input, and it returns a predicted reward. We compare it with the model reward and train the Critic to evaluate actions correctly. As a result, the Critic forms an abstract idea about the influence of the state and action on the expected reward. For example, in a bullish trend, an increase in buy volume will increase the reward, while a decrease in sell volume will decrease the loss.

Next, we train the Actor. We take a separate state and feed it to the Actor's input. The Actor generates some action. We feed this action and the initial state of the environment and it evaluates the Actor's action. We then tell the Critic that we need to improve the result (we specify a target result higher than the received result). In response, the Critic tells the Actor how to change its action (transmits an error gradient).

So theoretically we can get a positive result on negative passes as well. But the problem is that the function of income dependence on the action is not linear. And for more correct evaluation of actions we better passages.

Now, as for the direct work and training of the Actor. At the initial stage we initialise the model with random parameters. And the Actor's actions are just as random. In the process of training, for example, the Critic, for example, says we need to open a position for 1 lot. But we train the model with a low learning rate. And at the next iteration the Actor opens a position for 0.01 lot. Obviously, to achieve the desired result, we need to make 100 such training iterations.

It would seem, why beat our heads against the wall? Let's increase the learning coefficient to 1 and memorise the gained experience at once. But there is another side of the coin. In this case, the model will immediately forget all the accumulated experience and generalisation is out of the question.

That is why we keep beating our heads against the wall until we learn a simple truth. And the whole process of model training is a directed selection of parameters by trial and error method.

I see. Thank you very much for such a clear explanation.

Then the question arises - why train the model in 100 000 iterations and have a base of 200 trajectories and constantly repeat the process of collecting new examples? Why can't an Expert Advisor create a base of 1000 trajectories for example and put 10 000 000 iterations in training and let it learn there overnight, day, week? Why is it necessary to constantly replenish the base and teach a small number of iterations?

 
Viktor Kudriavtsev #:

I see. Thank you very much for such a clear explanation.

Then the question arises - why should we train the model for 100 000 iterations and have a base of 200 trajectories and constantly repeat the process of collecting new examples? Why can't an Expert Advisor create a base of 1000 trajectories for example and put 10 000 000 iterations in training and let it learn there overnight, day, week? Why is it necessary to constantly replenish the base and teach a small number of iterations?

Theoretically you can, but it all comes down to resources. For example, we are talking about a TA size of 1000 points. In the concept of a continuous action space, this is 1000 options. Even if we take it in increments of 10, that's 100 variants. Let the same number of SLs or even half of them (50 variants). Add at least 5 variants of the trade volume and we get 100 * 50 * 5 = 25000 variants. Multiply by 2 (buy / sell) - 50 000 variants for one candle. Multiply by the length of the trajectory and you get the number of trajectories to fully cover all possible space.

In step-by-step learning, we sample trajectories in the immediate vicinity of the current Actor's actions. Thus we narrow down the area of study. And we study not all possible variants, but only a small area with search of variants to improve the current strategy. After a small "tuning" of the current strategy, we collect new data in the area where these improvements led us and determine the further vector of movement.

This can be reminiscent of finding a way out in an unknown maze. Or the path of a tourist walking down the street and asking passers-by for directions.

 
Dmitriy Gizlyk #:

Theoretically it is possible, but it all depends on resources. For example, we are talking about the TP size of 1000 points. In the concept of a continuous action space, this is 1000 variants. Even if we take in increments of 10, that's 100 variants. Let the same number of SLs or even half of them (50 variants). Add at least 5 variants of the trade volume and we get 100 * 50 * 5 = 25000 variants. Multiply by 2 (buy / sell) - 50 000 variants for one candle. Multiply by the length of the trajectory and you get the number of trajectories to fully cover all possible space.

In step-by-step learning, we sample trajectories in the immediate neighbourhood of the current Actor's actions. Thus we narrow down the area of study. And we study not all possible variants, but only a small area with search of variants to improve the current strategy. After a small "tuning" of the current strategy, we collect new data in the area where these improvements have led us and determine the further vector of movement.

This can be reminiscent of finding a way out in an unknown maze. Or the path of a tourist walking down the street and asking passers-by for directions.

I see. Thank you.

I've noticed now that when you do the Research.mqh collection , the results are formed somehow in groups with a very close final balance in the group. And it seems as if there is some progress in Research.mqh (positive groups of outcomes began to appear more often or something). But with Test.mqh there seems to be no progress at all. It gives some randomness and in general more often finishes a pass with a minus. Sometimes it goes up and then down, and sometimes it goes straight down and then stalls. He also seems to increase the volume of entry at the end. Sometimes he trades not in the minus, but just around zero. I also noticed that he changes the number of trades - for 5 months he opens 150 trades, and someone opens 500 (approximately). Is this all normal, what am I observing?

 
Viktor Kudriavtsev #:

I see. Thank you.

I've noticed that when I do Research.mqh collection , the results are somehow formed in groups with very close final balance in the group. And it seems as if there is some progress in Research.mqh (positive groups of outcomes began to appear more often or something). But with Test.mqh there seems to be no progress at all. It gives some randomness and in general more often finishes a pass with a minus. Sometimes it goes up and then down, and sometimes it goes straight down and then stalls. He also seems to increase the volume of entry at the end. Sometimes he trades not in the minus, but just around zero. I also noticed that he changes the number of trades - for 5 months he opens 150 trades, and someone opens 500 (approximately). Is this all normal, what am I observing?

Randomness is a result of Actor's stochasticity. As you learn, it will become less. It may not disappear completely, but the results will be close.

 
Dmitriy I have now something changed with neuronka - It started to open trades strangely (opens on one candle and closes on the next) and for some reason balance and equity do not change. It just draws a straight line on the chart. And the change of balance on the result of passes 0. Moreover, this happens on both Test.mqh and Research.mqh. And the whole base is now filled with such passes. Is this normal? What to do - to continue to teach or I have an idea to tear down the base and trained models to move from a folder to another temporarily and create a new base with random models, and then bring back the models and continue to teach them. To somehow get them out of the straight line.
 
Viktor Kudriavtsev draws a straight line on the chart. And the change of balance on the result of passes 0. Moreover, this happens on both Test.mqh and Research.mqh. And the whole base is now filled with such passes. Is this normal? What to do - to continue to teach or I have an idea to tear down the base and trained models to move from a folder to another temporarily and create a new base with random models, and then bring back the models and continue to teach them. To somehow get them out of the straight line.

The database of examples will not be "clogged" by passes without deals. Research.mq5 has a check and does not save such passes. But it is good that such a pass will be saved from Test.mq5. There is a penalty for the absence of deals when generating the reward. And it should help the model to get out of such a situation.

 

Dmitriy I have made more than 90 cycles (training-test-collection of database) and I still have the model gives random. I can say that out of 10 runs of Test.mqh 7 drains 2-3 to 0 and 1-2 times for about 4-5 cycles there is a run in the plus. You indicated in the article that you got a positive result for 15 cycles. I understand that there is a lot of randomness in the system, but I do not understand why such a difference? Well, I understand if my model gave a positive result after 30 cycles, let's say 50, well, it's already 90 and you can't see much progress.....

Are you sure you have posted the same code that you trained yourself? Maybe you corrected something for tests and accidentally forgot and posted the wrong version.....?

And if, for example, the training coefficient is increased by one digit, won't it learn faster?

I don't understand something......