Discussion of article "Random Decision Forest in Reinforcement learning" - page 5

 
Maxim Dmitrievsky :

Hi, not ready yet. When it is completed, I'll write to you.

Thanks for your reply.

Have you ever thought about implementing Q-learning as a reward function in your current implementation of Random Forest?

I mean, do you want to use the "Bellman equation" as a way to reward the agent for updating the reward matrix for each candle closed for decisions, and is that possible?

I just have a sample MQL5 code of the Q formula to implement, and if you are interested I will post here. I tried to implement myself, but I'm not very good at matrix implementation and I'm still not 100% sure how to use the matrix correctly.

By the way, I have to admit that EA sometimes gives some promising and surprising results in certain market conditions, but not always when the market changes. So I am also trying to implement something so that EA will immediately change the predictors (indicators) automatically if one loss happens. I saw you mentioned this idea of automatically selecting predictors in one of your comments, which I think is also the most important part.

 
FxTrader562 :

Thanks for your reply.

Have you ever thought about implementing Q-learning as a reward function in your current implementation of Random Decision Forest?

I mean, do you want to use the "Bellman equation" as a way to reward the agent for updating the reward matrix for each candle closed for decisions, and is that possible?

I just have a sample MQL5 code of the Q formula to implement, and if you are interested I will post here. I tried to implement myself, but I'm not very good at matrix implementation and I'm still not 100% sure how to use the matrix correctly.

By the way, I have to admit that EA sometimes gives some promising and surprising results in certain market conditions, but not always when the market changes. So I am also trying to implement something so that EA will immediately change the predictors (indicators) automatically if one loss happens. I saw you mentioned this idea of automatically selecting predictors in one of your comments, which I think is also the most important part.

Yes, I was thinking about Q-learning. The thing is that the random forest itself approximates the policy, so there is no need for the Bellman equation. Also, q-learning is very much overpowered.

I am now looking for solutions for automatic function transformations, such as "kernel tricks". Then, iteratively, we can train the model and select a model with transformed features with a small classification error on a test subset.

 
Maxim Dmitrievsky :

Yes, I was thinking about Q-learning. The thing is that the random forest itself approximates the policy, so there is no need for the Bellman equation. Also, q-learning is very much overpowered.

I am now looking for solutions for automatic function transformations, such as "kernel tricks". Then, iteratively, we can train the model and select a model with transformed features with a small classification error on a test subset.

Yes, I agree with you about policy convergence. But with the current implementation of policy to policy, policy does not take into account the consistent loss of trading history, trading float, etc. And hence, by implementing Q, I mean that the agent will have a full understanding of the current floating profit of each open trade and previous consecutive losses and gains and hence it will approximate the policy to maximise profits and NOT aim only to maximise the number of profitable trades which become irrelevant when large losses occur.

What I mean is that the profits from a series of profitable trades will be wiped out of one large loss, but this is irrelevant to the agent as she is simply aiming to maximise the number of profitable trades. So in Q-value we can give an immediate reward to the agent with the current floating profit, which the agent will check at each open candle to make the next decision to maximise profits and minimise downside regardless of the number of profitable trades or termination of trades.

Anyway, if you plan to implement something to train the model iteratively and automatically at successive losses, this could be very useful. I'll be looking at something like this in my next article.

Also, based on my EA training and testing over the last couple of weeks, I've noticed that you just have to switch to a different model (different indicator values or indicators) automatically when one loss occurs, otherwise EA gives a series of profitable trades when the markets hit the strategy for a certain amount of time. But once one loss happens, the same series of losses keeps happening for quite some time.

 

FxTrader562 :

Yes, I agree with you on policy convergence. But with the current implementation of policy to policy, the policy does not take into account the successive losses of trading history, trading floating income, etc. And hence, by implementing Q, I mean that the agent will have full knowledge of the current floating profit of each open trade and previous consecutive losses and profits, and hence it will approximate the policy to maximise profits and NOT aim only to maximise the number of profitable trades that become irrelevant when large losses occur.

What I mean is that the profits from a series of profitable trades will be wiped out of one large loss, but this is irrelevant to the agent as she is simply aiming to maximise the number of profitable trades. So in Q-value we can give an immediate reward to the agent with the current floating profit, which the agent will check at each open candle to make the next decision to maximise profit and minimise downside regardless of the number of profitable trades or termination of trades.

Anyway, if you plan to implement something to train the model iteratively and automatically at successive losses, this could be very useful. I'll be looking at something like this in my next article.

Also, based on my EA training and testing over the last couple of weeks, I've noticed that I just need to switch to a different model (different indicator values or indicators) automatically when one loss occurs, otherwise EA gives a series of profitable trades when the markets hit the strategy for a certain amount of time. But as soon as one loss occurs, the same series of losses continues to occur for quite some time.

So I think automatic optimisation would be useful to implement in this case. I think one article already exists for automatic optimisation and if you can implement it for your current EA, the task will be complete.




 
FxTrader562 :

Yes, I agree with you on policy convergence. But with the current implementation of policy to policy, the policy does not take into account the successive losses of trading history, trading floating income, etc. And hence, by implementing Q, I mean that the agent will have full knowledge of the current floating profit of each open trade and previous consecutive losses and profits, and hence it will approximate the policy to maximise profits and NOT aim only to maximise the number of profitable trades that become irrelevant when large losses occur.

What I mean is that the profits from a series of profitable trades will be wiped out of one large loss, but this is irrelevant to the agent as she is simply aiming to maximise the number of profitable trades. So in Q-value we can give an immediate reward to the agent with the current floating profit, which the agent will check at each open candle to make the next decision to maximise profits and minimise downside regardless of the number of profitable trades or termination of trades.

Anyway, if you plan to implement something to train the model iteratively and automatically at successive losses, this could be very useful. I'll be looking at something like this in my next article.

Also, based on my EA training and testing over the last couple of weeks, I've noticed that you just have to switch to a different model (different indicator values or indicators) automatically when one loss occurs, otherwise EA gives a series of profitable trades when the markets hit the strategy for a certain amount of time. But as soon as one loss occurs, the same series of losses continues to occur for quite some time.

So I think automatic optimisation would be useful to implement in this case. I think one article already exists for automatic optimisation and if you can implement it for your current EA, the task is complete.

For example, you can change the reward function to get closer to the Sharpe ratio. Or other metrics. I tried different functions and realised that making it more complicated doesn't give much advantage.

You can also read this: https: //github.com/darden1/tradingrrl

Automatic optimisation is a good idea, but I'm working on improving the current algorithm now.

 
Maxim Dmitrievsky :

For example, you can change the reward function to approximate the Sharpe ratio. Or other metrics. I've tried different functions and realised that making it more complicated doesn't give much advantage.

You can also read this: https://github.com/darden1/tradingrrl

Automatic optimisation is a good idea, but now I'm working on improving the current algorithm.

Thanks for the article. I will be looking into it.

There is no doubt that in terms of ease of coding, speed of learning, and accuracy of results, the current implementation is still the best I have ever seen in machine learning, and by adding a few more indicators, it is likely that the results could even be greatly enhanced.

I completely agree with you that small complications make the results worse, and EA learns best on its own. I tried applying stop loss and takeprofit to limit the amount of losses, and the results got worse with a tight stop loss.

But the only thing missing is iterative learning. I mean the algorithm learns only during optimisation and after that it is totally dependent on the trained data and hence in one case we can't call it "reinforcement learning" because it doesn't learn during trading and only learns during training ,

So I am looking for some solution only to automate the optimisation at every loss. I mean every loss and also the reward update, the EA should call the optimiser to train it again for last month's data. Or we can pause for a while after a loss, and later after the optimisation is complete EA will resume trading again. This way the trained trees (Mtrees text file) will always contain the latest policy based on the current market.

There are probably some articles on automatic optimisation, but I'm not an EA programmer and hence I haven't found a way to integrate it into your EA so far.

Since your current implementation is already using the policy from the prepared data and therefore unlike other EAs, your automator does not need to save the values after automatic optimisation. Just running the optimiser and clicking the start button along with the start and end date will be enough to automate the optimisation.

 
FxTrader562:

Thanks for the article. I'll be looking into it.

There is no doubt that in terms of ease of coding, speed of learning, and accuracy of results, the current implementation is still the best I have ever seen in machine learning, and by adding a few more indicators, it is likely that the results could even be greatly enhanced.

I completely agree with you that small complications make the results worse, and EA learns best on its own. I tried applying stop loss and takeprofit to limit the amount of losses, and the results got worse with a tight stop loss.

But the only thing missing is iterative learning. I mean the algorithm learns only during optimisation and after that it is totally dependent on the trained data and hence in one case we can't call it "reinforcement learning" because it doesn't learn during trading and only learns during training ,

So I am looking for some solution only to automate the optimisation at every loss. I mean every loss and also the reward update, the EA should call the optimiser to train it again for last month's data. Or we can pause for a while after a loss, and later after the optimisation is complete EA will resume trading again. This way the trained trees (Mtrees text file) will always contain the latest policy based on the current market.

There are probably some articles on automatic optimisation, but I'm not an EA programmer and hence I haven't found a way to integrate it into your EA so far.

Since your current implementation is already using the policy from the prepared data and therefore unlike other EAs, your automator does not need to save the values after automatic optimisation. Just running the optimiser and clicking the start button along with the start and end date will be enough to automate the optimisation.

I see what you mean, you need a virtual back tester for this. It is not difficult to write it at all, maybe I will add it in the next articles.

 
Maxim Dmitrievsky :

I see your point, this requires a virtual tester. It's not hard to write, I will probably add it to the next articles.

Thank you very much. I will look at it in your future articles.

Yes, I also think it should not be a difficult task especially for your ea as there is nothing much to do except using the optimiser from the start date to today's date and the optimisation period can be specified in the EA input settings , I mean there is no reading and writing from the optimised files as this is already done by your ea. But I don't know exactly how to do it and hence I will wait for your update.

By the way, the most unusual thing that attracted me to Randomised Decision Forest (RDF) is that I noticed that the basic model of RDF implementation is very similar to the GO game, although I could be wrong in my observations. So if algo "ALPHAGO" machine learning can beat a complex game like go, RDF can definitely beat the forex market. I mean, I strongly believe that it is very easy to get 99% - accurate trades using RDF if sufficient input variables (indicators) are fed and continuously fed to develop and maintain optimal policies while the trade is on.

Thanks again for your time.

 

Good afternoon,

I am posting the results of some experiments (obtained on pure trees without fuzzy logic, I thought to attach them to a new article, but since the discussion of reward functions continues, I am posting them as information for comprehension and discussion).

1.It seemed to me not quite right that let's say at SELL random value is set on the whole interval 0..1, because we already know that sales are unprofitable

if(RDFpolisyMatrix[numberOfsamples-1][iNeuronEntra]==1.0)//SELL
   likelyhood = MathRandomUniform(0.0,0.6,unierr);
else
   likelyhood = MathRandomUniform(0.4,1.0,unierr);

By limiting the ranges to opposite and uncertain values, the speed of learning increases many times. With 2-3 runs (I believe with a pass on random data) the quality of training as with 4-6 old (scatter is wide, because a lot of additional factors, but the efficiency has increased not even by tens of per cent).

2. In the initial implementation, I found it strange that the value obtained randomly is a reinforcing factor. This easily creates a situation where a strong trend has a lower reward.

First attempt to get away from this

if(RDFpolisyMatrix[numberOfsamples-1][iNeuronEntra]==1.0)//SELL
      nagrada=MathMin(1.0,0.61+NormalizeDouble(profit/WS.PriceStep()/250,2));

Idea: at 100 pips and above taken profit- 1, if less - evenly increasing (in this case from 0.61). Priper pdl for selling, for buying similarly with other levels. Theoretically for a stronger trend - higher reward. The results have improved, but slightly more than the statistical error. At the same time, the file with the tree for the same conditions has significantly decreased in size. Apparently, such a peculiar sorting of the results allowed to describe the rules more simply.

To test the ensemble of trees, I decided to load the estimation of one tree.

updatePolicy(0); //for BUY
updatePolicy(1); //for SELL

and, out of habit, ran the training. What was my surprise, on a similar training with a coarsened reward function showed a significant improvement - on the trained site, all other things being equal, the profit for 4 months exceeded the profit for 6 months of the old variant (I am operating with comparisons, because the specific figures vary greatly from the training conditions, pair, curvature of the coder's handles) and what is most interesting, the results on the control interval improved. Loading the evaluation function improved the prediction! Probably for a pro-statistician there is nothing new here and he can prove by formulas that it should be so, but for me it is a shock, as they say it is necessary to get used to it. And the question of further selection and evaluation of prediction functions arises.

I hope that the time I have spent on the tests will help someone to at least reduce the time of their search (well, or give the opportunity to make new mistakes, which will be shared with us).

 

and how realistic is it to teach this code, kindly provided by the author of the article, to the simplest patterns of 3-5 bars?

SZY: hmm, even under alcohol I'm writing like on Aliexpress to a Chinese seller ))))