Machine learning in trading: theory, models, practice and algo-trading - page 2903

 

Hey, everybody!

Maybe someone can give me some advice. I am trying to predict the direction of a currency pair for the day (up or down) using the "DecisionTreeClassifier" model.

I take only 5 predictors for prediction, the result of prediction is upward (1) or downward (-1) trend. Dataset size: 999 rows and 6 columns (dataset attached).

But I encountered a problem when increasing "max_depth" increases all the time the accuracy on the training and test samples simultaneously. The accuracy on the test sample stops growing and becomes a constant at max_depth=22, equal to 0.780000. Results at different values of max_depth:


1) clf_20=DecisionTreeClassifier(criterion='entropy', max_depth=3)

Accuaracy on training set: 0.539424 Accuaracy on test set: 0.565000

2) clf_20=DecisionTreeClassifier(criterion='entropy', max_depth=5)

Accuaracy on training set: 0.579474 Accuaracy on test set: 0.585000

3) clf_20=DecisionTreeClassifier(criterion='entropy', max_depth=7)

Accuaracy on training set: 0.637046 Accuaracy on test set: 0.640000

4) clf_20=DecisionTreeClassifier(criterion='entropy', max_depth=9)

Accuaracyon training set: 0.667084 Accuaracy on test set: 0.700000

5) clf_20=DecisionTreeClassifier(criterion='entropy', max_depth=11)

Accuaracy on training set: 0.700876 Accuaracy on test set: 0.710000

6) clf_20=DecisionTreeClassifier(criterion='entropy', max_depth=13)

Accuaracy on training set: 0.720901 Accuaracy on test set: 0.720000

7) clf_20=DecisionTreeClassifier(criterion='entropy', max_depth=15)

Accuaracy on training set: 0.734668 Accuaracy on test set: 0.740000

8) clf_20=DecisionTreeClassifier(criterion='entropy', max_depth=17)

Accuaracy on training set: 0.747184 Accuaracy on test set: 0.760000

9) clf_20=DecisionTreeClassifier(criterion='entropy', max_depth=19)

Accuaracy on training set: 0.755945 Accuaracy on test set: 0.765000

10) clf_20=DecisionTreeClassifier(criterion='entropy', max_depth=22)

Accuaracy on training set: 0.760951 Accuaracy on test set: 0.780000


I am extremely confused by this situation, because I have heard that you should not use max_depth more than 3-4, because retraining is possible. But does the model behave like this when retraining, it looks more like an undertrained model.

.

I don't understand in such a situation, what depth of the decision tree to choose or what model even and in general whether it is worth to work further in this direction, maybe something is missing (but, like, the dataset is not 100 rows), whether it is possible to add more predictors and how many more can be added at such a size of the dataset (I would add 2-5 more pieces).

The code is simple, I also attach it together with the dataset:



 
Elvin Nasirov #:

I am very confused by this situation, because I heard that you should not use max_depth more than 3-4, because retraining is possible. But is this how the model behaves when retrained, it looks more like an undertrained model.

.

I don't understand in such a situation, what depth of the decision tree to choose or what model even and in general whether it is worth to work further in this direction, maybe something is missing (but, like, the dataset is not 100 rows), whether it is possible to add more predictors and how many more can be added at such a size of the dataset (I would add 2-5 more pieces).

The code is simple, I also attach it together with the dataset:

Hello.

More number of splits - more memory = risk of learning the sample.

I'm not proficient in python, but:

1. Try splitting the sample without mixing.

2. It still seems to me that you are learning on the whole sample, not on a reduced sample.

 
Aleksey Vyazmikin #:

Hello.

More number of splits - more memory = risk of learning a sample.

I'm not proficient in python, but:

1. Try splitting the sample without mixing.

2. It seems to me that you are training on the whole sample, not on a reduced sample.

Thank you! It seems that you are right.

I replaced "clf_20.fit(X, y)" with "clf_20.fit(X_train, y_train)" in the above code and the picture changed almost 50/50.

 
Elvin Nasirov #:

Thank you! I think you're right.

I replaced "clf_20.fit(X, y)" with "clf_20.fit(X_train, y_train)" in the above code and the picture changed almost 50/50.

It's normal to have such a result - too good a result is always a reason to start looking for a bug in the code.

 
Aleksey Vyazmikin #:

It's normal to have this result - too good a result is always a reason to start looking for a bug in the code.

I have another question, if I may.

It turns out that the best result is achieved at max_depth=1 and looks like this:

Accuaracy on training set: 0.515021 Accuaracy on test set: 0.503333

It seems to be extremely bad and equal to the probability of flipping a coin. Or can we consider this a good result and conclude that we have found a formalisation that allows us to level the probability of forex movement and the probability of the outcome with a flip of a coin?

That is, the situation is such that for each combination of predictors there are two equivalent variants of market movement: up or down, and therefore it is necessary to supplement the dataset with something that could specify at the current combination still up or down.

 
Elvin Nasirov #:

Another question came up, if I may.

It turned out that the best result is achieved at max_depth=1 and looks like this:

Accuaracy on training set: 0.515021 Accuaracy on test set: 0.503333

It seems to be extremely bad and equal to the probability of flipping a coin. Or can we consider this a good result and conclude that we have found a formalisation that allows us to level the probability of a forex movement and the probability of a coin flip?

That is, the situation is such that for each combination of predictors there are two equivalent variants of market movement: up or down, and therefore it is necessary to supplement the dataset with something that could specify at the current combination still up or down.

First read about other metrics for evaluating the results of training - Recall (completeness) and Precision (accuracy), they are especially relevant for unbalanced sampling. The strategy may be to produce a positive financial outcome for classification with the same chance of correct and incorrect results.

Consider a more complex but logical target markup. Determining how a day will close on its opening is more difficult than determining the probability of a rise and fall by some percentage of the day's opening - there is a probability of identifying an intraday pattern.

For me, the sample is too small.

Think about creating predictors that can describe the market. From the indicators of predictors, in my opinion, it should be possible to restore the situation on the chart without looking at it.

I recommend to try CatBoost for training - it builds models quickly and the issue of transferring models into code to work in MT5 without crutches is solved.

 
Elvin Nasirov #:

It turns out that the best result is achieved when max_depth=1 and it looks like this:

Accuaracy on training set: 0.515021 Accuaracy on test set: 0.503333

I also often see that the best result is at depth=1, which means that only 1 split on one of the features was made. Further splitting of the tree leads to overtraining on traine and worse results on test.

 
elibrarius #:

I also often see that the best result is at depth=1, which means that only 1 split on one of the features was made. Further splitting of the tree leads to retraining on the traine and worsening of results on the test.

Checked the results yesterday, it turned out that the model for all cases gave a prediction of "1", on average and therefore 50/50. You can do without the model - all the time saying "up" will go.

 
Trading as a professional pro trader
h ttps://youtu.be/RS9jRVmW1j4

This is what support and resistance levels are in my understanding.....

Not everyone will understand it, but if they do, kudos to them....

EARNING SEASON KICKS OFF - Trading Futures Live
EARNING SEASON KICKS OFF - Trading Futures Live
  • 2023.01.13
  • www.youtube.com
Join our FREE Discord community https://discord.gg/zhvUwUUhFirst 5 days of January bullish were followed by Full-year gains 83% of the time since 1950.Earnin...
 
mytarmailS #:
Trading as a professional pro trader
h ttps://youtu.be/RS9jRVmW1j4

This is what support and resistance levels are in my understanding.....

Not everyone will understand it, but if they do, I commend them...

If you do, you can trade like this.


Have you already put these levels into code? There are so many levels there that it is not realistic to trade them by hand....

Reason: