Machine learning in trading: theory, models, practice and algo-trading - page 1325

 
Farkhat Guzairov:

All that was above, cool and very informative, but here is "control training", what does it mean???

That is, for example, you train the system on a sample of 2014 data, and then you give a sample for 2015 and want to see the probability of patterns? If so, then you don't need to swap anything, everything is correct. Only I don't see the problem here that the control gives some different results from the expected ones, it will always be that way.

I use CatBoost for training, there is an opportunity to stop training on a control (validation) sample, i.e. error reduction on a training sample happens in parallel and then check on a control sample how the result on it has changed, if the result does not improve on the control sample a specified number of trees, then training stops and all trees to the last improvement are cut off. Yes the chronology is like this - I train 2014, and control training from 2015 to 2018, checking the result on 2018. Maybe it makes sense to switch places, because the patterns detected during training, can still stop working in time, and it may be better to train on data that is closer to the application in reality - an open question.

 
Farkhat Guzairov:

If we proceed from practical application of MO in your case, we should proceed, in my opinion, from the following.

Since it is simply not real to get 100% probability of the true outcome, then follow a simple method, for example, the ratio of true to false results, if it is about 50/50, then again we must understand what profits you get with these results, if 50% of the profit is an average of 100 points and the remaining 50% losses average 50 points, then I think your system is already suitable for practical use.

The accuracy of the classification in the table is the metric indicator Precision - about 60% (for the best models) get correct entries, in the tester it will be more, because some positions should break-even, but not take profit.

It's too early to apply yet, we should get ready for the next stage - to gut the models on herbaria :)

 
Aleksey Vyazmikin:

I use CatBoost for training, there is an opportunity to stop training on the control (validation) sample, i.e. there is a parallel decrease in the error on the training sample and immediately check on the control sample how the result has changed on it, if the result does not improve on the control sample a given number of trees, then training stops and all trees to the last improvement are cut off. Yes the chronology is like this - I train 2014, and control training from 2015 to 2018, checking the result on 2018. Maybe it makes sense to swap places, because the identified patterns in training, all the same, can stop working over time and may be better to train on data that is closer to the application in reality - an open question.

What for example I noticed with myself. The more data is involved in training, the more the system becomes "tight", i.e. That's why it produces less probability of outcome, why so, the answer sounded from you, because in certain periods one model gives a positive result, and the same model in a different period of time gives a negative result, as a result you bring the system into a stupor, it becomes "tight" like I said, maybe more intelligent, but that doesn't mean that the intelligent system will give more true results, I'm afraid the proportions will remain at the same level, just the system will less frequently tell you its point of view on the current situation.

 
Aleksey Vyazmikin:

There was a newer lecture about boosting (in python with catbust as an option) with the same lecturer - I can't find it.


It's interesting, that GBM solves the classification problem with regression trees.

Anybody knows? Do other boosting methods (packages) do the same?

 
Aleksey Vyazmikin:

And what conclusion can be drawn? It seems that the optimal volume is 60%-70% of the validation sample, i.e. training should take place on a smaller sample than the model validation. But it is impossible not to highlight the breakdown of 30%, there also the result on all indicators is not bad, and failures very close to 40% and 50%. I don't even know what affects the sample size or its content more, and how to set it up...

If 60-70% is good and 30% is good, then there's a chance of accidentally hitting these numbers.
You can try to fully repeat the calculations, if the second time will be the same - then you can consider it a pattern. (For greater stat. zanachivosti need to repeat 10 times).
 
Farkhat Guzairov:

What, for example, I have noticed in myself. The more data is involved in training, the more the system becomes "tight", i.e. The answer is that during certain periods some models show positive results, and the same models show negative results in another period, as a result the system gets in a stupor and as I said it gets "tight", maybe more intelligent, but that doesn't mean that the smart system will give more true results, I'm afraid the proportions will remain the same, just the system will less frequently tell you its point of view about the current situation.

I think it's better to have less signals in trading, more precise, and the models can be combined into independent ensembles, then the accuracy of classification will remain high and the completeness will increase (the number of events that qualify as 1). The main thing is to somehow adapt to generate excellent models, again, as an option, at the expense of a different sampling breakdown.

 
elibrarius:

Interestingly, GBM solves the classification problem with regression trees.

Is anyone up to speed on this? Do other boosting methods (packages) do the same?

They do the same of the ones I know (mentioned in different places). There is no other way because of the peculiarity of the training itself. That's why I said earlier that the sequence of trees I think can affect their weight in the answer, and that's what makes it reasonable to consider ensembles of leaves and converting them to a single rule.

 
elibrarius:
If 60-70% is good and 30% is good, then there is a chance of accidentally hitting these numbers.
You can try to fully repeat the calculations, if the second time will be the same - then you can consider a pattern. (For greater stat. zanachivosti need to repeat 10 times).

How to repeat? I.e. it will be the same, since the seed is fixed, you can take a new seed - I'll try later, see what happens.

On the other hand, 200 models were used per sample, which is also not small.
 
no conclusions can be drawn from such a study in a non-stationary market
 
Maxim Dmitrievsky:
No conclusions can be drawn from such a study in a non-stationary market

The sample is stationary, the breakdown for training has changed, but the breakdown for independent evaluation has remained the same.

Please expand your thought.

Reason: