Machine learning in trading: theory, models, practice and algo-trading - page 163

 
SanSanych Fomenko:

1) If you look at the first publications of the author of randomforest algorithms, the author seriously claimed that rf is not prone to overtraining at all and gave a lot of examples. The randomforest package itself is built so as to exclude even the slightest suspicion of overtraining.

At the same time, randomforest is the most overtrainable algorithm. I have personally burned myself.


2) The vast majority of publications on machine learning are not tested on any analog of the second file. The reason is trivial. The algorithms are NOT applied on time series. And it turns out that a random division of file number one is quite sufficient. And this is indeed the case, for example, when recognizing handwritten text.

1) Overtraining and forrest and GBM and any other methods. Unnoticeable on the background of folded data and very noticeable on heavily noisy data.

2) There are, there are publications discussing the introduction of nested crossvalidation on additional samples in a different time range.

 
Alexey Burnakov:

2) There are, there are publications discussing the introduction of nested crossvalidation on additional samples in a different time range.

If you don't mind, link
 
SanSanych Fomenko:
If it's not too much trouble, link


One of the discussions: http://stats.stackexchange.com/questions/65128/nested-cross-validation-for-model-selection

Ibid: /go?link=https://stats.stackexchange.com/questions/103828/use-of-nested-cross-validation

There are links to articles in the discussions.

One interesting article: http://www.andrewng.org/portfolio/preventing-overfitting-of-cross-validation-data/

As you can see from the title, it is about overtraining, which happens at the stage of model evaluation on validation folds of crossvalidation. Accordingly, in addition to crossvalidation, we need more sampling to evaluate the already selected model.

Nested cross validation for model selection
Nested cross validation for model selection
  • stats.stackexchange.com
How can one use nested cross validation for model selection? From what I read online, nested CV works as follows: There is the inner CV loop, where we may conduct a grid search (e.g. running K-fold for every available model, e.g. combination of hyperparameters/features) There is the outer CV loop, where we measure the performance of the model...
 

If krakto (already wrote about it):

A model selected via crossvalidation must be revalidated by another time-delayed sample.

And nested crossvalidation implies building n k-fold crossvalidations (on different data) followed by validation on n deferred samples (each time on different data).

And even that is not all. If the top layer of deferred samples is re-selected, e.g. a committee of models based on these deferred samples, then the validation of the committee must be performed on one more deferred sample.

Ideally, this process:

k-fold кроссвалидация 

-------------------------------- повторилась n раз

------------------------------------------------------------- на полученных данных сформирован комитет

------------------------------------------------------------------------------------------------------------------------ комитет валидирован на еще одной выборке из будущего 

needs to be repeated not once but m times, in order to GET the results to be DISTRIBUTED at the very top level. This lowers the bias to a practically feasible minimum.

But in doing so, the expected value of, for example, FS may decrease many times... Pain.

 
Alexey Burnakov:

introducing nested crossvalidation on additional samples in a different time range.

I do something similar too. Suppose I have data to train for a year. I will train 12 models - one on the data for January, the second model on the data for February, the third for March, etc. I select predictors and model parameters to ensure that any of these models trained on a small part of the data trades well during the whole year and this gives me some hope that predictors used have constant correlations between them. Making a decision on the new data using this entire ensemble of models.

Of all the crossvalidation methods I've tried, this one gave the best results on the new data. But there are a lot of unresolved problems - how many models should there be, i.e. I can train a hundred instead of 12, but is there a point? Valuation of trade is also important, I can choose anything, including rf or sharp, I need to experimentally pick the best one.

 
And do you want to give a hint that I'm going to cover in detail in my article????? Do you want it or not?
 
Dr.Trader:

I do something similar, too. Let's say I have data for training for a year. I will train 12 models - one on the data for January, the second model on the data for February, the third for March, etc. I select predictors and model parameters to ensure that any of these models trained on a small part of the data trades well during the whole year and this gives me some hope that predictors used have constant correlations between them. Making a decision on the new data using this entire ensemble of models.

Of all the crossvalidation methods I've tried, this one gave the best results on the new data. But there are a lot of unresolved problems - how many models there should be, i.e. I can train a hundred instead of 12, but is there a point? Trade evaluation is also important, you can choose anything, including rf or sharp, you need to experiment to find the best one.

Answer: 9.
 
Dr.Trader:

I do something similar, too. Let's say I have data for training for a year. I will train 12 models - one on the data for January, the second model on the data for February, the third for March, etc. I select predictors and model parameters to ensure that any of these models trained on a small part of the data trades well during the whole year and this gives me some hope that predictors used have constant correlations between them. Making a decision on the new data using this entire ensemble of models.

Of all the crossvalidation methods I've tried, this one gave the best results on the new data. But there are a lot of unresolved problems - how many models there should be, i.e. I can train a hundred instead of 12, but is there a point? Trade evaluation is also important, anything to choose from, including rf or sharp, you need to experimentally pick the best one.

It's a fit. By selecting parameters and inputs you can easily get models that work for at least 3 years of the test.

I too have a few moeydels (100) that show good results on data outside of training. We're talking about 10 years... But that's only because the models are chosen specifically from the test data (out of training). In other words, weather the test.

Your next step is to evaluate these models or any selected committee on an additional delayed sample. And preferably, each model on unique data. Then you will understand how the quality in the test correlates with the quality in the sample, for which the model was not selected.
 
Alexey Burnakov:


One of the discussions: http://stats.stackexchange.com/questions/65128/nested-cross-validation-for-model-selection

Ibid: /go?link=https://stats.stackexchange.com/questions/103828/use-of-nested-cross-validation

There are links to articles in the discussions.

One interesting article: http://www.andrewng.org/portfolio/preventing-overfitting-of-cross-validation-data/

As you can see from the title, it is about overtraining, which happens at the stage of model evaluation on validation folds of crossvalidation. Consequently, besides cross-validation we need another sample for estimating the already selected model.

Thank you. It is nice to see that I am not the only one who cares.
 
You people are so boring, especially in the field of new knowledge.
Reason: