Discussion of article "Advanced resampling and selection of CatBoost models by brute-force method"

 

New article Advanced resampling and selection of CatBoost models by brute-force method has been published:

This article describes one of the possible approaches to data transformation aimed at improving the generalizability of the model, and also discusses sampling and selection of CatBoost models.

A simple random sampling of labels used in the previous article has some disadvantages, such as:

  • Classes can be imbalanced. Suppose that the market was mainly growing during the training period, while the general population (the entire history of quotes) implies both ups and downs. In this case, naive sampling will create more buy labels and less sell labels. Accordingly, labels of one class will prevail over another one, due to which the model will learn to predict buy deals more often than sell deals, which however can be invalid for new data.

  • Autocorrelation of features and labels. If random sampling is used, the labels of the same class follow one another, while the features themselves (such as for example, increments) change insignificantly. This process can be shown using an example of a regression model training - in this case it will turn out that autocorrelation will be observed in the model residuals, which will lead to a possible model overestimation and overtraining. This situation is shown below:


Model 1 has autocorrelation of residuals, which can be compared to model overfitting on certain market properties (for example, related to the volatility of training data), while other patterns are not taken into account. Model 2 has residuals with the same variance (on average), which indicates that the model covered more information or other dependencies were found (in addition to the correlation of neighboring samples).

Author: Maxim Dmitrievsky

 
Wasn't there an idea to add EM(Expectation Maximisation) instead of GMM?
 
Stanislav Korotky:
Wasn't there an idea to use EM (Expectation Maximisation) instead of GMM?

and it already works with the EM algorithm, as far as I understand.

There are ideas to use deep neural networks for this, still under study.

 
Maxim Dmitrievsky:

and it's already running the EM algorithm, as far as I can see.

There are ideas to use deep neural networks for this, still under study.

OK. Also in the works was an approach with flipping the original series - so the classes are balanced automatically.

 
Stanislav Korotky:

OK. Still in the works was an approach with flipping the original series - this way the classes are balanced automatically.

As an option, and you can use oversampling-undersampling and their combinations. But this did not give significant improvements, while GMM did. Moreover, the more clusters the better. Purely empirical.

here is a good article about resampling, with examples https://imbalanced-learn.readthedocs.io/en/stable/index.html.

kernel density estimation is also worse than GMM. Deep neural network should be better than GMM, in theory. Because GMM doesn't work well with large feature space.

Welcome to imbalanced-learn documentation! — imbalanced-learn 0.5.0 documentation
  • imbalanced-learn.readthedocs.io
The exact API of all functions and classes, as given in the doctring. The API documents expected types and allowed features for all functions, and all parameters available for the algorithms.
 

Interesting article.

I got the feeling that with this tricky move with random assignment and pseudo-sample generation, we just find similar dependencies from the training period significant on the test.

What percentage of models fail the test?

It would be interesting to add a third sample - let us learn from the first one, select good results given the test, and check the selection result on the exam.

 
The main questionable point is learning from the latest data and testing on older data. This is somewhat analogous to looking into the future: the latest current models incorporate something from earlier models (market participants have memory, after all), but in the opposite direction it is more difficult to predict the future. I think that if you restart the algorithm in a canonical way (training on old data, testing on new data - this is more like reality), the result will not be so good.
 
Stanislav Korotky:
The main questionable point is learning from the latest data and testing on older data. This is somewhat analogous to looking into the future: the latest current models incorporate something from earlier models (market participants have memory, after all), but in the opposite direction it is more difficult to predict the future. I think that if you restart the algorithm in a canonical way (training on old data, testing on new data - it's more like reality), the result is not so good.

As far as I understand, for this method it's only a matter of brute force time.

 
Aleksey Vyazmikin:

As far as I understand it, for this method it is only a matter of brute force time.

I didn't realise that. I could be wrong, but in the settings it is ironcladly prescribed to train in the last year and test in the previous years, starting from 2015.

 
Stanislav Korotky:

I didn't realise that. I could be wrong, but in the settings it is ironcladly prescribed to train in the last year and test in the previous years, starting from 2015.

So there is a brute force - the purpose of which is to find those patterns in 2020, which were in effect for the entire period - since 2015. Theoretically, it may be necessary to brute force more, but the goal will be achieved, the other thing is that it is not clear whether it is a pattern or a fitting, and without even a hypothetical answer to this question, it is difficult to make a decision on the feasibility of installing the TC on the real....

 
Aleksey Vyazmikin:

So there is an oversampling - the purpose of which is to find those patterns in 2020, which were in effect for the entire period - since 2015. Theoretically, it may be necessary to brute force more, but the goal will be achieved, the other thing is that it is not clear whether it is a pattern or a fitting, and without even a hypothetical answer to this question, it is difficult to make a decision on the expediency of installing the TC on the real....

Depends on what to consider a regularity, if it is the order of increments, tied to time, it is a seasonal regularity of behaviour of increments, if without binding, then the same sequence of increments with some freedom in accuracy.

And it depends on what is considered fitting. If knowingly identical series, then it is a fitting, but the purpose of the test (no matter from which side) is to check the result on not identical areas.

And the logic of training on the near period is logical, but it is the same, if we test in the depth of history, the result should be the same, if we train in the depth of history, and test in the near period.

We only confirm the hypothesis that there are regularities in the test and training plots.