Discussion of article "Advanced resampling and selection of CatBoost models by brute-force method"
Wasn't there an idea to use EM (Expectation Maximisation) instead of GMM?
and it already works with the EM algorithm, as far as I understand.
There are ideas to use deep neural networks for this, still under study.
and it's already running the EM algorithm, as far as I can see.
There are ideas to use deep neural networks for this, still under study.
OK. Also in the works was an approach with flipping the original series - so the classes are balanced automatically.
OK. Still in the works was an approach with flipping the original series - this way the classes are balanced automatically.
As an option, and you can use oversampling-undersampling and their combinations. But this did not give significant improvements, while GMM did. Moreover, the more clusters the better. Purely empirical.
here is a good article about resampling, with examples https://imbalanced-learn.readthedocs.io/en/stable/index.html.
kernel density estimation is also worse than GMM. Deep neural network should be better than GMM, in theory. Because GMM doesn't work well with large feature space.
- imbalanced-learn.readthedocs.io
Interesting article.
I got the feeling that with this tricky move with random assignment and pseudo-sample generation, we just find similar dependencies from the training period significant on the test.
What percentage of models fail the test?
It would be interesting to add a third sample - let us learn from the first one, select good results given the test, and check the selection result on the exam.
The main questionable point is learning from the latest data and testing on older data. This is somewhat analogous to looking into the future: the latest current models incorporate something from earlier models (market participants have memory, after all), but in the opposite direction it is more difficult to predict the future. I think that if you restart the algorithm in a canonical way (training on old data, testing on new data - it's more like reality), the result is not so good.
As far as I understand, for this method it's only a matter of brute force time.
As far as I understand it, for this method it is only a matter of brute force time.
I didn't realise that. I could be wrong, but in the settings it is ironcladly prescribed to train in the last year and test in the previous years, starting from 2015.
I didn't realise that. I could be wrong, but in the settings it is ironcladly prescribed to train in the last year and test in the previous years, starting from 2015.
So there is a brute force - the purpose of which is to find those patterns in 2020, which were in effect for the entire period - since 2015. Theoretically, it may be necessary to brute force more, but the goal will be achieved, the other thing is that it is not clear whether it is a pattern or a fitting, and without even a hypothetical answer to this question, it is difficult to make a decision on the feasibility of installing the TC on the real....
So there is an oversampling - the purpose of which is to find those patterns in 2020, which were in effect for the entire period - since 2015. Theoretically, it may be necessary to brute force more, but the goal will be achieved, the other thing is that it is not clear whether it is a pattern or a fitting, and without even a hypothetical answer to this question, it is difficult to make a decision on the expediency of installing the TC on the real....
Depends on what to consider a regularity, if it is the order of increments, tied to time, it is a seasonal regularity of behaviour of increments, if without binding, then the same sequence of increments with some freedom in accuracy.
And it depends on what is considered fitting. If knowingly identical series, then it is a fitting, but the purpose of the test (no matter from which side) is to check the result on not identical areas.
And the logic of training on the near period is logical, but it is the same, if we test in the depth of history, the result should be the same, if we train in the depth of history, and test in the near period.
We only confirm the hypothesis that there are regularities in the test and training plots.
- Free trading apps
- Over 8,000 signals for copying
- Economic news for exploring financial markets
You agree to website policy and terms of use
New article Advanced resampling and selection of CatBoost models by brute-force method has been published:
This article describes one of the possible approaches to data transformation aimed at improving the generalizability of the model, and also discusses sampling and selection of CatBoost models.
A simple random sampling of labels used in the previous article has some disadvantages, such as:
Model 1 has autocorrelation of residuals, which can be compared to model overfitting on certain market properties (for example, related to the volatility of training data), while other patterns are not taken into account. Model 2 has residuals with the same variance (on average), which indicates that the model covered more information or other dependencies were found (in addition to the correlation of neighboring samples).
Author: Maxim Dmitrievsky