Discussion of article "Gradient Boosting (CatBoost) in the development of trading systems. A naive approach"

 

New article Gradient Boosting (CatBoost) in the development of trading systems. A naive approach has been published:

In this article, we train the CatBoost classifier in Python and export the model to mql5, as well as parse the model parameters and consider a custom strategy tester. The Python language and the MetaTrader 5 library are used for preparing the data and for training the model.

The compiled bot can be tested in the standard MetaTrader 5 Strategy Tester. Select a proper timeframe (which must match the one used in model training) and inputs look_back and MA_period, which should also match the parameters from the Python program. Let us check the model in the training period (training + validation subsamples):

Model performance (training + validation subsamples)

If we compare the result with that obtained in the custom tester, these results are the same, except for some spread deviations. Now, let us test the model using absolutely new data, from the beginning of the year:

Model performance on new data

The model performed worse with new data. Such a result is related to objective reasons, which I will try to describe further.

Author: Maxim Dmitrievsky

 

There's no need to mix it up here

train_X, test_X, train_y, test_y = train_test_split(X, y, train_size = 0.5, test_size = 0.5, shuffle=True)

According to the help at https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html

shuffle bool, default=True

Whether or not to shuffle the data before splitting. If shuffle=False then stratify must be None.

The data is shuffled before splitting, i.e. some examples from the test will be included in the train.

In general, I liked the article, it shows that it is quite easy to implement and use AI in trading.

sklearn.model_selection.train_test_split — scikit-learn 0.23.2 documentation
  • scikit-learn.org
*arrays , **options ¶ Quick utility that wraps input validation and and application to input data into a single call for splitting (and optionally subsampling) data in a oneliner. Parameters *arrays Allowed inputs are lists, numpy arrays, scipy-sparse matrices or pandas dataframes. test_size If float, should be between 0.0 and 1.0 and represent...
 
Can you do the last graph from the article but without mixing?
I guess the validation would deteriorate and the test on unknown data might improve.
 
elibrarius:

There's no need to stir here

According to the help at https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html

The data is mixed before splitting, i.e. some examples from test will be included in the train.

In general, I liked the article, it shows that it is quite easy to implement and use AI in trading.

I do this on purpose to even out the samples a bit. Without mixing the test turns out worse, but it has almost no effect on the new data. I will show examples later.

 

That's what I don't get:

if dataset['close'][i] >= (dataset['close'][i + rand]):
            labels.append(1.0)
        elif dataset['close'][i] <= (dataset['close'][i + rand]):
            labels.append(0.0)              
        else:
            labels.append(0.0)

Conditions that will never work are marked in red.

 
Stanislav Korotky:

That's what I don't get:

Conditions that will never work are marked in red.

There's nothing wrong here, I've changed the conditions and still have artefacts.

 
Maxim thank you a thousand times for sharing such articles...
Especially for the importation of the python model on mql5.
Because I am not an ALGLIB expert but I strongly think that XGBoost CATBoost and Pytorch are by far superiors to do machine and deep learning.
 
Very interesting work! Thanks to the author.
 
There is a question related to the article not directly, but indirectly, through dependence on CatBoost.

Can someone explain on fingers how inside CatBoost (or more generalised - in histogram-based gradient boosting decision tree) separation by characteristic (input variable) on the basis of histogram is done? It is clear that for each bin (histogram bar) statistics is calculated: the total number of hits of vectors with a value in the bin range and their breakdown by belonging to output classes (in this case, two). And having a histogram with these statistics, how to choose a division for creating the next level tree?

 
Stanislav Korotky:
There was a question related to the article, not directly, but indirectly through the CatBoost dependency.

Can someone explain on fingers how inside CatBoost (or more generalised - in histogram-based gradient boosting decision tree) separation by characteristic (input variable) on the basis of histogram is done? It is clear that for each bin (histogram bar) statistics is calculated: the total number of hits of vectors with a value in the bin range and their breakdown by belonging to output classes (in this case, two). And having the histogram with these statistics, how to choose the division for creating the next level tree?

Trees are built independently of each other, and then counting is done in the leaves (enumeration by unquantised predictors) in such a way that the gradient error is reduced.

When selecting predictors for tree construction and tree splits, random coefficients are used, which makes it possible, in theory, to increase completeness (Recall) and prevent overtraining.

 
Stanislav Korotky:
There was a question related to the article, not directly, but indirectly through the CatBoost dependency.

Can someone explain on fingers how inside CatBoost (or more generalised - in histogram-based gradient boosting decision tree) separation by characteristic (input variable) on the basis of histogram is done? It is clear that for each bin (histogram bar) statistics is calculated: the total number of hits of vectors with a value in the bin range and their breakdown by belonging to output classes (in this case, two). And having the histogram with these statistics, how to choose the division for creating the next level tree?

It is better to ask the developers