Machine learning in trading: theory, models, practice and algo-trading - page 3371

 
Forester #:

You're misunderstanding the stove. It looks like you have never looked at the tree building code.... There are no operations within a single row there at all!!!, only with sets (full or batches).

In brief:
A random/full set of rows passed to training, is sorted one by one for each predictor/column. Different splits on it are checked (middle/percentile/random), statistics for each are counted, and the best split is selected for the whole set of rows, not for one/each row as you suggested.
According to the best split, the set of strings is divided into 2 sets, then each set is sorted again and the best split is selected for each of the parts, etc. until the stopping rule is reached (by depth, number of examples per line, etc.)

You can see more details in the editor, you have the file:
\MQL5\Include\Math\Alglib\dataanalysis.mqh
ClassifierSplit() function and the one from which it is called.
You will understand it in a couple of hours and you won't have to talk about searching predictors by one line.
It will be clearer here, the code is more concise and has comments https://habr.com/ru/companies/vk/articles/438560/.

1. RegressionTree() class

Пишем XGBoost с нуля — часть 1: деревья решений
Пишем XGBoost с нуля — часть 1: деревья решений
  • 2019.02.06
  • habr.com
Привет, Хабр! После многочисленных поисков качественных руководств о решающих деревьях и ансамблевых алгоритмах (бустинг, решающий лес и пр.) с их непосредственной реализацией на языках программирования, и так ничего не найдя (кто найдёт — напишите в комментах, может, что-то новое почерпну), я решил сделать своё собственное руководство, каким...
 
Forester #:

You're misunderstanding the stove. It looks like you have never looked at the tree building code.... There are no operations within a single row there at all!!!, only with sets (full or batches).

In brief:
A random/full set of rows passed to training, is sorted one by one for each predictor/column. Different splits on it are checked (middle/percentile/random), statistics for each are counted, and the best split is selected for the whole set of rows, not for one/each row as you suggested.
According to the best split, the set of strings is divided into 2 sets, then each set is sorted again and the best split is selected for each of the parts, etc. until the stopping rule is reached (by depth, number of examples per line, etc.)

You can see more details in the editor, you have the file:
\MQL5\Include\Math\Alglib\dataanalysis.mqh
ClassifierSplit() function and the one from which it is called.
You will understand it in a couple of hours and you won't have to talk about searching predictors by one line.

You are right about many lines.

Let's go back to the beginning: what is a pattern in a random forest?

It is a single tree. Here is an example of one such tree from RF:

    left daughter right daughter split var split point status prediction
1               2              3         2  0.34154125      1          0
2               4              5         2  0.28238475      1          0
3               6              7         4  0.37505155      1          0
4               0              0         0  0.00000000     -1          2
5               8              9         5  0.64235664      1          0
6               0              0         0  0.00000000     -1          2
7              10             11         1  0.45438075      1          0
8              12             13         1  0.46271469      1          0
9              14             15         3  0.25803691      1          0
10             16             17         2  0.51942328      1          0
11             18             19         1  0.48839881      1          0
12             20             21         3  0.45243581      1          0
13              0              0         0  0.00000000     -1          2
14              0              0         0  0.00000000     -1          2
15             22             23         6  0.62789488      1          0
16             24             25         2  0.34224983      1          0
17             26             27         4  0.53845361      1          0
18             28             29         3  0.39207978      1          0
19             30             31         3  0.03596312      1          0
20             32             33         7  0.49380156      1          0
21              0              0         0  0.00000000     -1          2
22              0              0         0  0.00000000     -1          2
23             34             35         6  0.76472904      1          0
24              0              0         0  0.00000000     -1          1
25              0              0         0  0.00000000     -1          2
26             36             37         5  0.87588550      1          0
27             38             39         1  0.31089209      1          0
28             40             41         2  0.39193398      1          0
29             42             43         1  0.47068628      1          0
30             44             45         7  0.76420940      1          0
31             46             47         2  0.38380954      1          0

 [ reached getOption("max.print") -- omitted 185 rows ]

Total rows = 166+185! All of them didn't fit

There are 150 such trees in my model

split var - это номер переменной, всего их в модели 8
split point - лучшее значение конкретной переменной, которое использовалось для разделения.
 
СанСаныч Фоменко #:

You're right about the many lines.

Back to the beginning: what is a pattern in a random forest?

It is a single tree. Here is an example of one such tree from RF:

Total rows = 166+185! None of them fit

There are 150 such trees in my model

Consider again the path forming the leaf. In my example above there are 5 splits. Isn't this a description of the pattern of 2 vertices with a trough? Description.
7 splits can describe head shoulders etc.
Each leaf of one tree describes a different pattern.

The forest is the opinion of the crowd (derviews).
The 1st tree says: this string falls into my 18th pattern/leaf and answer = 1
2nd: the same string falls into my 215 pattern/leaf and gives answer=0
3rd: = 1

...

We average and get the average opinion of 150 trees. For example = 0.78. Each had a different activated leaf/pattern.

 
Forester #:

Consider again the path forming sheet. In my example above there are 5 splits. Isn't that a description of the pattern of 2 tops with a trough? Description.
7 splits can describe head shoulders etc.
Each leaf of a single tree describes a different pattern.

A forest is the opinion of a crowd of dervids.
The 1st tree says: this line falls into my 18th pattern/leaf and answer = 1
2nd: the same line falls into my 215 pattern/leaf and gives answer=0
3rd: = 1

...

We average and get the average opinion of 150 trees. Each had a different activated leaf/pattern.

We don't know how many leaves.

The number of trees is a parameter that can be changed to obtain the minimum sample size for training.

We see that 50 trees are enough, so it is convenient to consider a tree as a pattern.

 
СанСаныч Фоменко #:

How many leaves is unknown.

The number of trees is a parameter that can be changed to obtain the minimum sample size for training.

We see that 50 trees are enough, so it is convenient to consider a tree as a pattern.

The tree responds to each situation/row with one leaf/pattern. In other situations the response will be from other leaves/patterns.
 
Forester #:
The tree responds to each situation/line with one leaf/pattern. In other situations the response will be from other leaves/patterns.

It seems that not only the leaf, but also the tree doesn't solve anything.

Here I found the formula for the final classifier




Where

  • N - number of trees;
  • i - counter for trees;
  • b - the decisive tree;
  • x - the sample we generated from the data.

It is also worth noting that for the classification task we choose the solution by majority voting, while in the regression task we choose the solution by mean.

 
СанСаныч Фоменко #:

It seems that not only the leaf, but also the tree doesn't solve anything.

Here is the formula for the final classifier

It is also worth noting that for the classification task we choose the solution by majority voting, while in the regression task we choose the solution by average.

Why doesn't it solve? It contributes (1/150) to the final answer.

From each tree one of the activated leaves/patterns participates in the voting (average).

The answer of the forest is the average of the answers of all trees (or activated leaves/patterns) - this formula counts it. The majority for binary classification will be if the average is >0.5, then 1, otherwise 0.
But the 0.5 border is probably not the best option, if the package gives access to the value of the average, you can experiment with different borders.

 
Forester #:
The tree responds to each situation/line with one leaf/pattern. In other situations the response will be from other leaves/patterns.
SanSanych Fomenko #:

It seems that not only leaf, but also tree doesn't solve anything.

Not just one leaf but all trees are responsible for each situation, just not all of them are activated, the sum of forecasts of those that are activated is the forecast from the model....


What the hell are you talking about, tree model experts?

 
mytarmailS #:

Not one leaf, but all trees are responsible for each situation, just not all of them are activated, the sum of the forecasts of those that are activated is the forecast from the model.


What the hell are you talking about, tree model experts?

Did you say something new? If not, then by your reckoning, it's rubbish too.
 
mytarmailS #:


Do you have any experience using LigthGBM?

Reason: