Machine learning in trading: theory, models, practice and algo-trading - page 1277

 
elibrarius:

I understand this method differently.
For the predictor under study, you should not feed random values with a normal distribution, but just shuffle the rows in that column.

In general - the results from the article are impressive. It is necessary to try it in practice.

mix it up, what difference does it make?

 
elibrarius:
Alglib goes through all the remaining 50% predictors, divides each into 4 parts by quartiles, and chooses the division with the best error out of all the choices.

In principle, random partitioning is not hard to do.
I haven't seen any individual trees with good test results (45-50%), but a forest of them is more interesting).


I see, I thought so, so there is a high probability that the root split in most of the trees will be the same, which itself discards other options.

I assume that any leaves are just an attempt to describe a pattern, and we can't know from beforehand whether the description will be correct or a random coincidence in the sample. That's why I type leaves that are different and unique (not repeated) and check them separately, not the whole tree.

Alglib has redundant branching, so it's not learning, but memorization. I think scaffolding is a good idea, but it will work correctly if each tree contains unique rules (leaves) and the number of splits is not very large - 4-10.


elibrarius:
In terms of importance of predictors looked at:
xgboost, lightGBM packages had built-in methods to estimate feature importance for "wooden models":

  1. Gain
    This measure shows relative contribution of each feature to the model. To calculate it we go through each tree, look at each node of the tree which feature leads to node splitting and how much the model's uncertainty is reduced according to the metric (Gini impurity, information gain).
    For each feature its contribution is summed over all trees.
  2. Cover
    Shows the number of observations for each feature. For example, you have 4 features, 3 trees. Suppose, for example, that fich 1 has 10, 5 and 2 observations in trees 1, 2 and 3 respectively. Then the importance for this fich will be 17 (10 + 5 + 2).
  3. Frequency
    Shows how often a given feature is found in tree nodes, i.e. the total number of split tree nodes for each feature in each tree is counted.
Not really they correctly divide the importance.
I have a forest trained for 5 bars of the result on the test gives better results than at 100. But when trained at 100, the first 5 are not marked as important, and some distant.
When trained at 100, the error of individual trees and forests is lower - obviously at the expense of overtraining and giving importance to 30-100 bars. But obviously they are not important, not by conventional logic, but by the fact that the forest at 5 bars gives better results.

Yes, with importance estimation the standard approaches are not very effective. I want to try as an evaluation of some kind of uniqueness indicator, that is, when there are already ready leaves, and we try to change each predictor in turn to any other (taking into account the grid split), collect statistics, and compare the best variant of replacement with the default variant, take into account accuracy or other indicator (the concept is important), and thus collect points for each predictor for the entire model.

elibrarius:
After a cursory look at the code, I saw a genetic selection of features to build a tree from the rpart package. That is, each tree was offered a different set of features for learning. Due to genetics such feature set is faster than complete brute force.
But the tree is not a magic tree, but the one offered by rpart. I think it's standard there.

The tree itself is absolutely standard, the original idea of the script is to find the most significant predictors, and genetics seems to contribute to this.

I don't understand how you can change the entropy into some other indicator (accuracy or completeness or something) to create a new generation.

 

Not long ago I was watching a lecture on MO, and there was a situation when a model works in a narrow range of probability, so for boosting models this is considered almost the norm, because the model gives essentially not probability in its pure form, and in connection with this fact there is a concept of calibrating such a model for correct interpretations of the predictions. And I had just such a situation last year, when the models were giving out their results in the range of 40 to 60, and I was assured that this was a very bad option... and I had my doubts because the models were stable and gave good financial results.

 
Aleksey Vyazmikin:

Not long ago I was watching a lecture on MO, and there was a situation when a model works in a narrow range of probability, so for boosting models this is considered almost the norm, because the model produces essentially not probability in its pure form, and in connection with this circumstance there is a concept of calibrating such a model for correct predictions. And I had just such a situation last year, when models were giving out their results in the range from 40 to 60, and I was assured that this is a very bad option... And I doubted, because the models were stable and gave good financial results.

Alexei, let's say the error probability is 99% - is it good or bad?

I understand that the remaining only one percent - is the probability of success.

Not much, but it's cool, because we already know where the mistake is and how to avoid it.

That said, the ratio is 99 to 1.
 
Renat Akhtyamov:

Alexei, let's say the error probability is 99% - is that good or bad?

My understanding is that the remaining one percent is the probability of success.

It's small, but it's cool because we already know where the mistake is and how to avoid it.

Such a high probability of error tells us that we don't know much about what's going on.

It's accurate enough and that's good, but it's a long way from success - 1% could just be a fluke.

And that's if we're talking specifically about probability.

 
Aleksey Vyazmikin:

I see, that's what I thought, so there is a high probability that the root split in most of the trees will be the same, which itself discards other options.

About 50%. But this parameter can be changed, at any desired in other forest call foyer.

Aleksey Vyazmikin:
I want to try as a score some kind of uniqueness index, that is, when there are already ready leaves, and we try to change each predictor in turn to any other (taking into account grid partitioning), collect statistics, and compare the best option to the default, consider the accuracy or another index (important concept), and thus collect points for each predictor for the entire model.

Something similar to the permutation that Maxim found. But does it make sense to substitute a predictor with a change from 0.1 to 0.2 with a change from 800 to 300000 instead of a predictor with a change from 0.1 to 0.2? No!
But to shuffle its rows does. The range of numbers and probability distribution will remain, but the values in each example will become random.

Aleksey Vyazmikin:

What I don't understand is how you can change the entropy into some other parameter (accuracy or completeness or whatever) to create a new generation.

Some R packages allow to use their error function. Xgboost can, but there you have to find a formula for the derivative of your f-i and feed it together with it. For me, derivation is a problem. Look at the description of rpart package, maybe even there you can use your functions, and maybe even without the derivative.

 
elibrarius:

Something similar with permutation, which Maxim found. But is there any point in substituting a predictor with a variation from 0.1 to 0.2 for a predictor with a variation from 800 to 300000? No!
But to shuffle its rows does. The range of numbers and probability distribution will remain, but the values in each example will become random.

I wrote,"let's assumea n. distribution." Find the mean and variance and go ahead. It's better to randomize by Noise than by shuffling.

There are a lot of fools here who like to twist words and screenshot them, trying to assert themselves with it later

 
Maxim Dmitrievsky:

I wrote,"Supposen. ras. Naturally it makes sense with normalized traits. Find the mean and variance and go ahead.

There are a lot of fools here who like to twist words and screenshot them, trying to assert themselves on this later
Normalization will help with the range - that's right.
But the probability distribution of a normal distribution will be in the center (about 0.5), and the real predictor may be shifted sideways, like about 0.8. Or a saddle of some sort around 0.2 and 0.8, or something else...
Stirring will keep the distribution as well.
 
elibrarius:
Normalization will help with the range - that's right.
But the probability distribution of a normal distribution will be in the center (about 0.5), and the real predictor may be shifted sideways, like about 0.8. Or a saddle of some sort around 0.2 and 0.8, or something else...
Stirring will keep the distribution as well.

Take the average and variance, lol, and don't worry about it.

 
Maxim Dmitrievsky:

Take the average and the variance, lol, and don't worry about it.

mix it up easier)

And for the link to the interesting method (permutation) - thanks!

Reason: