Is there a pattern to the chaos? Let's try to find it! Machine learning on the example of a specific sample. - page 19

 

Profit is equal to the delta of price movement from some point in time.

Nothing surprises me yet.

We have already passed long tails ....

 
Renat Akhtyamov #:

Profit is equal to the delta of price movement from some point in time.

Nothing surprises me yet.

We have already passed long tails ....

Why did you just write that? Well right, you can't use any topic to mark your thoughts in the margins....

 
Aleksey Vyazmikin #:

Figure 13 shows that almost all of the available predictors are used, except for one, but I doubt that this is the root of the problem. So it's not so much the use, but the sequence of use in building the model?

Yes, it is. If you train 2 models with the same predictors, but one has the first split on one predictor and the other on the other, then the whole underlying tree for each variant will be quite different.

The other question is why does bousting on the same dataset make the first splits different? Is the coefficient for the number of columns !=1 like in the forest? In forest it is for randomness. But I think it should be ==1.
Then another option: different Seed for models? Try with the same one, if the result is the same, then I think it is very bad that seed can make a profitable model unprofitable.

 
By the way, what in Cutbust Seed is randomising?
 
Aleksey Vyazmikin #:

Why did you just write that? Well right, you can't use any topic to mark your thoughts in the margins....

about your graphs

 
elibrarius #:

Yes, it is. If you train 2 models with the same predictors, but one has the first split according to one predictor and the other according to another, then the whole underlying tree of each variant will be quite different.

Which once again proves that the greed method is flawed when selecting splits. I experimented with this myself when I was selecting leaves and came to the same conclusion.

elibrarius #:

The question is different - why does bousting with the same dataset make the first splits different? Is the coefficient for the number of columns !=1 like in forest? In forest it is for randomness. But I think it should be ==1.

As I understand it, there is an analogue here for selecting part of the columns for evaluation, but I have it set to force all of them.

elibrarius #:

Then another option: different Seed for the models? Try with the same one, if the result is the same, I think it is very bad that seed can make a profitable model unprofitable.

Seed fixes the result, i.e. everything will be the same.

elibrarius #:
By the way, what in Cutbust does Seed randomise?

As I understand, it sets the random number generator counter to a certain value, and this generator is used at least as they write "there is a randomisation of the metric by which the best tree is chosen." and it kind of uses the random number generator plus a coefficient, which, as I understand, is taken from the --random-strength parameter (it's 1 for me).

Here's the formula:

Score += random_strength * Rand (0, lenofgrad * q)

q is a multiplier that decreases as the iteration increases. Thus, the random decreases near the end.

"

But they also write there that a subsample can be used to build a tree, but I use the mode for full application of the sample "--boosting-type Plain".


There is also such an effect, if I remove columns after training, which do not use the model, then I can not get the model with the same Seed - which is not clear.

 
Renat Akhtyamov #:

your graphs are written

How does " Profit equals the delta of price movement from some point in time " apply to these charts. " ?

And this phrase then "Long tails we have already passed ...." should I take it that I offer you some form of training? But I don't do that, and tails are usually used here on the forum when modelling the distribution density of price change - not at all what I have on the histogram. And rather here we should not talk about risks, but about the fact that it is more difficult to build a model by chance than if you understand the structure of significance of predictors and their dependence.

 
Aleksey Vyazmikin #:

How does " Profit equals the delta of price movement from some point in time " relate to these charts. " ?

And this phrase then "Long tails we have already passed ...." should I take it that I am offering you some form of training? But I don't do that, and tails are usually used here on the forum when modelling the distribution density of price change - not at all what I have on the histogram. And rather here we should not talk about risks, but about the fact that it is more difficult to build a model by chance than if you understand the structure of significance of predictors and their dependence.

I was responding to the fact that there is a pattern in chaos.

is just this kind of histograms, no matter using what logic/approach/formula/theory etc. You applied and you will find no other patterns

 
Aleksey Vyazmikin #:

Which once again proves that the greed method of selecting splits is flawed. I experimented with it myself when selecting leaves and came to the same conclusion.

And without greed? You could calculate another one for each split and select a pair at once, but in your case the duration of calculations will increase 5000+ times. It's easier to average a hundred models.

As I understood, there is an analogue for selecting a part of columns for estimation, but I have forced use of all of them.

But they also say that a subsample can be used to build a tree, but I use the mode for full application of the sample "--boosting-type Plain".

To reduce the influence of randomness it is right. Otherwise you need to do averaging of 20-100 models like in the forest.

Aleksey Vyazmikin #:

As I understand, it sets the random generator counter to a certain value, but this generator is used at least as they write "there is a randomisation of the metric, by which the best tree is chosen." and it kind of uses a random generator plus a coefficient, which, as I understand, is taken from the --random-strength parameter (I have 1).

Here's the formula:

Score += random_strength * Rand (0, lenofgrad * q)

q is a multiplier that decreases as the iteration increases. Thus, the random decreases near the end.

That is, it turns out that the refining trees may not be the best, but randomly worse.
Hence the spread in the models from plum to profitable.
Judging by the distribution charts, there are more draining models, i.e. if we average, the average result will be unprofitable.



Should I try random-strength = 0? Hopefully the Seed changes will stop changing the model after that. Maybe create a model with better refinement trees rather than randomly bad ones. If the best model will be plum, then searching on this data from 10000 random models randomly the best one is the way to plum on real.

Or still average a few randomly selected models, as in the forest. Because the best one can be retrained.

 
Renat Akhtyamov #:

I was responding to the fact that there is a pattern to chaos.

are just this kind of histograms, no matter what logic/approach/formula/theory, etc. you apply. You applied and you will not find any other patterns

So how do you mean there is a pattern, but you won't find it? Or is the regularity in the randomness?

Reason: