Machine learning in trading: theory, models, practice and algo-trading - page 2804

 
Aleksey Vyazmikin #:

I'm doingit now, including for a forum thread to seeif it makessense for that sample.

It doesn't

 
mytarmailS #:

There's no point

You think that sample is hopeless?

 
Aleksey Vyazmikin #:

CatBoost chooses randomly the number of predictors at each iteration of splitting or tree building - it depends on settings, and it means that strongly correlated predictors have more chance to get into random, i.e. not at them, but at the information they carry.

Yeah, and the creators of boosts don't know that...

They also don't know that it's possible to filter out signs by correlation))) how would they know, the method is only 50 years old))))

do you really believe that you know more than they do?

Aleksey Vyazmikin #:

Do you think that sample is hopeless?

Sure... Boost takes it all into account.

And don't give me a hard time, I'm probably younger than you.)

 
Aleksey Vyazmikin #:

You think that sample is hopeless?

https://datascience.stackexchange.com/questions/12554/does-xgboost-handle-multicollinearity-by-itself


Decision trees are inherently immune to multicollinearity. For example, if you have 2 functions ,

that are 99% correlated, the tree will only choose one of them when making a partition decision. Other models,

such as logistic regression, will use both functions.

Since bousting trees use separate decision trees, they are also not affected by multicollinearity.

========

you can use this approach, evaluate the importance of each function and keep only the best functions for your final model.


Which is actually what I'm telling you earlier

Does XGBoost handle multicollinearity by itself?
Does XGBoost handle multicollinearity by itself?
  • 2016.07.02
  • ope ope 1,653 3 3 gold badges 16 16 silver badges 27 27 bronze badges
  • datascience.stackexchange.com
I'm currently using XGBoost on a data-set with 21 features (selected from list of some 150 features), then one-hot coded them to obtain ~98 features. A few of these 98 features are somewhat redundant, for example: a variable (feature) $A$ also appears as $\frac{B}{A}$ and $\frac{C}{A}$. My questions are : From what I understand, the model is...
 
mytarmailS #:

Yeah, and the creators of boosts like that don't know that....

They also don't know that it is possible to filter out signs by correlation)) how could they know, the method is only 50 years old)))

Do you really believe you know more than they do?

I do. Boost takes it all into account.

And don't give me that shit, I'm probably younger than you.)

I analyse the results of the models and I see that they grab highly correlated predictors, for example predictors based on time - even if they have a small time lag.

I think they know everything perfectly well, but also they shouldn't tell you about platitudes that are decades old....

About "You" or "You" - I think it's better for everyone to call the interlocutor as it suits him, if it doesn't carry an offensive message and doesn't hinder constructive dialogue.


mytarmailS #:

https://datascience.stackexchange.com/questions/12554/does-xgboost-handle-multicollinearity-by-itself


Decision trees are inherently immune to multicollinearity. For example, if you have 2 functions,

that are 99% correlated, the tree will choose only one of them when deciding whether to split. Other models,

such as logistic regression, will use both functions.

Because bousting trees use separate decision trees, they are also not affected by multicollinearity.

========

you can use this approach, evaluate the importance of each feature and keep only the best features for your final model.


Which is actually what I'm telling you earlier

That's the thing, it will choose - yes one, but how many times will this choice go through....

Besides CatBoost has some differences from xgboost, and there are different results on different samples, on average CatBoost is faster and even better, but not always.

 

Plus I have my own method of grouping similar predictors and selecting the best option from them, and I need a control group in the form of correlation...

 
The script all works - guess I'll have to leave it overnight....
 
Aleksey Vyazmikin #:

CatBoost chooses randomly the number of predictors at each iteration of splitting or tree building - it depends on settings, and it means that strongly correlated predictors have more chance to get into random, i.e. not at them, but at the information they carry.

Are you sure it's picking predictors at random? I wasn't catbusting, I was looking at the code of basic bousting examples. All the predictors are used there. I.e., the best one is taken. The correlated one will be next to it, but slightly worse. But at some other split levels or in correction trees, another of the correlated predictors may be better.

 
Aleksey Vyazmikin grouping similar predictors and selecting the best variant from them, and I need a control group in the form of correlation....
So throw me a couple of informative formulae to try out.
 
https://habr.com/ru/post/695276/ may be useful/interesting to some people
Хитрые методики сэмплинга данных
Хитрые методики сэмплинга данных
  • 2022.10.27
  • habr.com
Любой, кто хоть раз обучал нейронки, знает, что принято на каждой эпохе шаффлить датасет, чтобы не повторялся порядок батчей. А зачем это делать? Обычно это объясняют тем, что шаффлинг улучшает генерализацию сетей, делает точнее эстимейт градиента на батчах и уменьшает вероятность застревания SGD в локальных минимумах. Здесь можно посмотреть...
Reason: