Machine learning in trading: theory, models, practice and algo-trading - page 3311

 
Aleksey Vyazmikin #:

Who has tried the"Compactness Profile" method?

The goal of the method is to eliminate inconsistent examples from the sample, which should improve learning and reduce the model size if K nearest neighbour learning methods are used.

I couldn't find an implementation in python.....

The work is experimental. Here is a quote from http://www.ccas.ru/frc/papers/students/VoronKoloskov05mmro.pdf

The work was performed within the framework of RFBR projects 05-01-00877, 05-07-90410 and OMN RAS programme

It is unlikely that each experiment was created a package.

Yes and the experiment is artificial. Noise was added to the data set clearly separated by classes. And the clear separation is only by 1 feature - the Y-axis. If we remove the noise (all data from 0.2 to 0.8), it turns out that we leave examples only with the distance to another class not less than 0.6. I'm talking about the hardest 3rd option in the picture:


Go to real life and add your 5000 predictors that will be noise to this single working fiche. In clustering you calculate the total distance between points in this 5001 dimensional space. 0.6 working will never be found in this chaos.

I think any classifiers will do it better, the same tree will find this single feature and divide by it, first through 0.5 and then it will reach splits of 0.2 and 0.8 followed by leaves with 100% purity.

 
Aleksey Vyazmikin #:

Who has tried the"Compactness Profile" method?

The goal of the method is to eliminate inconsistent examples from the sample, which should improve learning and reduce the model size if K nearest neighbour learning methods are used.

I couldn't find an implementation in python....

One of Vladimir Perervenko's articles described such a method, and there was an example with code, of course
 
Forester #:

The work is experimental. Here's a quote from http://www.ccas.ru/frc/papers/students/VoronKoloskov05mmro.pdf

It's unlikely that every experiment was created a package.

Oh, and the experiment is artificial. Noise was added to the data set clearly separated by classes. And the clear separation is only for 1 feature - the Y axis. If we remove the noise (all data from 0.2 to 0.8), it turns out that we leave examples only with the distance to another class not less than 0.6. I mean the most complicated 3rd variant in the picture:


Go to real life and add your 5000 predictors that will be noise to this single working fiche. In clustering you calculate the total distance between points in this 5001 dimensional space. 0.6 working will never be found in this chaos.

I think any classifiers will do it better, the same tree will find this single feature and divide by it, first through 0.5 and then it will reach splits of 0.2 and 0.8 followed by leaves with 100% purity.

It never will. Any MO will not find it. Garbage should be got rid of BEFORE training the model. "Garbage in - rubbish out" is the law of statistics.

 
СанСаныч Фоменко #:

Never will. Any IO will not find it. You have to get rid of rubbish BEFORE training the model. "Garbage in, rubbish out" is the law of statistics.

I'm talking about a specific artificial example on which experiments were conducted. It's not rubbish in and rubbish out. What is littered in that example is easy to cut off.

 
This is exactly what the optimisers can't understand. That it is through simplification that stability can be improved, not through the search for a global maximum.
The simplest example is SVM, with a given distance between support vectors. The cross shaft is even more flexible. And there you will see, and then you can enter matstat for a half-sheets.
If you can't get into kozul from the start, you can think at this level to begin with.

Offtopic: have you played starfield? Besdazd knows how to make atmospheric. It's immersive.
 
Forester #:

I'm talking about the specific artificial example on which the experiments were conducted. It's not rubbish in and rubbish out. What is known in this example is easy to cut off.

To clarify my point.

Any MO algorithm tries to reduce the error. Error reduction is more effective on rubbish, because it is much more likely to have "convenient" values for error reduction. As a result, it is certain that the "importance" of predictors for rubbish will be higher than for NOT rubbish. That is why there is preprocessing, which is much more labour-intensive than model fitting itself.

 
СанСаныч Фоменко #:

Let me clarify my point.

Any MO algorithm tries to reduce the error. The error reduction is more effective on rubbish, because "convenient" values for error reduction are much more common in garbage. As a result, it is certain that the "importance" of predictors for rubbish will be higher than for NOT rubbish. That is why there is preprocessing, which is much more labour-intensive than model fitting itself.

Please tell me, what is not rubbish? I have never seen anyone talking about pure input data. But I hear about rubbish on the forum all the time.

What are they? If you are talking about rubbish, then you have not had rubbish, otherwise there is nothing to compare it to

 
СанСаныч Фоменко #:

Let me clarify my point.

Any MO algorithm tries to reduce the error. The error reduction is more effective on rubbish, because "convenient" values for error reduction are much more common in garbage. As a result, it is certain that the "importance" of predictors for rubbish will be higher than for NOT rubbish. This is why there is preprocessing, which is much more labour intensive than actual model fitting.

Preprocessing is about normalisation, not rubbish.
Debris is feature selection and partly feature engineering.

Sanych, stop feeding rubbish into the input of people who are immature.
 
Ivan Butko #:

Can you please tell me what is not rubbish? I have never seen anyone talking about clean input data. But I hear about rubbish on the forum all the time.

What are they? If you are talking about rubbish, then you have not had rubbish, otherwise there is nothing to compare it to

Nobody knows what is rubbish and what is not, these are hypothetical concepts.

If they knew exactly what was what, there wouldn't be a 3K page thread.)))

One simply makes an assumption that going beyond such and such limits is "rubbish", these limits are also hypothetical. That's why the expression "rubbish in - rubbish out" is nothing more than a beautiful phrase, what is rubbish for one researcher is non-m rubbish for another researcher. It is like Eliot's waves.

 
Ivan Butko #:

Can you please tell me what is not rubbish? I have never seen anyone talking about clean input data. But I hear about rubbish on the forum all the time.

What are they? If you are talking about rubbish, then you have not had rubbish, otherwise there is nothing to compare it to

It's a directional movement, a vector.

but getting it out of the rubbish is a challenge.

For example, I would try to load my indicator into neuronics as predictors and try to identify the signs of rubbish and rubbish collector.

Reason: