Machine learning in trading: theory, models, practice and algo-trading - page 56

 
Alexey Burnakov:
Several years. Here in the thread is her result.
Please give me the link.
 
Vadim Shishkin:
Can you give me a link, please?
The whole topic is the result.
 
Yury Reshetov:

At least with strict division into training and test sample by dates instead of preliminary random mixing of samples with equal distribution in the general sample and then dividing it into parts. After all, it may turn out that one part of the sample will contain mostly vertical trends, and the second part will contain lateral trends. If we do random mixing, the probability of gathering of similar patterns in different parts of the sample decreases.

By the way, this disadvantage is also present in the built-in MetaTrader strategy tester, i.e. it divides the training sample and the forward test strictly by dates. As a result, a change of market trends close to the dividing line may cause a deliberate overtraining.

This is the key point in planning the experiment. In reality there is a strict time separation. This is how the model is tested for the future in the full sense of the word.

I have this too: on validation the market was predominantly falling, and there is a preponderance of shorts. Well, it may predominantly rise in the future. Anything can happen.
 
Vadim Shishkin:
That is, you, like any respectable trader, uttered the answer.
The answer of the universe, if you like.
 
Alexey Burnakov:
This is the key point in planning an experiment. In reality, there is a strict separation in time. This is how the model is tested for the future in the full sense of the word.

I have this too: on validation, the market was predominantly falling, and there is a preponderance of shorts. Well, it may predominantly rise in the future. Anything can happen.

This is called unbalanced sample and is a problem of machine learning.

To make it clearer, here's an example. Suppose we have a training sample where uptrends prevail and hence downtrends are less present in it than in the uptrends, i.e. we have an imbalance.

Suppose we have 1,000 downward movements in the sample, and 10,000 upward ones. And assume that the classification error for upward movements is 10%. But that ten percent for 10,000 examples equals 1,000 false signals classified as predicting downward movements and we only have 1,000 examples with downward movements in the sample. This means that no matter how accurate the classification of downward movements is, for any response of the classifier predicting that future movement as potentially downward, its error will be at least 50%. That is, the more examples in the training sample for any one class are imbalanced, the greater the impact of incorrect classification for that class on the quality of the classifier's responses for the other class.

For this reason, it is very difficult to predict rare phenomena: earthquakes, volcanic eruptions, economic crises, etc. After all, if the phenomenon is very rare and unrepresentative in the sample, then any error for examples of opposite classes becomes excessively large for rare phenomena.

Therefore, the training sample must be pre-balanced so that it has the same number of examples for all classes. Otherwise, low-representative classes are more likely to fail tests outside the training sample. In addition, when dividing the general sample into training and test parts, it is necessary to mix examples using PRGP with a uniform probability distribution in order to avoid crowding of examples with similar predictors in one part and different predictors in the other part. That is, to avoid an imbalance in predictors and not only in dependent variables.

 
Yury Reshetov:

This is called unbalanced sampling and is a machine learning problem.

To make it clearer, let me give you an example. Suppose we have a training sample where uptrends prevail, which means that there are fewer downtrends than uptrends, i.e. we have an imbalance.

Suppose we have 1,000 downward movements in the sample, and 10,000 upward ones. And assume that the classification error for upward movements is 10%. But that ten percent for 10,000 examples equals 1,000 false signals classified as predicting downward movements and we only have 1,000 examples with downward movements in the sample. This means that no matter how accurate the classification of downward movements is, for any response of the classifier predicting that future movement as potentially downward, its error will be at least 50%. That is, the more examples in the training sample for any one class are imbalanced, the greater the impact of incorrect classification for that class on the quality of the classifier's responses for the other class.

For this reason, it is very difficult to predict rare phenomena: earthquakes, volcanic eruptions, economic crises, etc. After all, if the phenomenon is very rare and sparsely represented in the sample, any error for examples of opposite classes becomes excessive for rare phenomena.

Therefore, the training sample needs to be pre-balanced so that it has the same number of examples for all classes. Otherwise, low-representative classes are more likely to fail tests outside the training sample. In addition, when dividing the general sample into training and test parts, it is necessary to mix examples using PRGP with a uniform probability distribution in order to avoid crowding of examples with similar predictors in one part and different predictors in the other part. That is, to avoid an imbalance in predictors and not only in dependent variables.

Yury, I get the idea. The sample may indeed be unbalanced both in training and in validation. But in reality you are trading the future, where the bias may be very strong. And the strategy should be resistant to such an outcome.
 
Yury Reshetov:


And therefore the training sample should be pre-balanced, so that it contains examples for all classes with the same number of examples. Otherwise, low-representative classes are more likely to fail tests outside the training sample. In addition, when dividing the general sample into training and test parts, it is necessary to mix examples using PRGP with a uniform probability distribution in order to avoid crowding of examples with similar predictors in one part and different predictors in the other part. That is, to avoid an imbalance in predictors and not only in dependent variables.

caret package

A couple of functions: downSample/upSample - decreases/increases the number of observations, obtaining fully balanced classes. Decreasing/increasing of observations in a class is done by a simple random sampling algorithm.

PS.

Reshetov!

Start studying R. More and more often you slip into platitudes.

 
SanSanych Fomenko:

Reshetov!

Start studying R. More and more often you slip into platitudes.

I'm going to drop everything and become an adept of R in order to play with numbers with a serious face.
 
Alexey Burnakov:
Yuri, I get the idea. The sample may indeed be unbalanced both in training and in validation. But in reality you trade the future, where the bias may be very strong. And the strategy must be resistant to such an outcome.
Well, the stability is achieved by preventing the potential overtraining. And unbalanced training sample is a potential cause of overtraining for low-representative classes. After all, the learning algorithm tries to act as it sees fit, not as it needs to in order to increase generalizability. If the sample is unbalanced, it will minimize learning errors on the least representative classes, since there are few examples for such classes and it is easiest to rote them by heart instead of doing generalization. After such rote learning, it is not surprising that outside the training sample, algorithm errors are most likely to occur in the less representative classes.
 

you blind yourself to date ranges; - splitting data exactly by dates (before X day - training, after - validation)

The point is simple. In the real world no one will allow you to take a mixed sample containing observations from the future and from the past to assess the quality of real trading. All observations will go after day X.

Hence, by taking a mixed sample in validation (without date separation), you are overestimating the quality metric on validation. That's it. Then there will be unpleasant surprises.

Reason: