Machine learning in trading: theory, models, practice and algo-trading - page 5

 
SanSanych Fomenko:

I see no evidence that NS has coped with anything.

Overlearning is a worldwide evil in science and in model building in particular.

Therefore an error is needed for three sets:

  • learning sets. The way rattle understands it (OOB, test, validation) will do just fine.
  • A set that is outside, in terms of dates, the training set.
  • Another set that is outside, in the sense of by dates, the training set.

The last two sets are without mixing, since they come in terminal, bases behind the bar.

There should be about the same error on all three sets. In doing so, you will have to fix the set of predictors that you take when you train the model.

I did not consider overtraining in this problem at first, there was no need. The first time neuronc was with only one neuron in the inner layer. I checked it now, the error during long training is about 45%, and does not go lower. The network gives those inputs 1,3,5,7,9,11 higher weights when training, but it can't really train because of the lack of neurons. This was the idea, to squeeze the most correct result out of it, in very limited conditions, and see which inputs it will give more weight to.

Now, having validation file, you can already work more seriously. Now I divided the original file into two parts (85% and 15%), trained on the first part, sometimes stopping the training and measuring error in both parts. Those examples from the 15% file did not get into training, but the error in them decreased about the same as in the training part. When it became 0% in both parts of the file - I stopped training, and checked the neuron on the second file, the error there was ~0%. It turns out that for this problem, the overtraining was not even achieved, it's funny.


But neuronics on forex is worse than in this problem. First the error drops on both samples, both training and validation samples. Then error keeps falling on the training sample, but starts growing on the test sample, at this point it is time to stop training. Then I usually check the result on the pre-sample history, and again on the post-sample history. For me, that's three sets of data, like yours. But the error in all three sets I have so far is different (and large outside the training period).

I had different ideas how to improve results, for example to smooth results or add a filter (<0.1 - sell, >0.9 - buy, and everything in between - period without trade). You can improve results by adding filter to neural output and optimizing it too, but for fronttest it didn't help at all. Another good idea was that if the neuron was trained on a certain period of history, then you can optimize the filter on the history before that period. For an error before the optimization period is probably associated with an error after the optimization period. But it didn't work, if there are three periods of history - "before neuron" - "learning neuron" - "after learning neuron", then all three will have their own optimal filters, which are in no way related.

So far I've concluded that the model should have a small error (<10%) on the training and test samples. Do not apply filters to the model result, and no guesswork like "invert the result every fourth week". Do not use periods lower than M15. I try different new ideas and it is good if at least one out of ten of them improves the result. And somehow this should result in a profitable model.

 
Dr.Trader:

I did not take re-training into account in this problem at first, there was no need. The first time was a neuronica with only one neuron in the inner layer. I now checked, the error during long training is about 45%, and does not go lower. The network gives those inputs 1,3,5,7,9,11 higher weights when training, but it can't really train because of the lack of neurons. This was the idea, to squeeze the most correct result out of it, in very limited conditions, and see which inputs it will give more weight to.

Now, having validation file, you can already work more seriously. Now I divided the original file into two parts (85% and 15%), trained on the first part, sometimes stopping the training and measuring error in both parts. Those examples from the 15% file did not get into the training, but the error in them decreased about the same as in the training part. When it became 0% in both parts of the file - I stopped training, and checked the neuron on the second file, the error there was ~0%. It turns out that for this problem, the overtraining was not even achieved, it's funny.


But neuronics on forex is worse than in this problem. First the error drops on both samples, both training and validation samples. Then error keeps falling on the training sample, but starts growing on the test sample, at this point it is time to stop training. Then I usually check the result on the pre-sample history, and again on the post-sample history. For me, that's three sets of data, like yours. But the error in all three sets I have so far is different (and large outside the training period).

I had different ideas how to improve results, for example to smooth results or add a filter (<0.1 - sell, >0.9 - buy, and everything in between - period without trade). You can improve results by adding filter to neural output and optimizing it too, but for fronttest it didn't help at all. Another good idea was that if the neuron was trained on a certain period of history, then you can optimize the filter on the history before that period. For an error before the optimization period is probably associated with an error after the optimization period. But it didn't work, if there are three periods of history - "before neuron" - "learning neuron" - "after learning neuron", then all three will have their own optimal filters, which are in no way related.

So far I've concluded that the model should have a small error (<10%) on the training and test samples. Do not apply filters to the model result, and no guesswork like "invert the result every fourth week". Do not use periods lower than M15. I try different new ideas and it is good if at least one out of ten of them improves the result. And somehow a profitable model should be obtained.

It's all about the data )

The data in all sets should contain mutually independent observations. But even in this case validation will give the worst result.
 

I tried different models from Rattle, the forest also gave good results.

Step 1 - the forest learned something, and in the statistics inputs 1,3,5,7,9,11 look somehow isolated. Error on the training file is 0%, on the validation file - 46%

Step 2 - left only inputs 1,3,5,7,9,11 in the file and the result. Again trained the forest on a new file, now the error is 0% on both the training and validation files, everything is cool. The only nuance was that Rattle for the second step put the parameter "Number of variables" 2, probably because the file is less than the volume. I changed it to 4 as in the first step.

 

Dr.Trader

It's nice to see a brother in mind on the basis of rattle. In any case, we can compare results.

So and my own experience.

We take rattle and models from it.

On the model tab we learn, getting the value of the AOB, and on the Evaluate tab we estimate on the Validation and Testing sets. We get three figures.

I argue that the results obtained, if the predictor set was not previously cleared of noise predictors - these results are about nothing, most likely just a set of numbers.

On the Evaluate tab, in addition to getting the results listed, you need to put the set in the R Dataset window. It is very important that this set be obtained by mechanically dividing the original file, i.e. the first file for all three digits, for example from January 1, 2014 to January 1, 2015, but the file in the

R Dataset necessarily after January 1, 2015 without all the random sampling and other tricks used in R. Just stupidly, mechanically.

I don't recognize any other ways of evaluating the success of separating significant predictors from noisy ones, because my proposed method imitates real trading.

Could you please post all four figures? Including the result on the file from the R Dataset window?

 
Dr.Trader:

I tried different models from Rattle, the forest also gave good results.

Step 1 - the forest learned something, and in the statistics inputs 1,3,5,7,9,11 look somehow isolated. Error on the training file 0%, on the validation file - 46%

Step 2 - left only inputs 1,3,5,7,9,11 in the file and the result. Again trained the forest on a new file, now the error is 0% on both the training and validation files, everything is cool. The only nuance was that Rattle for the second step put the parameter "Number of variables" 2, probably because the file is less than the volume. I changed it to 4 as in step 1.

I wonder... You have to set the depth to 6 variables in order to pick up all the relevant ones.
 
Dr.Trader:

I tried different models from Rattle, the forest also gave good results.

Step 1 - the forest learned something, and in the statistics inputs 1,3,5,7,9,11 look somehow isolated. Error on the training file 0%, on the validation file - 46%

Step 2 - left only inputs 1,3,5,7,9,11 in the file and the result. Again trained the forest on the new file, now the error 0% on both the training and validation files, everything is cool. The only nuance was that Rattle for the second step put the parameter "Number of variables" 2, probably because the file is less than the volume. I changed it to 4 as in step 1.

In the first one, the forest used noises for training, which is not good.
 
SanSanych Fomenko:

Could you please post all four figures? Including the result on the file from the R Dataset window?

I did it with neuronka - dummy_set_features.csv is divided into 3 parts 70%/15%/15%; dummy_set_validation.csv used in evaluate tab as "csv file" (it's the same as R Dataset, just different file format).

Took the log from Rattle, removed those two limitations from neuronics call, about which I wrote earlier, ran it in R, but model is still undertrained (error 35% on training sample). So it happens that if you reduce training sample, the training result is worse. But you can increase the number of neurons in the inner layer, it should improve learning.

I changed number of neurons from 10 to 20, re-run training, now error in training sample is 0%. Error on validation sample is 0.6%, on test sample 0.1%. Error on dummy_set_validation.csv file is 0.3%. Everything is fine, R is in the application.

Interesting, it turns out that neuronka training problems are very different from tree problems. For me the problem with nn is to take the right number of internal neurons, and suspend learning before the network starts to retrain. In principle, the more neurons the better, but this greatly affects the required RAM and training time. Superfluous predictors don't affect training, neuronc is usually trained with them not worse, but they increase training time, and it's desirable to refuse from them to make it faster.

Some time ago I used self written network in mt4, EA was able to learn and trade immediately. There were some problems with speed of training, initialization of weights, learning algorithms. There were too many small problems, it was hard to achieve good results even with learning samples, so I gave up and now work with R. The nnet package can learn bfgs algorithm, it removes all those small problems. But the package has a limit of only 1 inner layer. I want to have at least 3 inner layers, but with bfgs algorithm, otherwise there will be a lot of problems with learning again.

Files:
r_nnet2.txt  10 kb
 
Dr.Trader:

I did this with neuronka - dummy_set_features.csv is divided into 3 parts 70%/15%/15%; dummy_set_validation.csv used in the evaluate tab as a "csv file" (it is essentially the same as R Dataset, just a different file format).

No, it's not the same.

The amazing thing is, I discuss this issue with many people and NO ONE does it the way I write. And I know what I'm writing, because I've spent half a year on exercises like yours, and outside of rattle(). What I didn't put in the tester - a completely different error. Then I did what I described above and the error in the tester practically coincided with the error in the R Dataset file.

I gave Alexei three files, which are obtained by mechanically dividing one big file into three parts. On the first part we exercise, learn, estimate... And on the other two we check the numbers we got on the first. If all three files have errors greater than 20%(!), or more precisely, closer than 40%, then the model is not retrained and you can work with it.

The three files mentioned above have 27 predictors and 6 target variables. The 27 predictors were selected from 170 predictors by my own algorithms. As of today, these 27 predictors do not lead to over-trained models. But the other 143 predictors in my set are noise and on that noise you can easily get an error comparable to yours, but the model is over-trained and not usable.

Why is there less error on the noise predictors than on the significant predictors?

The way I see it, since the model fitting algorithm seeks to reduce the fitting error, it is always able to find something among the noisy, random values that is better than from the meaningful predictors. As a result, the smaller the error, the fewer meaningful predictors are involved in model building!

PS

Judging by the error, either the NS is 100% overtrained, or it is looking ahead.

 

The "CSV file" and the "R dataset" on the evaluate tab are just different ways to specify the data source. If you feed the same data into them, then the result when testing the model will be the same.

If before running rattle execute

dataset_validate <- read.csv("file:///C:/dummy_set_validation.csv", na.strings=c(".", "NA", "", "?"), strip.white=TRUE, encoding="UTF-8")

then this dataset_validate will now be available in the evaluate tab as an R dataset. But the result of testing the model will end up being the same as if you just select the csv file option and specify the file C:/dummy_set_validation.csv, the data for the test will be identical in both cases.

The training itself was on a different file, dummy_set_features.csv, so looking ahead is impossible here, because the data in both files is different, and not time-dependent at all (but formula-dependent). I think that the neural network did a great job and found 6 inputs that determine the result, reduced the influence of other inputs, and with some neural logic described the desired formula.

Just in case, I checked both files to find duplicate strings, if they are still there. Here is the code in R:

#читаем  данные из файлов для обучения и валидации
dataset_train <- read.csv("file:///C:/dummy_set_features.csv", na.strings=c(".", "NA", "", "?"), strip.white=TRUE, encoding="UTF-8")
dataset_validate <- read.csv("file:///C:/dummy_set_validation.csv", na.strings=c(".", "NA", "", "?"), strip.white=TRUE, encoding="UTF-8")

anyDuplicated(dataset_train) #количество  повторных строк в первом файле - 567
anyDuplicated(dataset_validate) #количество  повторных строк во втором файле - 202
#  это покачто просто повторные строки в обоих файла по отдельности. Это не вредит, но и ничего не улучшает, лучше их убрать.
dataset_train <- unique(dataset_train)
dataset_validate <- unique(dataset_validate)

#Теперь  можно узнать сколько строк из dataset_train повторяются в dataset_validate
dataset_combined <- rbind(dataset_train, dataset_validate) #Объединяем  оба файла в один
nrow(dataset_combined) - nrow(dataset_combined[!duplicated(dataset_combined), ]) #количество  повторенных строк - 23. Таких повторов быть не должно, их надо удалить.
#Удаление  повторенных строк из dataset_validate
dataset_validate <- dataset_validate[!(tail(duplicated(dataset_combined), nrow(dataset_validate))), ]
#Удаление  из памяти объединённого файла, он больше не нужен
rm(dataset_combined)

If you run this before rattle, it will allow you to choose bothdataset_train anddataset_validate tables for training and model checking, and won't have duplicates in them anymore. Glory R.

the model validation file contained 23 repetitions from the training sample, but apart from that it has another 3000 unique rows, so the model evaluation could not be significantly affected by this.

 
This is how it should be. The file is generated by random numbers.

In general, I overestimated the complexity of the task. It is already solved by the forest, in fact. One would think, how is this possible in light of how the forest works.
Reason: