Machine learning in trading: theory, models, practice and algo-trading - page 6

 
Dr.Trader:

The "CSV file" and the "R dataset" on the evaluate tab are just different ways to specify the data source. If you feed the same data into them, then the result when testing the model will be the same.

If before running rattle execute

then this dataset_validate will now be available in the evaluate tab as an R dataset. But the result of testing the model will end up being the same as if you just select the csv file option and specify the file C:/dummy_set_validation.csv, the data for the test will be identical in both cases.

The training itself was on a different file, dummy_set_features.csv, so looking ahead is impossible here, because the data in both files is different, and not time-dependent at all (but formula-dependent). I think that the neuron did a great job and found 6 inputs that determine the result, reduced the influence of other inputs, and described the desired formula with some neural logic.

Just in case, I checked both files to find duplicate strings, if they are still there. Here is the code in R:

If you run this before rattle, it will make a selection of bothdataset_train anddataset_validate tables available for training and model checking, there will be no duplicates in them anymore. Glory R.

the model validation file contained 23 repetitions from the training sample, but on top of that it has another 3000 unique rows, so the model evaluation could not be significantly affected.

Apparently I am incapable of explaining what I want.

1. Could you please provide the start and end dates for both files?

2.You can't remove anything in thedataset_validate file, since that file simulates the arrival of bar after bar.

 

I've been thinking about it since this morning, it's really not that simple.

According to your formula, there are only 2^6 = 64 combinations of inputs used. If the learning algorithm somehow determines the importance of these 6 inputs, then it may well remember all 64 combinations. And then it doesn't matter that combination of all inputs in validation sample is unique, the model will take only those 6 significant inputs and determine already known answer. That's what happened with my neuron. I now removed a couple of combinations of 1,3,5,7,9,11 inputs from the training file, but left similar combinations in the validation file. Error in training remained the same 0%, but during validation - increased to 50% on those new combinations. And this is bad, forex spread would have led the deposit into deficit. I do not know what to do with this kind of numbers, but I do know what to do with them.

Apparently this is why many models do not work in forex, they just remember some combinations, but do not cope with new ones.

 
SanSanych Fomenko:

Apparently I am incapable of explaining what I want.

1. Could you please give me the start and end dates for both files?

2.You cannot remove anything in thedataset_validate file, becausethis file simulates the arrival of bar after bar.

No, that's right, I get it. What you say applies to forex, and I agree with that. But I was talking about the files of Alexei and the model training on them.

https://c.mql5.com/3/96/dummy_set_features.zip - training

https://c.mql5.com/3/96/dummy_set_validation.zip - validation

The result in the files is determined by the formula "1-mod(sum(input_1 ; input_3 ; input_5 ; input_7 ; input_9 ; input_11);2)".

Please give me a link to those files you mentioned, I'll try to train the neuron on them.

 
Guys,

I got the idea right that the forest is over-trained on all the data, although I highlighted the important features. For him we need to clean the data and train again.

Did the NS without data cleansing learn ok ?

Thanks
 
Dr.Trader:

No, that's right, I get it. What you say applies to forex, and I agree with that. But I was talking about the files from Alexei, and model training on them.

https://c.mql5.com/3/96/dummy_set_features.zip - training

https://c.mql5.com/3/96/dummy_set_validation.zip - validation

The result in the files is determined by the formula "1-mod(sum(input_1 ; input_3 ; input_5 ; input_7 ; input_9 ; input_11);2)".

Please give me a link to those files you mentioned, I will try to train the neuron on them.

Here you go, in the archive.

This is RData. Open R, load rattle and from it the data frames are available

"R1.F1" "R1.F3" "R1.F4" "R1.F5" "R1.F6" - this is for training, has different target variables that are marked with Fi

"R2.F1" "R2.F3" "R2.F4" "R2.F5" "R2.F6" - is for testing

"Rat_DF1" "Rat_DF2" "Rat_DF3" is the file that contains all the target variables

Files:
ALL_cod.zip  3281 kb
 
Alexey Burnakov:
Guys,

I got the idea right that the forest is over-trained on all the data, although I highlighted the important features. For him we need to clean the data and train again.

Did the NS without data cleansing learn ok ?

Thank you
Yes, for this task from the first post it is like that. But if to apply NS for forex, then NS on garbage also will be overtrained and will work worse on new data, so the selection of input data is also relevant there.
 
Dr.Trader:
Yes, for this task from the first post it is so. But if you apply NS for Forex, then NS on garbage will also retrain and will work worse on new data, so the selection of input data there is also relevant.

d The test data are clean. So there is a pattern there at any sites. On real data, you can't do without splitting the training sample into multiple parts and controlling the performance on a subsample or multiple subsamples when training the model. This is a straightforward rule of thumb.

But to test the retrainability of "out of the box" models like random forests, it was just useful to see that they stupidly teach noise - on validation a complete mess. Even though they isolate the important predictors correctly.

But, even on clean data, where 14 predictors are pure noise (randomly entered numbers with no connection and output) forest started using them for training. and this is something I myself have encountered before.

And one more aspect, a very important one. The test example contains mutually independent observations. Each row is independent of any other row in any way. In reality, on a time series, neighboring observations will be dependent. For example, two neighboring values of a Mach will be highly correlated. And so any - I emphasize, ANY - methods will fail on raw real data. The right way, ideally, to prepare real data for training is to sample them so that neighboring observations are not physically related. For time series, this means that if you take indicators with a window of 20, then neighboring observations in the training sample should be taken at least 20 steps apart. So that there is no reciprocal relationship between them. Then the statistics and the actual learning of patterns begins to work, not random clusters of similar data. I hope you understand what I mean.

When I come back from my business trip I will also conduct an experiment in public using real forex data. And maybe we will do it together with the practical purpose of making profit from patterns.

Alexey

 
Dr.Trader:

But since the NS is like a black box, it is impossible to know the logic of the solution. You can look at the weights, determine the average absolute value for each input, and draw a diagram. And find out that 1, 3, 5, 7, 9, 11 are more important than the rest. But at the same time other inputs are also used for some reason; zero weights are nowhere to be found. In other words, it is vice versa, first we learn, and then we may determine the important inputs.

Try some kind of NS contrasting algorithm. The algorithm can independently thin both inputs (you don't have to cull them by "manual" browsing) and connections (and hidden layer neurons, of course). At the output you will get a brief description of the logic of work.
 
SanSanych Fomenko:

Here you go, in the archive.

This is RData. Open R, load rattle and from it data frames are available

"R1.F1" "R1.F3" "R1.F4" "R1.F5" "R1.F6" - this is for training, has different target variables that are marked with Fi

"R2.F1" "R2.F3" "R2.F4" "R2.F5" "R2.F6" - is for testing

"Rat_DF1" "Rat_DF2" "Rat_DF3" is the file that contains all the target variables

Thank you, I tried it. I can see that you did a lot to select the predictors, for the neuronics easily trained on them, and stored the result on the check dataset as well.

The results below refer to learning on R1.F3

1) There was a funny result with Rattle. HH with the standard configuration showed train/validate/testing errors of 30%/29%/33%. The error on R2.F3 is 35%. But all this is just a lucky case really, in another configuration it would have easily under- or over-trained, here it was just lucky.

2) Then I took a simple crude approach with unsupervised training, 200 hidden neurons, the network was trained until it stopped improving. Errors train/validate/testing/R2.F3 - 2%/30%/27%/45%. Well, that's clear, the network is retrained.

3) Supervised learning. This is different from trees, but with neuronics you should always do so in order not to retrain it. The idea is to pause training sometimes, and check the results of train/validate/testing. I don't know the golden rule of results collation, but it is quite normal approach to train on train dataset, then look for errors in validate and testing datasets, stop training when errors in validate/testing stop dropping. This gives some guarantee against overtraining. R2.F3 is considered unavailable during this whole process, and the test on it is done only after the end of training. In this case the train/validate/testing/R2.F3 errors are 27%/30%/31%/37%. Here again there is overtraining, but not much. We could have stopped the learning process earlier, after the train error became noticeably smaller than validate/testing errors, but that's guessing... could have helped, or it could not.

The "R1.F1" target variable has three values, Rattle can't do that with neuronics and you have to write your own code in R, I skipped this dataset.

"R1.F4" "R1.F5" "R1.F6" gave approximately the same results for all 4 errors in Rattle neuronka, I think an adequate approach with neuronka will also give approximately the same results, I have not dealt with them further.

 

About teaching methodology.

Here I described the standard method, which is usually applicable and gives good results: https://www.mql5.com/ru/blogs/post/661499

Let me briefly explain: we divide all the data into 2 parts - training and validation. Validation should be as large as you think you need. I have done it this way. I take 15 years of one-minute quotes. I calculate the entries and exits, I thin the data so that the observations were physically different (I take them through n bars, where n is not less than the largest window used in the indicator and not less than the furthest lag in the future). Ponder why this is correct. The data becomes mutually independent.

Next I take the training part (the largest) - 10 years farthest in the past. I create row indexes for cross-validation. That is I divide data into 5 equal parts, rigidly separated by date. During crossvalidation, the model goes through training parameters such as depth, number of iterations, etc., training on four spot training chunks and measures the error on one fifth of the training chunk. It does this five times, each time selecting a different test part out of five. We get five error metrics (for one set of training parameters). They are averaged and fed as a test error value on the data not involved in the training.

The model then does this work n * m * s * d times, where n,m,s,d are the training parameters. This is a grid-based enumeration. It can reach hundreds of iterations. You can do random sampling or genetic sampling. Ez Yu vish.

Then we get a table of crossvalidation error metrics corresponding to the set of training parameters. We have to take the stupidly best result and these parameters. And then train the entire training sample on those parameters. For NS, the limit of the number of iterations will also be specified, so that overtraining does not occur.

And at the end validate the model on 5 years of quotes to evaluate the performance of the out-of-sample.

All in all, crossvalidation is so far the best choice for effective model training. I recommend giving it a try.

The package in R is caret.

See you soon.

Alexei

СОПРОВОЖДЕНИЕ ЭКСПЕРИМЕНТА ПО АНАЛИЗУ ДАННЫХ ФОРЕКСА: первое серьезное обучение модели и результаты
СОПРОВОЖДЕНИЕ ЭКСПЕРИМЕНТА ПО АНАЛИЗУ ДАННЫХ ФОРЕКСА: первое серьезное обучение модели и результаты
  • 2016.02.27
  • Alexey Burnakov
  • www.mql5.com
Начало по ссылкам: https://www.mql5.com/ru/blogs/post/659572 https://www.mql5.com/ru/blogs/post/659929 https://www.mql5.com/ru/blogs/post/660386 https://www.mql5.com/ru/blogs/post/661062
Reason: