Discussion of article "Deep Neural Networks (Part V). Bayesian optimization of DNN hyperparameters" - page 2
You are missing trading opportunities:
- Free trading apps
- Over 8,000 signals for copying
- Economic news for exploring financial markets
Registration
Log in
You agree to website policy and terms of use
If you do not have an account, please register
Interesting results of optimisation - the error on the validation sample was less than on the training sample!
In other words, the model trained on some data, learnt to work on other data, i.e. extracted more information from the training data than there was actually - is this science fiction or adventure...?
I appeal to the author - stop shtukarstva and falsification, finally write an Expert Advisor and show some results, at least on demo.
In other words, the model trained on some data learnt to work on other data, i.e. extracted more information from the training data than was actually there - is this science fiction or adventure...?
All models were trained on the training plot, i.e. they tried to minimise the error in the training plot, but at the end the selection was done on the test plot, and if the model did not find a pattern in the data, the results would have been very bad on the training plot or on the plot after the test plot. But as the results showed both there and there they are not very different from the test plot on which selection was done.
That is, the NS did not learn from the other, but found common patterns in both datasets.
If the results were less stable, then selection on the test plot (without taking into account the trainer's error) could have led to fitting to it. In this case it didn't, but on other data (if no patterns were found) it could. And then it would be better to look for a balance between errors like Err = (ErrLeran * 0.37 + ErrValid * 0.63)
Hello, Vladimir,
I don't quite understand why your NS is trained on training data and its evaluation is done on test data (if I'm not mistaken you use it as validation).
In this case, won't you get a fit to the test plot, i.e. will you choose the model that worked best on the test plot?
You should also take into account that the test plot is quite small and you can fit one of the temporal patterns, which may stop working very quickly.
Maybe it is better to estimate on the training plot, or on the sum of plots, or as in Darch, (with validation data submitted) on Err = (ErrLeran * 0.37 + ErrValid * 0.63) - these coefficients are default, but they can be changed.
There are many options and it's not clear which one is best. Your arguments in favour of the test plot are interesting.
Good afternoon.
Let's be more specific. Data set X consists of 4 subsets preptrain = 4001, train = 1000, test = 500 and test1= 100 bars. For pre-training we use training set - pretrain, validation set - train.
For fine-tuning, the pretrain set is used as the training set and the validation set is used for the first 250 bars of the test set.
Therefore, to determine the final quality, we use the last 250 bars of the test set as the test set. code
I don't see any contradiction. Do you agree?
Good luck
All models were trained on the training plot, i.e. they tried to minimise the error on this plot, but at the end the selection was done on the test plot, and if the model did not find a pattern in the data, the results would be very bad on the training plot or on the plot after the test plot. But as the results showed both there and there they are not very different from the test plot on which selection was done.
That is, the NS did not learn from the other, but found common patterns in both datasets.
If the results were less stable, then selection on the test plot (without taking into account the trainer's error) could have led to fitting to it. In this case it didn't, but on other data (if no patterns were found) it could. And then it would be better to look for a balance between errors like Err = (ErrLeran * 0.37 + ErrValid * 0.63).
In any case, the general regularities on the training plot should be not less than on the test plot, since they were found there together with all the others, and therefore the NS should not work worse on it.
In my opinion, it is obvious, as well as that the teacher's knowledge should not be less than the student's, though the author of the article apparently tries to refute it, because he wrote in comments that he is not a scientist and not a programmer, but persists in teaching us how to program neural networks.
Of course, he has already posted a lot of R code, but I would like to see adequate results, and most importantly, to understand how many more articles he will write until at least one normal Expert Advisor appears.
Good afternoon.
Let's clarify. Data set X consists of 4 subsets preptrain = 4001, train = 1000, test = 500 and test1= 100 bars. For pre-training we use the training set - pretrain, validation set - train.
For fine-tuning, the pretrain set is used as the training set and the validation set is used for the first 250 bars of the test set.
Therefore, to determine the final quality, we use the last 250 bars of the test set as the test set. code
I don't see any contradiction. Do you agree?
Good luck
Technically everything is well done, that's not the question.
I think 250 bar is not enough to evaluate the model, that's why I was wondering why only this section is selected.
Yes, it works in a particular case (you got good results at all sites), but I think it is not universal.
After all, there could be data that is not so good. For example, the model could be trained to 40% error on the training plot, and purely by chance show 30% on the test plot. And the second model, let's say, trained to 35% on both plots. The second one is obviously better. But choosing only on the test plot will select the first one. For comparison, there are the following options of model evaluation:
evaluation only on the training plot,
or on the sum of all plots,
or as in Darch, (with validation data submitted) on Err = (ErrLeran * 0.37 + ErrValid * 0.63) - these coefficients are default, but they can be changed.
The last variant is the most interesting, because it takes into account both errors, but with a greater weight of the validation section.
In principle, you can extend the formula, for example, to Err = (ErrLeran * 0.25 + ErrValid * 0.35 + ErrTest * 0.4).
Maybe even by error deltas it is necessary to make selection, for example, if ErrLeran and ErrTest differ by more than 5% - then reject such a model. And make a choice from the remaining ones.
I made an experiment.
Here are some results from 30 training sessions: (the model is not the same as the original one, everything is perfect on it, so I removed the pre-training so that there would be bad results)
Decoding: V1 = 1-ErrLearn; V2= 1-ErrOOC; Value = 1 - (ErrLeran * 0.37 + ErrOOC * 0.63); in the OOS I glued the data from Valid and Test.
Value V1 V2
0.5712 0.4988 0.6138 - OOS is good but random as Learn = 50%
0.5002 0.5047 0.4975 - this was more common
0.6719 0.6911 0.6606 - and like this a couple of times. Sorting by Value = 1 - (ErrLeran * 0.37 + ErrOOC * 0.63) pulls such patterns upwards
Of course, he has already posted a lot of R code, but I would like to see adequate results, and most importantly, to understand how many more articles he will write until there is at least one normal EA.
In one of the articles there was an Expert Advisor, I rewrote it in a couple of days for the new version of Darch - technically everything works perfectly. The Expert Advisor is simple - read OHLC and throw it into R and run the calculation function in R. At the end, having received commands from the NS, send trade commands to the server. That's all. The rest - trailing, stops, MM - to your taste.
The most difficult thing is exactly what is described in the articles - to find good data and properly train the NS on them. Vladimir shows all sorts of variants of models, ensembles, optimisations, selection and processing of predictors..., I can't keep up with him)))
What do you think,
maybe we should exclude from the hyperparameter enumeration the variants with the number of neurons in layers where n2 > n1 ?
For example network 10 - 20 - 100 - 2 or 10 - 8 - 100 - 2
If n2 > n1, you will get compression to n1, then decompression to n2 and then compression again to 1 or 2 neurons of the output layer. If we still have 2 neurons at the end, then decompression in the middle shouldn't give any perimutations after compressing the data on the n1 layer. But it will consume a lot of time to calculate the obviously worse variants.
Update:I think it can be done in the fitness function in the following way: n1 is the number of neurons, and n2 is the % of n1, then round up and *2 for maxout.
did it myself, hopefully the computational power will be spent more efficiently.
What do you think,
maybe we should exclude from the hyperparameter enumeration the variants with the number of neurons in layers where n2 > n1 ?
For example network 10 - 20 - 100 - 2 or 10 - 8 - 100 - 2
If n2 > n1, you will get compression to n1, then decompression to n2 and then compression again to 1 or 2 neurons of the output layer. If we still have 2 neurons at the end, then decompression in the middle shouldn't give any perimutations after compressing the data on the n1 layer. But it will eat up a lot of time to calculate the obviously worse variants.
I think it can be done in the fitness function in the following way: n1 should be enumerated as the number of neurons, and n2 as % of n1, then rounded and *2 for maxout.
Greetings.
From my long experiments with parameters only a few peculiarities noticed:
- often better results with fact = 2
- or when both hidden layers have the same activation function.
I did not limit the ratio of neurons in the hidden layers. Non-optimal ratios drop out of consideration very quickly. But you can test this idea. I might find time and check the optimisation with rgenoud.
Good luck
I did not limit the ratio of neurons in the hidden layers. Non-optimal ratios drop out of consideration very quickly. But you can test this idea. I might find time and check the optimisation with rgenoud.
I made the next layer to be counted in % of the previous one. But there is no desire to compare particularly. Purely theoretically I hope I am right, that decompression inside the network will not give it new information, after compression on previous layers.
I tried genetics on GA:ga. It ran NS 200-300 times for computation. That's a lot. And there is no improvement in the results.
Although for Bayesian optimisation I assume that we need more passes, not 20-30, but maybe up to 100. Because it often happens that the best result is one of 10 random starting variants, and for the next 10-20 passes by the optimiser nothing better is found. Maybe for 100 passes there will be improvements....
- often the best results with fact = 2
- or when both hidden layers have the same activation function
This is different for me. Relu is often good.
I made the next layer count as a % of the previous one. But there is no desire to compare it. Purely theoretically I hope that I am right, that decompression inside the network will not give it new information after compression on the previous layers.
I tried genetics on GA:ga. It ran NS 200-300 times for computation. That's a lot. And there is no improvement in the results.
Although for Bayesian optimisation I assume that we need more passes, not 20-30, but maybe up to 100. Because it often happens that the best result is one of 10 random starting variants, and for the next 10-20 passes by the optimiser nothing better is found. Maybe for 100 passes there will be improvements....
That's where it varies for me. Relu is often good.
Genetics didn't find the best options for me either.
For Bayesian you need to play not only with the number of passes but also with the number of points. You have to look for a faster option. This is very tedious to deal with.
Good luck
PS. Are you not switching to TensorFlow? It's just a higher level.