Machine learning in trading: theory, models, practice and algo-trading - page 1783
You are missing trading opportunities:
- Free trading apps
- Over 8,000 signals for copying
- Economic news for exploring financial markets
Registration
Log in
You agree to website policy and terms of use
If you do not have an account, please register
What are the current states? If it is about clusters, you just need to check the statistics on the new data. If they are the same, you can build TC
If it is about clusters, you just need to check the statistics on the new data. If the same, you can build TC.
The subject area of clusters and statistics should be clearly understood. If the same on all instruments from '70 to '20, then it is possible)))
The problem is the size of the data, I will not even be able to create traits, so it will not even come to training...
Make a sample of 50k, let it be small, let it be not serious, let it be more possible to retrain, let .... ..., ... The aim is to make a robot for production, and simply reduce the error by joint creative work, and then the knowledge gained can be transferred to any tool and market, 50 000 is quite enough to see what signs mean something.
All right, I'll make a small sampling.
If you don't know OHLK, you don't have to write it, why should you displace the whole OHLK? Nobody does that, you just have to displace ZZ by one step, as if to look into the future by 1 step for training and that's all. Have you read at least one article by Vladimir Perervenko about dir lening? Please read. This is very uncomfortable when there is already a well-established optimal actions with the data and all got used to them, and someone is trying to do the same but in their own way, differently, it's kind of pointless and annoying, and the cause of many errors in people who try to work with such an author's data.
I read his articles, but I don't understand R code, so I can't really understand everything there.
So I'm going to ask you, since you understand this question. The classification takes place on the zero bar, when only the opening price is known, as I understand you do not use the opening price from the zero bar, but only information from the 1st bar and later? In fact the target determines the ZZ vector on the zero bar? I get that the vector of the next bar was predicted - that's not essential, is it? Otherwise I have to do a lot of rework again - it's tiresome.
I just have a ready solution for taking data and applying the model, not a calculation model.
If after all this you still want to do something I have the following requirements
1) the data 50-60k no more, it is better to have one file, just agree that the n of the last candles will be the test
2) The data, preferably without pasting, since it is possible to consider not only the latest prices, but also the support and resistance, it is impossible with pasting
3) The target should be already included into the data
4) Data in the format date,time,o,h,l,c, target
Or should I make a dataset ?
You can demand from those who have made a commitment - i.e. not from me :) Let's make a deal.
1. Let's have 50 for training and another 50 for test (sampling outside of training).
2. ok.
3. okay.
4. okay.
Added: Understood that there are not enough normal bars in Si-3.20 futures (22793) and you don't want gluing.
Added a sampling on the sber - I got an accurasy 67.
So I'm going to ask you, since you have figured this out. The classification takes place on the zero bar, when only the opening price is known, as I understand it, you do not use the opening price from the zero bar, but only the information from bar 1 and later? In fact the target determines the ZZ vector on the zero bar? I get that the vector of the next bar was predicted - that's not essential, is it? Otherwise I have to do a bunch of rework again - it's tiresome.
The classification is done on the last bar on which the known clause (i.e. a full-fledged OHLS candle), we predict the ZZ sign of a future candle. Why take into account the candle on which the only known option I can not understand, what is the advantage besides the complexity? both in understanding and in realization, and if you understand that the option[i] is almost always equal to the clause[i-1], then I only have one question mark for this approach
You can demand from those who have made a commitment - i.e. not from me :) Let's negotiate.
I do not demand anything from you personally, what are you)) The requirement for the sample, the sample must be the same for all, so that something can be compared, right? I think it's obvious.
And thank you for listening to me )
1) data 50-60k no more , preferably one file.........
Let's have 50 for training and another 50 for the test (sample outside of training).
I numbers 50-60k probably spontaneously, why not increase by 2 times? )))
)))
1) data 50-60k no more, better one file, just agree
And thank you for uploading one file instead of two! ))I tried it first, out of the box, so to speak...
In the prediction only the last n values are involved, as well as you, because the error is the same.
There are 217 signs, I know there are some redundant ones, but I'm too lazy to clean them.
I had tested and validated the fileOHLC_Train.csv with 54147 of observations
tested the model on the first 10k observations (to be exact, 8k, the first 2k were not taken into account, because the indicators were calculated on them)
tested the model on the remaining 44k of data, so I think there is no retraining. the test is 5.5 times the trayn 44/8 =5.5
Of the models I tried boosting and forrest, boosting was not impressed, I stopped at forrest
in the training set a strong imbalance classes, but I am too lazy to shaman
the final model on the current signs - forrest 200 trees
on trayne...
on the test
Confusion Matrix and Statistics Reference Prediction 0 1 0 12449 5303 1 9260 17135 Accuracy : 0.6701 95% CI : (0.6657, 0.6745) No Information Rate : 0.5083 P-Value [Acc > NIR] : < 2.2e-16 Kappa : 0.3381 Mcnemar's Test P-Value : < 2.2e-16 Sensitivity : 0.5734 Specificity : 0.7637 Pos Pred Value : 0.7013 Neg Pred Value : 0.6492 Prevalence : 0.4917 Detection Rate : 0.2820 Detection Prevalence : 0.4021 Balanced Accuracy : 0.6686 'Positive' Class : 0
As you can see the results are identical to yours, and don't need millions of data 50k is quite enough to find a pattern if at all
So we got the same results, this is our starting point, now this error has to be improved
)) Hohma ))
I deleted all of the so-called technical analysis indicators
There are 86 indicators, not 217 like in the example above
And the quality of the model only improved )
Confusion Matrix and Statistics Reference Prediction 0 1 0 12769 5597 1 8940 16841 Accuracy : 0.6707 95% CI : (0.6663, 0.6751) No Information Rate : 0.5083 P-Value [Acc > NIR] : < 2.2e-16 Kappa : 0.3396 Mcnemar's Test P-Value : < 2.2e-16 Sensitivity : 0.5882 Specificity : 0.7506 Pos Pred Value : 0.6953 Neg Pred Value : 0.6532 Prevalence : 0.4917 Detection Rate : 0.2892 Detection Prevalence : 0.4160 Balanced Accuracy : 0.6694 'Positive' Class : 0
Classification occurs on the last bar where the known clause (those full-fledged OHLS candle), predict the sign of ZZ future candle . Why take into account the candle on which the only known option I can not understand, what is the advantage besides the complexity? both in understanding and in realization, and if you understand that the option[i] is almost always equal to the clause[i-1], then I only have one question mark for this approach
You can't understand it because you have data in R and you can't know in the terminal when the OHLC is formed on the current bar, that's why you can get OHLC only on the zero bar from the first bar. Well, Open on the zero bar is the new time data - especially relevant for large TFs, because I have a class of the same predictors in the sample, but applied to different TFs.
1) data 50-60k no more, it is better to have one file.........
Let's give 50 for training and another 50 for the test (sample outside training).
I've probably just named 50-60k, why not double it? )))
)))
1) data 50-60k no more, better one file, just agree
And thank you for uploading one file instead of two! ))trained and validated on the fileOHLC_Train.csv a total of 54147
trained the model on the first 10k observations (to be exact, 8k, the first 2k were not taken into account, because the indicators were calculated on them)
tested the model on the remaining 44k data, so I think there is no retraining. the test is 5.5 times the trayn 44/8 =5.5
As you can see the results are identical to yours and i don't need millions of data, 50k is enough to find the patterns if any
So we got the same results, this is our starting point, now this error must be improved
I've split the sample into two files, the first file is for any twisted attempts at learning, and the second is for checking the results of learning.
Don't you have a way to save the model and test it on the new data? If so, please check it, I gave you the result for OHLC_Exam.csv sample.
Can you send back these two files, also separated, but adding your predictors and the column with the classification result?
Regarding the retraining or lack of it.
In my opinion it's a clear overtraining.
Yes ... Everything is sadder on the new data (((
Confusion Matrix and Statistics Reference Prediction 0 1 0 9215 5517 1 3654 7787 Accuracy : 0.6496 95% CI : (0.6438, 0.6554) No Information Rate : 0.5083 P-Value [Acc > NIR] : < 2.2e-16 Kappa : 0.3007 Mcnemar's Test P-Value : < 2.2e-16 Sensitivity : 0.7161 Specificity : 0.5853 Pos Pred Value : 0.6255 Neg Pred Value : 0.6806 Prevalence : 0.4917 Detection Rate : 0.3521 Detection Prevalence : 0.5629 Balanced Accuracy : 0.6507 'Positive' Class : 0
Here are the files, Do NOT use the first 2k lines in trail
in the test the first 100 lines
UPD====
the files do not fit, send me mail in person
Yeah... Everything is sadder on the new data (((
Here are the files, Do NOT use the first 2k lines in trail
In the test, the first 100 lines.
There are no files in the appendix.
I changed the sampling breakdown for training and validation, for validation I took every 5 rows, I got a funny graph
The sample OHLC_Exam.csv Accuracy is 0.63
By X, each new tree decreases the result, indicating overtraining due to insufficient examples in the sample.
Compress the file with a zip.There are no files in the appendix.
I changed the sampling breakdown for training and validation, for validation I took every 5 rows and got a funny graph
On the sample OHLC_Exam.csv Accuracy 0.63
By X, each new tree decreases the result, indicating overtraining due to insufficient examples in the sample.
Compress the file with a zip.Yes, yes our models are over-trained...
Here is a link to download the files, even the compressed file does not fit on the forum
https://dropmefiles.com.ua/56CDZB
Try the model on my signs, I wonder what acuracy will be