Machine learning in trading: theory, models, practice and algo-trading - page 394

 
elibrarius:
If you start something for a month - use an uninterrupted power supply for your computer, I once in 2 weeks of calculations cut off the light)))
And GPU version don't expect much, to rewrite the code seems to me longer and if the author has not done, it is unlikely that someone else will finish this task to the end.

Well, the author paralleled everything, now you just need to run it. I started it for 3 days and got a model with 9 inputs, which is a record to be honest. I don't really want to optimize it for so long. But as they say. The market demands it. Therefore, looking for power, if anyone is able to optimize dataset on the optimizer, and even on the cores as 20-30, I would be very grateful.
 

Mihail Marchukajtes:

Learning days, weeks

Apparently your algorithm is not optimal, on such small datasets you can safely use bruteforcing algorithms such as Knn, which are quasi-optimal, if the algorithm runs slower than Knn it's probably a bad ML algorithm or poorly configured. On such a dataset, the whole cycle of training and running the whole set should not take more than a second.
 
pantural:
Apparently your algorithm is not optimal, on such small datasets, you can safely use bruteforcing algorithms such as Knn, which are quasi-optimal, if the algorithm runs slower than Knn it is probably a bad ML algorithm or poorly configured. On such a dataset, the whole training cycle and run of the whole set shouldn't take more than a second.

I explained above. 100 splits each split is trained 1000 epochs, etc. You're just fixated on a single training of a neuron, while the essence of optimizer is to calculate dataset so that there would be no questions about its suitability. I.e. it twists this file up and down figuratively, and you keep comparing it to a single training of a single neuron. IMHO. It is essentially a system of AI, in which in addition to training a neuron, there are all sorts of optimization and preprocessing, and the training itself runs hundreds of times. If anything....
 
Mihail Marchukajtes:

I explained above. 100 splits, each split is trained in 1000 epochs, etc. You're just fixated on a single neuron training, while the essence of optimizer is to calculate dataset so that there is no question about its suitability. I.e. it twists this file up and down figuratively, and you keep comparing it to a single training of a single neuron. IMHO. It is essentially a system of AI, in which in addition to training a neuron, there are all sorts of optimization and preprocessing, and the training itself runs hundreds of times. If anything....
Yes I am against all this training, but your unit is exactly what a dunce invented, even I understand it.
 
elibrarius:
MLP is guessing 95% of the time... I don't think you're doing the right bike) No offense.

You have a mistake.
The very first column in the table is the row number, and you can't use that column in prediction, but it's only required for jPrediction for some reason.

The target is distributed so that the first half of the lines is class 0, and the second half of the lines is class 1. So the neuronka just remembers that if the line number is less than 228 it's class 0, otherwise it's class 1.

 
Dr. Trader:

You have a mistake.
The very first column in the table is the row number, and you can't use this column in prediction, but it's required only for jPrediction for some reason.

The target is distributed so that the first half of the lines is class 0, and the second half of the lines is class 1. So the neuronics just remembers that if the line number is less than 228 then it's class 0, otherwise it's class 1.

And by the way, yes. Didn't notice that it's just a number.

Without it Inputs to keep: 4,50,53,59,61,64,92,98,101,104,

Average error in the training (60.0%) plot =0.269 (26.9%) nLearns=2 NGrad=7376 NHess=0 NCholesky=0 codResp=2
Mean error on validation (20.0%) plot =0.864 (86.4%) nLearns=2 NGrad=7376 NHess=0 NCholesky=0 codResp=2
Average error on test (20.0%) plot =0.885 (88.5%) nLearns=2 NGrad=7376 NHess=0 NCholesky=0 codResp=2

Clearly overtraining. So, we have to make some other sifting of inputs.

Maybe sift by the weight of the inputs? Like you did for the problem in the first post of the thread...

I'm trying to rewrite R script, which you've attached, so that it could determine names and number of columns... but I don't have enough knowledge of R.

 
elibrarius:

I'm trying to rewrite the R script that you attached, so that it determines the names and number of columns... but I don't know enough R.


I was still starting to learn R back then, the script is almost entirely generated in rattle (visual environment for datamining in R), that's why it's so complex and customized for all occasions.


This is the one

crs$input <- c("input_1", "input_2", "input_3", "input_4",
     "input_5", "input_6", "input_7", "input_8",
     "input_9", "input_10", "input_11", "input_12",
     "input_13", "input_14", "input_15", "input_16",
     "input_17", "input_18", "input_19", "input_20")

crs$numeric <- c("input_1", "input_2", "input_3", "input_4",
     "input_5", "input_6", "input_7", "input_8",
     "input_9", "input_10", "input_11", "input_12",
     "input_13", "input_14", "input_15", "input_16",
     "input_17", "input_18", "input_19", "input_20")

should be changed to...

crs$input <- colnames(crs$dataset)[-ncol(crs$dataset)]

crs$numeric <- crs$input

And it should be ok.


In general, it's a bad approach, you shouldn't define importance of inputs that way. For some reason it worked that time, but it never helped me again.

 

It is better to define the importance of predictors as follows

library(vtreat)

sourceTable <- read.table("BuySell.csv", sep=";", header = TRUE, stringsAsFactors = FALSE)

#Эта  строка кода относится только к конкретно этому файлу.
 этом csv первая колонка и первая строка специально заполнены для конкретной модели, и тут не нужны. Удалить.
#для  обычных csv файлов такую команду выполнять не нужно.
sourceTable <- sourceTable[-1,-1]

#число колонок
sourceTable_ncol <- ncol(sourceTable)

#Оценка  для классификации, только для двух классов.
#Outcometarget  должен быть равен значению одного из классов.
#На  выбор или эта функция designTreatmentsC, или designTreatmentsN, или designTreatmentsZ (ниже, закоменчены)
#Взаимная  корреляция предкиторов учитывается только в designTreatmentsC, и у повторяющихся или похожих предикторов оценка будет понижаться
set.seed(0)
treats <- designTreatmentsC(dframe = sourceTable,
                            varlist = colnames(sourceTable)[-sourceTable_ncol],
                            outcomename = colnames(sourceTable)[sourceTable_ncol],
                            outcometarget = 1,
                            verbose = FALSE
)

# #оценка  для регрессии или если больше двух классов
#  sourceTable[,sourceTable_ncol] <- as.numeric(sourceTable[,sourceTable_ncol])
#  set.seed(0)
#  treats <- designTreatmentsN(dframe = sourceTable,
#                              varlist = colnames(sourceTable)[-sourceTable_ncol],
#                              outcomename = colnames(sourceTable)[sourceTable_ncol],
#                              verbose = FALSE
# )

# #Оценка  предикторов без учёта цели.
#  set.seed(0)
#  treats <- designTreatmentsZ(dframe = sourceTable,
#                              varlist = colnames(sourceTable)[-sourceTable_ncol],
#                              verbose = FALSE
# )
# 




#табличка  только с названием колонки и её оценкой важности
resultTable <- treats$scoreFrame[,c("varName", "sig")]

#сортировка
 resultTable <- resultTable[order(resultTable$sig),]

#согласно  общему правилу, оценка предиктора (sig) должна быть меньше 1/<общее число предикторов>
#чем  оценка меньше, тем лучше
resultTable$testPassed <- resultTable$sig < 1/(sourceTable_ncol-1)

#для  создания модели и прогноза лучше использовать только те предкторы у которых testPassed == TRUE
resultTable
 

The results of the importance assessment are as follows. The higher the predictor in the table, the better. OnlyVVolum6, VDel1, VVolum9, VQST10 passed the test.

In rattle we can build 6 models at once on these 4 predictors, and SVM shows accuracy of about 55% on validation and test data. Not bad.

             varName sig testPassed 182 VVolum6_catB 3.220305e-06 TRUE 28 VDel1_catB 1.930275e-03 TRUE 186 VVolum9_catB 5.946373e-03 TRUE 143 VQST10_catB 8.458616e-03 TRUE 126 VQST_catB 1.843740e-02 FALSE 23 Del11_catP 2.315340e-02 FALSE 147 Volum_catP 2.331145e-02 FALSE 24 Del11_catB 2.429723e-02 FALSE 154 Volum3_catB 2.985041e-02 FALSE 12 Del5_catP 3.689965e-02 FALSE 120 QST9_catB 4.092966e-02 FALSE 130 VQST2_catB 4.136235e-02 FALSE 163 Volum9_catP 4.299684e-02 FALSE 109 QST2_catB 4.311742e-02 FALSE 32 VDel3_catB 4.704981e-02 FALSE 11 Del5_lev_x.1 4.725332e-02 FALSE 19 Del9_catB 5.316355e-02 FALSE 13 Del5_catB 5.472078e-02 FALSE 178 VVolum4_catB 5.705614e-02 FALSE 191 VVolum11_catB 5.749245e-02 FALSE 148 Volum_catB 6.281945e-02 FALSE 181 VVolum6_catP 6.534487e-02 FALSE 31 VDel3_catP 6.911261e-02 FALSE 74 VST11_catB 7.709038e-02 FALSE 134 VQST4_catB 9.536026e-02 FALSE 141 VQST9_catB 9.536026e-02 FALSE 162 Volum7_catB 9.589108e-02 FALSE 107 QST1_catB 9.589108e-02 FALSE 2 Del_catB 1.049703e-01 FALSE 151 Volum2_catP 1.071203e-01 FALSE 60 ST11_catB 1.076877e-01 FALSE 43 VDel10_catP 1.201338e-01 FALSE 184 VVolum7_catB 1.286891e-01 FALSE 121 QST10_catP 1.464880e-01 FALSE 38 VDel6_catB 1.479268e-01 FALSE 173 VVolum2_catP 1.663695e-01 FALSE 8 Del3_catB 1.703652e-01 FALSE 10 Del4_catB 1.755150e-01 FALSE 30 VDel2_catB 1.781568e-01 FALSE 37 VDel6_catP 1.797087e-01 FALSE 1 Del_catP 1.995316e-01 FALSE 112 QST4_catP 2.104902e-01 FALSE 15 Del6_catB 2.132517e-01 FALSE 27 VDel1_catP 2.313270e-01 FALSE 41 VDel9_catP 2.316597e-01 FALSE 100 VAD11_catP 2.320692e-01 FALSE 144 VQST11_lev_x.100 2.374690e-01 FALSE 123 QST11_catP 2.576971e-01 FALSE 145 VQST11_catP 2.626389e-01 FALSE 104 QST_catP 2.716664e-01 FALSE 160 Volum6_catB 2.776463e-01 FALSE 115 QST6_catP 3.034207e-01 FALSE 137 VQST6_catB 3.060767e-01 FALSE 102 QST_lev_x..100 3.061104e-01 FALSE 36 VDel5_catB 3.149911e-01 FALSE 99 VAD11_lev_x.0 3.340276e-01 FALSE 17 Del7_catB 3.431346e-01 FALSE 16 Del7_catP 3.819094e-01 FALSE 3 Del1_catP 3.912432e-01 FALSE 152 Volum2_catB 3.938369e-01 FALSE 44 VDel10_catB 3.965567e-01 FALSE 5 Del2_catP 4.363645e-01 FALSE 20 Del10_catP 4.409282e-01 FALSE 171 VVolum1_catP 4.550495e-01 FALSE 169 VVolum_catP 4.682515e-01 FALSE 46 VDel11_catP 4.693330e-01 FALSE 86 AD11_catP 4.742976e-01 FALSE 187 VVolum10_catP 4.963890e-01 FALSE 132 VQST3_catB 5.291401e-01 FALSE 14 Del6_catP 5.310502e-01 FALSE 124 QST11_catB 5.355186e-01 FALSE 177 VVolum4_catP 5.542335e-01 FALSE 150 Volum1_catB 5.552986e-01 FALSE 39 VDel7_catP 5.589613e-01 FALSE 185 VVolum9_catP 5.589901e-01 FALSE 59 ST11_catP 5.669251e-01 FALSE 188 VVolum10_catB 5.680089e-01 FALSE 21 Del10_catB 5.706515e-01 FALSE 9 Del4_catP 5.708557e-01 FALSE 142 VQST10_catP 5.725309e-01 FALSE 113 QST4_catB 5.856434e-01 FALSE 119 QST9_catP 5.922916e-01 FALSE 131 VQST3_catP 6.033950e-01 FALSE 168 Volum11_catB 6.156530e-01 FALSE 155 Volum4_catP 6.196455e-01 FALSE 170 VVolum_catB 6.244269e-01 FALSE 180 VVolum5_catB 6.279081e-01 FALSE 87 AD11_catB 6.372863e-01 FALSE 153 Volum3_catP 6.641713e-01 FALSE 73 VST11_catP 6.701117e-01 FALSE 172 VVolum1_catB 6.707140e-01 FALSE 183 VVolum7_catP 6.771533e-01 FALSE 55 ST6_catB 6.780044e-01 FALSE 42 VDel9_catB 6.925700e-01 FALSE 167 Volum11_catP 6.973599e-01 FALSE 179 VVolum5_catP 7.093678e-01 FALSE 125 VQST_catP 7.189573e-01 FALSE 146 VQST11_catB 7.195859e-01 FALSE 101 VAD11_catB 7.250369e-01 FALSE 25 VDel_catP 7.310211e-01 FALSE 108 QST2_catP 7.426980e-01 FALSE 29 VDel2_catP 7.486648e-01 FALSE 136 VQST6_catP 7.529104e-01 FALSE 103 QST_lev_x.0 7.600202e-01 FALSE 22 Del11_lev_x.0 7.600202e-01 FALSE 47 VDel11_catB 7.619000e-01 FALSE 140 VQST9_catP 7.684919e-01 FALSE 164 Volum9_catB 7.743767e-01 FALSE 4 Del1_catB 7.796789e-01 FALSE 158 Volum5_catB 7.804397e-01 FALSE 117 QST7_catP 7.843659e-01 FALSE 26 VDel_catB 7.904299e-01 FALSE 166 Volum10_catB 7.936121e-01 FALSE 165 Volum10_catP 8.017445e-01 FALSE 6 Del2_catB 8.104867e-01 FALSE 190 VVolum11_catP 8.133908e-01 FALSE 45 VDel11_lev_x.0 8.231377e-01 FALSE 189 VVolum11_lev_x.0 8.231377e-01 FALSE 105 QST_catB 8.431046e-01 FALSE 174 VVolum2_catB 8.506238e-01 FALSE 81 AD6_catP 8.552222e-01 FALSE 94 VAD6_catP 8.552222e-01 FALSE 110 QST3_catP 8.560370e-01 FALSE 35 VDel5_catP 8.633955e-01 FALSE 122 QST10_catB 8.651814e-01 FALSE 18 Del9_catP 8.816989e-01 FALSE 34 VDel4_catB 8.909886e-01 FALSE 176 VVolum3_catB 8.911481e-01 FALSE 159 Volum6_catP 9.086195e-01 FALSE 106 QST1_catP 9.218420e-01 FALSE 133 VQST4_catP 9.218420e-01 FALSE 70 VST9_catP 9.223350e-01 FALSE 129 VQST2_catP 9.276503e-01 FALSE 54 ST6_catP 9.371128e-01 FALSE 161 Volum7_catP 9.634046e-01 FALSE 138 VQST7_catP 9.991105e-01 FALSE 116 QST6_catB 9.992413e-01 FALSE 7 Del3_catP 9.993376e-01 FALSE 33 VDel4_catP 9.994999e-01 FALSE 40 VDel7_catB 9.995014e-01 FALSE 157 Volum5_catP 9.995728e-01 FALSE 156 Volum4_catB 9.995799e-01 FALSE 118 QST7_catB 9.995921e-01 FALSE 139 VQST7_catB 9.995937e-01 FALSE 175 VVolum3_catP 9.996133e-01 FALSE 149 Volum1_catP 9.996479e-01 FALSE 48 ST_catB 1.000000e+00 FALSE 49 ST1_catB 1.000000e+00 FALSE 50 ST2_catB 1.000000e+00 FALSE 51 ST3_catB 1.000000e+00 FALSE 52 ST4_catB 1.000000e+00 FALSE 53 ST5_catB 1.000000e+00 FALSE 56 ST7_catB 1.000000e+00 FALSE 57 ST9_catB 1.000000e+00 FALSE 58 ST10_catB 1.000000e+00 FALSE 61 VST_catB 1.000000e+00 FALSE 62 VST1_catB 1.000000e+00 FALSE 63 VST2_catB 1.000000e+00 FALSE 64 VST3_catB 1.000000e+00 FALSE 65 VST4_catB 1.000000e+00 FALSE 66 VST5_catB 1.000000e+00 FALSE 67 VST6_catP 1.000000e+00 FALSE 68 VST6_catB 1.000000e+00 FALSE 69 VST7_catB 1.000000e+00 FALSE 71 VST9_catB 1.000000e+00 FALSE 72 VST10_catB 1.000000e+00 FALSE 75 AD_catB 1.000000e+00 FALSE 76 AD1_catB 1.000000e+00 FALSE 77 AD2_catB 1.000000e+00 FALSE 78 AD3_catB 1.000000e+00 FALSE 79 AD4_catB 1.000000e+00 FALSE 80 AD5_catB 1.000000e+00 FALSE 82 AD6_catB 1.000000e+00 FALSE 83 AD7_catB 1.000000e+00 FALSE 84 AD9_catB 1.000000e+00 FALSE 85 AD10_catB 1.000000e+00 FALSE 88 VAD_catB 1.000000e+00 FALSE 89 VAD1_catB 1.000000e+00 FALSE 90 VAD2_catB 1.000000e+00 FALSE 91 VAD3_catB 1.000000e+00 FALSE 92 VAD4_catB 1.000000e+00 FALSE 93 VAD5_catB 1.000000e+00 FALSE 95 VAD6_catB 1.000000e+00 FALSE 96 VAD7_catB 1.000000e+00 FALSE 97 VAD9_catB 1.000000e+00 FALSE 98 VAD10_catB 1.000000e+00 FALSE 111 QST3_catB 1.000000e+00 FALSE 114 QST5_catB 1.000000e+00 FALSE 127 VQST1_catP 1.000000e+00 FALSE 128 VQST1_catB 1.000000e+00 FALSE 135 VQST5_catB 1.000000e+00 FALSE
 
elibrarius:
MLP gets it right 95% of the time... I don't think you're making the right bike) No offense.
I make my own bike too, but based on decades of proven MLP (which they say is obsolete and needs something cooler to work on).


And try alglib decision trees too, they count faster and have better counts than mlp. Diplerning is also faster, but not in alglib.

The main thing is the speed/quality ratio, what's the point of waiting for a week or even a day or even an hour... you'll never find the optimal combination that way.) Model should take a few seconds to learn, then you can use genetics for autosimulation of parameters or predictors, then it's the true AI, otherwise it's rubbish)

Reason: