Machine learning in trading: theory, models, practice and algo-trading - page 693

 
I wrote this code into the script and ran it several times, getting slightly different results time after time...
 
Vizard_:

set.seed(1234)

What is it and where do I put it?

 
Mihail Marchukajtes:

that's more than half the battle... that's how it is....

90%, and models - it's a matter of technology, caret is full of this stuff with the appropriate tying.

 

Here's an example of vtreat as well,

Generally this is preprocessing data, but you can use it as an estimate of each predictor relative to the target. I don't like that the package doesn't take predictor interactions into account, use the code only if you have enough to estimate the predictors one at a time relative to the target.

forexFeatures <- read.csv2("Qwe.txt", dec=".")
forexFeatures <- forexFeatures[,-1]

library(vtreat)


#designTreatmentsC  подходит только для классификации с двумя классами
treatmentsC <- designTreatmentsC(dframe = forexFeatures,
                                 varlist=colnames(forexFeatures)[-ncol(forexFeatures)], #названия  колонок с предикторами (тут - все кроме последней колонки)
                                 outcomename = colnames(forexFeatures)[ncol(forexFeatures)], #названия  колонок с таргетом (тут - последняя колонка)
                                 outcometarget = "1") #текст  или цифра одного из классов
#обработка,  сортировка результата
treatmensC_scores <- treatmentsC$scoreFrame[order(treatmentsC$scoreFrame$sig),]
treatmensC_scores <- treatmensC_scores[!duplicated(treatmensC_scores$origName),]
treatmensC_scores <- treatmensC_scores[,c("origName","sig")] 
treatmensC_scores$is_good <- treatmensC_scores$sig <= 1/nrow(forexFeatures)
treatmensC_scores #вывод на экран таблички результатов. Лучшие предикторы вверху таблицы. Чем оценка sig меньше тем лучше. Желательно (колонка is_good==TRUE) чтоб sig была меньше чем 1/nrow(forexFeatures), всё что больше - плохо 


#designTreatmentsN подходит для регрессии или если больше двух классов. Если всего два класса то предпочтительнее использовать C функцию, она вроде как и коррелирующие предикторы убирает заодно.
treatmentsN <- designTreatmentsN(dframe = forexFeatures,
                                 varlist=colnames(forexFeatures)[-ncol(forexFeatures)], #названия колонок с предикторами (тут - все кроме последней колонки)
                                 outcomename = colnames(forexFeatures)[ncol(forexFeatures)]) #названия колонок с таргетом (тут - последняя колонка)
#обработка, сортировка результата
treatmensN_scores <- treatmentsN$scoreFrame[order(treatmentsN$scoreFrame$sig),]
treatmensN_scores <- treatmensN_scores[!duplicated(treatmensN_scores$origName),]
treatmensN_scores <- treatmensN_scores[,c("origName","sig")] 
treatmensN_scores$is_good <- treatmensN_scores$sig <= 1/nrow(forexFeatures)
treatmensN_scores #вывод на экран таблички результатов. Лучшие предикторы вверху таблицы. Чем оценка sig меньше тем лучше. Желательно (колонка is_good==TRUE) чтоб sig была меньше чем 1/nrow(forexFeatures), всё что больше - плохо

 
Mihail Marchukajtes:
What is it and where do I put it?

set.seed sets initial seed of random number generator. If you put it in the same state (like 1234) before code execution then all following code will be executed the same way in all cases.

read.csv2(... etc.
set.seed(1234)
Boruta(TargetProf... etc.)

But maybe the algorithm needs more iterations (maxRuns = 1000000 for example), it doesn't fit into such a small number as the default and stops halfway through.

 

I've been thinking about regression....
Regression in financial markets is not a smooth function, but rather a step function with a step of 1 pt. (both for the teacher and for the forecast). If, for example, we limit ourselves to a movement of +/- 100 pt, then there is an analogy with classification by 200 classes. That is, at the output we predict the most probable class - for example +22 pt.
Doesn't this mean that for good results the structure/complexity of the model (number of neurons) for regression should be 200 times larger? Well, if we increase the step to 5 pt, then 40 times would be a little more economical at the expense of less accuracy.

 
I ran the file through vtreat, and I admit it was not very good. Only 4 inputs selected. Thanks for the tips..... I will keep on spinning....
 
Vizard_:

Now remember boxplot, do something with the inputs, and run it again.

The set.seed parameter is the same in both cases.

What do you want me to do about the inputs?

Yeah..... I admit I expected better from my inputs. In any case, I thought there would be more important, certainly not four out of 100. With so many inputs the models are getting too small, although I've noticed from practice the simpler the model is, the better it works. And looking at the obtained model and conducting my tests, I understand that this little shit is pretty damn good..... It's too early to draw conclusions, I need more tests. Keep digging....

 
Everywhere I've written
forexFeatures <- read.csv2("Qwe.txt")

actually need to
forexFeatures <- read.csv2("Qwe.txt", dec=".")

I apologize, I didn't see the format in your file. I will correct old posts. Re-run the code, the results should be better, and all the numbers with decimal fractions were not processed as it should be.
 
Mihail Marchukajtes:

What should I do with the inputs?

For example, jPrediction scales data to the interval [-1;1], and learns on these numbers. You can also scale to the same interval in R before evaluating the inputs.

forexFeatures <- read.csv2("Qwe.txt", dec=".")

for(i in 1:(ncol(forexFeatures)-1)){
  forexFeatures[,i] <- (forexFeatures[,i] - min(forexFeatures[,i]))/(max(forexFeatures[,i])-min(forexFeatures[,i]))*2-1
}

vtreat, Boruta, etc,...

Methods of estimation based on trees probably won't change the result, forests don't really care in what interval the data came, but it's better to check. vtreat is not picky about the interval either.


But in general, he is talking about non-linear transformation of inputs even before feeding into neuronics. Neurons are very sensitive to inputs, and if you process the input data in some special way, its results may improve. For example I heard such a trick - convert inputs through sigmoid.

Reason: