Machine learning in trading: theory, models, practice and algo-trading - page 21
You are missing trading opportunities:
- Free trading apps
- Over 8,000 signals for copying
- Economic news for exploring financial markets
Registration
Log in
You agree to website policy and terms of use
If you do not have an account, please register
I tried Y-scale too, R^2 in both cases (with and without Y-scale) came out the same (even though different packages are used in these cases!).
I understood that Y-scale can give the same good result with fewer principal components. But, if even using all components the result is still unsatisfactory (as I have now) - then there is no difference. This way works faster, which is more important for me now. But I haven't proved in theory or in practice if this method is suitable for picking predictors... At first I had an idea to make a principal component model for all predictors and pick the predictors by looking at the coefficients of the components. But then I noticed that with the addition of garbage - R^2 of the model drops. It makes sense to try different sets of predictors and look for those with R^2 higher, but still it's just a theory.
I regularly make the following suggestion here: if you distill your set to me, we will compare my results with yours.
For me, the ideal is .RData. A frame in which the target is binary and the predictors are preferably real numbers.
I used to train the forest, and return an error on validation sample. In principle it worked - if the forest overtrains even a little bit, the error immediately tends to 50%.
Now I use GetPCrsquared(), that code above. I also have your example from feature_selector_modeller.txt, but I have to figure out and get the needed fragment of code there, so I haven't tested it on my data yet.
What you need to take there is this:
library(infotheo) # measured in nats, converted to bits
library(scales)
library(GenSA)
#get data
sampleA <- read.table('C:/Users/aburnakov/Documents/Private/dummy_set_features.csv'
, sep= ','
, header = T)
#calculate parameters
predictor_number <- dim(sampleA)[2] - 1
sample_size <- dim(sampleA)[1]
par_v <- runif(predictor_number, min = 0, max = 1)
par_low <- rep(0, times = predictor_number)
par_upp <- rep(1, times = predictor_number)
#load functions to memory
shuffle_f_inp <- function(x = data.frame(), iterations_inp, quantile_val_inp){
mutins <- c(1:iterations_inp)
for (count in 1:iterations_inp){
xx <- data.frame(1:dim(x)[1])
for (count1 in 1:(dim(x)[2] - 1)){
y <- as.data.frame(x[, count1])
y$count <- sample(1 : dim(x)[1], dim(x)[1], replace = F)
y <- y[order(y$count), ]
xx <- cbind(xx, y[, 1])
}
mutins[count] <- multiinformation(xx[, 2:dim(xx)[2]])
}
quantile(mutins, probs = quantile_val_inp)
}
shuffle_f <- function(x = data.frame(), iterations, quantile_val){
height <- dim(x)[1]
mutins <- c(1:iterations)
for (count in 1:iterations){
x$count <- sample(1 : height, height, replace = F)
y <- as.data.frame(c(x[dim(x)[2] - 1], x[dim(x)[2]]))
y <- y[order(y$count), ]
x[dim(x)[2]] <- NULL
x[dim(x)[2]] <- NULL
x$dep <- y[, 1]
rm(y)
receiver_entropy <- entropy(x[, dim(x)[2]])
received_inf <- mutinformation(x[, 1 : dim(x)[2] - 1], x[, dim(x)[2]])
corr_ff <- received_inf / receiver_entropy
mutins[count] <- corr_ff
}
quantile(mutins, probs = quantile_val)
}
############### the fitness function
fitness_f <- function(par){
indexes <- c(1:predictor_number)
for (i in 1:predictor_number){
if (par[i] >= threshold) {
indexes[i] <- i
} else {
indexes[i] <- 0
}
}
local_predictor_number <- 0
for (i in 1:predictor_number){
if (indexes[i] > 0) {
local_predictor_number <- local_predictor_number + 1
}
}
if (local_predictor_number > 1) {
sampleAf <- as.data.frame(sampleA[, c(indexes[], dim(sampleA)[2])])
pred_entrs <- c(1:local_predictor_number)
for (count in 1:local_predictor_number){
pred_entrs[count] <- entropy(sampleAf[count])
}
max_pred_ent <- sum(pred_entrs) - max(pred_entrs)
pred_multiinf <- multiinformation(sampleAf[, 1:dim(sampleAf)[2] - 1])
pred_multiinf <- pred_multiinf - shuffle_f_inp(sampleAf, iterations_inp, quantile_val_inp)
if (pred_multiinf < 0){
pred_multiinf <- 0
}
pred_mult_perc <- pred_multiinf / max_pred_ent
inf_corr_val <- shuffle_f(sampleAf, iterations, quantile_val)
receiver_entropy <- entropy(sampleAf[, dim(sampleAf)[2]])
received_inf <- mutinformation(sampleAf[, 1:local_predictor_number], sampleAf[, dim(sampleAf)[2]])
if (inf_corr_val - (received_inf / receiver_entropy) < 0){
fact_ff <- (inf_corr_val - (received_inf / receiver_entropy)) * (1 - pred_mult_perc)
} else {
fact_ff <- inf_corr_val - (received_inf / receiver_entropy)
}
} else if (local_predictor_number == 1) {
sampleAf<- as.data.frame(sampleA[, c(indexes[], dim(sampleA)[2])])
inf_corr_val <- shuffle_f(sampleAf, iterations, quantile_val)
receiver_entropy <- entropy(sampleAf[, dim(sampleAf)[2]])
received_inf <- mutinformation(sampleAf[, 1:local_predictor_number], sampleAf[, dim(sampleAf)[2]])
fact_ff <- inf_corr_val - (received_inf / receiver_entropy)
} else {
fact_ff <- 0
}
return(fact_ff)
}
########## estimating threshold for variable inclusion
iterations = 5
quantile_val = 1
iterations_inp = 1
quantile_val_inp = 1
levels_arr <- numeric()
for (i in 1:predictor_number){
levels_arr[i] <- length(unique(sampleA[, i]))
}
mean_levels <- mean(levels_arr)
optim_var_num <- log(x = sample_size / 100, base = round(mean_levels, 0))
if (optim_var_num / predictor_number < 1){
threshold <- 1 - optim_var_num / predictor_number
} else {
threshold <- 0.5
}
#run feature selection
start <- Sys.time()
sao <- GenSA(par = par_v, fn = fitness_f, lower = par_low, upper = par_upp
, control = list(
#maxit = 10
max.time = 1200
, smooth = F
, simple.function = F))
trace_ff <- data.frame(sao$trace)$function.value
plot(trace_ff, type = "l")
percent(- sao$value)
final_vector <- c((sao$par >= threshold), T)
names(sampleA)[final_vector]
final_sample <- as.data.frame(sampleA[, final_vector])
Sys.time() - start
In the dataframe, the rightmost column is the target column.
ALL columns should be categories (ineteger, character or factor).
And you have to load all the bibbles.
A piece of code that shows how to translate numerics into categorical variables:
disc_levels <- 3 # сколько равночастотных уровней переменной создается
for (i in 1:56){
naming <- paste(names(dat[i]), 'var', sep = "_")
dat[, eval(naming)] <- discretize(dat[, eval(names(dat[i]))], disc = "equalfreq", nbins = disc_levels)[,1]
}
I found this interesting function on the Internet
Maybe in this form the algorithm will recognize the data better? But there is one problem, the output of the function is a variable "d" and it has a matrix with two columns "x" and "y", one denotes, as it were, the price of the second curved by the algorithm time, the question is how to turn this matrix into a vector, so it does not lose its properties
I regularly make the following suggestion: if you transfer your set to me, we will compare my results with yours.
For me, the ideal is a .RData. A frame in which the binary target and the predictors are preferably real numbers.
Atachment is my best set of predictors. TrainData is D1 for eurusd for 2015, fronttestData is January 1, 2016 through June. Fronttest is a bit long, in real trading I am unlikely to trade more than a month with the same settings, I just wanted to see how long the profitability of the model really lasts. FronttestData1, fronttestData2, fronttestData3 are separate cuts from fronttestData, only for January, only for February, only for March. I'm really only interested in lowering the error on fronttestData1, the rest is just for research. The predictor set contains mostly indicators and different calculations between them. With nnet the error on fronttest I have 30% on fronttestData1, training with iteration control and fitting the number of internal neurons. I think the 30% here is just a matter of chance, the model caught some trend in the market from March2015 to February2016. But the fact that the other periods do not merge is already good.
Here is a picture from mt5 tester 2014.01-2016.06, I marked training period with a frame. It's already better than it was.) For now this is my limit, I have to solve a lot of problems with indicators, namely the fact that their default parameters are strictly tied to timeframes, for example on H1 my experience is completely useless, the same algorithm for selecting indicators on H1 all considers garbage. I should either add a bunch of their variations with different parameters to the initial set of indicators or generate random indicators from ohlc by myself.
There you have to take this:
That makes more sense, thank you. It seems to me that only 3 categories per indicator will not work. Logically, I would make at least 100 levels, but is it better or does it loose the sense of the algorithm?
That makes more sense, thank you. It seems to me that only 3 categories per indicator will not do. Logically, I would make at least 100 levels, but would it be better, or it will lose all meaning of the algorithm?
The meaning will be lost then. The algorithm counts the total number of levels of input variables and how response levels are distributed across these levels. Accordingly, if the number of response values at each of the input levels is very low, it would be impossible to estimate the statistical significance of the probability skew.
If you make 100 levels, yes there will be many variables. Then the algorithm will return zero significance for any subset, which is reasonable given the limited sample size.
The example is a good one.
input levels | number of observations
1 150
2 120
...
9 90
Here we can estimate the significance within the response
Example - bad.
input levels
112 5
...
357 2
...
1045 1
here it is not possible to estimate the significance within the response
Atachment is my best set of predictors. TrainData - D1 for eurusd for 2015, fronttestData - from January 1, 2016 to June. Fronttest is a bit long, in real trading I am unlikely to trade for more than a month with the same settings, I just wanted to see how long the profitability of the model really lasts. FronttestData1, fronttestData2, fronttestData3 are separate cuts from fronttestData, only for January, only for February, only for March. I'm really only interested in lowering the error on fronttestData1, the rest is just for research. The predictor set contains mostly indicators and different calculations between them. With nnet the error on fronttest I have 30% on fronttestData1, training with iteration control and fitting the number of internal neurons. I think the 30% here is just a matter of chance, the model caught some trend in the market from March2015 to February2016. But the fact that the other periods do not merge is already good.
Here is a picture from mt5 tester 2014.01-2016.06, I marked training period with a frame. It's already better than it was.) For now this is my limit, I have to solve a lot of problems with indicators, namely the fact that their default parameters are strictly tied to timeframes, for example on H1 my experience is completely useless, the same algorithm for selecting indicators on H1 all considers garbage. It is necessary either to add to the original set of indicators a bunch of their variations with different parameters, or somehow to generate random indicators from ohlc.
Not bad, but the periods themselves outside the sample are small.
It is also not clear how many trades fall outside the sample. There are dozens, hundreds, what is the order?
Atachment is my best set of predictors. TrainData - D1 for eurusd for 2015, fronttestData - from January 1, 2016 to June. Fronttest is a bit long, in real trading I am unlikely to trade for more than a month with the same settings, I just wanted to see how long the profitability of the model really lasts. FronttestData1, fronttestData2, fronttestData3 are separate cuts from fronttestData, only for January, only for February, only for March. I'm really only interested in lowering the error on fronttestData1, the rest is just for research. The predictor set contains mostly indicators and different calculations between them. With nnet the error on fronttest I have 30% on fronttestData1, training with iteration control and fitting the number of internal neurons. I think the 30% here is just a matter of chance, the model caught some trend in the market from March2015 to February2016. But the fact that the other periods do not merge is already good.
Here is a picture from mt5 tester 2014.01-2016.06, I marked training period with a frame. It's already better than it was.) For now this is my limit, I have to solve a lot of problems with indicators, namely the fact that their default parameters are strictly tied to timeframes, for example on H1 my experience is completely useless, the same algorithm for selecting indicators on H1 all considers garbage. It is necessary either to add to the original set of indicators a bunch of their variations with different parameters, or somehow to generate random indicators from ohlc.
I took a look at it.
Did I understand correctly that there are 107 lines (107 observations) in the total dataset?
Looked at.
Did I understand correctly that the total dataset has 107 lines (107 observations)?
No, the training set has 250-something rows (number of trading days in 2015). I trained the model on the trainData table. I tested it on fronttestData1. Everything else is for additional checks, you can ignore them
trainData - all year 2015.
fronttestData1 - January 2016
fronttestData2 - February 2016
fronttestData3 - March 2016
fronttestData - January 2016 - June 2016
No, the training set has 250-something rows (the number of trading days in 2015). I trained the model on the trainData table. I tested it on fronttestData1. Everything else is for additional checks, you can ignore them.
TrainData - the whole year 2015.
fronttestData1 - January 2016
fronttestData2 - February 2016
fronttestData3 - March 2016
fronttestData - January 2016 - June 2016
For me it is very little - I use statistics. Even for the current window, 107 rows is very little for me. I use over 400 for the current window.
Generally, in your sets the number of observations is comparable to the number of predictors. These are very specific sets. Somehow I have seen that such sets require special methods. No references, as I do not have such problems.
Unfortunately my methods are not suitable for your data.