Machine learning in trading: theory, models, practice and algo-trading - page 24

 

Alexey, I have another question about your function for sifting predictors - what interval of return values should it have? I ran that fitness function on my data with random inputs (fitness_f(runif(predictor_number, min = 0, max = 1))), saw results from 0 to 0,03. I also see in the code that in case of error (zero inputs) the function will return 0. There is something wrong here, GenSA is trying to minimize the result, i.e. at the end GenSA will just get to zero inputs and stop. Maybe you should change sign of fitness function result so that GenSA acts in opposite direction?
If fitness function at some point of optimization starts returning negative numbers and the lower they are the better - then everything is ok now.

 
Dr.Trader:

Alexey, I have another question on your function for sifting out predictors - what interval of return values should it have? I've run that fitness function on my data with random inputs (fitness_f(runif(predictor_number, min = 0, max = 1))), saw results from 0 to 0,03. I also see in the code that in case of error (zero inputs) the function will return 0. There is something wrong here, GenSA is trying to minimize the result, i.e. at the end GenSA will just get to zero inputs and stop. Maybe you should change sign of fitness function result so that GenSA acts in opposite direction?
If fitness function at some point of optimization starts returning negative numbers and the lower they are the better - then everything is ok now.

That's a good question. You figure it out.

I am writing an article about this method and here is an excerpt:


Block diagram of the algorithm based on corrected mutual information and stochastic search for a subset of predictors.


  • a) To reduce the data set to a categorical form by one of the known methods.
  • b) Evaluate the parameters of the dataset such as the number of rows, predictors, the average number of levels of predictors (if they have a different number of levels) and on the basis of these data calculate the "optimal number" of predictors in the final subset.
  • c) Initialize a vector of numeric type, the length of the number of input variables of the set with random numbers evenly distributed in the range [0, 1] and set the lower (0) and the upper (1) limit for the vector values - that is the argument of the SA function.
  • d) Initialize the functions evaluating the quantile of multinformation, the quantile of mutual information, and the fitness function combining all calculations.
  • e) Set the number of Monte Carlo simulations to estimate the quantiles of MI and VI; set quantiles (e.g., 0.9) for the noise values of MI and VI.
  • f) Set the time or the number of iterations of the algorithm. The more the better.
  • g) Run the algorithm, wait for the results.


Point "b" needs clarification. The optimal number of variables is a conditional value that is determined by the formula:

optim_var_num < - log(x = sample_size / 100, base = round(mean_levels, 0))


The intuition is that,assuming the input variables are independent, the average number of levels of a variable must be raised to the required degree in order to get the total number of unique interacting levels such that each of them concentrates at least n observations on average, where n is taken as 100.

We cannot have too many input variables with too many levels, because a conservative estimate of the frequency of observations over the input levels would turn out to be too small to give a statistical conclusion about the dependence of the output variable on the set of inputs.

By setting the threshold above which vector values of the number of input variables will be converted to 1 (the variable index inclusion flag), we make a probabilistic calculation:

threshold < - 1 - optim_var_num / predictor_number


Its essence is reduced to the fact that we set the maximum probability value to select the calculated optimal number of inputs. This logic is checked by applying a binomial distribution.

For example, let's take our data: half of the entire set, which is used for training.

We have 17,973 rows, 12 predictors, each with 5 levels. Applying the above formulas, we get that the optimal number of predictors is 3,226.

Applying the threshold formula for including a predictor in the set, we obtain 0.731.

What is the most likely number of selected variables to be obtained on a binomial distribution?


The maximum is 3 observations. To be precise, 5 ^ 3,226 would give us 178 levels, which would accommodate an average of 100 observations each.
 
Continued. The value 0 at the output of the function is the worst maximum value. If no element of the par vector passes threshold, automatically the value is 0. The best possible value is -1. Means the output is completely deterministic from the subset of inputs.
 

Yes, I found a subset of predictors with a negative fitness value. There were a lot of predictors, several thousand, I limited gensa to only 2 iterations, it took me 8 hours :). The result of the fitness function was 6%. On fronttest on these predictors with nnet I got 45% error. That is not much, I do not think the EA would have been profitable. I put the limit of 10 iterations again to find a better result, ran it and have been waiting for 24 hours, I hope it will ever complete. I should try genetics (GA library), its work with multiple threads will be faster (GA minimizes rather than maximizes the result, i.e. the result of fitness function for GA should change its sign). I will keep experimenting.

I have read various articles on the principal components model and tried to teach the model not only to measure R^2 and maximize it by selecting predictors, but also to really test the model on the fronttest data. It comes out somehow ambiguous. On the one hand, I have increased R^2 of the model by removing correlated pairs of predictors (function findCorrelation from bibiloteka caret), but as it turned out R^2 in testing the model on fronttest data from this falls. The miracle didn't happen, the PCA model overtrains too. I want to try a more complex estimation of predictors - split training sample into two - for training proper and for validation, train PCA model, then test immediately on validation sample, return minimum R^2 as final result. If you use such a function to estimate a set of predictors and maximize this value, then it is supposed to find exactly those sets of predictors that give good results on both trained and new data. We need to check.

Also I must have misunderstood the text from that first article about PCA in this thread. There it was said, that the number of components must be chosen to describe 95% of the variation, I thought they were talking about the accuracy of predicting the target variable. But it's not like that, the main components are built without target variables at all, and 95% accuracy is how accurately the raw data is described using those very components. And prediction accuracy has nothing to do with it at all.

 
Dr.Trader:

Yes, I found a subset of predictors with a negative fitness value. There were a lot of predictors, several thousand, I limited gensa to only 2 iterations, it took me 8 hours :). The result of the fitness function was 6%. On fronttest on these predictors with nnet I got 45% error. That is not much, I do not think the EA would have been profitable. I put the limit of 10 iterations again to find a better result, ran it and have been waiting for 24 hours, I hope it will ever complete. I'll have to try genetics (GA library), its work with multiple threads will be faster (GA minimizes rather than maximizes the result, i.e. the result of fitness function for GA should change its sign). I will continue to experiment.

I've read various articles about the principal components model, and now I've tried not only to train the model to measure R^2 and maximize it by selecting predictors, but also to really test the model on fronttest data. It comes out somehow ambiguous. On the one hand, I have increased R^2 of the model by removing correlated pairs of predictors (function findCorrelation from bibiloteka caret), but as it turned out R^2 when testing the model on fronttest data from this falls. The miracle didn't happen, the PCA model overtrains too. I want to try a more complex estimation of predictors - split training sample into two - for training proper and for validation, train PCA model, then immediately test on validation sample, return minimum R^2 as final result. If you use such a function to estimate a set of predictors and maximize this value, then it is supposed to find exactly those sets of predictors that give good results on both trained and new data. We should check it.

Also, I must have got the text from that first article about PCA in this thread wrong. There it was said, that the number of components should be chosen to describe 95% of variations, I thought they were talking about the accuracy of target variable prediction. But it's not like that, the main components are built without target variables at all, and 95% accuracy is how accurately the raw data is described using those very components. And prediction accuracy has nothing to do with it at all.

Yes, it turns out that you don't get it.

PCA can be used as a stand-alone tool, but the article doesn't discuss that.

What is discussed is how to filter out noise from some large set of predictors.

According to my understanding this is done in the following steps:

1. Y-aware. This is scaling the predictors depending on the target variable

2. With the help of PCA algorithm, a set of predictors is ordered and the part that explains 95% of the variance is taken.

Or so (I haven't figured it out myself) with the help of PCA algorithm a new set of predictors is built, which are obtained by multiplying the original set by the calculated coefficients (loading). This set is ordered. We take such a number of these new vectors, which explain 95% of the variance.

PS.

Let's go to publications that Y-aware is the new peep in the field of noise predictor filtering.

Success

 
SanSanych Fomenko:

2. Using the PCA algorithm, the set of predictors is ordered and the part that explains 95% of the variance is taken.

I haven't figured that out yet. (I will now write only about the y-aware approach, not to confuse it with the other one). The article itself is here:http://www.r-bloggers.com/principal-components-regression-pt-2-y-aware-methods/

After the code"princ <- prcomp(dmTrain, center = FALSE, scale. = FALSE)" we have this situation: the data is read, scaled by Y, PC components are built. This function doesn't limit the number of components - you can build as many components as there are predictors. The first thing to do is to select only part of them (it is recommended enough to describe 95% of variations). In the article itself, the author looked at the sdev graph (some unspecified variances) for the components, and said that 2 or 5 would be enough, because they stand out on the graph. I somehow nothing stands out, the graph is smoothly decreasing.

There is a table sdev, the number of entries in it is equal to the number of components. Is it possible to calculate how many components to take from this? The sum of all the numbers is not necessarily limited to 1, I've seen a sum of 6, and probably more.

> princ$sdev
[1] 0.17585066 0.15322845 0.13818021 0.13090573 0.12177070 0.11854969
[7] 0.11176954 0.10910302 0.10616631 0.10265987 0.10056754 0.09441041
[13] 0.09343688 0.08832101 0.08620753 0.08235531 0.08132748 0.07992235
[19] 0.07800569 0.07575063 0.07463254 0.07311194 0.07210698 0.07032990
[25] 0.06907964 0.06763711 0.06634935 0.06544930 0.06451703 0.06260861
 
Dr.Trader:

I haven't figured that out yet. (I will now write only about the y-aware approach, so as not to confuse it with the other one). The article itself is here:http://www.r-bloggers.com/principal-components-regression-pt-2-y-aware-methods/

After the code"princ <- prcomp(dmTrain, center = FALSE, scale. = FALSE)" we have the following situation: the data is read, scaled by Y, PC components are built. This function doesn't limit the number of components - you can build as many components as there are predictors. The first thing to do is to select only part of them (it is recommended enough to describe 95% of variations). In the article itself, the author looked at the sdev graph (some unspecified variances) for the components, and said that 2 or 5 would be enough, because they stand out on the graph. I somehow nothing stands out, the graph is smoothly decreasing.

There is a table sdev, the number of entries in it is equal to the number of components. Is it possible to calculate how many components to take from this? The sum of all the numbers is not necessarily limited to 1, I've seen a sum of 6, and probably more.

I run rattle and get three tables:

  • Standard deviations: there can be all sorts of values, and to say that the sum = 1 is not necessary
  • Rotation: these are the coefficients by which the original vectors must be multiplied to get the new
  • Importance of components. what is being discussed

The last one has the 1st column saying that if you take only PC1 then 0.9761 of variability (Cumulative Proportion) will be explained, if you take TWO components - PC1 and PC2 then 0.99996 will be explained, etc.

(I don't know how to insert tables)

Importance of components:

PC1 PC2 PC3 PC4 PC5

Standard deviation 2.2092 0.34555 0.01057 0.008382 0.004236

Proportion of Variance 0.9761 0.02388 0.00002 0.000010 0.000000

Cumulative Proportion 0.9761 0.99996 0.99998 1.000000 1.000000

 

Looked for a long time for this table, finally found it in the summary. The most obvious place actually :) Thank you for showing me. This is the case when something is in summary, but not in object attributes.

summary(princ)$importance[3,]

It turns out that article has a sequel, dedicated to the very question of component selection, with some special solution for Y-aware. Haven't tried it yet.

http://www.win-vector.com/blog/2016/05/pcr_part3_pickk/

 
Dr.Trader:

Looked for a long time for this table, finally found it in the summary. The most obvious place actually :) Thank you for showing me. This is the case when something is in summary, but not in object attributes.

It turns out that article has a sequel, dedicated to the very question of component selection, with some special solution for Y-aware. Haven't tried it yet.

http://www.win-vector.com/blog/2016/05/pcr_part3_pickk/

In this R, as soon as you get some object, you immediately put str and summary on it, and also plot. You can see a lot of amazing things. The point is that the "object" thing in R is much more complicated than in many programming languages.
 
Dr.Trader:

Yes, I found a subset of predictors with a negative fitness value. There were a lot of predictors, several thousand, I limited gensa to only 2 iterations, it took me 8 hours :). The result of the fitness function was 6%. On fronttest on these predictors with nnet I got 45% error. That is not much, I do not think the EA would have been profitable. I put the limit of 10 iterations again to find a better result, ran it and have been waiting for 24 hours, I hope it will ever complete. I should try genetics (GA library), its work with multiple threads will be faster (GA minimizes rather than maximizes the result, i.e. the result of fitness function for GA should change its sign). I will continue to experiment.

I read various articles about the principal components model, and now I tried not only to train the model to measure R^2 and maximize it by selecting predictors, but also to really test the model on the fronttest data. It comes out somehow ambiguous. On the one hand, I have increased R^2 of the model by removing correlated pairs of predictors (function findCorrelation from bibiloteka caret), but as it turned out R^2 when testing the model on fronttest data from this falls. The miracle didn't happen, the PCA model overtrains too. I want to try a more complex estimation of predictors - split training sample into two - for training proper and for validation, train PCA model, then immediately test on validation sample, return minimum R^2 as final result. If you use such a function to estimate a set of predictors and maximize this value, then it is supposed to find exactly those sets of predictors that give good results on both trained and new data. We should check it.

Also, I must have got the text from that first article about PCA in this thread wrong. There it was said, that the number of components should be chosen to describe 95% of variations, I thought they were talking about the accuracy of target variable prediction. But it's not like that, the main components are built without target variables at all, and 95% accuracy is how accurately the raw data is described using those very components. And prediction accuracy has nothing to do with it at all.

I don't quite understand why it took so long. How many optim_var_number did you get? It should be within 10. Set it to 1200 seconds and it should be something by now.
Reason: