Machine learning in trading: theory, models, practice and algo-trading - page 7

 
Dr.Trader:

Thank you, I tried it. I see that you have done a lot for the selection of predictors, because the neuronics easily trained on them, and saved the result also on the test dataset.

The results below refer to learning on R1.F3

1) There was a funny result with Rattle. HH with the standard configuration showed train/validate/testing errors of 30%/29%/33%. The error on R2.F3 is 35%. But all this is just a lucky case really, in another configuration it would have easily under- or over-trained, here it was just lucky.

2) Then I took a simple crude approach with unsupervised training, 200 hidden neurons, the network was trained until it stopped improving. Errors train/validate/testing/R2.F3 - 2%/30%/27%/45%. Well, that's clear, the network is retrained.

3) Supervised learning. This is different from trees, but with neuronics you should always do so in order not to retrain it. The idea is to pause training sometimes, and check the results of train/validate/testing. I don't know the golden rule of results collation, but it is quite normal to train on train dataset, then look for errors in validate and testing datasets, stop training when errors in validate/testing stop decreasing. This gives some guarantee against overtraining. R2.F3 is considered unavailable during this whole process, and the test on it is done only after the end of training. In this case the train/validate/testing/R2.F3 errors are 27%/30%/31%/37%. Here again there is overtraining, but not much. We could have stopped the learning process earlier, after the train error became noticeably smaller than validate/testing errors, but that's guessing... could have helped, or it could not.

The "R1.F1" target variable has three values, Rattle can't do that with neuronics and you have to write your own code in R, I skipped this dataset.

"R1.F4" "R1.F5" "R1.F6" gave approximately the same results for all 4 errors in Rattle neuronka, I think an adequate approach with neuronka will also give approximately the same results, I have not dealt with them further.

I have similar numbers for forest and ada.

And now, if we return to our "rams" - how to discard noise from an arbitrary list of predictors? I have some empirical algorithm that selected my 27 predictors out of 170. In addition, I have used it to analyze other people's sets of predictors and also successfully. Based on this experience, I argue that all methods from R that use "importance" variables in their algorithms cannot clear the predictor set of noise.

I appeal to all readers of the thread: I'm willing to do the appropriate analysis if the raw data is presented as RData or an Excel file that doesn't require processing.

Other than that.

Attached I am attaching a number of articles, which supposedly solve the problem of clearing the original set of predictors from noise, and with much greater quality. Unfortunately I don't have time to try it at the moment. Maybe someone will try and post the result?

 
SanSanych Fomenko:

Attached are a number of articles that supposedly solve the problem of clearing the original set of predictors from noise, and with much greater quality. Unfortunately at the moment I do not have time to try. Maybe someone will try it and post the result?

Thanks, leafed through the document, but didn't find what I need. I'm trying to train my model for forex, somewhere in the range of M15-H4. It is not sufficient for me just to take data for the last bar, I need to take them for tens of bars at once, and put one after another in one long array for model inputs. For example (open_bar1, close_bar1, hh_bar1, open_bar2, close_bar2, hh_bar2, open_bar3, close_bar3, hh_bar3,...). If some way of selection will tell me that it is necessary to remove time of the second bar, it will not help me, the way should tell me that it is possible to remove for example all data about time (indexes 3,6,9...).

Did I get it right, in your file ALL_cod.RData, you can use Rat_DF1 for learning as well (specifying necessary target) and then use Rat_DF2 and Rat_DF3 for checking? I have attached my code in R for those who are interested, it implements error controlled neural network training. To select another target variable you can simply replace "Short_Long.75" with something from "Short_Long.35", "Flet_Long", "Short_Flet", "Flet_In". This is more convenient than substituting different datasets.

Files:
 
Dr.Trader:
I'm trying to train my model for forex, somewhere between M15 and H4. It is not enough for me just to take data for the last bar, I need to take them for tens of bars at once, and put them one after another in one long array for model inputs. For example (open_bar1, close_bar1, hh_bar1, open_bar2, close_bar2, hh_bar2, open_bar3, close_bar3, hh_bar3,...). If some way of selection will tell me that I need to remove the time of the second bar, it will not help me, the way should tell me that I can remove for example all time data (indexes 3,6,9...).

mmm Time data may be needed, because markets work differently by sessions.

See what I have a set of features for forex:

> names(sampleA)

  [1] "lag_diff_2"        "lag_diff_3"        "lag_diff_4"        "lag_diff_6"        "lag_diff_8"        "lag_diff_11"       "lag_diff_16"     

  [8] "lag_diff_23"       "lag_diff_32"       "lag_diff_45"       "lag_diff_64"       "lag_diff_91"       "lag_diff_128"      "lag_diff_181"    

 [15] "lag_diff_256"      "lag_diff_362"      "lag_diff_512"      "lag_diff_724"      "lag_mean_diff_2"   "lag_mean_diff_3"   "lag_mean_diff_4" 

 [22] "lag_mean_diff_6"   "lag_mean_diff_8"   "lag_mean_diff_11"  "lag_mean_diff_16"  "lag_mean_diff_23"  "lag_mean_diff_32"  "lag_mean_diff_45"

 [29] "lag_mean_diff_64"  "lag_mean_diff_91"  "lag_mean_diff_128" "lag_mean_diff_181" "lag_mean_diff_256" "lag_mean_diff_362" "lag_mean_diff_512"

[36] "lag_mean_diff_724" "lag_max_diff_2"    "lag_max_diff_3"    "lag_max_diff_4"    "lag_max_diff_6"    "lag_max_diff_8"    "lag_max_diff_11" 

 [43] "lag_max_diff_16"   "lag_max_diff_23"   "lag_max_diff_32"   "lag_max_diff_45"   "lag_max_diff_64"   "lag_max_diff_91"   "lag_max_diff_128"

 [50] "lag_max_diff_181"  "lag_max_diff_256"  "lag_max_diff_362"  "lag_max_diff_512"  "lag_max_diff_724"  "lag_min_diff_2"    "lag_min_diff_3"  

 [57] "lag_min_diff_4"    "lag_min_diff_6"    "lag_min_diff_8"    "lag_min_diff_11"   "lag_min_diff_16"   "lag_min_diff_23"   "lag_min_diff_32" 

 [64] "lag_min_diff_45"   "lag_min_diff_64"   "lag_min_diff_91"   "lag_min_diff_128"  "lag_min_diff_181"  "lag_min_diff_256"  "lag_min_diff_362"

 [71] "lag_min_diff_512"  "lag_min_diff_724"  "lag_sd_2"          "lag_sd_3"          "lag_sd_4"          "lag_sd_6"          "lag_sd_8"        

 [78] "lag_sd_11"         "lag_sd_16"         "lag_sd_23"         "lag_sd_32"         "lag_sd_45"         "lag_sd_64"         "lag_sd_91"       

 [85] "lag_sd_128"        "lag_sd_181"        "lag_sd_256"        "lag_sd_362"        "lag_sd_512"        "lag_sd_724"        "lag_range_2"     

 [92] "lag_range_3"       "lag_range_4"       "lag_range_6"       "lag_range_8"       "lag_range_11"      "lag_range_16"      "lag_range_23"    

 [99] "lag_range_32"      "lag_range_45"      "lag_range_64"      "lag_range_91"      "lag_range_128"     "lag_range_181"     "lag_range_256"   

[106] "lag_range_362"     "lag_range_512"     "lag_range_724"     "symbol"            "month"             "day"               "week_day"        

[113] "hour"              "minute"            "future_lag_2"      "future_lag_3"      "future_lag_4"      "future_lag_6"      "future_lag_8"    

[120] "future_lag_11"     "future_lag_16"     "future_lag_23"     "future_lag_32"     "future_lag_45"     "future_lag_64"     "future_lag_91"   

[127] "future_lag_128"    "future_lag_181"    "future_lag_256"    "future_lag_362"    "future_lag_512"    "future_lag_724"

I take both data from moving averages, highs and lows, and price spreads in the window. And time, and days and even months.)

My algorithms can realistically leave 10 or even 5 out of 114 predictors. That's fine. In such data there is a strong correlation between PREDICTORS and hence there is a strong redundancy.

 

I will tell you about my method of selecting informative features in brief. The code is attached.

There are two sides of the question: how to select subsets and how to measure the relevance of the selected predictors to the output variable.

The first question. I solve it by stochastic enumeration of predictor combinations, using the Simulated Annealing method. Similar in results to genetics and nondeterministic gradient descent. The plus side is that it selects from local minima and works on a principle that is found in nature. Can work with a non-smooth error surface, but here everything is conditional.

For many problems, supporters of the method consider it better than genetics, for example. Implemented through a package in R is almost standard. The trick is that it is for continuous data, and I have indexes of predictors, so I make a continuous vector of length in the total number of predictors and when breaking through a given threshold by any scalar index predictor turns into one.

The second question is even more subtle. Fitness function.

How do you measure that a predictor (and set of predictors) affects the output. The dependence can be nonlinear. Standard regression can screw up big time on some nonlinear problems. That said, I'm not talking about black box training and using a built-in importance estimator. I'm talking about a separate method.

You have to understand that dependence can be very complex and involve interactions, redundancies, non-linearity again. All of these things can be applied to categorical data as well as numerics.

I opted for categorical data, as there are good tools from Information Theory for it. To put it quite simply: there is an input and an output. If the state of the output depends at least a little bit on the input (probabilistically), they are dependent. There is a thing called mutual information. It measures this.

Now go deeper. VI measures something on the observed distribution on a sample of finite size. This is, of course, a point estimate.

So we need to estimate the statistical boundaries of the information in the case of an independent input-output pair. This is done with a self-written function using numerical methods.

Even deeper. If we have two or more predictors - what to do with them?

First, they themselves can be related and the more related they are, the more redundancy in their set will be. This redundancy is measured by so-called multinformation. But multinformation is also a point estimate on the sample. For it the quantile of distribution is also calculated numerically through another self-written function.

Second, the number of levels of predictor categories can be so large (say, 2 ^ 15) that we cannot say anything about the dependence at these levels. There are very few observations per level.

Finally, when all this is done and put together, we can measure any kind of dependence on an arbitrary number of predictors and outputs, on an arbitrary sample size with a predetermined statistical significance. The basic functions themselves are taken from the Information Theory package.

All this is in the attached file. Of course, it's not easy to understand it without 100 grams. There is also full code for creating trading rules and their validation. All for your information and to deepen your knowledge.

In principle, the result is as usual:

[1] "1.69%"

> final_vector <- c((sao$par >= threshold), T)

> names(sampleA)[final_vector]

 [1] "lag_diff_23"      "lag_diff_45"      "lag_mean_diff_2"  "lag_mean_diff_8"  "lag_max_diff_11"  "lag_max_diff_181" "lag_min_diff_3"   "lag_min_diff_724"

 [9] "lag_sd_724"       "lag_range_32"     "symbol" "future_lag_181"  

After a day and a half of work and enumeration of tens of thousands of combinations of predictors the function gives the value of the fitness function is significant mutual information penalized for the presence of redundancy in the set of predictors. And the predictors themselves.

All of this, I repeat, is categorical and allows the construction of human-readable rules. Allows interpretation of the pattern found.

For example, I have 1.7% of complete determinism (not bad for Forex) and a lot of inputs that together significantly at the 0.1 confidence level (I put the experiment that way) determine the state of the output (binary). That is, clearly the information in the forex data is present. The question is experimentally proven.

After that it is possible to evaluate on validation the profitability and to code the trading system.

Alexey

 
Dr.Trader:


Did I understand correctly that in your file ALL_cod.RData you can use Rat_DF1 for training as well (by specifying the desired target), and then use Rat_DF2 and Rat_DF3 for checking? I have attached my code in R for those who are interested, it implements error controlled neural network training. To select another target variable you can simply replace in the file by auto substitution "Short_Long.75" with something from "Short_Long.35", "Flet_Long", "Short_Flet", "Flet_In". This is more convenient than substituting different datasets.

Yes. Exactly for the convenience of rattle.

One more nuance.

All target variables are derived from two ZZs: ZZ(35) and ZZ(25). And here is one very unpleasant nuance that resonates with yours.

The target variable is the sequence 0 and 1, which corresponds to the ZZ shoulder. But we ALWAYS predict an individual element of the ZZ shoulder, not the shoulder itself. Therefore, it is incorrect to say that we are predicting trends. It is correct to say that we are predicting an element of a trend. And if you add up all of the trend elements predicted, the trend probably does not work.

 

Thank you for feature_selector_modeller.zip, I will look into it.

SanSanych Fomenko:

I address to all readers of the branch: I am ready to do the corresponding analysis, if the initial data will be presented in the form of RData or Excel file, which does not require processing.

I have attached the file, it is a set of data from forex. Inside is RData with two datasets, for training and for validation. Physically the data in the two datasets follow each other, they are divided into two files just for the convenience of testing the model. The model on this dataset can be trained, I manually sifted out the predictors and trained the neuron, as a result, the minimum error on the validation dataset was 46%, which is not really profitable. About the real profit, you can think if the error drops below 40%. Try to sift out the predictors from this file please.

SanSanych Fomenko:

The target variable is a sequence of 0 and 1, which corresponds to a ZZ shoulder. But we ALWAYS predict an individual element of the ZZ shoulder, not the shoulder itself. Therefore, it is incorrect to say that we are predicting trends. It is correct to say that we are predicting an element of a trend. And if you add up all the trend elements predicted, you probably won't even get a trend.

I've tried different target variables. On the one hand, you can predict the price one bar ahead and then the target variable will be 1 or 0 depending if the price for the next bar is up or down. I never got a result on small timeframes, it seems that the closing price is a pretty random number on them. But on H1 and higher there is already some positive result.

The other option is a zigzag or other trend indicators. I obtained some positive results with it but on condition that the result of neuronics goes through the filter. For instance, to take the average of results for the last bars or use the results only above a certain threshold value. I think that all this is not worth applying, it's more guesswork than exact calculations. The problem is that the model is required to give only buy signals for the next 10-20 bars, while it sometimes gives sell signals. In such a case the deal is reversed, the commission and spread are paid, and so on several times per trend. That is why it is necessary either to achieve high precision or to smooth the results in order to avoid such reversals on only one bar. So, yes, like you said, only a trend element is predicted; the trend itself cannot be composed of such elements.

Files:
set56.RData.zip  525 kb
 
Dr.Trader:


I tried different target variables. On the one hand, you can predict the price one bar ahead, and then the target variable will be 1 or 0 depending on if the price for the next bar went up or down. I never got a result on small timeframes, it seems that the closing price is a pretty random number on them. But on H1 and above there is already some positive result.


My results are consistently the opposite. I easily predict the price movement a few minutes ahead (up to an hour) with 55% accuracy (Ask to Ask). The best result for 23 minutes. It's a binary classification on validation samples.

And as the forecast horizon increases, the accuracy slowly drops down to 51% at 12 hours ahead. And this relationship is there throughout history. But the accuracy has to be even higher to get to the plus side (to overcome the Ask - Bid distance).

We will discuss this later.

 
Dr.Trader:

Thanks for the feature_selector_modeller.zip, I will look into it.

I attached the file, it's a dataset from forex. Inside RData with two datasets, for training and for validation. Physically the data in the two datasets follow each other, they are divided into two files just for the convenience of testing the model. The model on this dataset can be trained, I manually sifted out the predictors and trained the neuron, as a result, the minimum error on the validation dataset was 46%, which is not really profitable. About the real profit, you can think if the error drops below 40%. Try to sift out the predictors from this file please.


I have not found a single predictor - all of them are noise. Your predictors have no predictive power for your target variable. Some hints are 54,55,56. You might be able to get something out of them... Otherwise, as I see it, everything can be thrown out.
 
I see, thank you, I will look for other source data.
 
Dr.Trader:
Okay, thanks, I'll look for other raw data.

Hold on. I will also run your data for dependencies.

One question before I start. Does your data include all the bars in a row, or was there thinning of the bars before sampling?

Reason: