Market etiquette or good manners in a minefield - page 81

 
When doing this rounding, don't forget to divide the number of values equal to zero by 2 in the distribution.
 
Yeah, got it (+/-0).
 
Neutron >> :

The point is that I'm not typing the statistics for the same training sample, but I'm shifting one sample at a time on each cycle. Therefore, the training results do not coincide with each other. I don't remember why I did it, but it doesn't change the essence. Apparently, I wanted to show the quasi-stationary processes in the market and reflect their influence on the learning speed.

Here's what the results look like when averaging 10 experiments on the same training sample (fig. left):

You can see that there is no statistical variation for weights with zero initialization.

The figure on the right is based on a network architecture with 12 inputs, 5 neurons in the hidden layer and 1 neuron in the output and with a training sample of 120 samples, i.e. it is a copy of your case. The statistics were gathered from 50 independent numerical experiments. Also, everything works correctly.

No, I used the first opening price difference as input (I thought it was clear from the context). It's clear that the average is zero. Predicted the amplitude and sign of the next difference.

As for the theorem, I liked it. But, it relates to our networks as a special case!

You have proved the degenerate case for training sample length tending to infinity. Really, in this case for vector of input data representing SV with zero MO we obtain zero weights - the best forecast for tomorrow for integrated SV is the current value today! But, once we take a training sample of finite length, the trained weights will tend towards equilibrium, minimising the square of the error. As an example to prove this statement, take the case of SLAE (the same NS). In this case, the weights are uniquely defined, the training error on the training sample is identically equal to zero (the number of unknowns is equal to the number of equations) and the weights (coefficients at the unknowns) are obviously not equal to zero.

I agree with your comment on my theorem. Indeed reducing the number of sets in the training set will deviate the weights from zeros. But I believe that the theorem is applicable to networks, because in order to calculate the correlation I don't need to use an infinite number of training sets. The statistical average R(m)=E{x[i]x[i+m]} is calculated as the sum(x[i]x[i+m]) of the available data. The theorem is significant in that it shows that the network will have predictive power only if these sums (correlations) are significantly different from zero; otherwise the weights will converge to zeros. This is why it is important to find training data with nonzero correlation between inputs and outputs. Those inputs that have low correlation can be discarded as they will not help the network in predictions.

As far as I understand your training error on the above charts is not divided by 2 or by the number of training sets. Is that correct? I would like to run your inputs on my network to make sure everything works properly. Could you save them in a file as they are fed to the network inputs and outputs, and put them here. You could use your 5-4-1 network with 40 samples to reduce the amount of data.

 

Allocations of the construction kagi and transaction series shoulders for different H


1. H = 1(one spread)


2. Н = 4


3. Н = 15


 
gpwr >> :

I agree with your comment on my theorem. Indeed reducing the number of sets in the training sample will deviate weights from zeros. But I think that the theorem is applicable to networks for the reason that in order to calculate the correlation you don't need to use an infinite number of training sets. The statistical average R(m)=E{x[i]x[i+m]} is calculated as the sum(x[i]x[i+m]) of the available data. The theorem is significant in that it shows that the network will have predictive power only if these sums (correlations) are significantly different from zero; otherwise the weights will converge to zeros. This is why it is important to find training data with nonzero correlation between inputs and outputs. Those inputs that have low correlation can be discarded as they will not help the network in predictions.

As far as I understand your training error on the above charts is not divided by 2 or by the number of training sets. Is that correct? I would like to run your inputs on my network to make sure everything works properly. Could you save them in a file as they are fed to the network inputs and outputs, and put them here. You can use your 5-4-1 network with 40 samples to reduce the data.

Increased epoch count to 1000 and tweaked iProp+ settings so that the weight step doesn't fade quickly. Also removed division of learning error by 2*Number of epochs. Now I obtain more satisfying results, closer to Neutron. Learning error for random weights is 2-3 times less than for zero weights, indicating that there is correlation between inputs and outputs. But still don't like that from epoch 4 to 70 the learning error is almost unchanged. We need to improve the learning algorithm. Although most commercial NS packages use iProp+ so I trust this algorithm. That leaves slow and complex ML and BFGS.


 
gpwr >> :

I agree with your comment on my theorem.


Since you are so good at maths, why don't you try to prove another theorem about optimum of network input dimension on market BPs (better not BPs, but series of kagi transactions) - that's a really useful thing!

 
paralocus >> :

Since you are so good at maths, why don't you try to prove another theorem about the optimum of the network input dimension on market BPs (preferably not BPs, but a series of kagi transactions) - that's the real thing!

I'll give it a try.

 
gpwr писал(а) >>

I agree with your comment on my theorem. It is true that reducing the number of sets in the training set will deviate the weights from zeros. But I believe that the theorem is applicable to networks, because in order to calculate the correlation I don't need to use an infinite number of training sets. The statistical average R(m)=E{x[i]x[i+m]} is calculated as the sum(x[i]x[i+m]) of the available data. The theorem is significant in that it shows that the network will have predictive power only if these sums (correlations) are significantly different from zero; otherwise the weights will converge to zeros. This is why it is important to find training data with nonzero correlation between inputs and outputs. Inputs that have low correlation can be discarded because they do not help the network in making predictions.

There is also a non-linear correlation between samples. It is caught by bilayer nonlinear NS and is not caught by linear discriminator, the limit theorem for which you proved.

As far as I understood your error of training on given diagrams is not divided by 2 or by number of sets. Correct? I would like to run your input data on my network to make sure everything works properly. Could you save them in a file as they are fed to the network inputs and outputs, and put them here. You can use your 5-4-1 network with 40 samples to reduce the data.

Below is the file with the inputs I used.

Files:
dif.zip  14 kb
 
Neutron >> :

There is also non-linear correlation between samples. It is caught by bilayer nonlinear NS and is not caught by linear discriminator, the limit theorem for which you have proved.

Below, attached is a file with input data, which I used.

>> Thank you. There's a lot to talk about non-linear correlation. I'll give my thoughts on it in a little while. In the meantime, I am intrigued by your interesting conclusion about the "cog girl". The fact that the unlearned network ("ternary girl") shows more accurate predictions on out-of-sample data alarms me. The variance of the unlearned state is much larger than the variance of the learned state. And if the learned state is the global entropy minimum (error squared), then the variance of such a state is zero, since there is only one global minimum. As there are a lot of non-inflated states in the network, there will be a lot of different predictions for the same input data. You can see that in your graphs. All in all, an interesting but alarming conclusion.

 

This is where I don't have a complete understanding myself.

According to the statement(Alexander Ezhov, Sergey Shumsky"Neurocomputing"), there is an optimal length, at which generalization error Popt=w^2/d is minimized, where d is dimension of NS input, w is number of all tunable parameters of NS. So, from this point of view, the NS is over-trained if P<Popt the NS "remembers" the training sample. The variant P>Popt is not good either, because at longer length, there is more probability of market trend reversal, which equals to decrease of correlations between samples.

On the other hand, NS can be "dragged" at the excessive number of training epochs and as the consequence, the generalization error will start to grow again, or it will not be dragged... In general, we need to perform numerical experiments with a set of statistics, which in itself is very recursive! But, it has to be done. It will make things much easier, to prove the above equation for the optimal length of the training vector. gpwr, do you want to tinker?

Reason: