Market etiquette or good manners in a minefield - page 82

 
Neutron >> :

This is where I don't have a complete understanding myself.

According to the statement(Alexander Ezhov, Sergey Shumsky"Neurocomputing"), there is an optimal length, at which generalization error Popt=w^2/d is minimized, where d is dimension of NS input, w is number of all tunable parameters of NS. So, from this point of view, the NS is overtrained if P<Popt the NS "remembers" the training sample. The variant P>Popt is not good either, because at longer length, there is more probability of market trend reversal, which equals to decrease of correlations between samples.

On the other hand, NS can be "dragged" at the excessive number of training epochs and as the consequence, generalization error will start to grow again or not to grow... In general, we need to perform numerical experiments with a set of statistics, which in itself is very recursive! But, it must be done. It will make things much easier, to prove the above equation for the optimal length of the training vector. gpwr, do you want to tinker?

If you look at your graphs


then several questions arise. As I understand it, the red line in circles is the average learning error from several statistical experiments with different random initial weights. The blue line in circles is the average prediction error on the untrained data. Right? The thin lines show the range of scatter. Now the questions

1. Does the bottom thin blue line correspond to the bottom thin red line? In other words, does out-of-sample prediction accuracy improve for statistical experiments with the smallest learning error?

2. Since the spread of the learning error does not narrow to zero, then the learning does not reach a global minimum.

I'm now very concerned with this question: should I look for a learning algorithm that reaches the global minimum in the hope that the predictions on the untrained samples will be more accurate? I'm running my grid and seeing how inconsistent its predictions are depending on where I stop training it. Even if I set the same number of epochs 1000, the predictions are different on different runs on the same training samples. Half of the predictions are price will go up, the other half will go down. I'm not happy with that. If you train very long, the network gets closer to a global minimum and its predictions are the same on different runs.

About the optimal number of samples, I will think about it. It is not easy. You have to know the statistics of the market and how fast its distribution changes. Increase in the number of samples will lead to a situation where the net was tentatively detecting a cow and in the process it was changed from a cow to a turtle. Eventually the net will conclude that it is a horned turtle with hooves. If you reduce the number of samples, let's say the net was only given to feel the horns of a cow, then there will be many variants: cow, elk, goat, deer, etc.

 
gpwr писал(а) >>

Increasing the number of samples will cause the net to use its tentacles to identify a cow, and in the process change the cow to a tortoise. As a result, the net concludes that it is a horned turtle with hooves. If to reduce the number of samples, say the net was only given to feel the horns of a cow, then there would be many variants: cow, moose, goat, deer, etc.

+5 I completely agree.

You, however, flip through Jejov and Shumsky. Maybe you will get some ideas concerning the proofs.

The blue line in circles is the average prediction error on raw data. Correct?

Correct.

1. Does the bottom thin blue line correspond to the bottom thin red line? In other words, does out-of-sample prediction accuracy improve for statistical experiments with the smallest learning error?

Due to resource-intensiveness, I haven't done a full-run experiment. But, I agree that it is needed and I will make myself do it.

P.S. gpwr, I met a link on the net to work of two Americans who 5 years ago could prove existence and realize modified ORO algorithm for bilayer nonlinear NS with ONE output neuron. So, with special type of activation function (and its specific type does not affect to network computing power), speed of learning by the new algorithm exceeds the classical ORO by more than two orders of magnitude! Have you ever seen anything like that?

 
Neutron >> :

+5 I completely agree.

You should flip through Jejov and Shumsky though. Maybe you'll get some ideas about the evidence.

Right.

Due to resource-intensiveness, I haven't done a full-run experiment. But I agree, that it is necessary and I will force myself to conduct it.

P.S. gpwr, I met a reference on the net to work of two Americans who 5 years ago could prove existence and realize modified ORO algorithm for bilayer nonlinear NS with ONE output neuron. So, with special type of activation function (and its specific type does not affect to network computing power), speed of learning by the new algorithm exceeds the classical ORO by more than two orders of magnitude! Have you never encountered anything like this?

I've seen several variants of RFO:

QuickProp - 1988, second order derivative added to speed up convergence

RProp - Resilient back-Propagation - 1993, Riedmiller, Germany, the point of the algorithm is to replace the gradient with its sign

iRProp - Improved RProp - 2000, Igel, German, same RProp but the network takes a step back if the learning error of the previous epoch is increased

SARProp - Simulated Annealing back-Propagation - 1998, Treadgold, Australian, for global convergence, added random step size under certain conditions when the error from the previous epoch increased

JRProp - Jacobi RProp, 2005, Anastasiadis, Greek from England, same iRProp, but slightly different method of returning when the error is increased

GRProp, GJRProp - Global RProp/JRProp - 2005, Anastasiadis, at each epoch the smallest weight step is chosen and replaced by a strange formula

I have tried them all. RProp, iRProp, JRProp work almost identically. Global SARProp and GRProp methods don't work. You can easily find articles on these algorithms.

Take a look here in Russian

http://masters.donntu.edu.ua/2005/kita/tkachenko/library/article01/index.htm

www.iis.nsk.su/preprints/pdf/063.pdf

 

Thank you. I'll have a look.

Those two americans came up with their fast algorithm solely for single output NS, i.e. we're talking about something highly specialized.

 

Got myself a 2001i Pro.

Can you briefly comment on the allocation graphs I posted yesterday?

 

Well, of course.

They are correct. The first and third figures are of no interest due to the small statistics on the last one and the small H on the first one. However, the second figure is representative:

For the Kagi distribution (fig. on the left), we can note the absence of shoulder lengths smaller than the splitting step H(paralocus, you are of course a great original in terms of unusual representations of dnans, e.g. measure the splitting step in spreads instead of points...) and the exponential decrease in the frequency of shoulder length appearance with an increase in their length. For a number of transactions, we can note an almost band-shaped distribution of frequency of occurrence of lengths in prepositions +/-H and the same exponential decay at the transition to lengths greater than H. This can be seen in Fig. on the right. I think that such representation of input data for NS (normalized still on Н), is almost ideal, since it does not require "cunning" procedures of normalization and centering (MO is identically equal to zero). However, the question about optimality of the Cagi-representation remains open. Here the problem should be solved comprehensively, and the second important block in the link is MM. For TC without reinvestment, Kagi-partitioning is indeed optimal.

 
Thank you. MM is still terra incognita for me. I tried several times to reinvest the money I accumulated with one lot and got a significant loss. At the beginning of this thread you wrote about MM in relation to leverage. But is the leverage adjustable by the trader? In my opinion leverage = 100 and that's it. You can only choose pairs to reduce risk. I prefer AUDUSD - I got it from your post too. Well, the time has not come yet (for me). I will now work on the double layer. I will be coding today and I will show you what I have got tomorrow.
 
Leverage is proportional to the value of the lot in relation to the amount of capital. Therefore, by increasing or decreasing the size of the lot being traded you are essentially changing the size of the leverage. For analysis it is easier to use leverage than lot size, because it is dimensionless. That is why I used it in my formulas.
 

Essentially, a MT tester is a black box with several inputs (MAs, stochastics and other TA indicators), with a countable number of adjustable parameters (periods of MAs, optimal amplitudes, etc.) and a "tricky" algorithm for mixing it all inside. In the output we have a Sell/Buy or Stop trade order. There is an optimization procedure that allows choosing the best parameters on the condition of the maximum profit of the TS on historical data. Does it remind you of anything? Exactly, if we consider that TA indicators together with the cunning (non-linear) algorithm of their processing, the essence is a non-linear function of multilayer pseptron activation, then all of us here have been doing the same thing for many years - building and educating our NS! Only, this fact is not obvious, which causes so many problems in working with the strategy tester (fitting, instability of the found optimum, etc.). Many respectable people on the forum are often skeptical about the NS, while they do the same thing all their free time and there seems to be nothing else! Think about it.

If this is indeed the case, then obviously we need to move on to the language of AI terms. Much will become obvious from what has plagued us for so many years. For example, fitting a tester on history is simply not long enough (measured in TC events i.e. transactions, not the number of bars) or, similarly, an excessive number of tunable parameters. Insufficient profitability - indicators with a linear transformation of price are used (non-linear correlations between market events are not exploited), etc. Another important point - it is proved in the theory of NS that the computing power of the network does not depend on a specific type of nonlinearity. It follows that there is little sense in stuffing smart and nontrivial indicators and algorithms of price series processing into TS, it cannot significantly affect the TS predicative properties. But it is very important to minimize the generalization error (in terms of TC), and for this it is enough to choose the optimal length of historical data and the number of indicators!

In short, all of us will do the same thing, and it doesn't matter if we are polishing our Strategy Tester or writing our own network. The important thing is that we understand exactly what we are doing and why.

P.S. I ran a little lonely perseptron on synthetics.

It is well seen that in the process of training the neuron confidently rolls down to the global minimum (fig. on the left in red), this is indicated by dispersion decreasing to zero (thin lines), characterizing the learning process for experiments with different values of the initialization weights. On the other hand, the generalization error (the inverse of the predictive ability) begins to grow again at some point in the learning process, indicating that the neuron loses its ability to generalize knowledge. The figure on the right shows the same data but in the bumps axis. The learning optimum is well indicated.

 

When I was "polishing the tester" I had a similar feeling, but it didn't come to fruition... -:)

But now seemingly simple, but more working ideas have come up. Something I formulated yesterday:

Оптимальным каги-разбиением тикового ВР с порогом Н, следует считать такое разбиение, при котором имеется минимальное количество следующих друг за другом одноцветных плеч ряда транзакций. При этом средняя длина плеча равна величине средней взятки.


I.e. it actually turns out what you wrote - take is equal to stop! There is one subtle point here:

If the distribution of received series of transactions is such that more than 50% of successive shoulders have different colour, then why NS at all?(just don't kick me, I just asked... -:))


P.S. Corrected the typo

Reason: