Neural network - page 3

 
joo >> :

The point of minimizing/maximizing the target function E(w1,w2) is to find a global extremum. And if these global extrema are a million, what difference does it make to us which one of them NN falls in!

It is worse if it gets stuck at one of the local minima/maxima. But it's not NN's problem anymore. It is a problem of the optimization algorithm.


Described by gpwr - no way.


I agree that if all local minima are identical in depth and hence global, it makes no difference which one the network ends up in. But for simplified version of network with noisy series, local minima exist on vertices of surface E(w1,w2) too. So we need genetic optimization or gradient descent with several variants of initial values to end up in the valley. My example was intended to illustrate parallel neural network mathematics which leads to large number of local minima and complicated (long) learning process. This learning process often consists of several steps using different optimization techniques: global (genetic algorithm, differential evolution, particle swarm optimization, ant colony optimization) and local (gradient descent, conjugate gradient, Levenberg-Marquardt, BFGS), and takes a long time.

The basis of neural network mathematics is Kolmogorov's theorem: any continuous function of n variables x[1]...x[n], can be represented as a sum of 2n+1 superpositions of continuous and monotone mappings of unit segments:


Any continuous function can also be represented as an infinite Taylor series:


A power series is a simplified version of a Taylor series:



Representing an unknown function as a power series is mathematically simpler than a neural network. I will explain below.

Let us take a power series of the first order:


(1) y = f(x[1]...x[n]) = a[0] + sum(a[i]*x[i], i=1...n)


This is something other than a linear function. If y,x[1],...,x[n] are terms of the same series, then we have a linear autoregressive (AR) model. The single layer neural network is also described by the same model (1).

Now let's take a second order power series:


(2) y = f(x[1]..x[n]) = a[0] + sum(a[i]*x[i], i=1..n) + sum(b[i,j]*x[i]*x[j], i=1..n,j=1..n)


and so on. The unknown parameters of the model based on a power series are coefficients a[i],b[i,j],..., which are partial derivatives of the function f(x[1]..x[n]) for each input x[i]. The output of the model is a linear function of these coefficients while the output is a non-linear function of x[1]...x[n]. Finding of model coefficients a[i],b[i,j],..., is done by minimization of sum of squares of errors as in the case of training of neural network:


E(a[i],b[i,j],...) = sum( (t[k]-y[k])^2, k=1...p)


But in the case of the neural network, we get a nonlinear least squares method, and in the case of a power series we get a linear least squares method, which is solved quite simply: we find the derivatives E(a[i],b[i,j],...) for each coefficient a[i],b[i,j],... and equate them to zero. We obtain a linear symmetric system of equations with unknowns a[i],b[i,j],..., which is solved by Cholesky method.

Advantages of the power series method as compared to the Kolmogorov method (neural network) are:

1. It is much easier and faster to train the model: only one iteration. Neural networks are trained in 1000-100000 iterations by a combination of different optimization methods.

2. The result of learning a power series is unambiguous, i.e. only one minimum which is both local and global. Consistent training of a neural network leads to different local minima and hence different values of weights and different models of the same process (time series)

Below is surface E(a,b) for power series y = a*x + b*x^2 with "noisy" training data t[k] = cos(x[k]) + rnd:


Note that in contrast to the neural network there is only one minimum here.

The disadvantage of the nonlinear model based on power series is the fast growth of the number of its coefficients with the increase of power series order. Suppose n is the number of inputs (x[1]..x[n]). The number of coefficients nc is determined by formulas:

order = 1, nc = n+1

order = 2, nc = (n+1)*(n+2)/2

order = 3, nc = (n+1)*(n+2)*(n+3)/6

order = 4, nc = (n+1)*(n+2)*(n+3)*(n+4)/24

...

For example, a 3rd order process model with 12 inputs has 455 coefficients. Their retrieval rate is still higher than that of the neural network with fewer weights. The problem is not in slowing down the learning rate of the power series but in finding a sufficient number of training sets x[1...n][k],t[k] which must exceed the number of coefficients of the model to avoid degeneracy. In practice, power series of 2nd or 3rd order give satisfactory results.

 
gpwr >> :

I am still new to meshes, so I cannot speak authoritatively,

but I think all grid enthusiasts on the forum are trying to make a linear solver (a system of linear equations),

and then in order to introduce unpredictability into the solution, they make it fit by looping the input with the output.


I have come to the same conclusion as you by reflecting and trying to understand what others do.

But solving it this way you won't answer the question:

And these newly arrived data have no roots in the system - these are BUY or SELL.

Because there is no function that defines the model.


try to teach the grid to the painted area.

You want to train TE so that when coordinates from this area are inputted, the grid generates 1, and when coordinates from an unpainted area are 0.

The output of each neuron should have a classifier that should take a state depending on training:

If d is greater than d or true if less than d (d is a threshold, respectively). (If I'm wrong, don't judge me too harshly.)

 
gpwr >> :

After thoroughly studying neural networks and using various learning algorithms, ranging from gradient descent to genetics, I have come to the conclusion that the mathematical apparatus of neural networks is not perfect.

You don't have to have an ideal in your hands. This all resonates with the question of what percentage of movement one can afford to skip in order to improve market entry reliability. Ideally one would like to take 100%, right on the zigzag ;-). In practice it would be a grail for many people to take at least 50% on each edge.

Judging by the information available, grids do work. Ambiguity problems are solved by choosing the configuration, mesh size and its initialisation. In principle, the problem of local minima is solved too - by annealing or the same genetic method (don't we choose the probability of accepting "bad genes" there, which is equivalent to jumping out of the local valley?) Plus we also have to remember that there are committees of nets doing the work, not just one. And looking more broadly, isn't everything limited to a full-bound backpropagation grid with teacher-assisted learning? Why not try to put quotes and signals into the input vector and feed it to Kohonen, for example?

 
gpwr писал(а) >>

Do you have a network that generates stable profits?

What do you think "stable profits" means?

 
marketeer >> :

You don't have to have a perfect one in hand. This all echoes the question of what percentage of movement one can afford to skip in order to improve market entry credibility. Ideally one would like to take 100%, right on the zigzag ;-). In practice it would be a grail for many to take at least 50% on each edge.

Judging by the available information, the nets do work. The ambiguity problems are solved by choosing the configuration, grid size and its initialization. In principle, the problem of local minima is solved too - by annealing or the same genetic method (don't we choose the probability of accepting "bad genes" there, which is equivalent to jumping out of the local valley?) Plus we also have to remember that there are committees of nets doing the work, not just one. And looking more broadly, isn't everything limited to a full-bound back propagation grid with teacher-assisted learning? Why not try to put quotes and signals into the input vector and feed that to Kohonen, for example?


You misunderstood the essence of my reasoning. I wasn't talking about the correlation between an "under-learned" network and trading results. It is written everywhere that the network should be trained until the error on the sample under test stops decreasing. I agree with that and don't want to argue about it. The essence of my reasoning was to show how a parallel network structure leads to difficulties in its optimization and how a non-linear model based on a power series is able to achieve the same goal as a neural network, but with a much simpler mathematical apparatus and a fast learning process leading to a unique result.

As for the committee of networks, I have an opinion: it's all useless. Here's a question for anyone who believes in network committees. Suppose one network gives right signals 60% of the time. Another network gives the right signals 60% of the time. Now let us combine these two networks and calculate correct signals received by both networks simultaneously. That is, if both networks indicate "buy" or "sell", the corresponding "buy" or "sell" signal will be given. If one network indicates "buy" and the other "sell", no signal is given. What is the probability that these signals are correct?

You could phrase the same question differently. Take one meeting of scientists where everyone votes on the question "if there is life on Mars?" from a biological point of view. 60% of the voting answer the question correctly (by the way I do not know the answer :) Take the meeting of other scientists who vote on the same question, but from the astronomical point of view, and only 60% of them are right. Then combine two meetings (biologists and astronomers) into one and ask the same question. If you say that by some miracle the correctness of the answer rises above 60%, then you need to study the statistics.

 
gpwr писал(а) >>

You have misunderstood the essence of my reasoning. I did not talk about the relationship between an "untrained" network and trading results. It is written everywhere that the network must be trained until the error in the sample under test ceases to decrease. I agree with that and don't want to argue about it. The essence of my reasoning was to show how a parallel network structure leads to difficulties in its optimization and how a non-linear model based on a power series is able to achieve the same goal as a neural network, but with a much simpler mathematical apparatus and a fast learning process leading to a unique result.

As for the committee of networks, I have an opinion: it's all useless. Here's a question for anyone who believes in network committees. Suppose one network gives right signals 60% of the time. Another network gives the right signals 60% of the time. Now let us combine these two networks and calculate correct signals received by both networks simultaneously. That is, if both networks indicate "buy" or "sell", the corresponding "buy" or "sell" signal will be given. If one network indicates "buy" and the other "sell", no signal is given. What are the probabilities of those signals being correct?

We can formulate the same question in another way. Take one meeting of scientists where everyone votes on the question "if there is life on Mars?" from a biological point of view. 60% of the voting answer the question correctly (by the way I do not know the answer :) Take the meeting of the other scientists who vote on the same question but from the astronomical point of view and only 60% of them are right. Then combine two meetings (biologists and astronomers) into one and ask the same question. If you say that by some miracle the correctness of the answer rises above 60%, then you need to study statistics.

This is not a very good example...

There are many ways of constructing algorithmic compositions(committees). You suggest Voting, from my experience I can say far from the best way, a simple weighting is often better.

Autocorrelation (FunkOsh[i]||FunkOsh[i+1]) is usually significant > 0.8. And correlation of error functions of basic algorithms tends to 1. Committees build believe that base algorithms compensate each other, it is necessary that there was no correlation between error functions for any good.

And let's not forget about AdaBoost - it really works, but it has its own pitfalls.

 
gpwr >> :

You have probably forgotten, as indeed most neuro writers on this forum forget, judging by the comments left, about learning without a teacher. Why, if you use NN in trading, do you have to teach NN something? We cannot adequately teach ourselves how to trade. By talking about the committee of networks, do you mean that each NN is trained independently? And why do they give signals separately from each other? When building NNs, much less a committee of NNs, the only correct solution is "learning without a teacher". The brain has several parts of the brain, and more than a dozen sub-branches. Each of them performs a different function, processing information external to them. And the owner of this "committee" makes one decision. How is this possible? It is possible because the committee of networks must function in connection with each other - as a complex, otherwise nothing will work, there will be a "split personality".

 
StatBars >> :

just a little bit ahead of me :)

 
gpwr >> :

About the network committee, I have an opinion: it's all useless. Here's a question for anyone who believes in network committees. Let's say one network gives the right signals 60% of the time. Another network gives the right signals 60% of the time. Now let us combine these two networks and calculate correct signals received by both networks simultaneously. That is, if both networks indicate "buy" or "sell", the corresponding "buy" or "sell" signal will be given. If one network indicates "buy" and the other "sell", no signal is given. What are the probabilities of those signals being correct?

We can formulate the same question in another way. Take one meeting of scientists where everyone votes on the question "if there is life on Mars?" from a biological point of view. 60% of the voting answer the question correctly (by the way I do not know the answer :) Take the meeting of the other scientists who vote on the same question but from the astronomical point of view and only 60% of them are right. Then combine two meetings (biologists and astronomers) into one and ask the same question. If you say that by some miracle the correctness of the answer rises above 60%, then you need to study the statistics.

probability of correct signal (0.6*0.6)/(0.6*0.6+0.4*0.4)=69.23%, this is in theory)

the example of a collection of scientists is incorrect. from the probability to the mean.

Statistics is another fun part of mathematics :)

 
gpwr писал(а) >> It is written everywhere that the network must be trained until the error in the sample under test stops decreasing.

It's actually much more complicated than that. When you train it to a minimum error on the sample under test, you are likely to get an over-trained network......

Reason: