Hybrid neural networks. - page 17

 

Who has tried it?

...........................................................

Combining back propagation with Cauchy learning

The correction for weights in the combined algorithm using back propagation and Cauchy learning consists of two components: (1) a directional component computed using the back propagation algorithm, and (2) a random component defined by the Cauchy distribution.

These components are computed for each weight, and their sum is the amount by which the weight changes. As in the Cauchy algorithm, once the weight change is calculated, the target function is calculated. If there is an improvement, the change is saved. Otherwise, it is saved with the probability determined by the Boltzmann distribution.

The weight correction is calculated using the equations presented earlier for each of the algorithms:

wmn,k(n+1) = wmn,k(n) + η [aΔwmn,k(n) + (1 - a) δ n,k OUT m,j ] + (1 - η) xc,

where η is the coefficient controlling the relative Cauchy and backpropagation quantities in the weight step components. If η equals zero, the system becomes a fully Cauchy machine. If η equals one, the system becomes a back propagation machine.

........................................................

 
gumgum >> :

Can you go into more detail on quasi-Newtonian and LMA.

LMA = Levenberg-Marquardt algorithm

Quasi-Newton method

Second order gradient methods


A lot to write, in brief:


Newton's algorithm,
xk+1 = xk - sk*H^(-1)(xk)grad f(xk),where
H^(-1)(xk) is inverse of Hesse matrix at point xk,
sk is the step value,
gradf(xk) is the gradient of the function at point xk.

So, the quasi-Newtonian method uses matrix H(xk) instead of H^(-1)(xk) which is constructed from second order partial derivatives,
In the quasi-newtonian method the second derivatives are calculated using the method of extreme differences. Accordingly, we have the two most frequently used

CR calculation formulas:


Broydon-Fletcher-Goldfarb-Shanno (BFGS)

Devidon - Fletcher - Powell(DFP).


LMA:


Also refers to second-order methods, i.e. second-order partial derivatives have to be calculated,


xk+1 = xk - sk*H^(-1)(xk)grad f(xk),where

so H^(-1)(xk)is calculated as H= J^t*J where J is the Jacobian

respectively gradf(xk) = J^t*E, J^t is the transpose Jacobian, E is the network error vector, then

xk+1 = xk - [J^t*J + mu*I]^(-1)*J^t*E, where mu is scalar, if mu is 0, we have Newton method with Hessian approximation, if mu -> + Inf,

gradient method with small step.


There is more discussion here:


Neuro-synthesizer, constructor+lab

 
rip писал(а) >>

LMA = Levenberg-Marquardt algorithm

Quasi-Newton method

Second order gradient methods

A lot to write, in short:

Newton's algorithm,
xk+1 = xk - sk*H^(-1)(xk)grad f(xk),where
H^(-1)(xk) is the inverse of the Hesse matrix at point xk,
sk is the step value,
gradf(xk) is the gradient of the function at point xk.

So, the quasi-Newtonian method uses matrix H(xk) instead of H^(-1)(xk) which is constructed from second order partial derivatives,
In the quasi-newtonian method the second derivatives are calculated using the method of extreme differences. Correspondingly we have two most frequently used

CR calculation formulas:

Broydon-Fletcher-Goldfarb-Schanno (BFGS)

Devidon-Fletcher-Powell(D FP)


LMA:

Also refers to second-order methods, i.e. second-order partial derivatives have to be calculated,

xk+1 = xk - sk*H^(-1)(xk)grad f(xk),where

so H^(-1)(xk)is calculated as H= J^t*J where J is the Jacobian

respectively gradf(xk) = J^t*E, J^t is the transposed Jacobian, E is the network error vector, then

xk+1 = xk - [J^t*J + mu*I]^(-1)*J^t*E, where mu is a scalar, if mu is 0, we have Newton method with Hessian approximation, if mu -> + Inf,

gradient method with small step.

There is more discussion here:


Neuro-synthesizer, constructor+lab

Thanks.

The question remains... Where is the truth?

Upper left corner(RProp). Why DE/DW=0 when DE/DW(t-1)*DE/DW<0?

 
gumgum >> :

Thank you.

The question remains... Where's the truth?

Upper left corner. Why DE/DW=0 when DE/DW(t-1)*DE/DW<0?

A negative product of gradients indicates that the algorithm has "jumped" the required extremum. That's why that memory cell where the value of the gradient at the current step is stored (note that it is the memory cell, not the gradient itself) is zeroed, for the third condition to work at the next step. This is a nice feature of the algorithm if you read the article in full.

 
alsu писал(а) >>

A negative product of gradients indicates that the algorithm has "overshot" the required extremum. That is why the memory cell where the value of the gradient at the current step is stored (note, it is the memory cell, not the gradient itself) is zeroed, so that the third condition will be triggered at the next step. This is a trick of the algorithm, if you have read the article in full.

But then you need de/dw(t-1)=0

 
gumgum >> :

But then you need de/dw(t-1)=0

in this step it is de/dw(t), in the next it will become de/dw(t-1)

 
alsu писал(а) >>

in this step it is de/dw(t), in the next it will become de/dw(t-1)

Thank you. Can you hint with JRprop q is individual for each weight or what?

 

I am already confused by some writing deltaW=n*DE/DW others deltaW=n*DE/DW others etc. etc. ......

 
gumgum >> :

Thank you. Can you tell me with JRprop q is individual for each weight or what?

As far as I understand, q is the same for all

 
gumgum >> :

I'm already confused some write deltaW=-n*DE/DW others deltaW=n*DE/DW others etc. etc. ......

Let's start from the beginning, RProp heuristic algorithm, it uses analysis of the sign of the first derivative of the error function by synapse weight.

If the sign of the derivative is positive, dEdW(t)*dEdW(t-1) > 0, i.e. error minimization is implemented, we move in the right direction.

If the sign has changed, i.e. dEdW(t)*dEdW(t-1) < 0, we have missed the minimum (local minimum) and we should make two steps back. First, to compensate for the minimum we just missed - in your example DELTAij(t) is calculated from the previous value of delta and -eta. You don't need to correct Wij(t) at this step, we will return to the previous value of Wij only, but the way you do it means that we go back twice from the point of change of the derivative sign.


Regarding <deltaW=-n*DE/DW other deltaW=n*DE/DW>, it doesn't matter, you just have to understand which step does what, in which direction and at which point in time.


Since this is a heuristic, exact adherence to the formulas is not important, it is the principle that counts.

Reason: