The simplest XOR gate neural network for mql5 🍺 part2

The simplest XOR gate neural network for mql5 🍺 part2

3 April 2023, 14:16
Lorentzos Roussos
0
235

Read Part 1 : The simplest XOR gate neural network 

Okay .

What do we have : nodes , layers and the net and the feed forward functions .

So what does that mean in the spectrum of "a problem" ?

It means that if our problem has parameters / features we can send them in via the input layer and get a forecast or prediction from the network .

Now the problem is the network is guessing because its at a random state.

What do we need to do to teach it ? We need to get samples with a known correct answer , throw their features in the input layer , calculate the network , and compare the correct answer with the networks prediction.

Yeah no sh*t Lorenso we know that , so , each weight on the network is responsible for the prediction we get , right or wrong .

When that prediction is wrong we want to note how wrong we are and move in the opposite direction . I won't touch on the derivatives and why we use the chain rule and all that , if you are interested i have another also very serious blogpost about that here

So , to the task , what you hear all the time is we have this loss function and we calculate the error and we back propagate the error blablalba . What must we do in code thought ?

Let's look at the node structure first . 

class snn_node{
       public:
double weights[];//weights to the previous layer
double weight_adjustments[];//the "learning" that will occur
double bias,bias_adjustment;//the bias unit (a weight with a node of value 1.0 essentially)
double error;//we receive the error from the following layers here
double output;//the output value of the node
       snn_node(void){reset();}
      ~snn_node(void){reset();}
  void reset(){
       ArrayFree(weights);
       ArrayFree(weight_adjustments);
       bias=0.0;
       bias_adjustment=0.0;
       error=0.0;
       output=0.0;
       }
  void setup(int nodes_of_previous_layer,
             bool randomize_weights=true,
             double weights_min=-1.0,
             double weights_max=1.0){
       //we have the same number of weights as nodes of the previous layer
       ArrayResize(weights,nodes_of_previous_layer,0);
       ArrayResize(weight_adjustments,nodes_of_previous_layer,0);
       ArrayFill(weight_adjustments,0,nodes_of_previous_layer,0.0);
       //if weights randomize 
         if(randomize_weights){
         init_weights(weights_min,weights_max);
         }else{
         ArrayFill(weights,0,nodes_of_previous_layer,1.0);
         bias=1.0;
         }
       }
       //initialize weights
  void init_weights(double min,double max){
       double range=max-min;
       for(int i=0;i<ArraySize(weights);i++){
          weights[i]=(((double)MathRand())/32767.0)*range+min;
          }
       bias=(((double)MathRand())/32767.0)*range+min;
       }
       /* now the calculation
          for this node to be calculated (feed forward)
          we need the nodes of the previous layer
       */
  void feed_forward(snn_node &previous_layer_nodes[]){
       output=0.0;
       error=0.0;
       //we sum the weights times the previous layer node values 
         for(int i=0;i<ArraySize(previous_layer_nodes);i++){
            output+=previous_layer_nodes[i].output*weights[i];
            }
         //we add the bias
           output+=bias;
         //we send this through the activation function
           output=activationSigmoid(output);     
       }
  void reset_weight_adjustments(){
       for(int i=0;i<ArraySize(weight_adjustments);i++){
          weight_adjustments[i]=0.0;
          }
       bias_adjustment=0.0;
       }
};

We will construct a back propagation function.

At the time this function is called for this node , it means , we have summed up the error from the nodes of the next layer that this node is responsible for .

But , the error produced is by the output that has gone through the activation function of this node . Right ? So what do we need to do ? we must get the slope of that or the effect the node has on the network , how ? with the ... derivative (you hate that word by now probably).

So , what do we do first ? we multiply the error we have received by the derivative of the activation function . Easy , and , what do we need to receive for this function ?

2 things :

  • Whether or not we should send the error to the previous layer (in case this is layer 1 and the layer before it is the input layer , and the input layer does not "learn")
  • the array of the nodes of the previous layer.

okay let's build it:

  void back_propagation(bool propagate_error,//if true we will send the error to the previous layer , this will be false if the previous layer is the input layer
                        snn_node &previous_layer_nodes[]){//the previous layer's nodes to receive the error
       /* imagine the error received is after the activation function of this node
          so , we need to bring it back to distribute it 
          we multiply it by the derivative of the activation function
       */
       error*=derivativeSigmoid(output);
       }

See the error is what we had received from the right side of this layer for this node and we just multiply it by the derivative of the sigmoid .

Cool . 

Now if you've read the other post you have a fresh view on the derivatives for this process and why they are what they are , but let's briefly mention it again :

For a node we have weights that are multiplied by the output values of the nodes of the previous layer and go into a box that sums them up .

So each weight's affect on the sum ,and on the error by extention, is dependent on the output of the node of the previous layer it connects to (that is how we are going to calculate the adjustment needed for this weight) . It makes total sense right ?
The sum at this node received some values , and one weight affects the sum by "output of the node it connects to" magnitude . 

How much did this weight affect the sum that goes into the activation is proportionally equal to how much this weight will change based on the error this node produced . Right ?

Look how simple that is , remember we are storing the weight adjustments in the equivalent array (the network is not learning yet)

void back_propagation(bool propagate_error,
                        snn_node &previous_layer_nodes[]){
       error*=derivativeSigmoid(output);
         for(int i=0;i<ArraySize(weights);i++){
            weight_adjustments[i]+=error*previous_layer_nodes[i].output;
            }
           bias_adjustment+=error;
       }

Don't be confused by the fact we are adding , we are just collecting the total adjustment that we are going to apply to this weight at the end of the batch (and the bias too).

Magical right ? we took the error , we unpacked it and now we know how much the weight must change for this correction.

Great , and now look at how simple the other part is (the back propagation , i.e. the spreading the error backwards)

How "responsible" is a node of the previous layer for the error on this layer ?

Simple , it is Weight amount responsible , you do the exact same thing , almost , to send the error back to the nodes of the previous layer :

 void back_propagation(bool propagate_error,
                        snn_node &previous_layer_nodes[]){
       error*=derivativeSigmoid(output);
         if(propagate_error){
           for(int i=0;i<ArraySize(previous_layer_nodes);i++){ 
              previous_layer_nodes[i].error+=weights[i]*error;
              }
           }
         for(int i=0;i<ArraySize(weights);i++){
            weight_adjustments[i]+=error*previous_layer_nodes[i].output;
            }
           bias_adjustment+=error;
       }

This is it , all that math and the equations leads to that little function above .

Awesome , now , you will need to call this function from a layer too so let's add it there as well ,

do not worry about the whole structure of the node , the full source code include is attached at the bottom of this blog .

So the layer back prop function , what do we need to send ? whether or not we must backpropagate to the previous layer ,and, the previous layer. 

Then we loop in the nodes and call the back propagation for each one of them :

this is for the layer structure

    void back_propagation(bool propagate_error,
                          snn_layer &previous_layer){
         for(int i=0;i<ArraySize(nodes);i++){
            nodes[i].back_propagation(propagate_error,previous_layer.nodes);
            }
         }

Great , and , let's eliminate another issue you are probably wondering about right about now .

How do we get the error for the output layer , don't we need a function for that ? 

Yes , and here's how simple that is too :

We create a function to calculate the error on the output nodes of the network .

We give it a distinct name to not call it by accident

And what do we send in there ? the correct answers of the sample !

That's it . And what do we expect it to return to us ? The loss function value .

And that's where all the quadratics and the and hellingers and the itakura-saiko and cross-entropy buzz words are coming in .

This is the function you place them at . 

For this example we use the simplest one the quadratic loss function .

this function goes in the layer structure as well

  double output_layer_calculate_loss(double &correct_outputs[]){
                 double loss=0.0;
                 for(int i=0;i<ArraySize(correct_outputs);i++){
                 loss+=MathPow((nodes[i].output-correct_outputs[i]),2.0);
                 }
                 loss/=2.0;
           for(int i=0;i<ArraySize(nodes);i++){
                nodes[i].error=nodes[i].output-correct_outputs[i];
              }
         return(loss);
         } 

Excellent , now let's pause and think .

We are missing 2 things , 1 how the net structure handles the error and backpropagates , 2 how the network learns.

coffee break .

Let's draw some stuffs 

The network will receive a sample for which we have a known outcome and push the "parameters/features/inputs" of this sample into the input layer

triggering an avalanche of calculations :

The network will respond with forecast f (which in the academian lingo is "a vector" , in our lingo its an array that can be of any length.

We then have another array which is the correct values for this sample let's call it C (a vector of known outcomes in their lingo)

And what we must do now is compare C to f and :

  • tally how wrong the network is for this sample into a general total loss using any of the equations the tutors have been stuffing our face with without explaining where they go (  🤣 ,jk )
  • then measure how wrong each node of our output is 
  • then send these errors back all the way to layer1 and log weight adjustments for the network

Let's professionally draw this : 

Good , what is the "output" ? Its the last layer of the network , and what is the array f ? its the array of the nodes of the last layer of the network.

So , back to the net stucture ,

we will add an integer samples_calculated that will count how many samples we have "seen" with the orange eye right there , and , a double total_loss that will measure how wrong this "batch" (theres another buzzword that popped up out of nowhere , at the right time though) is in total using the magical equations smarter than us people have provided.

But wait , we wrote these equations on the calculate loss function . Aha So what will we do in the net structure ? 

We will increase samples calculated , and call the last layers calculate loss function . And that is it !

      void calculate_loss(double &sample_correct_outputs[]){
           samples_calculated++;
           //calculate the loss at the output layer and also add to the total loss
           total_loss+=layers[ArraySize(layers)-1].output_layer_calculate_loss(sample_correct_outputs);
           }

Easy right ? But wait a minute , the output_layer_blabla thing also prepared the error on the nodes of the output layer for us , right ?

Yes , that means that we can now call back propagation for the entire network , but , let's separate the function for convenience.

      void back_propagation(){
             for(int i=ArraySize(layers)-1;i>=1;i--){
                  layers[i].back_propagation(((bool)(i>1)),layers[i-1]);
                }
           }

Thats it , we start from the last layer all the way till layer 1 , and , we remember to send a "false" if the layer is 1 because we cannot back propagate  the error to layer 0 (which is the input layer)

Magic , almost done . By the time you call back_propagation() on the net structure your network is ready to "learn" .

So let's do it 

Look at how simple this is , when you see it and compare it to the six boards full of equations on the mit video you first watched you won't believe your ears .

We go back to the node structure , okay ? 

Let's assume all the back propagations have completed . 

We need to do what ? 

Add the array weight_adjustments (that has been collecting corrections) to the array of the weights . and do that for the bias too.

We need to send what though ? a learning rate 

  void adjust(double learning_rate){
       for(int i=0;i<ArraySize(weights);i++){
          weights[i]-=learning_rate*weight_adjustments[i];
          }
       bias-=learning_rate*bias_adjustment;
       reset_weight_adjustments();
       }

6 lines of code , and the 6th line we don't forget to call the adjustment reset , and that's it , this node "learned" .

How about the layer ?

    void adjust(double learning_rate){
         for(int i=0;i<ArraySize(nodes);i++){
            nodes[i].adjust(learning_rate);
            }
         }

Same thing , we send the learning rate , it goes into its nodes and calls adjust.

How about the entire network ?

      void adjust(double learning_rate){
           for(int i=1;i<ArraySize(layers);i++){
              layers[i].adjust(learning_rate);
              }
           }

Same thing , learning rate in , calls all layers (besides 0) and adjusts nodes

That's it , you now have a working simple neural net library .

But wait , let's do a couple more dribbles . 

Wouldn't it be wonderful if the network deccelerated its learning rate as it got better and better ? yes

can we provide the means to the coder (you) to calculate that ? 

yes. We can provide a calculation of the max possible loss per sample , and , the current loss per sample .

We plug those in the net structure :

    double get_max_loss_per_sample(){
           double max_loss_per_sample=((double)ArraySize(layers[ArraySize(layers)-1].nodes))/2.0;
           return(max_loss_per_sample);
           }
    double get_current_loss_per_sample(){
           double current_loss_per_sample=total_loss/((double)samples_calculated);
           return(current_loss_per_sample);
           }
      void reset_loss(){
           total_loss=0.0;
           samples_calculated=0;
           }

We also add a reset loss function for when we start a new batch . But wait what is that , the max loss is the nodes of the last layer divided by 2 ?

Yes for the quadratic loss function we are using .

Let's see the equation real quick , this is the loss per sample : 

This is you take the prediction of the network at each node i and you subtract the correct answer for that node , you square the difference and you sum this for all the output nodes , then divide by 2.

What is the most this equation can spit out per node if all the nodes are wrong ? (and taking into account the sigmoid function outputs from 0.0 to 1.0) , 1 , -1 squared and 1 squared equal 1 . So the sum of the output nodes by 2 , that is how we got there . If you use different loss functions you will need to change that .

Cool , let's see how would we set a net using this library  , for the xor problem at that.

Let's deploy this : 

snn_net net;
  //setup the network for xor
    //input layer 2 inputs no weights
    net.add_layer().setup(2);
    //layer 1 connects to input layer , 2 nodes 
    net.add_layer().setup(2,2);
    //layer 2 connects to layer 1 , 2 nodes
    net.add_layer().setup(2,2);
    //output layer connects to layer 2 , 1 node
    net.add_layer().setup(1,2);

Easy . and , how do we do the training thingy ? 

Let's do it for the Xor problem , we need 2 features and 1 output 

    double features[],results[];
    ArrayResize(features,2,0);
    ArrayResize(results,1,0);
    for(int i=0;i<1000;i++)
     {
       int r=MathRand();
       if(r>16000){features[0]=1.0;}else{features[0]=0.0;}
           r=MathRand();
       if(r>16000){features[1]=1.0;}else{features[1]=0.0;}
       if(features[0]!=features[1]){results[0]=1.0;}else{results[0]=0.0;}

       net.feed_forward(features);
       net.calculate_loss(results);   
       net.back_propagation();  
     } 

     net.adjust(0.1);
     net.reset_loss(); 

So what did we do ? we trained on a batch of 1000 samples with a known answer :

  • We randomize the first feature 0 or 1 , then the second , we store these 2 in the features array .
  • Then we calculate the correct outcome and put it in the results array (which has size of 1)
  • then we call (per sample) the feed forward of the net by dumping in the features of this sample
  • then we call the loss calculation and we send in the correct results array (for this sample , the results array has the same size as the nodes of the last layer)
  • then we back propagate and accumulate adjustments for the 1000 samples .
  • Exiting the loop we perform the adjustments collected with a 0.1 learning rate
  • and then we reset the loss (total loss stat and samples calculated)

Awesome , and , how do we adjust the learning rate based on the loss ? 

     double max_loss=net.get_max_loss_per_sample();
     double cur_loss=net.get_current_loss_per_sample();
     double a=(cur_loss/max_loss)*0.1;
     net.adjust(a);
     net.reset_loss(); 

We get the max loss possible per sample , and , the current loss per sample . 

We divide the current loss by the max loss , so if we have the maximum possible loss per sample this returns 1.

then the result of this division is multiplied by a base learning rate (0.1 in our example) 

and then we call the adjust function with that value .

So if the loss by the max loss per sample returns 0.1 we will get a learning rate of 0.1*0.1 = > 0.01 , slowing down as we approach the solution , as your ea reaches holy grail territory  😋

That is it , i'm attaching the source code and the example of a xor network .

If you get something like this : (without the balenciaga harry potter)

it is also correct . Here is an explanation of what is happening

 


Files:
Share it with friends: