Discussion of article "Neural networks made easy (Part 5): Multithreaded calculations in OpenCL"

MetaQuotes 2021.01.04 08:24

New article Neural networks made easy (Part 5): Multithreaded calculations in OpenCL has been published:

We have earlier discussed some types of neural network implementations. In the considered networks, the same operations are repeated for each neuron. A logical further step is to utilize multithreaded computing capabilities provided by modern technology in an effort to speed up the neural network learning process. One of the possible implementations is described in this article.

After selecting the technology, we need to decide on the process of splitting calculations into threads. Do you remember the fully connected perceptron algorithm during a feed-forward pass? The signal moves sequentially from the input layer to hidden layers and then to the output layer. There is no point in allocating a thread for each layer, as calculations must be performed sequentially. A layer calculation cannot start until the result from the previous layer is received. The calculation of an individual neuron in a layer does not depend on the results of calculation of other neurons in this layer. It means that we can allocate separate threads for each neuron and send all neurons of a layer for parallel computation.

Fully connected perceptron

Going down to the operations of one neuron, we could consider the possibility of parallelizing the calculation of the products of input values by their weight coefficients. However, further summation of the resulting values and the calculation of the activation function value are combined into a single thread. I decided to implement these operations in a single OpenCL kernel using vector functions.

Author: Dmitriy Gizlyk

Boris Egorov 2020.09.20 11:36 #1

The article is extremely complicated for someone not dedicated to OpenCL, I tried to understand the essence, I read the articles you have indicated in your place and ..... I didn't understand anything, or rather extremely partially.

Your article is probably super, but too complicated, I think you need to consider the very basics of OpenCL, and perhaps work with built-in libraries like OpenCL.mqh, because without them there will be no understanding, as evidenced by the lack of discussion of the article. I don't understand your code, because it contains references to many libraries I have no idea about.

The second question I don't understand is how to use OpenCL in optimisation mode. What will happen if several instances of OpenCL are simultaneously accessing the video card????. Then there are a lot of methods to speed up calculations of even one and the same thing in OpenCL. For example, get rid of double in favour of int. etc. .... There are a lot of questions.

I propose to start with a very simple example in multithreaded programming code. Suppose there is ONE perceptron which has n inputs, respectively we need to write a simple code for calculating normalisation functions in the range [-1,1] and calculating the value of the neuron by tanh using OpenCL. And the code itself in OpenCL should be commented. And consider a very simple trading Expert Advisor with optimisation in the tester on history using ONE indicator, and also a simple one, i.e. giving only one value, like RSI or CCI. That is, everything is as simplified as possible, without contrivances, just as simple as possible. In the future, I think it is not a problem to extend this to many perceptrons and write your own Expert Advisor.

Thanks for the article.

Discussion of article "Using Discussion of article "Library Discussion of article "MQL

Реter Konow 2020.09.20 15:32 #2

The article is just what you need. Thanks to the author. I understood parallel calculations in neurons from the first time. Just need to "mature".))

Boris Egorov 2020.09.20 16:11 #3

Реter Konow:
The article is just what you need. Thanks to the author. I understood parallel calculations in neurons from the first time. Just need to "mature".))

Well, one hasn't matured, the other ..... But I guess everybody wants to.

I agree that the article is necessary

I'm trying hard to understand it.

Maybe you can write what I asked you to write, if it's so easy for you.

Discussion of article "How Discussion of article "Library Discussion of article "Exploring

Реter Konow 2020.09.20 17:26 #4

Boris Egorov:

Well, one's not ripe, the other one's ..... and I guess everybody wants one.

I agree that this is a good article.

I'm trying hard to understand it.

Maybe you can write what I asked you to write, if it is so easy for you.

As far as I understand, you don't need to use OpenCL in the optimisation mode because the optimisation itself is multithreaded. The neural network is optimised by training, not by the in-house tester.

Multithreading in the network is necessary to speed up the processing of a layer of its volume of information (data) and passing the baton to the next layer.

Each neuron processes its own "piece" of the value space in its own thread without making other neurons wait for it. This is the advantage of using multithreading in NS, which is provided by OpenCL technology.

Discussion of article "Neural Discussion of article "Programming Discussion of article "Neural

Aleksey Vyazmikin 2020.09.21 01:37 #5

All of this is interesting, although unlikely to be repeatable by me, but the question that puzzled me was - where did the 10-fold increase come from if only one more core was involved in the training?

Discussion of article "Practical Discussion of article "Grid Discussion of article "14,000

Boris Egorov 2020.09.21 04:27 #6

Aleksey Vyazmikin:
All this is interesting, although it is unlikely to be repeated by me, but I was puzzled by the question - why the 10-fold increase if only one more core was involved in training?

As far as I understand, when training on OpenCL, it does parallel calculation of several indicators at once.

Maxim Dmitrievsky 2020.09.21 08:58 #7

Aleksey Vyazmikin:
All this is interesting, although it is unlikely to be repeatable by me, but I was puzzled by the question - why a 10-fold increase if only one more core was used in training?

kernel is not a physical or logical core, but a firmware that is executed on all logicalcores of a processor or video card in parallel.

Boris Egorov 2020.09.21 09:44 #8

Here are some points I don't understand

__kernel void FeedForward(__global double *matrix_w,
                          __global double *matrix_i,
                          __global double *matrix_o,
                          int inputs, 
                          int activation)
  {
   int i=get_global_id(0); //What does this line do?
   double sum=0.0;
   double4 inp, weight;
   int shift=(inputs+1)*i; //What does this line do?
   for(int k=0; k<=inputs; k=k+4)
     {
      switch(inputs-k)
        {
         case 0:
           inp=(double4)(1,0,0,0); //What does this line do?
           weight=(double4)(matrix_w[shift+k],0,0,0);
           break;
         case 1:
           inp=(double4)(matrix_i[k],1,0,0);
           weight=(double4)(matrix_w[shift+k],matrix_w[shift+k+1],0,0);
           break;
         case 2:
           inp=(double4)(matrix_i[k],matrix_i[k+1],1,0);
           weight=(double4)(matrix_w[shift+k],matrix_w[shift+k+1],matrix_w[shift+k+2],0);
           break;
         case 3:
           inp=(double4)(matrix_i[k],matrix_i[k+1],matrix_i[k+2],1);
           weight=(double4)(matrix_w[shift+k],matrix_w[shift+k+1],matrix_w[shift+k+2],matrix_w[shift+k+3]);
           break;
         default:
           inp=(double4)(matrix_i[k],matrix_i[k+1],matrix_i[k+2],matrix_i[k+3]);
           weight=(double4)(matrix_w[shift+k],matrix_w[shift+k+1],matrix_w[shift+k+2],matrix_w[shift+k+3]);
           break;
        }
      sum+=dot(inp,weight); //What does this line do?
     }
   switch(activation)
     {
      case 0:
        sum=tanh(sum);
        break;
      case 1:
        sum=pow((1+exp(-sum)),-1);
        break;
     }
   matrix_o[i]=sum;
  }

I think it is necessary to explain them for "distant" people, and the main thing is that I don't understand from the code where everything is initialised and where everything is called.

Dmitriy Gizlyk 2020.09.21 10:38 #9

Boris Egorov:

Here are some things I don't understand

I think it is necessary to explain them for "distant" people, and the main thing is that I don't understand from the code where everything is initialised and where everything is called.

Good day, Boris.
You have attached the kernel code. As Maxim wrote above, it is a microprogram that is executed on microprocessor or video card cores (depends on the initial initialisation context). The whole point is that several such programmes are called at once and they are executed in parallel on all cores. Each program works with its own data.

int i=get_global_id(0); //What does this line do?

This line just gets the serial number of the microprogram from the pool of parallel running copies. In the code, this number is used to determine which chunk of data to give for processing. In this case, it will correspond to the number of neuron in the layer.

Further, for even more parallelisation of actions inside the microprogram, vector variables are used - these are small arrays of fixed size. In this case, the dimension of vectors is 4, i.e. arrays of 4 elements.

double4 inp, weight;

But the size of the input array will not always be a multiple of 4. Therefore, to avoid incorrect array reference, swith is used and missing values are filled with "0". In this case, for each neuron the dimensionality of the weight array is 1 element greater than the dimensionality of the input elements, this element is used as a Bayesian bias. Earlier we used an additional neuron for this purpose, in which we corrected only the weights and did not recalculate the output value, which always remained equal to "1". Here, I didn't constantly add "1" to the input array, but wrote it directly in the code, and the size of the input array doesn't change.

           inp=(double4)(1,0,0,0); //What does this line do?
           weight=(double4)(matrix_w[shift+k],0,0,0);

The dot function returns the scalar product of two vectors, i.e. in our case we count the sum of 4 products of input values by weights in one line.

sum+=dot(inp,weight); //What does this line do?

Discussion of article "Using OpenCL support - Developing Opening and Closing Positions

Aleksey Vyazmikin 2020.09.21 11:12 #10

Maxim Dmitrievsky:

Akernel is not a physical or logical core, but a firmware that runs on all logicalcores of a processor or video card in parallel

So it's not news - there was 1 core and the load was on it, and now there are two cores, the load is halved.... Most likely, the changes are more significant and the comparison is not correct.

1 2 3 4 5

New comment