"New Neural" is an Open Source neural network engine project for the MetaTrader 5 platform. - page 58

 
Urain:
It doesn't matter.
 
TheXpert:
It doesn't matter.

I care, I want to develop, hear different opinions, make conclusions.

I want to develop and hear different opinions and conclusions.

 

GPU Cuda is a powerful thing. I had a code which ran for 2-3 hours on 16 Intel tracks (4-core CPU). And on 300+ CUDA cores it ran the same distance in 10 minutes, which was impressive. But Kuda code development was very difficult: one had to manually split all the calculations in the code into parallel threads and optimize the memory (unfortunately, Kuda cores have little memory). I could not do that, a smart programmer helped me. I'm still afraid of his code and continue to make all the changes in my original (non-parallelized) code. I have developed an opinion: if the algorithm is stable, it can be optimized and transferred to GPU with help of a competent programmer, self-taught programmers like me can not do it. If you immediately start with GPU code for a well-trained algorithm, it will take a long time to bypass it. I myself always start with simple, though suboptimal code, which I can understand myself, and only after obtaining expected results, I begin to optimize it. I am a mess, not a programmer :)

 
gpwr:

GPU Cuda is a powerful thing. I had a code which ran for 2-3 hours on 16 Intel tracks (4-core CPU). And on 300+ CUDA cores it ran the same distance in 10 minutes, which was impressive. But Kudo code development was very difficult: one had to manually split all the calculations in the code into parallel threads and optimize the memory (unfortunately, Kudo cores have little memory). I could not do that, a smart programmer helped me. I'm still afraid of his code and continue to make all the changes in my original (non-parallelized) code. I have developed an opinion: if the algorithm is stable, it can be optimized and transferred to GPU with help of a competent programmer, self-taught programmers like me can not do it. If you immediately start with GPU code for a well-trained algorithm, it will take a long time to bypass it. I myself always start with simple, though suboptimal code, which I can understand myself, and only after obtaining expected results, I begin to optimize it. I am a mess, not a programmer :)

Therefore we would like to get some advice from a specialist in GPU computing. yu-sha and JavaDev, ay!

I am interested in the following questions:

1. What should be the first thing to pay attention to when enabling (so far only capabilities, not implementation) to calculate the project (or its individual modules) on the GPU, so as to avoid the troublesome redesign in the future?

2. What are the memory limitations of GPU cores? Is it possible, in principle, to put the entire executable code on a separate history run in memory (tens or hundreds of thousands of bars)?

So far, such general questions. More detailed ones, as I understand, will come later.

 
joo:

Therefore, we would like advice from an expert in GPU computing. yu-sha and JavaDev, ouch!

Interested in the following questions:

1. What should be the first thing to pay attention to when enabling project's computation (or its individual modules) on GPU, so as to avoid the troublesome redesigning in the future?

2. What are the memory limitations of GPU cores? Is it possible, in principle, to put the entire executable code on a separate history run in memory (tens or hundreds of thousands of bars)?

So far, such general questions. More detailed ones, I guess, will come later.

Why mess with a cuda in a project intended for general use at all? Some users have one cuda, some have another, and some have no cuda at all. I, for example, have no cuda on my work laptop. The same network code will be very different depending on the number of cores in the cuda and memory. Better at first write this networking engine for normal Intel processor for everyone to be able to use, and then optimize it for cuda if it makes sense. By the way, it's better to create the engine in such a way that it works in the cloud. I'm not familiar with cloud computing. Is there anywhere at all?
 
gpwr:
Why mess with cuda at all in a project intended for general use? Some users have one cuda, some have a different cuda, and some have no cuda at all. I, for example, have no cuda on my work laptop. The same network code will be very different depending on the number of cores in the cuda and memory. Better at first write this networking engine for normal Intel processor for everyone to be able to use, and then optimize it for cuda if it makes sense. By the way, it's better to create the engine in such a way that it works in the cloud. I'm not familiar with cloud computing. Is there anywhere in there at all?
I agree, at first you need to make a project without CUDs. But I have one remark to the post - you should not only sharpen for Intel, but do not forget about AMD as well.
 
gpwr:
Why even bother with a cuda in a project intended for general use? Some users have one cuda, others have a different cuda, and some have no cuda at all. I, for example, have no cuda on my work laptop. The same network code will be very different depending on the number of cores in the cuda and memory. Better at first write this networking engine for normal Intel processor for everyone to be able to use, and then optimize it for cuda if it makes sense. By the way, it's better to create the engine in such a way that it works in the cloud. I'm not familiar with cloud computing. Is there anywhere at all?

MQ have promised that MQL5 will not support CUDA (only graphics cards from nVidia) but OpenCL. OpenCL can run on any hardware that has multi-core processors, both CPUs and GPUs (AMD, nVidia and intel). So, a project that supports calculations on both CPU and GPU will work for everyone.

Since MQL5 will support OpenCL, it means that cloud agents will support GPU computing.

 

Urain:

You can't always be right, that's what the Open Source project is for, to argue, to prove.

Do I need it?
 
joo:

Therefore, we would like advice from an expert in GPU computing. yu-sha and JavaDev, ouch!

Interested in the following questions:

1. What should be the first thing to pay attention to when enabling project's computation (or its individual modules) on GPU, so as to avoid the troublesome redesigning in the future?

2. What are the memory limitations of GPU cores? Is it possible, in principle, to place the entire executable code on a separate history run in memory (tens or hundreds of thousands of bars)?

So far such general questions. More detailed ones, as I understand, will appear later.

The main points are:

1) GPUs are only needed for training

2) Significant increase of performance is achieved by parallel calculation of neuron values in one layer and, what is more important, by simultaneous running of hundreds of networks

Maximum breakdown of the project into autonomous components - this is what you should pay attention to

The DDR memory on the GPU is more than enough to store the history of hundreds of thousands of bars

Core memory is severely limited (30-50 KBytes). Straightforward programming solves this issue as well - it is worth it, since the chip memory runs at core frequency and accesses it costs 0 clock cycles. There are still kernel memory bank conflicts, they get around that too.

There is an unpleasant thing about Windows feature - drivers reset, if one run lasts >~ 5 seconds

That is why the maximal result is training 100-200 networks of ~100 neurons per 100k bars of history.

With this (maximum) configuration we get the optimization speed of ~ 20-30 runs per second on GTX 460 GPU

 

Folks, if you decide to mess with the GPU, consider the following. It is easy to parallelize learning algorithms where one iteration is independent of the other. For example, genetic, where computation of hundreds of networks in a generation does not depend on each other, Manhattan (dumb enumeration of all solutions), or optimization of independent networks in the committee. But there are few such algorithms. Methods based on gradient descent will be more difficult to parallelize because one iteration depends on the other. In such cases, it will be necessary to split the computation of neurons in the same layer into parallel threads, which is not always trivial.

Reason: