OpenCl and the tools for it. Reviews and impressions. - page 28

 
joo: You would never guess from the post that its author is the topicstarter.... It is unclear why he started this thread.

Come a couple of years from now, we'll remind you of this thread.

Personally for me this branch was very useful - even when the topicstarter began to scare my immature spirit horrors of reduced accuracy of calculations.

 
Gone to tear down Windows))) Dotnet does not want to be installed
 
Reshetov:

The optimisation mode on MT5 is very slow when the genetic algorithm is enabled. I made an Expert Advisor on MT4 and tested and optimized it. Optimization time does not exceed 5 minutes on dual core (of course MT4 has only one core involved, but other tasks do not interfere, because they can run on the second core). I have rewritten the same Expert Advisor for MT5. I have tested it for optimization. The optimization time is more than one hour, or nearly 2 hours to be exact. What is the difference?

There is no difference now.

Well, MetaTrader 5 appears to be ahead even when testing by opening prices: Compare of Testing Speed in MetaTrader 4 and MetaTrader 5

As promised, we have simplified the bar opening test mode and made it faster.

 

Well, it's been two years.

The CUDA version of the EA is working. Under MT4. So far, it is only in testing mode. So far I cannot get the speedup of calculations promised by nVidia.

There are two problems here:

1. nVidia itself, which exaggerates the speed of redesigning programs, or does NOT prepare its documentation at all, or fundamentally changes some essential aspects of programming.

2. Parallelization of algorithms under GPU takes much longer than expected. When I started porting a program from DLL to CUDA-DLL, based on my 20+ years of experience in the Caelian language, I figured that nVidia's promises should be divided by 3, and the algorithm porting time they quoted should be multiplied by, say, 3.

But it turned out that the general rule is: all the promises of nVidia must be divided by TEN and the estimated time of porting C to CUDA must be multiplied by 10.

Note: if you have understood the principles of GPU accelerator operation, you can port the algorithm from C to CUDA in THREE WEEKS. And you can do it directly - just to check the build. That is, your program will be executed by ONLY ONE of hundreds or THOUSANDS of small processors in the video card. This runs about 70 (seventy) times SLOWER than on the CPU. Yes, slow, but it works.

Then you can, with considerably more effort, start paralleling your program. This already works 4-5 times slower, or just 2-3 times faster than on the central processor.

And to modify your ALGORITHM globally, so that it is executed in PASSES, I repeat in PASSES, and not sequentially as they teach you in all the universities in the world, that's a difficult task.

Let's make it clear - it is difficult but not unusual to parallelize a usual sequential algorithm by the Multithreading principle. It is one thing. You get 5-10 times speed-up on GPU in the same way. But to convert the sequential algorithm into a bunch algorithm (I have no better word in my vocabulary), to load hundreds and thousands of GPU processors and get the speed-up of 100 times as promised by nVidia - this can be a problem.

But it's solvable too. It is just a matter of time.

But there is also Crimea, Benderites and so on .....

3. MetaTrader-4 has no problems when writing a DLL for CUDA.

The problem is the nVidia software developers (2500 people) radically disagree with the multithreading model used in MetaTrader-4-5. Nvidia fundamentally changed this model when they switched from CUDA 3.2 to 4.0+. Moreover, if you start asking them why it was so (like Metatrader-4 and hundreds of other multi-threaded programs) and now is so, all you will hear in response is "you fundamentally misunderstood our kAnstment".

i've heard that somewhere before.... recently.....

4. It is much easier to translate a fresh algorithm from C to CUDA than from C to generic OpenCL directly, so I recommend this way. The more so that just today nVidia is supposed to officially present CUDA-6 in which, theoretically, on new Maxwell-series GPUs and under some operating systems it will be possible to significantly reduce the amount of programming - due to memory unification and dropping of forwarding operations between CPU and GPU.

 

Well?

So?

No one is interested at all?

Not a single question or post in a year.

But it's interesting for Nvidia: it read my complaints on this and other forums, got together their arts council, rubbed it in every possible way, - and decided that traders are people too, and that the trading terminal is a program too, and introduced in the latest version of CUDA a special key for the compiler - to create a highly multithreaded programs in CUDA.

http://devblogs.nvidia.com/parallelforall/gpu-pro-tip-cuda-7-streams-simplify-concurrency/

a string like

nvcc --default-stream per-thread ./pthread_test.cu -o pthreads_per_thread

 

Unfortunately, even the Xeon Phi didn't take off. And it's even closer to conventional programming.

Whoever needs power in universal calculations can now easily get it without much strain on general-purpose multiprocessor systems. The number of cores in Intel processors is growing fast enough.

For example, our servers have 40 cores each, I even have a working computer with 20 cores and DDR4 memory. A server with 40 cores of 3Ghz Xeon unambiguously beats a low-frequency Xeon Phi with 56 or more cores without having to rewrite a single line of code.

 
Renat:

Whoever needs power in general-purpose calculations can now easily obtain it without much strain on general-purpose multiprocessor systems. The number of cores in the Intel processors is growing quite rapidly.

For example, our servers have 40 cores each, I even have a working computer with 20 cores and DDR4 memory. A Xeon-based server with 40 cores at 3Ghz unambiguously beats a low-frequency Xeon Phi with 56 or more cores, without requiring a single line of code to be rewritten.

You're slightly mistaken (2 times. Both.). Basically, I used to think so too, especially when getting into GPU programming.

(1). "Power in universal calculations" on a host-CPU can easily be obtained ONLY for the simplest algorithms. That's the sticking point for most OpenMP and GPU programmers. There are hundreds of millions of CUDA video cards sold, but only about 300 programs for it. For finance - only about 7-8 (not counting collections of useless libraries).

Check out the full list on the nVidia web site:

http://www.nvidia.com/object/gpu-applications.html?cFncE

(Our first commercially available CUDA-accelerated EA for MT4 trading terminal is *not* there yet).

This list has not changed for several years.

Why? Well, because the complex adaptive algorithm, which can easily be put together from pieces on a host-CPU, turns out that it is not enough to program it, it needs to BREAK. And that's not such an easy task because of :

a). peculiarities and limitations of the CUDA-OpenCL GPU model (kernels of different sizes should be run sequentially).

b). any data transfers via the PCI bus between the GPU and the host processor will kill the entire speed gain. And in complex adaptive algorithms, you can't do without them.

(2). "not requiring a single line of code to be rewritten" is also only for simple algorithms. The situation is worsened by the fact that OpenMP - as a real alternative to the GPU - works mysteriously, i.e. sometimes it works, and sometimes it produces rubbish. It is an illusion that just by adding a pragma line in one place, the algorithm will immediately parallelize like that. It is far from it. In a complex algorithm such unexpected correlations occur between data and parallel threads that we cannot do without constructing a graph.

The GPU is an entirely different matter. There's more work to do there at first, but the program ALWAYS works correctly, in terms of timing. Moreover, a program re-written for CUDA (even without writing the kernels) is translated into OpenMP ACTIVELY by one line of pragma and THIS does work. It makes no sense at all to translate it into OpenMP just after that - it would be much more perspective and reliable to add kernels for CUDA-OpenCL. Surprisingly, the kernels for CUDA GPUs turn out to be short, clear and reliable.

Well and in terms of absolute speed and reliability - the host-CPU has no chance against the GPU.

=Financial markets and forex in particular are a VERY compressed version of huge processes around the globe.

=For this reason, an aglorithm for price prediction cannot be simple. Therefore it has to be adaptive and figuratively speaking statistical at present.

=So without simulation and adaptive feedback in such a good algorithm there is nowhere to go.

=Therefore, if the host-CPU can still be useful for placing orders (i.e. its speed is still high enough), it is almost impossible to work without GPU for testing and optimization purposes.

 

You stated that I was wrong twice and then, under the guise of proof, gave away a completely extraneous piece of evidence.

I'm right about the following (which was stated immediately):

  1. In universal (x86 CPU based) calculations/algorithms there is no point in switching to CUDA/OpenCL. The x86 architecture is tearing up the GPU on all fronts: lower development cost, retraining cost, rewriting cost (just a disaster here), final performance higher, lower complexity, number of high frequency cores increasing, base memory frequency increasing jerkily - DDR4.
  2. Even the attempt at a multi-core Xeon Phi due to the attendant complexities (Linux based environment) died, losing out to a pure build-up of high frequency cores of the base CPU


I haven't mentioned OpenMP at all. From my point of view, OpenMP is a "silver bullet for wimps". If you're fighting for real performance, get rid of OpenMP nonsense and hand-write initially correct/native multi-threaded code, profile it and push it to the max.

You yourself have proven that there is not enough GPU computing software. Most GPU programs are just the simplest of cases, including password crackers and silly miners (games not to be discussed).

My opinion is that CPUs and underlying infrastructure are evolving quite rapidly and are actually outperforming GPUs in real-world performance in real-world applications. 3-4 years ago you could have believed in the potential of GPUs, but now it has become clear.

 
Renat:

You stated that I was wrong twice and then, under the guise of proof, gave away a completely extraneous piece of evidence.

I'm right about the following (which was stated immediately):

  1. In universal (x86 CPU based) calculations/algorithms there is no point in switching to CUDA/OpenCL. The x86 architecture is tearing up the GPU on all fronts: lower development cost, retraining cost, rewriting cost (just a disaster here), final performance higher, lower complexity, number of high frequency cores increasing, base memory frequency increasing jerkily - DDR4.
  2. Even the attempt at a multi-core Xeon Phi due to the attendant complexities (Linux based environment) died, losing out to a pure build-up of high frequency cores of the base CPU


I haven't mentioned OpenMP at all. From my point of view, OpenMP is a "silver bullet for wimps". If you're fighting for real performance, ditch the OpenMP nonsense and hand-write initially correct/native multi-threaded code, profile it and push it to the max.

You yourself have proven that there is not enough GPU computing software. Most GPU programs are just the simplest cases, including password crackers and silly miners (games not to be discussed).

My opinion is that CPUs and underlying infrastructure are evolving fast enough to actually outperform GPUs in real-world performance in real-world applications. 3-4 years ago you could have believed in the potential of GPUs, but now it has become clear.

1. extrapolating the growth rate of cores from the host-CPU, it is unlikely that in the next few years their number will reach 3000 cores like TODAY a good video card has. And each video card core runs at about 1GHz. So it would be impossible for a host processor to compete with the GPU. But that's assuming that there's a good program that can not only and not just work those 3000 cores, but also TAKE UP all the pitfalls of today's GPU hardware architecture. And the video speed of GDDR5 memory on the average video card today is about 150 GBytes/sec. All types of DDR4 (25 GB/sec) memory still have a long way to go.

How can a host processor compete with 40-60 cores, even at 4GHz and 25Gb/s memory?

2. All sorts of exotics like Phi do not have the necessary support and versatility as a video card. Therefore, they have died out.

3. about the need for direct multithreading programming - yes, I agree with you, but it's an arduous task. Writing a complex NEW adaptive algorithm at once in a multi-threaded version is very difficult. And you have to work by evolution, so to say. And besides, I don't have to tell you how badly Windows handles multi-threading when it gets really full (there are all sorts of delays). That's why even the OS came up with so called fibers - simplified threads.

Conclusion: There is nothing cheaper, more promising and reliable than GPU.

 
You are retelling a theory that all interested people already know.

The reality is that the cpu is faster in general purpose tasks due to a combination of factors. This has now become clear. The gpu silver bullet categorically fails to reach the target.
Reason: