How to speed up computer calculations with CUDA? - MQL4 and MetaTrader 4

Andrey Dik 2009.10.04 17:35 #721

begemot61 >> :

Why. I'm also very interested in the speed of calculation of serious things

Well, that makes three of us. Still not a lot.

Петр 2009.10.04 17:40 #722

joo >> :

I understood your idea very well. But I think we load the tester in a wrong way. My point, on the other hand, you don't seem to have understood it. But it doesn't matter, by and large. For orientation, so to speak, "on the ground", that last expert will do as well.

OK. It's not casus belli for decent husbands, is it? ))) I'm also interested in the speed of code execution specifically, as my indicators (suddenly, have seen) are quite resource-intensive even in public execution.

Any questions from newcomers open every day week Execution prior to/outside the

Eugene 2009.10.04 17:43 #723

I think grasn would also welcome the opportunity to count faster

Eugene 2009.10.04 18:06 #724

joo >> :

No. Everyone just doesn't see resource-intensive tasks in MT apart from the optimizer's work. And even if they do, they don't use them in their daily work. At least most of them do. But never mind. I will wait for MT5. The speed of code there can be seen with the naked eye. And there is also CUDA. I've downloaded from nVidia site toolkits, will be studying them. And it's no problem to transfer the code into dll.

As for CUDA, I've seen examples of calculations accelerated by 10-100 times. For some medical applications. And CUDA programming is already taught in universities. But it is a very tedious business. I.e. C is a similar language, but it is necessary to correctly divide the task, take into account peculiarities of the GPU and integer calculations. It turns out to be a very low-level programming. And not all tasks can be easily reduced to this type to get a real gain even after six months of work. Although, for example, operations with integer matrices-almost perfectly translate to CUDA.

OpenCl and the tools Auto or manual Time of writing the

Петр 2009.10.04 18:23 #725

begemot61 >> :
As for CUDA, I've seen examples of calculations accelerated by a factor of 10-100. For some medical applications. And CUDA programming is already taught in universities. But it is very tedious. I.e. C is a similar language, but it is necessary to correctly divide the task, take into account peculiarities of the GPU and integer calculations. It turns out to be a very low-level programming. And not all tasks can be easily reduced to this type to get a real gain after six months of work. Although, for example, operations with integer matrices-almost perfectly translate to CUDA.

There is the OpenCL project, which is a distributed computing environment. Almost everyone is involved in it, including AMD and nVidia. There is a higher level of abstraction there. The link contains a code sample which, as you can see, is C (the C99 standard).

Why don't I read Errors, bugs, questions Features of the mql5

[Deleted] 2009.10.04 23:27 #726

I've studied the sources, I'll report back in the afternoon, it's bedtime now.

The results are more or less clear.

[Deleted] 2009.10.05 15:18 #727

I will try to briefly describe my findings.

When optimizing the Expert Advisor, the tester uses several tens of MB of memory. I, for example, have a fxt-file for a year with minutes by openings for about 36 MB. This history is stored in memory and is accessed more or less randomly. In this mode memory does not provide enough performance to provide processor with that amount of data which it could process in "ideal" case. Here the important role is played by the cache.

Here begins the most interesting part.

1) Obviously, in cases of cache misses speed and latency of memory accesses will play an important role. Here processors can be divided into 2 groups:

a) Atom and Core 2 - the memory controller is in the "north bridge" (North Bridge - NB) chipset.

b) all the others with integrated (into the processor) memory controller - ICP.

Processors of group "a" in this case can significantly lose out to processors of group "b". That said, the Core i7 ICP is much more efficient than on AMD processors. This is one of the reasons for the Core i7's unqualified victory.

2) For a cache to effectively mask latency it has to be as large as possible, associative (x-way in CPU-Z screenshots) and less intrinsic latency.

And here the processors clearly line up in terms of speed depending on cache size (all other things being equal).

- The slowest CPU is Celeron with 512KB cache (I don't take Atom into account - its architecture is designed for economy rather than performance);

- Athlons - their low cache sizes have less of an effect due to ICP;

- Celeron 900 with 1 MB cache;

- Core 2 processors with 3-6 MB cache; models with larger cache volume are a bit off the mark;

- Phenom II, here 6 MB of cache (and with maximum associativity - as much as 48-way!) are combined with ICP;

- and the fastest - Core i7 - here it combines all the most progressive things - 3-channel (and generally very fast) RPC and the largest (and again very fast) L3 cache with 8 MB.

As for why Phenom's efficiency goes down when overclocked and Core i7's goes up.

In both these processors the ICP and L3 cache are clocked separately (while the L1/L2 cache is always running at CPU frequency).

But Belford's method of overclocking involves increasing the CPU's multiplier (he has a BE - Black Edition series processor - with a free multiplier, normally the multiplier on top is limited), without overclocking the L3 cache.

Whereas overclocking the Core i7 (with the exception of the XE) is only possible by increasing the base frequency (BCLK). This also overclocks ICs with L3 cache (in Core ix this is called Uncore).

So Belford's Phenom's L3 speed is always fixed at 2009.1 MHz. And at YuraZ it accelerates from 2.13 GHz at par, to 3.2 GHz when the processor is overclocked to 4 GHz. (CPU BCLK x 20, Uncore BCLK x 16). And the Xeon, with a CPU frequency of 3.33 GHz, the Uncore runs at 2.66 GHz.

At that, even at 2.13 GHz the Core i7's L3 cache runs noticeably faster than the Phenom's L3 cache at 2 GHz. And naturally much faster at 3.2 GHz, which ensures the Core i7's excellent scalability in this test.

Now this is at the level of speculation, as I haven't done any detailed research. But it seems that the optimization speed strongly depends on cache size and performance, and somewhat less on processor frequency.

Evaluating CPU cores for MetaTrader 4 Strategy Tester MT4 in a virtual

Eugene 2009.10.05 15:34 #728

Docent >> :