AMD or Intel as well as the memory brand - page 73

 
begemot61 >> :

Why. I'm also very interested in the speed of calculation of serious things

Well, that makes three of us. Still not a lot.

 
joo >> :

I understood your idea very well. But I think we load the tester in a wrong way. My point, on the other hand, you don't seem to have understood it. But it doesn't matter, by and large. For orientation, so to speak, "on the ground", that last expert will do as well.

OK. It's not casus belli for decent husbands, is it? ))) I'm also interested in the speed of code execution specifically, as my indicators (suddenly, have seen) are quite resource-intensive even in public execution.

 

I think grasn would also welcome the opportunity to count faster

 
joo >> :

No. Everyone just doesn't see resource-intensive tasks in MT apart from the optimizer's work. And even if they do, they don't use them in their daily work. At least most of them do. But never mind. I will wait for MT5. The speed of code there can be seen with the naked eye. And there is also CUDA. I've downloaded from nVidia site toolkits, will be studying them. And it's no problem to transfer the code into dll.

As for CUDA, I've seen examples of calculations accelerated by 10-100 times. For some medical applications. And CUDA programming is already taught in universities. But it is a very tedious business. I.e. C is a similar language, but it is necessary to correctly divide the task, take into account peculiarities of the GPU and integer calculations. It turns out to be a very low-level programming. And not all tasks can be easily reduced to this type to get a real gain even after six months of work. Although, for example, operations with integer matrices-almost perfectly translate to CUDA.
 
begemot61 >> :
As for CUDA, I've seen examples of calculations accelerated by a factor of 10-100. For some medical applications. And CUDA programming is already taught in universities. But it is very tedious. I.e. C is a similar language, but it is necessary to correctly divide the task, take into account peculiarities of the GPU and integer calculations. It turns out to be a very low-level programming. And not all tasks can be easily reduced to this type to get a real gain after six months of work. Although, for example, operations with integer matrices-almost perfectly translate to CUDA.

There is the OpenCL project, which is a distributed computing environment. Almost everyone is involved in it, including AMD and nVidia. There is a higher level of abstraction there. The link contains a code sample which, as you can see, is C (the C99 standard).

 

I've studied the sources, I'll report back in the afternoon, it's bedtime now.

The results are more or less clear.

 

I will try to briefly describe my findings.

When optimizing the Expert Advisor, the tester uses several tens of MB of memory. I, for example, have a fxt-file for a year with minutes by openings for about 36 MB. This history is stored in memory and is accessed more or less randomly. In this mode memory does not provide enough performance to provide processor with that amount of data which it could process in "ideal" case. Here the important role is played by the cache.

Here begins the most interesting part.

1) Obviously, in cases of cache misses speed and latency of memory accesses will play an important role. Here processors can be divided into 2 groups:

a) Atom and Core 2 - the memory controller is in the "north bridge" (North Bridge - NB) chipset.

b) all the others with integrated (into the processor) memory controller - ICP.

Processors of group "a" in this case can significantly lose out to processors of group "b". That said, the Core i7 ICP is much more efficient than on AMD processors. This is one of the reasons for the Core i7's unqualified victory.

2) For a cache to effectively mask latency it has to be as large as possible, associative (x-way in CPU-Z screenshots) and less intrinsic latency.

And here the processors clearly line up in terms of speed depending on cache size (all other things being equal).

- The slowest CPU is Celeron with 512KB cache (I don't take Atom into account - its architecture is designed for economy rather than performance);

- Athlons - their low cache sizes have less of an effect due to ICP;

- Celeron 900 with 1 MB cache;

- Core 2 processors with 3-6 MB cache; models with larger cache volume are a bit off the mark;

- Phenom II, here 6 MB of cache (and with maximum associativity - as much as 48-way!) are combined with ICP;

- and the fastest - Core i7 - here it combines all the most progressive things - 3-channel (and generally very fast) RPC and the largest (and again very fast) L3 cache with 8 MB.

As for why Phenom's efficiency goes down when overclocked and Core i7's goes up.

In both these processors the ICP and L3 cache are clocked separately (while the L1/L2 cache is always running at CPU frequency).

But Belford's method of overclocking involves increasing the CPU's multiplier (he has a BE - Black Edition series processor - with a free multiplier, normally the multiplier on top is limited), without overclocking the L3 cache.

Whereas overclocking the Core i7 (with the exception of the XE) is only possible by increasing the base frequency (BCLK). This also overclocks ICs with L3 cache (in Core ix this is called Uncore).

So Belford's Phenom's L3 speed is always fixed at 2009.1 MHz. And at YuraZ it accelerates from 2.13 GHz at par, to 3.2 GHz when the processor is overclocked to 4 GHz. (CPU BCLK x 20, Uncore BCLK x 16). And the Xeon, with a CPU frequency of 3.33 GHz, the Uncore runs at 2.66 GHz.

At that, even at 2.13 GHz the Core i7's L3 cache runs noticeably faster than the Phenom's L3 cache at 2 GHz. And naturally much faster at 3.2 GHz, which ensures the Core i7's excellent scalability in this test.

Now this is at the level of speculation, as I haven't done any detailed research. But it seems that the optimization speed strongly depends on cache size and performance, and somewhat less on processor frequency.

 
Docent >> :

I will try to briefly describe my findings.

When optimizing the Expert Advisor, the tester uses several tens of MB of memory. I, for example, have a fxt-file for a year with minutes by openings for about 36 MB. This history is stored in memory and is accessed more or less randomly. In this mode memory does not provide enough performance to provide processor with that amount of data which it could process in "ideal" case. Here the important role is played by the cache.

Here begins the most interesting part.

1) Obviously, in cases of cache misses speed and latency of memory accesses will play an important role. Here processors can be divided into 2 groups:

a) Atom and Core 2 - the memory controller is in the "north bridge" (North Bridge - NB) chipset.

b) all the others with integrated (into the processor) memory controller - ICP.

Processors of group "a" in this case can significantly lose out to processors of group "b". That said, the Core i7 ICP is much more efficient than on AMD processors. This is one of the reasons for the Core i7's unqualified victory.

2) For a cache to effectively mask latency it has to be as large as possible, associative (x-way in CPU-Z screenshots) and less intrinsic latency.

And here the processors clearly line up in terms of speed depending on cache size (all other things being equal).

- The slowest CPU is Celeron with 512KB cache (I don't take Atom into account - its architecture is designed for economy rather than performance);

- Athlons - their low cache sizes have less of an effect due to ICP;

- Celeron 900 with 1 MB cache;

- Core 2 processors with 3-6 MB cache; models with larger cache volume are a bit off the mark;

- Phenom II, here 6 MB of cache (and with maximum associativity - as much as 48-way!) are combined with ICP;

- and the fastest - Core i7 - here combines all the most progressive - 3-channel (and generally very fast) RPC and the largest (and again very fast) L3 cache of 8 MB.

As for why the Phenom's efficiency goes down when overclocked, and the Core i7's goes up.

In both these processors the ICP and L3 cache are clocked separately (while the L1/L2 cache is always running at CPU frequency).

But Belford's method of overclocking involves increasing the CPU's multiplier (he has a BE - Black Edition series processor - with a free multiplier, normally the multiplier on top is limited), without overclocking the L3 cache.

Whereas overclocking the Core i7 (with the exception of the XE) is only possible by increasing the base frequency (BCLK). This also overclocks ICs with L3 cache (in Core ix this is called Uncore).

So Phenom's L3 speed is always fixed at 2009.1 MHz. And with YuraZ it accelerates from 2.13 GHz at par, to 3.2 GHz when the processor is overclocked to 4 GHz. (CPU BCLK x 20, Uncore BCLK x 16). And the Xeon, with a CPU frequency of 3.33 GHz, the Uncore runs at 2.66 GHz.

At that, even at 2.13 GHz the Core i7's L3 cache runs noticeably faster than the Phenom's L3 cache at 2 GHz. And naturally much faster at 3.2 GHz, which ensures the Core i7's excellent scalability in this test.

Now this is at the level of speculation, as I haven't done any detailed research. But it seems that the optimization speed strongly depends on cache size and performance, and somewhat less on processor frequency.

Thank you. I think it's very convincing. I agree.

 
Docent >>: Но похоже, что скорость оптимизации сильно зависит от объема и быстродействия кэша, и несколько меньше от частоты процессора.

A little clarification. Would it be correct to assume that the speed of optimisation is more dependent on cache size and performance than on CPU frequency?

 
HideYourRichess писал(а) >>

A little clarification. Is it correct to assume that the optimization speed is more dependent on cache size and performance than on processor frequency?

It turns out that it does. But it is still a supposition for now and I emphasized that in my post!

Reason: