OpenCL: internal implementation tests in MQL5 - page 51

 
Mathemat:

And the result of your OpenCL Emulator (your second test) seems suspiciously weak even in comparison with mine (I had about 10 seconds on "GPU").

It was counting on CPU-intensive GPU - I had more cores than mine. And mine counted on pure CPU with 4 cores (device 1).
 
joo: You were calculating on CPU intensive GPU - more cores than mine. And mine was counting on a pure CPU with 4 cores.

Nope, not so, Andrei. imho, I also counted on a pure CPU with 2 cores. There are three arguments for it. The first is simply ironclad, the second and third are not as solid:

1. I only have one OpenCL device on my system. On the other hand, one of those disks has to be the host anyway, i.e. the bare CPU. It is. It's not a GPU, I checked it on GPU Caps.

So I suspect the CPU device that emulates OpenCL could be any stone that supports SSE2 and above. It doesn't matter if it has integrated graphics or not. Including an Intel Core i5-750, by the way.

2. My CPU graphics is Intel HD Graphics. It's got 6 pipelines, but 24 processors (threads, flies?). Circle the graphics specs close to what both you and I have.

Since the fly frequency is much lower than the stone frequency, I can't imagine that 24 GPU threads at a lower frequency would speed up execution compared to a pure CPU by more than 25 times (and I've had that). That leaves only one thing: it's not the GPU, it's the CPU emulation of OpenCL. I may be counting wrong, but I don't have more detailed data on the graphics architecture of the Intel stones.

Your embedded graphics are about the same as mine - Intel HD Graphics 2000 - but something is crooked up there. I don't know how to explain such a strangely low result of a nearly top-of-the-range Intel Sandy Bridge stone. The acceleration on it should be, according to rough estimates, in the region of 50.

3. Formally, Intel's embedded graphics will have OpenCL support only on Ivy Bridge together with Intel HD Graphics 2500 and 4000 (it will already have 64 fast flies that are much faster than AMD's integrated ones). And right now it really doesn't seem to exist.

I myself have yet to work out why my i3 is not speeding up my calculations.

 
Mathemat:

...

I have yet to figure out why my i3 isn't speeding up my calculations myself.

Hooray! It's working!

2012.04.09 15:32:01 Terminal CPU: GenuineIntel Intel(R) Core(TM) i3-2100 CPU @ 3.10GHz with OpenCL 1.1 (4 units, 3092 MHz, 4008 Mb, version 2.0)
2012.04.09 15:32:01 Terminal MetaTrader 5 x64 build 619 started (MetaQuotes Software Corp.)

2012.04.09 16:00:18 ParallelTester_00-02-316x7x3j (USDJPY,H1) CpuTime/GpuTime = 2.604419002781939
2012.04.09 16:00:18 ParallelTester_00-02-316x7x3j (USDJPY,H1) Result on Cpu MachResult==3.64642 at 1594 pass
2012.04.09 16:00:18 ParallelTester_00-02-316x7x3j (USDJPY,H1) Count inticators = 16; Count history bars = 50,000; Count pass = 4096
2012.04.09 16:00:18 ParallelTester_00-02-316x7x3j (USDJPY,H1) CPU time = 243409 ms
2012.04.09 15:56:15 ParallelTester_00-02-316x7x3j (USDJPY,H1) Result on Gpu MachResult==3.64642 at 1594 pass
2012.04.09 15:56:15 ParallelTester_00-02-316x7x3j (USDJPY,H1) Count inticators = 16; Count history bars = 50,000; Count pass = 4096
2012.04.09 15:56:15 ParallelTester_00-02-316x7x3j (USDJPY,H1) GPU time = 93460 ms
2012.04.09 15:54:41 ParallelTester_00-02-316x7x3j (USDJPY,H1) OpenCL init OK!
2012.04.09 15:42:27 ParallelTester_00-02-316x7x3j (EURUSD,M1) CpuTime/GpuTime = 2.573211516347179
2012.04.09 15:42:27 ParallelTester_00-02-316x7x3j (EURUSD,M1) Result on Cpu MachResult==3.82222 at 3357 pass
2012.04.09 15:42:27 ParallelTester_00-02-316x7x3j (EURUSD,M1) Count inticators = 16; Count history bars = 50,000; Count pass = 4096
2012.04.09 15:42:27 ParallelTester_00-02-316x7x3j (EURUSD,M1) CPU time = 243907 ms
2012.04.09 15:38:23 ParallelTester_00-02-316x7x3j (EURUSD,M1) Result on Gpu MachResult==3.82222 at 3357 pass
2012.04.09 15:38:23 ParallelTester_00-02-316x7x3j (EURUSD,M1) Count inticators = 16; Count history bars = 50,000; Count pass = 4096
2012.04.09 15:38:23 ParallelTester_00-02-316x7x3j (EURUSD,M1) GPU time = 94787 ms
2012.04.09 15:36:49 ParallelTester_00-02-316x7x3j (EURUSD,M1) OpenCL init OK!

 
Ashes:

Yay! It works!

2012.04.09 15:32:01 Terminal CPU: GenuineIntel Intel(R) Core(TM) i3-2100 CPU @ 3.10GHz with OpenCL 1.1 (4 units, 3092 MHz, 4008 Mb, version 2.0)
2012.04.09 15:32:01 Terminal MetaTrader 5 x64 build 619 started (MetaQuotes Software Corp.)

2012.04.09 16:00:18 ParallelTester_00-02-316x7x3j (USDJPY,H1)CpuTime/GpuTime = 2.604419002781939

2012.04.09 16:00:18 ParallelTester_00-02-316x7x3j (USDJPY,H1) CPU time = 243409 ms
2012.04.09 15:56:15 ParallelTester_00-02-316x7x3j (USDJPY,H1) Result on Gpu MachResult==3.64642 at 1594 pass
2012.04.09 15:56:15 ParallelTester_00-02-316x7x3j (USDJPY,H1) Count inticators = 16; Count history bars = 50,000; Count pass = 4096
2012.04.09 15:56:15 ParallelTester_00-02-316x7x3j (USDJPY,H1)GPU time = 93460 ms

Well, this is better than 0.7 - at least it has some speedup. So what did you do?

Although... It's very weak. On my Pentium G840, like this:

2012.04.08 22:01:08 Terminal CPU: GenuineIntel Intel(R) Pentium(R) CPU G840 @ 2.80GHz with OpenCL 1.2 (2 units, 2793 MHz, 8040 Mb, version 2.0 (sse2))

2012.04.09 22:11:35 ParallelTester_00-02-316x7x3j_080412 (EURUSD,H1)CpuTime/GpuTime = 26.0524992748719
2012.04.09 22:11:35 ParallelTester_00-02-316x7x3j_080412 (EURUSD,H1) Result on Cpu MachResult==4.04242 at 1775 pass
2012.04.09 22:11:35 ParallelTester_00-02-316x7x3j_080412 (EURUSD,H1) Count inticators = 16; Count history bars = 50,000; Count pass = 4096
2012.04.09 22:11:35 ParallelTester_00-02-316x7x3j_080412 (EURUSD,H1)CPU time = 269461 ms
2012.04.09 22:07:05 ParallelTester_00-02-316x7x3j_080412 (EURUSD,H1) Result on Gpu MachResult==4.04242 at 1775 pass
2012.04.09 22:07:05 ParallelTester_00-02-316x7x3j_080412 (EURUSD,H1) Count inticators = 16; Count history bars = 50,000; Count pass = 4096
2012.04.09 22:07:05 ParallelTester_00-02-316x7x3j_080412 (EURUSD,H1)GPU time = 10343 ms
2012.04.09 22:06:55 ParallelTester_00-02-316x7x3j_080412 (EURUSD,H1) OpenCL init OK!

Do you think it is normal that a low-cost rock beats a more powerful one of the same company by almost an order of magnitude? They have almost the same architecture, and the instruction set of i3 is at least not poorer...

 
Mathemat:

Well, it's already better than 0.7, at least some speedup. And what have you done?

Although... it's very weak. On my Pentium G840 it goes like this:

What do you think, is it normal or not, if a budget stone beats a more powerful one of the same company by almost an order of magnitude? They have almost the same architecture, and the instruction set of i3 is at least not poorer...

Updated the HD Graphics driver and installed the new AMD OpenCL SDK with English.

My suggestion regarding the difference in performance is as follows: different versions of OCL + initially you have GPU over CPU frequency advantage (assuming that GPU part is the same).

 
Ashes: As for the difference in performance I'll assume: different versions of OCL + initially you have GPU frequency advantage over CPU (assuming the GPU part is the same).

I don't have any GPU OpenCL device, read my reply to joo. And you shouldn't have one either (I'm talking about integrated graphics). Everything goes on pure emulation on CPU cores. Yes and the frequency advantage is 2x at most, while the performance differs by an order of magnitude. This is a discrepancy.

Second: the version difference doesn't affect the performance of absolutely all the tests laid out here. And it was the same fast on 1.1 (on my G840).

 
Mathemat:

I don't have any GPU OpenCL device, read my reply to joo. And you shouldn't have one either (I'm talking about integrated graphics). Everything runs on pure emulation on CPU cores. Yes and the frequency advantage is 2x at most, while the performance differs by an order of magnitude. This is a discrepancy.

Second: the version difference doesn't affect the performance of absolutely all the tests laid out here. And it was the same fast on 1.1 (on my G840).

Looks like you are right. I was rejoicing too early. Judging by the CPU load it looks like the GPU part of the script is parallelized to 4 cores/threads (CPU load 100%), while the CPU part is practically executed on one core (25-30%).

So it turns out that 4*0.7 (overhead?) ~= 2.8 - GPU/CPU.

It is not clear why there is such a difference between yours and mine, except that AMD SDK does not use all the features of "someone else's" hardware in x64 version (you have, as far as I understand, x32). HD Graphics driver version (Intel Corporation, 14.02.2012, 8.15.10.2653).

Comparison of G840 and i3-2100 from intel.com:

PS. Oops, missed the 8040Mb you have (i.e. x32/x64 is out).

 
Mathemat:

Second, the version difference has no effect on the performance of absolutely all the tests posted here. And with 1.1 it was just as fast (on my G840).

OpenCL 1.2 gave a gain of about 10% to 1.1:

2012.04.10 09:41:19 Terminal MetaTrader 5 x64 build 619 started (MetaQuotes Software Corp.)
2012.04.10 09:41:19 Terminal CPU: GenuineIntel Intel(R) Core(TM) i3-2100 CPU @ 3.10GHz with OpenCL 1.1 (4 units, 3092 MHz, 4008 Mb, version 2.0)

2012.04.10 09:41:30 ParallelTester_00-02-316x7x3j (EURUSD,M1) OpenCL init OK!
2012.04.10 09:43:04 ParallelTester_00-02-316x7x3j (EURUSD,M1) GPU time = 94365 ms
2012.04.10 09:43:04 ParallelTester_00-02-316x7x3j (EURUSD,M1) Count inticators = 16; Count history bars = 50,000; Count pass = 4096
2012.04.10 09:43:04 ParallelTester_00-02-316x7x3j (EURUSD,M1) Result on Gpu MachResult==3.52408 at 1914 pass
2012.04.10 09:47:09 ParallelTester_00-02-316x7x3j (EURUSD,M1) CPU time = 244968 ms
2012.04.10 09:47:09 ParallelTester_00-02-316x7x3j (EURUSD,M1) Count inticators = 16; Count history bars = 50,000; Count pass = 4096
2012.04.10 09:47:09 ParallelTester_00-02-316x7x3j (EURUSD,M1) Result on Cpu MachResult==3.52408 at 1914 pass
2012.04.10 09:47:09 ParallelTester_00-02-316x7x3j (EURUSD,M1) CpuTime/GpuTime = 2.595962486091242
2012.04.10 10:20:22 ParallelTester_00-02-316x7x3j (EURUSD,M1) OpenCL init OK!
2012.04.10 10:21:56 ParallelTester_00-02-316x7x3j (EURUSD,M1) GPU time = 93756 ms
2012.04.10 10:21:56 ParallelTester_00-02-316x7x3j (EURUSD,M1) Count inticators = 16; Count history bars = 50,000; Count pass = 4096
2012.04.10 10:21:56 ParallelTester_00-02-316x7x3j (EURUSD,M1) Result on Gpu MachResult==4.06735 at 1519 pass
2012.04.10 10:25:58 ParallelTester_00-02-316x7x3j (EURUSD,M1) CPU time = 242426 ms
2012.04.10 10:25:58 ParallelTester_00-02-316x7x3j (EURUSD,M1) Count inticators = 16; Count history bars = 50,000; Count pass = 4096
2012.04.10 10:25:58 ParallelTester_00-02-316x7x3j (EURUSD,M1) Result on Cpu MachResult==4.06735 at 1519 pass
2012.04.10 10 10:25:58 ParallelTester_00-02-316x7x3j (EURUSD,M1) CpuTime/GpuTime = 2.58571184775076


2012.04.10 11:33:50 PM Terminal MetaTrader 5 x64 build 619 started (MetaQuotes Software Corp.)
2012.04.10 11:33:50 Terminal CPU: GenuineIntel Intel(R) Core(TM) i3-2100 CPU @ 3.10GHz with OpenCL 1.2 (4 units, 3092 MHz, 4008 Mb, version 2.0 (sse2,avx))

2012.04.10 11:34:14 ParallelTester_00-02-316x7x3j (EURUSD,M1) OpenCL init OK!
2012.04.10 11:35:41 ParallelTester_00-02-316x7x3j (EURUSD,M1) GPU time = 86923 ms
2012.04.10 11:35:41 ParallelTester_00-02-316x7x3j (EURUSD,M1) Count inticators = 16; Count history bars = 50,000; Count pass = 4096
2012.04.10 11:35:41 ParallelTester_00-02-316x7x3j (EURUSD,M1) Result on Gpu MachResult==4.27665 at 970 pass
2012.04.10 11:39:48 ParallelTester_00-02-316x7x3j (EURUSD,M1) CPU time = 246965 ms
2012.04.10 11:39:48 ParallelTester_00-02-316x7x3j (EURUSD,M1) Count inticators = 16; Count history bars = 50,000; Count pass = 4096
2012.04.10 11:39:48 ParallelTester_00-02-316x7x3j (EURUSD,M1) Result on Cpu MachResult==4.27665 at 970 pass
2012.04.10 11:39:48 ParallelTester_00-02-316x7x3j (EURUSD,M1) CpuTime/GpuTime = 2.841192779816619
2012.04.10 11:47:50 ParallelTester_00-02-316x7x3j (EURUSD,M1) OpenCL init OK!
2012.04.10 11:49:18 ParallelTester_00-02-316x7x3j (EURUSD,M1) GPU time = 87610 ms
2012.04.10 11:49:18 ParallelTester_00-02-316x7x3j (EURUSD,M1) Count inticators = 16; Count history bars = 50,000; Count pass = 4096
2012.04.10 11:49:18 ParallelTester_00-02-316x7x3j (EURUSD,M1) Result on Gpu MachResult==4.43566 at 781 pass
2012.04.10 11:53:24 ParallelTester_00-02-316x7x3j (EURUSD,M1) CPU time = 245873 ms
2012.04.10 11:53:24 ParallelTester_00-02-316x7x3j (EURUSD,M1) Count inticators = 16; Count history bars = 50,000; Count pass = 4096
2012.04.10 11:53:24 ParallelTester_00-02-316x7x3j (EURUSD,M1) Result on Cpu MachResult==4.43566 at 781 pass
2012.04.10 11:53:24 ParallelTester_00-02-316x7x3j (EURUSD,M1) CpuTime/GpuTime = 2.806449035498231

 
I take it that if we want to pass a 2D/multidimensional array with data to the GPU, we can use the data representation as a structure and pass the structure?
 
Ashes: OpenCL 1.2 gave a gain of about 10% to 1.1:

Yeah, about that. The gain is non-critical anyway, not many times over.

The only difference is that AMD SDK does not use all features on "foreign" hardware.

I had that hypothesis at the beginning as well.

But CPU OpenCL tests by GPU Caps utility all pass quite decently, I checked it. The tool is in the trailer if you need it. Just spin the CPU emulations.

In some tests the fps on i3 is several times higher than on G840 (on 4D Quaternion Julia Set - about 17 vs 4-5).

So, it turns out that AMD did a good job here.

The problem is in the terminal which for some reason "understands" G840 but "does not understand" cooler Intel stones with AMD APP SDK installed. I sent a message to servicedesk a week ago, but no response so far.

Files:
Reason: