OpenCL: internal implementation tests in MQL5 - page 29

 
MetaDriver:

...

--

Make 512 and see what you get. Don't be afraid of crunching the program, it'll only make it better. :) When you've done it, post it here.

OK! At 512 passes and 144000 bars:

PK      0       po_00-02 (GBPJPY,M5)    23:38:29        OpenCL init OK.
LS      0       po_00-02 (GBPJPY,M5)    23:38:30        Generation 001 (512 passes, 1186 ms) : MaxResult==81.21127; Average Result==24.14348
PR      0       po_00-02 (GBPJPY,M5)    23:38:32        Generation 002 (512 passes, 1170 ms) : MaxResult==88.56933; Average Result==45.67882
RF      0       po_00-02 (GBPJPY,M5)    23:38:33        Generation 003 (512 passes, 1170 ms) : MaxResult==100.78146; Average Result==66.20171
RF      0       po_00-02 (GBPJPY,M5)    23:38:34        Generation 004 (512 passes, 1170 ms) : MaxResult==107.30714; Average Result==82.67181
RG      0       po_00-02 (GBPJPY,M5)    23:38:35        Generation 005 (512 passes, 1170 ms) : MaxResult==115.61784; Average Result==93.52664
DG      0       po_00-02 (GBPJPY,M5)    23:38:36        Generation 006 (512 passes, 1170 ms) : MaxResult==116.37332; Average Result==100.41042
CG      0       po_00-02 (GBPJPY,M5)    23:38:37        Generation 007 (512 passes, 1170 ms) : MaxResult==116.37332; Average Result==103.95667
JF      0       po_00-02 (GBPJPY,M5)    23:38:39        Generation 008 (512 passes, 1170 ms) : MaxResult==116.37332; Average Result==105.85167
NI      0       po_00-02 (GBPJPY,M5)    23:38:40        Generation 009 (512 passes, 1170 ms) : MaxResult==116.37332; Average Result==106.22531
MI      0       po_00-02 (GBPJPY,M5)    23:38:41        Generation 010 (512 passes, 1170 ms) : MaxResult==116.37332; Average Result==106.33067
GH      0       po_00-02 (GBPJPY,M5)    23:38:42        Generation 011 (512 passes, 1170 ms) : MaxResult==116.37332; Average Result==106.23798
DK      0       po_00-02 (GBPJPY,M5)    23:38:43        Generation 012 (512 passes, 1170 ms) : MaxResult==116.37332; Average Result==106.02062
PK      0       po_00-02 (GBPJPY,M5)    23:38:44        Generation 013 (512 passes, 1170 ms) : MaxResult==116.37332; Average Result==105.62199
CJ      0       po_00-02 (GBPJPY,M5)    23:38:44        Optimization finished. Best result == 116.37332 at 13 generation.
RM      0       po_00-02 (GBPJPY,M5)    23:38:44        Total time of optimization == 15 sec 226 ms

Well and if 60 is optimal, then generally cool:

FG      0       po_00-02 (GBPJPY,M5)    23:39:44        OpenCL init OK.
OO      0       po_00-02 (GBPJPY,M5)    23:39:44        Generation 001 (60 passes, 312 ms) : MaxResult==91.27985; Average Result==38.30907
RN      0       po_00-02 (GBPJPY,M5)    23:39:44        Generation 002 (60 passes, 312 ms) : MaxResult==94.08679; Average Result==48.68662
DR      0       po_00-02 (GBPJPY,M5)    23:39:45        Generation 003 (60 passes, 296 ms) : MaxResult==108.52215; Average Result==58.43468
IS      0       po_00-02 (GBPJPY,M5)    23:39:45        Generation 004 (60 passes, 312 ms) : MaxResult==129.80438; Average Result==65.32684
DP      0       po_00-02 (GBPJPY,M5)    23:39:45        Generation 005 (60 passes, 297 ms) : MaxResult==144.99834; Average Result==73.78468
MQ      0       po_00-02 (GBPJPY,M5)    23:39:46        Generation 006 (60 passes, 297 ms) : MaxResult==144.99834; Average Result==79.96281
QF      0       po_00-02 (GBPJPY,M5)    23:39:46        Generation 007 (60 passes, 312 ms) : MaxResult==152.74852; Average Result==85.70296
EG      0       po_00-02 (GBPJPY,M5)    23:39:46        Generation 008 (60 passes, 312 ms) : MaxResult==152.74852; Average Result==87.95421
PD      0       po_00-02 (GBPJPY,M5)    23:39:46        Generation 009 (60 passes, 296 ms) : MaxResult==152.74852; Average Result==89.29836
CE      0       po_00-02 (GBPJPY,M5)    23:39:47        Generation 010 (60 passes, 312 ms) : MaxResult==152.74852; Average Result==87.88991
OI      0       po_00-02 (GBPJPY,M5)    23:39:47        Generation 011 (60 passes, 296 ms) : MaxResult==152.74852; Average Result==85.3231
HK      0       po_00-02 (GBPJPY,M5)    23:39:47        Generation 012 (60 passes, 312 ms) : MaxResult==152.74852; Average Result==81.60567
IH      0       po_00-02 (GBPJPY,M5)    23:39:48        Generation 013 (60 passes, 297 ms) : MaxResult==152.74852; Average Result==77.38504
QI      0       po_00-02 (GBPJPY,M5)    23:39:48        Generation 014 (60 passes, 312 ms) : MaxResult==152.74852; Average Result==76.46695
EM      0       po_00-02 (GBPJPY,M5)    23:39:48        Optimization finished. Best result == 152.74852 at 14 generation.
PO      0       po_00-02 (GBPJPY,M5)    23:39:48        Total time of optimization == 4 sec 290 ms

//---

That is, on the weakest laptop presented in this thread, this is the result. So very promising.

//---

Unfortunately, I'm unable to discuss the subject freely, as I haven't even got into the joo article and neural networks, while I've never dug around to OpenCL. I can't use this or that code without understanding every single line of code. I want to know everything. ))) I am still working on the trading program engine. There's so much to do that my head is already swirling. )))

 

Increased CountBars by a factor of 30 (to 4,320,000), decided to test the stone's resistance to load.

Doesn't matter: it works, it warms up, but it doesn't sweat too much. The temperature is slowly rising, but has already reached saturation.

The red line is the temperature, the green line is the load of the cores.


That's why I love Intel's Sandy Bridge specimen: it's "green". Yes, the graphics aren't great, but we'll see what Ivy Bridge becomes...
 
Mathemat:

...

That's why I love Intel's Sandy Bridge model: it's "green". Yeah, the graphics aren't great, but we'll see what Ivy Bridge becomes...

Oh. (chuckles) Now that's a real stress test. :) Mine would probably be dead by now.

Then what a Haswell and then a Rockwell a little later... )))

 

An example of a Barnsley fern implementation in OpenCL.

The calculation is based on the Chaos Game algorithm(example) and uses a random number generator with a generation base that depends on the thread ID and returns get_global_id(0) to create unique trajectories.

IFS fern OpenCL

As you scale, the number of points required to maintain image quality grows quadratically, so this implementation assumes that each of the kernel instances will render a fixed number of points that fall within the visible area.

The number of estimated threads is specified on line 191:

   uint  work  []={500};

the number of points is in line 233:

   float pointsneeded=float(MathRound(1500+scale));

UPD

IFS-fern.mq5 - CPU analogue

At scale=1000:


Chaos game - Wikipedia, the free encyclopedia
Chaos game - Wikipedia, the free encyclopedia
  • en.wikipedia.org
In mathematics, the term chaos game, as coined by Michael Barnsley,1 originally referred to a method of creating a fractal, using a polygon and an initial point selected at random inside it.2 The fractal is created by iteratively creating a sequence of points, starting with the initial random point, in which each point in the sequence is a...
Files:
 
Beautiful.
 

I have made three-layer of 16x7x3 neurons. Actually, I've made it the day before yesterday, debugged it today. Before that the results did not fit when checking with CPU - I won't describe here the reasons why, at least not now - I'm too sleepy. :)

Temporal characteristics :

2012.03.08 04:46:13 ParallelTester_00-02-(16 x7x3) (EURUSD,M30)  CpuTime/GpuTime = 776.72 18045112782
2012.03.08 04:46:13 ParallelTester_00-02-(16 x7x3) (EURUSD,M30)  Result on Cpu МахResult==1.06443 at 1004 pass
2012.03.08 04:46:13 ParallelTester_00-02-(16 x7x3) (EURUSD,M30)  Соunt inticators = 16; Count history bars = 144000; Count pass = 1024
2012.03.08 04:46:13 ParallelTester_00-02-(16 x7x3) (EURUSD,M30)  CPU time = 206608 ms
2012.03.08 04:42:46 ParallelTester_00-02-(16 x7x3) (EURUSD,M30)  Result on Gpu МахResult==1.06443 at 1004 pass
2012.03.08 04:42:46 ParallelTester_00-02-(16 x7x3) (EURUSD,M30)  Соunt inticators = 16; Count history bars = 144000; Count pass = 1024
2012.03.08 04:42:46 ParallelTester_00-02-(16 x7x3) (EURUSD,M30)  GPU time = 266 ms

Tomorrow I will make Optimizer for this grid. Then I will busy loading real data and finishing the tester up to realistic calculations verifiable with MT5-tester. Then I will deal with generator MLP+cl-codes of grids for their optimization.

I don't post the source code because of greed, but ex5 is included for those who would like to test it on their hardware.

 
MetaDriver: I don't upload source code, because of greed, but for those, who want to test it on their hardware, ex5 is attached.

I'm as stable as I was under Putin:

2012.03.08 05:38:22    ParallelTester_00-02-j16x7x3z (EURUSD,H1)    CpuTime/GpuTime = 24.08037178786222
2012.03.08 05:38:22    ParallelTester_00-02-j16x7x3z (EURUSD,H1)    Result on Cpu МахResult==1.09311 at 771 pass
2012.03.08 05:38:22    ParallelTester_00-02-j16x7x3z (EURUSD,H1)    Соunt inticators = 16; Count history bars = 144000; Count pass = 1024
2012.03.08 05:38:22    ParallelTester_00-02-j16x7x3z (EURUSD,H1)    CPU time = 176172 ms
2012.03.08 05:35:26    ParallelTester_00-02-j16x7x3z (EURUSD,H1)    Result on Gpu МахResult==1.09311 at 771 pass
2012.03.08 05:35:26    ParallelTester_00-02-j16x7x3z (EURUSD,H1)    Соunt inticators = 16; Count history bars = 144000; Count pass = 1024
2012.03.08 05:35:26    ParallelTester_00-02-j16x7x3z (EURUSD,H1)    GPU time = 7316 ms
2012.03.08 05:35:18    ParallelTester_00-02-j16x7x3z (EURUSD,H1)    OpenCL init OK!


By the way, pay attention: by CPU runtime the difference between your system and mine (based on Pentium G840) is not so big.

Is your RAM fast? I have 1333 MHz.

One more thing: it's interesting that both cores are loaded on the CPU during computations. The sharp drop in load at the end is after the end of the calculations. What would that mean?


 
Mathemat:

I'm as stable as under Putin:

2012.03.08 05:38:22    ParallelTester_00-02-j16x7x3z (EURUSD,H1)    CpuTime/GpuTime = 24.08037178786222
2012.03.08 05:38:22    ParallelTester_00-02-j16x7x3z (EURUSD,H1)    Result on Cpu МахResult==1.09311 at 771 pass
2012.03.08 05:38:22    ParallelTester_00-02-j16x7x3z (EURUSD,H1)    Соunt inticators = 16; Count history bars = 144000; Count pass = 1024
2012.03.08 05:38:22    ParallelTester_00-02-j16x7x3z (EURUSD,H1)    CPU time = 176172 ms
2012.03.08 05:35:26    ParallelTester_00-02-j16x7x3z (EURUSD,H1)    Result on Gpu МахResult==1.09311 at 771 pass
2012.03.08 05:35:26    ParallelTester_00-02-j16x7x3z (EURUSD,H1)    Соunt inticators = 16; Count history bars = 144000; Count pass = 1024
2012.03.08 05:35:26    ParallelTester_00-02-j16x7x3z (EURUSD,H1)    GPU time = 7316 ms
2012.03.08 05:35:18    ParallelTester_00-02-j16x7x3z (EURUSD,H1)    OpenCL init OK!


1. By the way, notice the difference between your system and mine (Pentium G840 based) in CPU execution time.

2. Is your RAM fast? I have 1333 MHz.

1. I've been restoring my overclocking in my spare time. I once had a really bad crash (I found out later that the drive power cord had fallen out), so I pressed the "MemoryOK" button on the motherboard in search of a miracle. After that, it still didn't work, only the CMOS settings were reset to default. Now, I've overclocked the processor to 3840 MHz again, so it's working smarter now.

2. Still can't figure it out. :) In particular, the benchmark, to which Renat showed the link, shows 1600MHz. The Windows even show 1033MHz :)))), in spite of the fact that the memory itself is 2GHz, but my mother can pull up to 1866 (figuratively).

 
Mathemat:

One more thing: it's interesting that I have both cores loaded when calculating on CPU. The sharp drop in load at the end is after the end of calculations. What would it mean?

So maybe it's not on GPU at all? The driver is up, but... My only explanation is that the calculation is done on CPU-OpenCL, only, of course, on all available cores and using vector SSE instructions. :)

The second variant is that it counts simultaneously on CPU and CPU. I don't know how this (CPU-LPU) support is implemented by the driver, but in principle I don't exclude such a variant of opentzl processing startup as well.

This is my speculation, if anything. Or as it is fashionable to write now - "IMHO". ;)

 
MetaDriver: The only explanation I have is that the calculation is done on OpenCL CPU using all available cores and vector SSE instructions, of course. :)

I doubt it. Especially since I only have two cores. Where does the 25x profit come from then?

If you have Intel Math Kernel Library or Intel Performance Primitives (I haven't downloaded them), it's possible... in some cases. But it's unlikely, since they weigh hundreds of meg.

I'll have to see what Google has to say about it.

Mathemat: Also, interestingly enough, my CPU computations have both cores loaded.

No, I meant pure CPU computation without any OpenCL. The load is just below 100% where each core has comparable load values. But when running OpenCL code, it goes up to 100%, which can easily be explained by GPU operation.

Reason: