OpenCL: internal implementation tests in MQL5

Vladimir Gomonov 2013.11.27 22:36 #691

tol64:

Maybe Renat can see what can be learnt from this. It is quite possible that new specification will give better performance in MQL5 as well, isn't it?

As for C#/C++, if need be, we can also dump it. The main thing is to get the maximum possible output. ;)

For the time being, I'm keeping myself from rewriting the CL-optimizer for Sharp hoping that the new MT4 will at least provide version 1.1 eventually. The language is the same, the compiler is the same and there are no principal obstacles (I don't really need OpenCL support in an MT4 tester, although I will continue doing so if it appears). If it is not implemented - I will think to the left.

Strategy Tester to support Help to upgrade to Why valenok2003 is against

Anatoli Kazharski 2013.11.29 15:51 #692

Tested some of the scripts in this thread on such a machine:

CPU-Z

CUDA-Z

For each script I will provide a link to the post where it was published so that others can quickly find it, run the tests and compare results if needed.

Test 1

Test 2

2013.11.29 14:29:13     ParallelOptimazer_00-02 (EURUSD,H1)     Generation 013 (1280 passes, 140 ms) : MaxResult==116.05191; Average Result==106.7991
2013.11.29 14:29:13     ParallelOptimazer_00-02 (EURUSD,H1)     Generation 014 (1280 passes, 125 ms) : MaxResult==116.05191; Average Result==106.77599
2013.11.29 14:29:13     ParallelOptimazer_00-02 (EURUSD,H1)     Generation 015 (1280 passes, 125 ms) : MaxResult==116.05191; Average Result==106.37561
2013.11.29 14:29:13     ParallelOptimazer_00-02 (EURUSD,H1)     Generation 016 (1280 passes, 140 ms) : MaxResult==116.05191; Average Result==106.64193
2013.11.29 14:29:13     ParallelOptimazer_00-02 (EURUSD,H1)     Optimization finished. Best result == 116.05191 at 16 generation.
2013.11.29 14:29:13     ParallelOptimazer_00-02 (EURUSD,H1)     Total time of optimization == 2 sec 122 ms

Test 3

scale = 1000

CPU

GPU

Test 4

2013.11.29 16:02:31     Tast_Mand_ (EURUSD,H1)  1872 msec

Test 5

2013.11.29 16:39:50     ParallelTester_00-01 x (EURUSD,H1)       CLGetInfoInteger() returned 2
2013.11.29 16:39:51     ParallelTester_00-01 x (EURUSD,H1)       OpenCL init OK!
2013.11.29 16:39:51     ParallelTester_00-01 x (EURUSD,H1)       GPU time = 62 ms
2013.11.29 16:39:51     ParallelTester_00-01 x (EURUSD,H1)       Соunt indicators = 16; Count history bars = 144000; Count pass = 1280
2013.11.29 16:39:51     ParallelTester_00-01 x (EURUSD,H1)       Result on Gpu МахResult==1.34787 at 699 pass
2013.11.29 16:40:05     ParallelTester_00-01 x (EURUSD,H1)       CPU time = 14492 ms
2013.11.29 16:40:05     ParallelTester_00-01 x (EURUSD,H1)       Соunt indicators = 16; Count history bars = 144000; Count pass = 1280
2013.11.29 16:40:05     ParallelTester_00-01 x (EURUSD,H1)       Result on Cpu МахResult==1.34787 at 699 pass
2013.11.29 16:40:05     ParallelTester_00-01 x (EURUSD,H1)       CpuTime/GpuTime = 233.741935483871

Test 6

2013.11.29 16:45:28     ParallelTester_00-01 x_cycle (EURUSD,H1) OpenCL init OK! Device number = 0
2013.11.29 16:45:28     ParallelTester_00-01 x_cycle (EURUSD,H1) GPU time = 577 ms
2013.11.29 16:45:28     ParallelTester_00-01 x_cycle (EURUSD,H1) Соunt indicators = 16; Count history bars = 144000; Count pass = 12800
2013.11.29 16:45:28     ParallelTester_00-01 x_cycle (EURUSD,H1) Result on Gpu МахResult==1.57161 at 7031 pass
2013.11.29 16:45:28     ParallelTester_00-01 x_cycle (EURUSD,H1) OpenCL init OK! Device number = 1
2013.11.29 16:45:29     ParallelTester_00-01 x_cycle (EURUSD,H1) GPU time = 546 ms
2013.11.29 16:45:29     ParallelTester_00-01 x_cycle (EURUSD,H1) Соunt indicators = 16; Count history bars = 144000; Count pass = 12800
2013.11.29 16:45:29     ParallelTester_00-01 x_cycle (EURUSD,H1) Result on Gpu МахResult==1.57161 at 7031 pass
2013.11.29 16:47:54     ParallelTester_00-01 x_cycle (EURUSD,H1) CPU time = 145144 ms
2013.11.29 16:47:54     ParallelTester_00-01 x_cycle (EURUSD,H1) Соunt indicators = 16; Count history bars = 144000; Count pass = 12800
2013.11.29 16:47:54     ParallelTester_00-01 x_cycle (EURUSD,H1) Result on Cpu МахResult==1.57161 at 7031 pass
2013.11.29 16:47:54     ParallelTester_00-01 x_cycle (EURUSD,H1) CpuTime/GpuTime = 265.8315018315018

Test7

2013.11.29 16:54:52     ParallelTester_00-01 x_new_cycle (EURUSD,H1)     ========================================
2013.11.29 16:57:16     ParallelTester_00-01 x_new_cycle (EURUSD,H1)     CPU time = 144691 ms
2013.11.29 16:57:16     ParallelTester_00-01 x_new_cycle (EURUSD,H1)     Соunt indicators = 16; Count history bars = 144000; Count pass = 12800
2013.11.29 16:57:16     ParallelTester_00-01 x_new_cycle (EURUSD,H1)     Result on Cpu МахResult==0.91969 at 4641 pass
2013.11.29 16:57:16     ParallelTester_00-01 x_new_cycle (EURUSD,H1)     -------------------------
2013.11.29 16:57:16     ParallelTester_00-01 x_new_cycle (EURUSD,H1)     Device number = 0
2013.11.29 16:57:17     ParallelTester_00-01 x_new_cycle (EURUSD,H1)     GPU time = 593 ms
2013.11.29 16:57:17     ParallelTester_00-01 x_new_cycle (EURUSD,H1)     CpuTime/GpuTime = 243.9983136593592
2013.11.29 16:57:17     ParallelTester_00-01 x_new_cycle (EURUSD,H1)     Result on Gpu МахResult==0.91969 at 4641 pass
2013.11.29 16:57:17     ParallelTester_00-01 x_new_cycle (EURUSD,H1)     ------------
2013.11.29 16:57:17     ParallelTester_00-01 x_new_cycle (EURUSD,H1)     Device number = 1
2013.11.29 16:57:18     ParallelTester_00-01 x_new_cycle (EURUSD,H1)     GPU time = 546 ms
2013.11.29 16:57:18     ParallelTester_00-01 x_new_cycle (EURUSD,H1)     CpuTime/GpuTime = 265.0018315018315
2013.11.29 16:57:18     ParallelTester_00-01 x_new_cycle (EURUSD,H1)     Result on Gpu МахResult==0.91969 at 4641 pass
2013.11.29 16:57:18     ParallelTester_00-01 x_new_cycle (EURUSD,H1)     ------------

Test 8

2013.11.29 17:08:08     vect_v2_all_devices (EURUSD,H1) =======================================
2013.11.29 17:08:08     vect_v2_all_devices (EURUSD,H1) OCL martices mul:         ROWS1 = 2000; COLSROWS = 2000; COLS2 = 2000
2013.11.29 17:09:12     vect_v2_all_devices (EURUSD,H1) CPUTime = 64.085
2013.11.29 17:09:12     vect_v2_all_devices (EURUSD,H1) ---------------
2013.11.29 17:09:12     vect_v2_all_devices (EURUSD,H1) read = 4000000 elements
2013.11.29 17:09:12     vect_v2_all_devices (EURUSD,H1) Device = 0: time = 0.251 sec.
2013.11.29 17:09:12     vect_v2_all_devices (EURUSD,H1) CPUTime / GPUTotalTime = 255.319
2013.11.29 17:09:12     vect_v2_all_devices (EURUSD,H1) sum( 1362,1715 ) = -5.34762192;    thirdCPU[ 1362,1715 ] = -5.34762192;    buf[ 1362,1715 ] = -5.34761715
2013.11.29 17:09:12     vect_v2_all_devices (EURUSD,H1) sum( 365,218 ) = 1.04545093;    thirdCPU[ 365,218 ] = 1.04545093;    buf[ 365,218 ] = 1.04544997
2013.11.29 17:09:12     vect_v2_all_devices (EURUSD,H1) sum( 1461,1678 ) = -0.26404253;    thirdCPU[ 1461,1678 ] = -0.26404253;    buf[ 1461,1678 ] = -0.26404306
2013.11.29 17:09:12     vect_v2_all_devices (EURUSD,H1) sum( 1116,1765 ) = 0.61209172;    thirdCPU[ 1116,1765 ] = 0.61209172;    buf[ 1116,1765 ] = 0.61209279
2013.11.29 17:09:12     vect_v2_all_devices (EURUSD,H1) sum( 256,499 ) = 2.50011539;    thirdCPU[ 256,499 ] = 2.50011539;    buf[ 256,499 ] = 2.50011611
2013.11.29 17:09:12     vect_v2_all_devices (EURUSD,H1) sum( 528,1433 ) = 2.69000340;    thirdCPU[ 528,1433 ] = 2.69000340;    buf[ 528,1433 ] = 2.69000053
2013.11.29 17:09:12     vect_v2_all_devices (EURUSD,H1) sum( 926,1280 ) = 4.74232054;    thirdCPU[ 926,1280 ] = 4.74232054;    buf[ 926,1280 ] = 4.74231577
2013.11.29 17:09:12     vect_v2_all_devices (EURUSD,H1) sum( 361,1757 ) = 2.25322127;    thirdCPU[ 361,1757 ] = 2.25322127;    buf[ 361,1757 ] = 2.25322032
2013.11.29 17:09:12     vect_v2_all_devices (EURUSD,H1) sum( 1441,400 ) = -1.65504980;    thirdCPU[ 1441,400 ] = -1.65504980;    buf[ 1441,400 ] = -1.65504801
2013.11.29 17:09:12     vect_v2_all_devices (EURUSD,H1) sum( 1617,306 ) = -2.14686131;    thirdCPU[ 1617,306 ] = -2.14686131;    buf[ 1617,306 ] = -2.14686537
2013.11.29 17:09:12     vect_v2_all_devices (EURUSD,H1) ________________________
2013.11.29 17:09:13     vect_v2_all_devices (EURUSD,H1) read = 4000000 elements
2013.11.29 17:09:13     vect_v2_all_devices (EURUSD,H1) Device = 1: time = 0.734 sec.
2013.11.29 17:09:13     vect_v2_all_devices (EURUSD,H1) CPUTime / GPUTotalTime = 87.309
2013.11.29 17:09:13     vect_v2_all_devices (EURUSD,H1) sum( 370,1332 ) = 0.78463894;    thirdCPU[ 370,1332 ] = 0.78463894;    buf[ 370,1332 ] = 0.78463584
2013.11.29 17:09:13     vect_v2_all_devices (EURUSD,H1) sum( 1346,515 ) = 4.13771629;    thirdCPU[ 1346,515 ] = 4.13771629;    buf[ 1346,515 ] = 4.13771629
2013.11.29 17:09:13     vect_v2_all_devices (EURUSD,H1) sum( 632,631 ) = 0.53385985;    thirdCPU[ 632,631 ] = 0.53385985;    buf[ 632,631 ] = 0.53386015
2013.11.29 17:09:13     vect_v2_all_devices (EURUSD,H1) sum( 930,102 ) = 6.17934942;    thirdCPU[ 930,102 ] = 6.17934942;    buf[ 930,102 ] = 6.17935467
2013.11.29 17:09:13     vect_v2_all_devices (EURUSD,H1) sum( 507,167 ) = 2.76653004;    thirdCPU[ 507,167 ] = 2.76653004;    buf[ 507,167 ] = 2.76652718
2013.11.29 17:09:13     vect_v2_all_devices (EURUSD,H1) sum( 1638,1623 ) = -3.40129304;    thirdCPU[ 1638,1623 ] = -3.40129304;    buf[ 1638,1623 ] = -3.40129256
2013.11.29 17:09:13     vect_v2_all_devices (EURUSD,H1) sum( 208,649 ) = 8.09206963;    thirdCPU[ 208,649 ] = 8.09206963;    buf[ 208,649 ] = 8.09207344
2013.11.29 17:09:13     vect_v2_all_devices (EURUSD,H1) sum( 298,741 ) = -0.59763604;    thirdCPU[ 298,741 ] = -0.59763604;    buf[ 298,741 ] = -0.59763324
2013.11.29 17:09:13     vect_v2_all_devices (EURUSD,H1) sum( 1334,521 ) = -2.74508810;    thirdCPU[ 1334,521 ] = -2.74508810;    buf[ 1334,521 ] = -2.74508691
2013.11.29 17:09:13     vect_v2_all_devices (EURUSD,H1) sum( 858,760 ) = -7.48025274;    thirdCPU[ 858,760 ] = -7.48025274;    buf[ 858,760 ] = -7.48025846
2013.11.29 17:09:13     vect_v2_all_devices (EURUSD,H1) ________________________

CPU-Z CPUID - System & hardware benchmark, monitoring, reporting

www.cpuid.com

CPU-Z is a freeware that gathers information on some of the main devices of your system.

Exit Strategy: Stepping Stops 75,000 options - 4GB Trading Gold/USD

Anatoli Kazharski 2013.12.01 12:18 #693

I also tried to test MetaDriver'sqpu_EMA-Rainbow indicator.

On the CPU, the result is sometimes up to 2x better. Here is the result:

2013.12.01 14:12:50     qpu_Future_EMA-Rainbow (EURUSD,M1)      Calculate 1000129 bars at CPU, time = 811 ms
2013.12.01 14:12:57     qpu_Future_EMA-Rainbow (EURUSD,M1)      OpenCL: GPU device 'GeForce GTX 650 Ti BOOST' selected
2013.12.01 14:12:58     qpu_Future_EMA-Rainbow (EURUSD,M1)      Calculate 1000129 bars at GPU (OpenCL), time = 1295 ms

//---

Volodya(MetaDriver), show me your results?

P.S. I changed my type in kernel code in gpuEMA function parameters from__global to__local. A little faster, but still slower than on CPU.

2013.12.01 14:29:46     qpu_Future_EMA-Rainbow (EURUSD,M1)      Calculate 1000129 bars at CPU, time = 795 ms
2013.12.01 14:29:51     qpu_Future_EMA-Rainbow (EURUSD,M1)      OpenCL: GPU device 'GeForce GTX 650 Ti BOOST' selected
2013.12.01 14:29:52     qpu_Future_EMA-Rainbow (EURUSD,M1)      Calculate 1000129 bars at GPU (OpenCL), time = 1061 ms

Files:

qpu_Future_EMA-Rainbow.mq5 14 kb

Vladimir Gomonov 2013.12.02 22:25 #694

tol64:

I also tried to test MetaDriver'sqpu_EMA-Rainbow indicator.

On the CPU, the result is sometimes up to 2x better. Here is the result:

Volodya(MetaDriver), show me your results?

P.S. Changed in kernel code in gpuEMA function parameters from__global to__local. A little faster, but still slower than on CPU.

My results are similar. This has long been discussed and it is logical - the task is too simple, transferring memory to and from the video card does not pay off. The advantage of GPU appears in more complex tasks.

Discussion of article "Speed NFA bans locking from Discussion of article "MQL5

Anatoli Kazharski 2013.12.03 00:00 #695

MetaDriver:
I have similar results. This has long been discussed, and it makes sense - the task is too simple, transferring memory to and from the video card doesn't pay off. The advantage of the GPU appears in more complex tasks.

I see, thanks, I will experiment with more complex tasks.

[Deleted] 2015.05.26 09:10 #696

An example of using GPU acceleration for trading (derivatives).

Mark Joshi - famous for his books on financial mathematics, in particular on derivatives and options trading, has reported here about his work:

http://ssrn.com/abstract=2388415

He translated his OOP-style work to CUDA GPU. He started it in 2010, then had a break, and from 2011 until summer 2014 he made it to working version 0.3. He managed to achieve acceleration of 100X... 137X times - and that's on an AMAZING algorithm, which is difficult.

The work used the QuantLib library in C++, which he himself admits he had to rework along the lines of "OOP ->-> procedural approach" - in order to make it all work on the CUDA GPU.

He writes:

"I have implemented Monte Carlo pricing of IRD with the LMM on the GPU with least-squares for early exercise features.

You can get the code from kooderive.sourceforge.net in both C++ and CUDA. The paper is at ......

I used a completely different code for CUDA than I had previously used for C++. In essence, I treat data as the central concept and use the code to act on the data. The style is very functional. It did took a lot of work because my previous C++ implementations had been object oriented."

His project itself is open source:

http://sourceforge.net/projects/kooderive/

OpenCl and the tools Errors, bugs, questions NVIDIA CUDA for strategy

OpenCL: internal implementation tests in MQL5 - page 70

CPU-Z

CUDA-Z