Features of the mql5 language, subtleties and tricks - page 322

 
Nikolai Semko #:
Perhaps using GPU (OpenCL) will give some results, but this is only in case there are multiple operations on one buffer or more complex calculations than just sum. For a single sum - GPU can be comparable or even slower due to transfer overhead.

For the sake of curiosity I generated the test code using OpenCL with AI assistant.

2025.12.11 20:18:38.625 test_array_sum (EURUSD,M4)      ╔════════════════════════════════════════════════════════════╗
2025.12.11 20:18:38.625 test_array_sum (EURUSD,M4)      ║          CPU vs GPU BENCHMARK (OpenCL)                     ║
2025.12.11 20:18:38.625 test_array_sum (EURUSD,M4)      ╠════════════════════════════════════════════════════════════╣
2025.12.11 20:18:38.625 test_array_sum (EURUSD,M4)      ║  Elements: 100 M | Data size: 381 MB
2025.12.11 20:18:38.625 test_array_sum (EURUSD,M4)      ╚════════════════════════════════════════════════════════════╝
2025.12.11 20:18:38.758 test_array_sum (EURUSD,M4)      
2025.12.11 20:18:38.758 test_array_sum (EURUSD,M4)      ▓▓▓ TEST 1: SIMPLE SUM (only +=) ▓▓▓
2025.12.11 20:18:38.786 test_array_sum (EURUSD,M4)      CPU: 28269 µs, sum = 5050000000
2025.12.11 20:18:38.786 test_array_sum (EURUSD,M4)      --- Simple GPU ---
2025.12.11 20:18:38.787 test_array_sum (EURUSD,M4)      OpenCL: GPU device 'Intel Iris Xe Graphics' selected
2025.12.11 20:18:39.187 test_array_sum (EURUSD,M4)        GPU load:    112267 µs
2025.12.11 20:18:39.187 test_array_sum (EURUSD,M4)        GPU compute: 27958 µs
2025.12.11 20:18:39.187 test_array_sum (EURUSD,M4)        GPU total:   140225 µs
2025.12.11 20:18:39.187 test_array_sum (EURUSD,M4)        Sum: 50500000002025.12.11 20:18:39.229 test_array_sum (EURUSD,M4)      
2025.12.11 20:18:39.229 test_array_sum (EURUSD,M4)      ▓▓▓ TEST 2: HEAVY COMPUTE (sin × cos) ▓▓▓
2025.12.11 20:18:40.568 test_array_sum (EURUSD,M4)      CPU: 1339550 µs, sum = -136000000
2025.12.11 20:18:40.568 test_array_sum (EURUSD,M4)      --- Heavy GPU ---
2025.12.11 20:18:41.019 test_array_sum (EURUSD,M4)        GPU load:    105231 µs
2025.12.11 20:18:41.019 test_array_sum (EURUSD,M4)        GPU compute: 27317 µs
2025.12.11 20:18:41.019 test_array_sum (EURUSD,M4)        GPU total:   132548 µs
2025.12.11 20:18:41.019 test_array_sum (EURUSD,M4)        Sum: -1360000002025.12.11 20:18:41.060 test_array_sum (EURUSD,M4)      
2025.12.11 20:18:41.060 test_array_sum (EURUSD,M4)      ╔════════════════════════════════════════════════════════════╗
2025.12.11 20:18:41.060 test_array_sum (EURUSD,M4)      ║                    RESULTS SUMMARY                         ║
2025.12.11 20:18:41.060 test_array_sum (EURUSD,M4)      ╠════════════════════════════════════════════════════════════╣
2025.12.11 20:18:41.060 test_array_sum (EURUSD,M4)      ║ SIMPLE SUM (only +=):                                      ║
2025.12.11 20:18:41.060 test_array_sum (EURUSD,M4)      ║   CPU:         28 ms
2025.12.11 20:18:41.060 test_array_sum (EURUSD,M4)      ║   GPU total:   140 ms (load: 112 + compute: 27)
2025.12.11 20:18:41.060 test_array_sum (EURUSD,M4)      ║   Winner:      CPU ✓ (5.0 x faster)
2025.12.11 20:18:41.060 test_array_sum (EURUSD,M4)      ║                                                            ║
2025.12.11 20:18:41.060 test_array_sum (EURUSD,M4)      ║ HEAVY COMPUTE (sin × cos):                                 ║
2025.12.11 20:18:41.060 test_array_sum (EURUSD,M4)      ║   CPU:         1339 ms
2025.12.11 20:18:41.060 test_array_sum (EURUSD,M4)      ║   GPU total:   132 ms (load: 105 + compute: 27)
2025.12.11 20:18:41.060 test_array_sum (EURUSD,M4)      ║   Winner:      GPU ✓ (10.1 x faster)
2025.12.11 20:18:41.060 test_array_sum (EURUSD,M4)      ╠════════════════════════════════════════════════════════════╣
2025.12.11 20:18:41.060 test_array_sum (EURUSD,M4)      ║                    CONCLUSIONS                             ║
2025.12.11 20:18:41.060 test_array_sum (EURUSD,M4)      ╠════════════════════════════════════════════════════════════╣
2025.12.11 20:18:41.060 test_array_sum (EURUSD,M4)      ║ Memory bandwidth:                                          ║
2025.12.11 20:18:41.060 test_array_sum (EURUSD,M4)      ║   CPU→GPU transfer: 3.31 GB/s
2025.12.11 20:18:41.060 test_array_sum (EURUSD,M4)      ║   CPU throughput:   3.54 G elements/s
2025.12.11 20:18:41.060 test_array_sum (EURUSD,M4)      ║   GPU throughput:   3.58 G elements/s
2025.12.11 20:18:41.060 test_array_sum (EURUSD,M4)      ║                                                            ║
2025.12.11 20:18:41.060 test_array_sum (EURUSD,M4)      ║ Recommendations:                                           ║
2025.12.11 20:18:41.060 test_array_sum (EURUSD,M4)      ║   • Simple ops:    USE CPU (GPU 5.0 x slower)
2025.12.11 20:18:41.060 test_array_sum (EURUSD,M4)      ║   • Heavy compute: USE GPU (10.1 x faster)
2025.12.11 20:18:41.060 test_array_sum (EURUSD,M4)      ║                                                            ║
2025.12.11 20:18:41.060 test_array_sum (EURUSD,M4)      ║ GPU is optimal when:                                       ║
2025.12.11 20:18:41.060 test_array_sum (EURUSD,M4)      ║   • Data stays on GPU (pipeline)                           ║
2025.12.11 20:18:41.060 test_array_sum (EURUSD,M4)      ║   • Heavy compute per element (sin, cos, sqrt, etc.)       ║
2025.12.11 20:18:41.060 test_array_sum (EURUSD,M4)      ║   • Multiple passes over same data                         ║
2025.12.11 20:18:41.060 test_array_sum (EURUSD,M4)      ╚════════════════════════════════════════════════════════════╝


It is important to understand that this is the result on a very weak integrated laptop video card with Intel Iris Xe Graphics RAM.
If you put a video card for about 500 quid (e.g. NVIDIA RTX 4070), the GPU memory load speed will increase by 4-5 times and the computation speed by about 10 times compared to my card.

Expected results on RTX 4070

Metric Iris Xe RTX 4070 Difference
GPU load (400 MB) 110ms ~25ms* 4x
GPU compute (simple) 26 ms ~2-3 ms 10x
GPU compute (heavy) 28 ms ~2-3 ms 10x
Memory bandwidth ~3.6 GB/s ~15-20 GB/s** 5x

*PCIe 4.0 x16 = ~25 GB/s theoretical
**real through PCIe, not VRAM bandwidth

It will be interesting to see the result of this test for advanced graphics cards. If anyone has it, please show the result.

PS The second test of array loading into GPU is always faster because of the L3 cache where our array ends up. It is especially noticeable when the array size is smaller than the L3 cache size.
Files:
 
Nikolai Semko #:
The best option is to have all the logic inside a C++ DLL
But you can try a partial solution via loop unwinding

and include OMP there

https://learn.microsoft.com/ru-ru/cpp/parallel/openmp/reference/openmp-directives?view=msvc-170

https://stackoverflow.com/questions/27056090/how-to-parallelize-this-array-sum-using-openmp

OpenMP Directives
OpenMP Directives
  • TylerMSFT
  • learn.microsoft.com
Дополнительные сведения: Директивы OpenMP
 
Nikolai Semko #:

For the sake of curiosity, I generated test code using OpenCL with AI assistant

Is the dialogue saved?

I get ERR_OPENCL_NOT_SUPPORTED.

 
fxsaber #:

Is the dialogue still intact?

I get ERR_OPENCL_NOT_SUPPORTED.

Do MT examples work? Maybe the drivers are not installed.

I have such a problem:

Forum on trading, automated trading systems and testing trading strategies

OpenCL in trading

Rorschach, 2024.11.13 20:37

I installed AMD APP SDK 3.0, the terminal writes opencl.dll not found, please install OpenCL drivers. Processors are no longer supported? Third-party software sees and works.


Nikolai Semko #:
loss of auto-vectorisation (does not recognise pattern)

This is the problem https://www.mql5.com/ru/forum/495741/page2#comment_58102015

Тест AVX512
Тест AVX512
  • 2025.09.22
  • www.mql5.com
Требуются добровольцы с avx512 Нужно скомпилировать и запустить скрипт в 4 режимах Поделиться результатами Дефолтные значения оптимальные можно SIZ...
 
╔════════════════════════════════════════════════════════════╗
║          CPU vs GPU BENCHMARK (OpenCL)                     ║
╠════════════════════════════════════════════════════════════╣
║  Elements: 100 M | Data size: 381 MB
╚════════════════════════════════════════════════════════════╝

▓▓▓ TEST 1: SIMPLE SUM (only +=) ▓▓▓
CPU: 22857 µs, sum = 5050000000
--- Simple GPU ---
OpenCL: GPU device 'NVIDIA GeForce RTX 3060' selected
  GPU load:    73441 µs
  GPU compute: 1309 µs
  GPU total:   74750 µs
  Sum: 5050000000 ✓

▓▓▓ TEST 2: HEAVY COMPUTE (sin × cos) ▓▓▓
CPU: 1999283 µs, sum = -136000000
--- Heavy GPU ---
  GPU load:    75667 µs
  GPU compute: 1723 µs
  GPU total:   77390 µs
  Sum: -136000000 ✓

╔════════════════════════════════════════════════════════════╗
║                    RESULTS SUMMARY                         ║
╠════════════════════════════════════════════════════════════╣
║ SIMPLE SUM (only +=):                                      ║
║   CPU:         22 ms
║   GPU total:   74 ms (load: 73 + compute: 1)
║   Winner:      CPU ✓ (3.3 x faster)
║                                                            ║
║ HEAVY COMPUTE (sin × cos):                                 ║
║   CPU:         1999 ms
║   GPU total:   77 ms (load: 75 + compute: 1)
║   Winner:      GPU ✓ (25.8 x faster)
╠════════════════════════════════════════════════════════════╣
║                    CONCLUSIONS                             ║
╠════════════════════════════════════════════════════════════╣
║ Memory bandwidth:                                          ║
║   CPU→GPU transfer: 5.07 GB/s
║   CPU throughput:   4.38 G elements/s
║   GPU throughput:   76.39 G elements/s
║                                                            ║
║ Recommendations:                                           ║
║   • Simple ops:    USE CPU (GPU 3.3 x slower)
║   • Heavy compute: USE GPU (25.8 x faster)
║                                                            ║
║ GPU is optimal when:                                       ║
║   • Data stays on GPU (pipeline)                           ║
║   • Heavy compute per element (sin, cos, sqrt, etc.)       ║
║   • Multiple passes over same data                         ║
╚════════════════════════════════════════════════════════════╝
 
Rorschach #:

Do the MT examples work? Maybe the drivers are not installed.

I don't know where to look.
 
fxsaber #:
I don't know where to look.
OpenCL drivers are automatically installed on fresh hardware. Perhaps GPU hardware is old.
You can download a free utility Geeks3D GPU Caps Viewer and see your video card and whether it supports OpenCL. And then manually download and install drivers.


 
Nikolai Semko #:
OpenCL drivers are automatically installed on fresh hardware. Perhaps GPU hardware is old.
You can download a free utility Geeks3D GPU Caps Viewer and see your video card and whether it supports OpenCL. And then manually download and install drivers.


just for the record: to use graphics for calculations and CPU for rendering (and Blend2d is pure CPU) is to some extent a mega solution :-) With all sympathy to Blend, honestly thought that there will be a move to Skia, it is more hardware dense.

modern way

 
Edgar Akhmadeev #:

Of course, the performance of calculations on a GPU card (NVIDIA GeForce RTX 3060), even at a cost of ~300 USD, compared to a CPU, is impressive. More than 1000 times. 0.017 nanoseconds per computational iteration sum += (long)(sin(x) * cos(x) * 1000.0f);

It is even hard to imagine the result on NVidia GPU cluster costing ~30000USD, the sale of which is strictly quoted by the US government and for which many countries of the world, especially China, are hunting.
Recently Kazakhstan got a handful of such clusters. And, as far as I understand, that's why Pavel Durov came to Kazakhstan and met with President Tokayev.
It's gotten to the point that in the world one company NVidia determines the strategic security of states.

PS. We should make sure that there is no optimisation due to caching of the last calculations, as we have a very primitive array [1,2,3....100,1,2,....100,1,2............100]
please insert this line into line 117:

for (int i = 0; i < Size; Array[i++]=rand());

and double-check the result to make sure that caching has nothing to do with it.

 
Maxim Kuznetsov #:

I just had to mention: using graphics for calculations and CPU for rendering (and Blend2d is pure CPU) is to some extent a mega solution :-) With all the sympathy for Blend, honestly I thought that there will be a move to Skia, it is more hardware dense.

modern way

Yes, I agree. Still cloud providers are starting to actively populate GPU solutions into their infrastructure. Mainly for LLM.
I already have experience installing and using local LLM LLAMA 3.1 and understand how important a powerful GPU is. For such an LLM, you will have to shell out 3000 USD for a GPU to run comfortably. CPUs do not pull at all