Features of the mql5 language, subtleties and tricks - page 323

 
Nikolai Semko #:


PS. We should make sure that there is no optimisation due to caching of the last calculations, because we have a very primitive array [1,2,3....100,1,2,....100,1,2............100]
please insert this line into line 117:

and double-check the result to make sure that caching has nothing to do with it.

on an RTX 4060

2025.12.13 00:40:30.662 test_array_sum (EURUSD,H1) ▓▓▓ TEST 1: SIMPLE SUM (only +=) ▓▓▓

2025.12.13 00:40:30.676 test_array_sum (EURUSD,H1) CPU: 14067 µs, sum = 1638475978008

2025.12.13 00:40:30.676 test_array_sum (EURUSD,H1) --- Simple GPU ---

2025.12.13 00:40:30.744 test_array_sum (EURUSD,H1) OpenCL: GPU device 'NVIDIA GeForce RTX 4060' selected

2025.12.13 00:40:30.789 test_array_sum (EURUSD,H1)   GPU load:    40551 µs

2025.12.13 00:40:30.789 test_array_sum (EURUSD,H1)   GPU compute: 1691 µs

2025.12.13 00:40:30.789 test_array_sum (EURUSD,H1)   GPU total:   42242 µs

2025.12.13 00:40:30.789 test_array_sum (EURUSD,H1)   Sum: 16384759780082025.12.13 00:40:30.801 test_array_sum (EURUSD,H1)

2025.12.13 00:40:30.801 test_array_sum (EURUSD,H1) ▓▓▓ TEST 2: HEAVY COMPUTE (sin × cos) ▓▓▓

2025.12.13 00:40:31.992 test_array_sum (EURUSD,H1) CPU: 1190328 µs, sum = -5115631

2025.12.13 00:40:31.992 test_array_sum (EURUSD,H1) --- Heavy GPU ---

2025.12.13 00:40:32.082 test_array_sum (EURUSD,H1)   GPU load:    40659 µs

2025.12.13 00:40:32.082 test_array_sum (EURUSD,H1)   GPU compute: 1674 µs

2025.12.13 00:40:32.082 test_array_sum (EURUSD,H1)   GPU total:   42333 µs

2025.12.13 00:40:32.082 test_array_sum (EURUSD,H1)   Sum: -51371262025.12.13 00:40:32.097 test_array_sum (EURUSD,H1)

2025.12.13 00:40:32.097 test_array_sum (EURUSD,H1) ╔════════════════════════════════════════════════════════════╗

2025.12.13 00:40:32.097 test_array_sum (EURUSD,H1) ║                    RESULTS SUMMARY                         ║

2025.12.13 00:40:32.097 test_array_sum (EURUSD,H1) ╠════════════════════════════════════════════════════════════╣

2025.12.13 00:40:32.097 test_array_sum (EURUSD,H1) ║ SIMPLE SUM (only +=):                                      ║

2025.12.13 00:40:32.097 test_array_sum (EURUSD,H1) ║   CPU:         14 ms

2025.12.13 00:40:32.097 test_array_sum (EURUSD,H1) ║   GPU total:   42 ms (load: 40 + compute: 1)

2025.12.13 00:40:32.097 test_array_sum (EURUSD,H1) ║   Winner:      CPU ✓ (3.0 x faster)

2025.12.13 00:40:32.097 test_array_sum (EURUSD,H1) ║                                                            ║

2025.12.13 00:40:32.097 test_array_sum (EURUSD,H1) ║ HEAVY COMPUTE (sin × cos):                                 ║

2025.12.13 00:40:32.097 test_array_sum (EURUSD,H1) ║   CPU:         1190 ms

2025.12.13 00:40:32.097 test_array_sum (EURUSD,H1) ║   GPU total:   42 ms (load: 40 + compute: 1)

2025.12.13 00:40:32.097 test_array_sum (EURUSD,H1) ║   Winner:      GPU ✓ (28.1 x faster)

2025.12.13 00:40:32.097 test_array_sum (EURUSD,H1) ╠════════════════════════════════════════════════════════════╣

2025.12.13 00:40:32.097 test_array_sum (EURUSD,H1) ║                    CONCLUSIONS                             ║

2025.12.13 00:40:32.097 test_array_sum (EURUSD,H1) ╠════════════════════════════════════════════════════════════╣

2025.12.13 00:40:32.097 test_array_sum (EURUSD,H1) ║ Memory bandwidth:                                          ║

2025.12.13 00:40:32.097 test_array_sum (EURUSD,H1) ║   CPU→GPU transfer: 9.18 GB/s

2025.12.13 00:40:32.097 test_array_sum (EURUSD,H1) ║   CPU throughput:   7.11 G elements/s

2025.12.13 00:40:32.097 test_array_sum (EURUSD,H1) ║   GPU throughput:   59.14 G elements/s

2025.12.13 00:40:32.097 test_array_sum (EURUSD,H1) ║                                                            ║

2025.12.13 00:40:32.097 test_array_sum (EURUSD,H1) ║ Recommendations:                                           ║

2025.12.13 00:40:32.097 test_array_sum (EURUSD,H1) ║   • Simple ops:    USE CPU (GPU 3.0 x slower)

2025.12.13 00:40:32.097 test_array_sum (EURUSD,H1) ║   • Heavy compute: USE GPU (28.1 x faster)

2025.12.13 00:40:32.097 test_array_sum (EURUSD,H1) ║                                                            ║

2025.12.13 00:40:32.097 test_array_sum (EURUSD,H1) ║ GPU is optimal when:                                       ║

2025.12.13 00:40:32.097 test_array_sum (EURUSD,H1) ║   • Data stays on GPU (pipeline)                           ║

2025.12.13 00:40:32.097 test_array_sum (EURUSD,H1) ║   • Heavy compute per element (sin, cos, sqrt, etc.)       ║

2025.12.13 00:40:32.097 test_array_sum (EURUSD,H1) ║   • Multiple passes over same data                         ║

2025.12.13 00:40:32.097 test_array_sum (EURUSD,H1) ╚════════════════════════════════════════════════════════════╝

 
Nikolai Semko #:

PS. We should make sure that there is no optimisation due to caching of the last calculations, because we have a very primitive array [1,2,3....100,1,2,....100,1,2............100]
please insert this line into line 117:

and double-check the result to make sure that caching has nothing to do with it.

on RTX 4060 with new line

13790 CPU


 
lynxntech #:

on the RTX 4060 with a new line

13790 processor


thanks! So there is no effect of caching on the result. A video card costing ~250 USD beats a very normal modern processor by 700 times in terms of computing power.
In short - everyone should upgrade their GPUs urgently. Including me :)))
and it's time to rewrite kanvas with OpenCL support

By the way, I even tried writing in Python with OpenCL for testing. Everything flies much cooler than native C++ programmes written without GPU.

 
Maxim Kuznetsov #:

I just had to mention: using graphics for calculations and CPU for rendering (and Blend2d is pure CPU) is to some extent a mega solution :-) With all the sympathy for Blend, honestly I thought that there will be a move to Skia, it is more hardware dense.

modern way

by the way, I thought the same thing about Skia
https://www.mql5.com/ru/forum/487541/page4#comment_56817317
Надеюсь, что не уйдет в долгострой.
Надеюсь, что не уйдет в долгострой.
  • 2025.05.29
  • www.mql5.com
Наконец-то появиться полноценная темная тема и нормальный IDE. ЗЫ Хотел начать писать несколько статей на тему высокопроизводительного чарта со сглаженностью и прозрачностью на Canvas. сейчас хороший момент уйти в open source по примеру NET
 
╔════════════════════════════════════════════════════════════╗
║          CPU vs GPU BENCHMARK (OpenCL)                     ║
╠════════════════════════════════════════════════════════════╣
║  Elements: 100 M | Data size: 381 MB
╚════════════════════════════════════════════════════════════╝

▓▓▓ TEST 1: SIMPLE SUM (only +=) ▓▓▓
CPU: 31423 µs, sum = 1638263169596
--- Simple GPU ---
OpenCL: GPU device 'NVIDIA GeForce RTX 3060' selected
  GPU load:    73800 µs
  GPU compute: 1249 µs
  GPU total:   75049 µs
  Sum: 1638263169596 ✓

▓▓▓ TEST 2: HEAVY COMPUTE (sin × cos) ▓▓▓
CPU: 2541158 µs, sum = -143796
--- Heavy GPU ---
  GPU load:    78563 µs
  GPU compute: 1752 µs
  GPU total:   80315 µs
  Sum: -164921 ✗

╔════════════════════════════════════════════════════════════╗
║                    RESULTS SUMMARY                         ║
╠════════════════════════════════════════════════════════════╣
║ SIMPLE SUM (only +=):                                      ║
║   CPU:         31 ms
║   GPU total:   75 ms (load: 73 + compute: 1)
║   Winner:      CPU ✓ (2.4 x faster)
║                                                            ║
║ HEAVY COMPUTE (sin × cos):                                 ║
║   CPU:         2541 ms
║   GPU total:   80 ms (load: 78 + compute: 1)
║   Winner:      GPU ✓ (31.6 x faster)
╠════════════════════════════════════════════════════════════╣
║                    CONCLUSIONS                             ║
╠════════════════════════════════════════════════════════════╣
║ Memory bandwidth:                                          ║
║   CPU→GPU transfer: 5.04 GB/s
║   CPU throughput:   3.18 G elements/s
║   GPU throughput:   80.06 G elements/s
║                                                            ║
║ Recommendations:                                           ║
║   • Simple ops:    USE CPU (GPU 2.4 x slower)
║   • Heavy compute: USE GPU (31.6 x faster)
║                                                            ║
║ GPU is optimal when:                                       ║
║   • Data stays on GPU (pipeline)                           ║
║   • Heavy compute per element (sin, cos, sqrt, etc.)       ║
║   • Multiple passes over same data                         ║
╚════════════════════════════════════════════════════════════╝
 
Nikolai Semko #:

thank you! So there is no effect of caching on the result. A video card costing ~250 USD beats a very normal modern processor by 700 times in terms of computing power.
In short - everyone should upgrade their GPUs urgently. Including me :)))
and it's time to rewrite kanvas with OpenCL support.

In 2D/3D, where matrix multiplications occur, switching to OpenCL is understandable. But in algotrading I don't see any problems. The main tool there is Tester. I don't understand how OpenCL can help.


ZY

 

I'm dabbling in LLM on my RTX 3060 with 12 Gb, not enough power. Let's not talk about the CPU. Model 30B - 15 Gb, yes context 48K takes a lot of gigabytes just in VRAM.

If you are serious about local LLM, you should put 2 RTX 5xxx GPUs with 24Gb VRAM.

And if you are very serious - you should put 6 single-slot GPUs with turbines (now there are such novelties, expensive, bastard) and 3 PSUs. Only on regular miniPCI motherboards miniPCI are included in 2x lines, and this is a bottle neck. Not for inference (there is enough there), but for training - there is a big exchange of the card with CPU. I don't even know if there are motherboards that have enough lines for 8x for 6 cards in miniPCI. Mine has 2 cards.

And if it's serious, I'll have to splurge on a home mini-supercomputer. They say it can beat GPU-servers.

Of course, all this is on condition that there will be a financial result. Which, in my opinion, is not far off. In the meantime, we need to get ready for new technologies.

 
Nikolai Semko #:

2. optimisation in the tester:

  • MT5 genetic algorithm uses CPU
  • Grid optimisation 10000 passes × 100 parameters - GPU can count fitness functions in parallel
  • Custom optimisers with OpenCL give x50 to speed
In all these sub-items, I don't see a way to speed up. There's a sequential enumeration of ticks.
 
Edgar Akhmadeev #:

I'm dabbling in LLM on my RTX 3060 with 12 Gb, not enough power. Let's not talk about the CPU. Model 30B - 15 Gb, yes context 48K takes a lot of gigs just in VRAM.

If you are serious about local LLM, you need to put 2 RTX 5xxx GPUs with 24Gb VRAM.

And if you are very serious - you should put 6 single-slot GPUs with turbines (now there are such novelties, expensive, bastard) and 3 PSUs. Only on regular miniPCI motherboards miniPCI are included in 2x lines, and this is a bottle neck. Not for inference (there is enough there), but for training - there is a big exchange of the card with CPU. I don't even know if there are motherboards that have enough lines for 8x for 6 cards in miniPCI. Mine has 2 cards.

And if it's serious, I'll have to splurge on a home mini-supercomputer. They say it can beat GPU-servers.

Of course, all this is on condition that there will be a financial result. Which, in my opinion, is not far off. In the meantime, we need to get ready for new technologies.

I agree.
I've already done the preliminary calculations. That a satisfactory GPU home server for research work is about 7-10 k USD. (for LLM - ~70 billion parameters)
And a normal server with LLM with 300-400 billion parameters is already ~70-100 k USD. But here you still have to manage to find a GPU cluster for 30 k
 
fxsaber #:
In all these subparts, I don't see a way to speed up. It's a sequential enumeration of ticks.

You will have to write your own tester instead of using the MT one.