MML5 precompiler is smarter than it seems; The compiler sees 8 different indexes, 8 separate autovectorisation at each

Dominik Egert 2025.12.01 06:46 #3201

Alain Verleyen #:

The preprocessor knows nothing about the value of 'o' or any variable.

What are you REALLY trying to achieve ?

I guess he is looking for C/C++ constexpr precompiler feature.

MQL's preprocessor does not support const expr feature.

amrali 2025.12.01 07:06 #3202

This way you can use compile-time conditions:

#define mcr2(z) x##z = 5
#define mcr(z) mcr2(z)

// uncomment next line and re-check
//#define MYCOND

#ifdef MYCOND
  #define num 3
#endif

#ifndef MYCOND
  #define num 5
#endif

void OnStart()
{
   char x3 = 0, x4 = 0, x5 = 0;   
   mcr(num);
   Print("x3 = ",x3,", x4 = ",x4,", x5 = ",x5);
}

Andrei Iakovlev 2025.12.01 12:08 #3203

amrali #:
This way you can use the compilation conditions

Works just fine. Thank you.

fxsaber 2025.12.11 19:30 #3204

Why is the second option 20 times slower?

#import "msvcrt.dll"
  ulong memcpy( const int &dst, const ulong src, const int cnt );
  ulong memcpy( const int &dst[], const int &src[], const int cnt );  
#import

void OnStart()
{
  int Array[];
  ArrayResize(Array, 1 e8);

  ArrayInitialize(Array, 1);
  
  const ulong StartTime1 = GetMicrosecondCount();  
  const int Size1 = ArraySize(Array);
  
  int Sum1 = 0;
  
  for (int i = 0; i < Size1; i++)
    Sum1 += Array[i];
    
  Print(GetMicrosecondCount() - StartTime1); // 32029
  Print(Sum1);

  const ulong StartTime2 = GetMicrosecondCount();  
  const ulong Size2 = msvcrt::memcpy(Array, Array, 0) + Size1 * sizeof(int);

  int Sum2 = 0;
  int Value = 0;
    
  for (ulong i = msvcrt::memcpy(Array, Array, 0); i < Size2; i += sizeof(int))
  {
    msvcrt::memcpy(Value, i, sizeof(int));
    
    Sum2 += Value;
  }

  Print(GetMicrosecondCount() - StartTime2); // 686099
  Print(Sum2);      
}

amrali 2025.12.11 19:45 #3205

fxsaber #:
Why is the second option 20 times slower?

1. calling an external DLL.

2. copying from memory.

Nikolai Semko 2025.12.11 20:02 #3206

fxsaber #:
Why does the second variant work 20 times slower?

The memcpy(Array, Array, 0) trick to get the address is a clever hack, but it is antipattern to use it for element-by-element access to an array. DLL-calls make sense only for batch operations where one call processes a large block of data.

If you need low-level memory access in MQL5, it is better to write all the logic inside one DLL-function rather than pulling it millions of times from MQL.

The reason is obvious: overhead of calling the DLL-function at each iteration.

dll and array Errors, bugs, questions Crash Heeeeelp! going gone

fxsaber 2025.12.11 20:08 #3207

@amrali, @Nikolai Semko, thank you!

I wanted to implement direct access to the array without checking if the index is correct.

Nikolai Semko 2025.12.11 20:35 #3208

fxsaber #:

@amrali, @Nikolai Semko, thank you!

I wanted to implement direct access to the array without checking if the index is correct.

The best option is to implement all the logic inside C++ DLL

But you can try a partial solution through loop unwinding

int i = 0;
for (; i <= Size - 8; i += 8) {
    Sum += Array[i]   + Array[i+1] + Array[i+2] + Array[i+3] +
           Array[i+4] + Array[i+5] + Array[i+6] + Array[i+7];
}

fxsaber 2025.12.11 22:03 #3209

Nikolai Semko #:

But you can try a partial solution via loop unwrapping

This variant is three times slower.

Nikolai Semko 2025.12.11 23:36 #3210

fxsaber #:

This option is three times slower.

This confirms that the MQL5 compiler is smarter than it seems.
Apparently, the compiler sees 8 different indexes, 8 separate boundary checks at each iteration, and as a result it loses autovectorisation (does not recognise the pattern).
0.3 ns per iteration - I think this is almost the limit. Even C++ DLL will not give a big gain. We are practically stuck in memory bandwidth, not CPU bandwidth.
Further only to change the architecture of the solution. Perhaps using GPU (OpenCL) will give some results, but this is only in case there are multiple operations on one buffer or more complex calculations than just sum. For a single sum - GPU can be comparable or even slower because of transfer overhead.

Ambitious ideas !!! OpenCL: real challenges Will OOP be in

Features of the mql5 language, subtleties and tricks - page 321