Features of the mql5 language, subtleties and tricks - page 308

 

b5179, I tried to speed up calculations by creating a table of pre-calculated data. But I didn't get any acceleration, because every access to an array element by index inside MQL5 causes checking of index correctness, even if everything in the code is error-free.


Then I decided to try to create an object that has FASTER access to the index than the array. And I "succeeded"!

I put it in quotes because I got unexplained results.

#include <fxsaber\TicksShort\Table.mqh> // https://www.mql5.com/en/code/61126
#include <fxsaber\Benchmark\Benchmark.mqh> // https://www.mql5.com/en/code/31279

#define  AMOUNT 50 // If you make it 60+, it will slow down hundreds of times!

void OnStart()
{
  long Array[2048];
  ArrayInitialize(Array, 1);

  TABLE<long> Table;
  Table = Array;
    
  long Sum1 = 0;
  long Sum2 = 0;
    
  const ulong StartTime1 = GetMicrosecondCount();
    for (int i = 0; i < 1 e10 / AMOUNT; i++)    
      for (uchar j = 0; j < AMOUNT; j++)
        Sum1 += Array[j];
  Print(GetMicrosecondCount() - StartTime1);

  const ulong StartTime2 = GetMicrosecondCount();
  _BV( // If you comment out this line and the line four below, the slowdown will be 200 times!
  for (int i = 0; i < 1 e10 / AMOUNT; i++)    
    for (uchar j = 0; j < AMOUNT; j++)
      Sum2 += Table[j];
     , 1)
  Print(GetMicrosecondCount() - StartTime2);
      
  Print(Sum1);
  Print(Sum2);
}


The performance of this code depends greatly on whether optimisation is selected in the compiler or not. And also on the included instructions (x64, AVX, ...).

432795 // Time to calculate the sum of array elements by index.
Alert: Bench_Stack = 0, 1 <= Time[Test6.mq5 187 in OnStart: for (int i = 0; i < 1 e10 / AMOUNT; i++) for (uchar j = 0; j < AMOUNT; j++) Sum2 += Table[j];] = 403864 mcs.
403951 // Time to calculate the sum of "table" elements by index.
10000000000
10000000000

It turns out that there is a way to access items by index 8-9% faster than array items.


And then inexplicable things begin, which I highlighted in the source code. For some reason, the table is calculated so quickly only if it is wrapped in a macro. Because of this macro I had to add a corresponding #include. And I can't reproduce this macro without this #include - it slows down hundreds of times.


The same wild slowdown also occurs when the number of calculated elements increases a bit - see the comment to AMOUNT.


Another complaint about the compiler. In the source code the index of an array element is uchar-type, and the number of elements of the corresponding static array is 2048 - which is obviously more than UCHAR_MAX. So why does the compiler checks each time the index for going outside the array?


Please do not make such checks in such situations. And please explain the anomalous behaviour of the script above. Like and inform if it is possible to make data tables in static arrays so that there are no checks on the index every time it is accessed?

Search string: Uluchshenie 130.
 
fxsaber #:

b5179, I tried to speed up calculations by creating a table of pre-calculated data. But I didn't get any acceleration, because every access to an array element by index inside MQL5 causes checking of index correctness, even if everything in the code is error-free.

There are few points to consider here:

1. Index checking for arrays is done at compile time only (if the compiler can deduce the array bounds) and never at runtime, that's why you may get an "array out of index" runtime error if your index is beyond the array bounds.

2. Faster access times that were achieved here are never due to an object or a structure, it is merely due to compiler's optimization for the "switch" statement.

When compiling an small to medium switch with consecutive cases (0 to n), under optimization mode, the compiler converts the series of these multiple if-else statements O(n)  into a jump table, where an array of pointers to code blocks is indexed using the cases number at runtime, giving O(1) direct access similar to the lookup table using an array.

3. Optimization of switch depends on the number of cases + optimization mode. Large number or complex switches do not get this jump table optimization and will remain as multiple if-else.

4. Lookup table using an array will always be the standard and the most reliable method not depending on the number of cases, their complexity or the compiler's optimization mode.

5. It would be more interesting if you measure random access times instead of sequentially to negate the effect of cpu cache on results:

const ulong StartTime1 = GetMicrosecondCount();
for (int i = 0; i < INT_MAX; i++) { 
   uchar j = (uchar) (((ulong)i * 601) % AMOUNT);  // Random index. 601 is a prime > AMOUNT
   Sum1 += Array[j];
}
Print(GetMicrosecondCount() - StartTime1);

 
amrali #:
There are a few things to consider here:

1. Index checking for arrays is only done at compile time (if the compiler can detect array boundaries) and never at runtime, so you may get an "array out of index" error at runtime if your index is outside the array boundaries.

I disagree. It is at runtime that there is a check for array out of bounds (as well as a check for pointer validity every time you work through it).

 
fxsaber #:

I disagree. It is at runtime that there is a check for array out of bounds (as well as a check for pointer validity every time you work through it).

Yes it seems that I was wrong in point #1, I did not know about the new access control checks at runtime. 

Still, the other points can provide valid explanations for your findings.

Another way to defeat spatial and temporal locality is :
   // Generate array of randomized indices
   int indices[][2];
   ArrayResize(indices, AMOUNT);
   for(int i = 0; i < AMOUNT; i++) {
      indices[i][0] = MathRand();
      indices[i][1] = i;
   }
   ArraySort(indices);

    // iterate the array in a randomized order
   const ulong StartTime1 = GetMicrosecondCount();
   for (int i = 0; i < 1e10 / AMOUNT; i++)    
      for (uchar j = 0; j < AMOUNT; j++) {
        int k = indices[j][1];
        Sum1 += Array[k];
      }
  Print(GetMicrosecondCount() - StartTime1);
 
amrali #:
Another way to defeat spatial and temporal locality is :

A good way to mix up the indexes. Thanks.

 

Simpler mix (Fixed):

const ulong StartTime1 = GetMicrosecondCount();
for (int i = 0; i < INT_MAX; i++) { 
   uchar j = (uchar) (((ulong)i * 67) % AMOUNT);  // Random index. 67 is a prime > AMOUNT
   Sum1 += Array[j];
}
Print(GetMicrosecondCount() - StartTime1);
 
b5179, another interesting compiler behaviour.
#define  AMOUNT 2000

void OnStart()
{
  long Array[AMOUNT];
  ArrayInitialize(Array, 1);
  
  long Sum = 0;

  const ulong StartTime1 = GetMicrosecondCount();
    for (int i = 0; i < 1 e10 / AMOUNT; i++)    
      for (uint j = 0; j < AMOUNT; j++)
// Sum += Array[j]; // 1880657
        Sum += Array[j] << 1; // 1154557: 40% faster in AVX mode!
  Print(GetMicrosecondCount() - StartTime1);
      
  Print(Sum);
  Print("Compiler Version: " + (string)__MQLBUILD__ + " " + __CPU_ARCHITECTURE__); // Compiler Version: 5179 AVX
}

If you do an additional bit operation during addition, the execution speed increases by 40%.

Search string: Oshibka 139.

 
fxsaber #:
b5179, another interesting compiler behaviour.

If you do an additional bit operation during addition, the execution speed increases by 40%.

Search string: Oshibka 139.

By the way, I accidentally found that ArrayInitialize(array, 0) is much faster than ZeroMemory(array) for long[] arrays.
This could make difference in a class constructor or reset() method.
 
amrali #:
By the way, I accidentally discovered that ArrayInitialise(array, 0) is much faster than ZeroMemory(array) for long[] arrays.

The second function zeroes the array size.

 
fxsaber #:

The second function zeroes the array size.

Wrong!
ZeroMemory() zeros all the array elements. (The size does not change).

See the example in the docs: