Features of the mql5 language, subtleties and tricks - page 276

 
Edgar Akhmadeev #:

Ah, the days of assembly language programming.... I used to calculate speed by knowing the number of clock cycles of each instruction. I had to, when in the 80s I wrote a driver for our laboratory audio card and a speech synthesiser for GTS.

And on the matter - if you calculate the number of mat. operations and estimate the speed of each type ( * / + - - )?

Nowadays you won't see it and you won't calculate it...well, or you have to study convectors, register-aliasing, caches and so on.

It used to be simple: here is a processor, it executes commands one by one. Total time = sum of all. Now it sucks a bunch of commands into a cache, arranges them into chains, plans operations and executes all this at the same time. roughly like this

 
Maxim Kuznetsov #:

Nowadays you won't see it and you won't calculate it...well, or you have to study convectors, register-aliasing, caches and so on.

It used to be simple: here is a processor, it executes commands one by one. Total time = sum of all. Now it sucks a bunch of commands into a cache, arranges them into chains, plans operations and executes all this at the same time. roughly like this

Very roughly like that. x86 ISA is an interpreted "language".

Edit: actually, Intel does specify the latency per command.

See this post for more details:

But generally, yes, you are right. Its not easy
 

My observations about the results:

   uint  K = 536895458;
   uint  L = 1468000  ;
   uint  m = 146097   ;
   uint  q = 2939745  ;
   uint  r = 2141     ;
   uint  p = 197913   ;
   ulong t = (ulong)time;
   int   n = (int)(t / 86400)                   ;  // Unix day
   uint  N = ((uint)n) + K                      ;  // Computational calendar day
   uint  a = 4 * N + 3                          ;
   uint  c = a / m                              ;
   uint  e = a % m / 4                          ;
   uint  b = 4 * e + 3                          ;

   ulong f = ((ulong)q) * b                     ;
   uint  z = (uint)(f >> 32)                    ;
   uint  h = ((uint)f) / q / 4                  ;
   uint  d = r * h + p                          ;
   uint  Y = 100 * c + z                        ;
   uint  M = d >> 16                            ;
   uint  D = ((ushort)d) / r                    ;
   uint  J = h >= 306                           ;  // Map from Computational to Gregorian calendar
   int Y_G = int((Y - L) + J)                   ;
   int M_G = int(J ? M - 12 : M)                ;
   int D_G = int(D + 1)                         ;

This code executes fast because:

* it uses least possible number of divisions and modulo (takes a lot of processor cycles).

* most variables are 32-bit, unsigned types.

* only one branching (conditionasl operator), so the cpu pipeline is not flushed out frequently.

* The code benefits from AVX2 instruction set which support FMA (fused multiply addition), the code mostly uses * and +

See on the contrast why this code is slower:

bool TimeToJulian(datetime time, MqlDateTime& dt_struct)
  {
   int x = (int)(time / 86400)                           ;  // Unix day
   int J = x + 2440588                                   ;  // Julian day
   int f = J + j + (((4 * J + B) / 146097) * 3) / 4 + C  ;
   int e = r * f + v                                     ;
   int g = (e % p) / r                                   ;
   int h = u * g + w                                     ;
   int D = (h % s) / u + 1                               ;  // Map from Julian to Gregorian calendar
   int M = (h / s + m) % n + 1                           ;
   int Y = (e / p) - y + (n + m - M) / n                 ;

no branching, but with a lot of div/mods.

One trick I learned recently about datetime type -> casting datetime variables to ulong (or even better to uint) makes a huge speed-up with datetime variables (depending on the surrounding code).

Here in this code, the second version of TimeHour() is faster:

int TimeHour(const datetime t)
  {
   return (int)(t / 60) % 24;
  }

int TimeHour2(const datetime t)
  {
   return (int)((uint)t / 60) % 24;
  }
  1.10 ns, checksum = 20499563265481417   // TimeHour
  0.67 ns, checksum = 20499563265481417   // TimeHour2

that's why fxsaber found slow-down in TimeToJulian when hh:mm:ss are calculated. Because I forgot to cast to ulong :-)

bool TimeToJulian(datetime time, MqlDateTime& dt_struct)
  {
   int x = (int)(time / 86400)                           ;  // Unix day
   int J = x + 2440588                                   ;  // Julian day
   int f = J + j + (((4 * J + B) / 146097) * 3) / 4 + C  ;
   int e = r * f + v                                     ;
   int g = (e % p) / r                                   ;
   int h = u * g + w                                     ;
   int D = (h % s) / u + 1                               ;  // Map from Julian to Gregorian calendar
   int M = (h / s + m) % n + 1                           ;
   int Y = (e / p) - y + (n + m - M) / n                 ;
#ifndef WITHOUT_HOURS
   int HH  = (int)((time / 3600) % 24)                   ;
   int MM  = (int)((time / 60) % 60)                     ;
   int SS  = (int)(time % 60)                            ;
#endif //#ifndef WITHOUT_HOURS      

Utilizing the cpu branch predictor may benefit micro-optimization, see here: https://stackoverflow.com/a/11227902

bool TimeToStructFast_fxsaber(datetime time, MqlDateTime& dt_struct)
  {
//   int  isleap  = ((year & 3) == 0);
//   int  leapadj = ((doy < (isleap + 59)) ? 0 : (2 - isleap));
//   int  mon     = ((((doy + leapadj) * 12) + 373) / 367);
//   int  day     = doy - Months[mon] - (isleap && doy > 59);
     int  mon     = (doy < 59) ? ((doy + 1) >> 5) + 1 : (((doy - !(year & 3)) * 67 + 2209 ) >> 11);   

Although the isleap check is eliminated from the main instruction path, it is still get executed in 84% of times. (1 - 59/365). That's why there was no big improvement here. The cpu predicts the right branch 84% of times at (doy < 59) ?, but it finds another branch there !(year & 3), so the instruction pipeline has to flushed and re-filled again.

Finally, I think this is the max we can optimize, any of the first 4 functions are within an accepted 10% variability, and any of them is even faster than MQL's built-in TimeToStruct 4-5 times.

IMHO these functions may be optimized more by adding a cache, much like the built-in TimeToStruct function https://www.mql5.com/ru/forum/170952/page274#comment_55238816

 3.42 ns, checksum = 1230902265328330   // TimeToStruct2100
 2.97 ns, checksum = 1230902265328330   // TimeToStructFast
 3.65 ns, checksum = 1230902265328330   // TimeToStructFast_fxsaber
 3.18 ns, checksum = 1230902265328330   // TimeToCalendar
 7.04 ns, checksum = 1230902265328330   // TimeToJulian
18.47 ns, checksum = 1230902265328330  /// MQL's TimeToStruct()

TimeToJulian may be less optimized for the job.

The speed-up factors for the sub-functions (TimeYear, TimeDayOfWeek, etc..) to extract time components such as yyyy/mm/dd hh:mm:ss are within 10-20 times faster than TimeToStruct().

  0.64 ns, checksum = 20497723281054842   // TimeYear
 18.21 ns, checksum = 20497723281054842  /// MQL's TimeToStruct()

High performance code with complex for loops to scan the quotes history searching for certain patterns or bars (begin of trading weeks, doing other statistics) will benefit from these optimizations. 

Regular code for a trading robot may also benefit but to a lesser extent.

On the other side of the coin, composing datetime variables from time components (yyyy/mm/dd hh:mm:ss) using the custom CreateDateTime() as an alternative to StructToTime will gain > 30x times speed-ups.

  1.45 ns, checksum = 40671201835781217   // CreateDateTime
 34.83 ns, checksum = 40671201835781217  /// MQL's StructToTime()

Composing datetime variables is really slow on the MT5 platform and needs some consideration from the devs.

Особенности языка mql5, тонкости и приёмы работы - Используйте TimeToCalendar в AVX-компиляторе.
Особенности языка mql5, тонкости и приёмы работы - Используйте TimeToCalendar в AVX-компиляторе.
  • 2024.11.27
  • fxsaber
  • www.mql5.com
Самая быстрая замена встроенной функции TimeToStruct. я рекомендую использовать эту функцию В моем предыдущем варианте была ошибка. но я проверю другие варианты использования TimeToStruct. TimeToCalendar в AVX-компиляторе очень сильно опережает любые другие реализации
 

For all the cached versions, I suggest this fix for a possible bug that could happen.

Initially, when the function are called repeatedly with times on 1970/1/1, it returns 000/00/00 for the year, month date fields.

-1 will fix this bug, so the returned dates are correctly 1970/1/1. 

bool TimeToStructFast_Cached(datetime time, MqlDateTime& dt_struct)
  {
   static const int Months[13] = {0, -1, 30, 58, 89, 119, 150, 180, 211,242, 272, 303, 333};
   //static int last_days = 0;
   static int last_days = -1;   
   static MqlDateTime last_result = {};

   const uint t = (uint)time;
   const int  n = (int)(t / (24 * 3600));
   if (last_days != n)
     {
 

Some very little more optimized version:


bool TimeToStructMQLplus(datetime timestamp, MqlDateTime& dt_struct)
{
    static const int Months[] = { 0, 11512692, 11512196, 11511744, 11511248, 11510766, 11510272, 11509790, 11509296, 11508797, 11508318, 11507822, 11507342 };

    const uint t            = (uint)timestamp;
    const int  n            = (int)(t / 86400);
    const int  tn           = (n << 2) | 2;

    dt_struct.day_of_year   = (tn % 1461) >> 2;
    dt_struct.year          = (tn / 1461) + 1970;
    const int  isleap       = !(dt_struct.year & 3);

    dt_struct.mon           = ((((dt_struct.day_of_year + ((dt_struct.day_of_year < (isleap + 59)) ? 0 : (2 - isleap))) * 12) + 373) / 367);
    dt_struct.day           = n - (int)((dt_struct.year * 5844 - Months[dt_struct.mon]) >> 4);
    #ifndef WITHOUT_HOURS
        dt_struct.hour      = (int)(t / 3600) % 24;
        dt_struct.min       = (int)(t / 60) % 60;
        dt_struct.sec       = (int)(t % 60);

    #endif //#ifndef WITHOUT_HOURS
    dt_struct.day_of_week   = (n + 4) % 7;

   return (true);
}


Results:


Compiler Version: 4620 AVX2 + FMA3, optimization - true
12th Gen Intel Core i7-12700K, AVX2 + FMA3
With hours (dt.hour+ dt.min+ dt.sec - on), random datetimes[].
1970.01.01 00:00:35 - 2097.11.29 23:59:51
 4.87 ns, checksum = 1235575018417502   // TimeToStruct2100
 4.43 ns, checksum = 1235575018417502   // TimeToStructFast
 5.12 ns, checksum = 1235575018417502   // TimeToStructFast_fxsaber
 4.83 ns, checksum = 1235575018417502   // TimeToCalendar
10.17 ns, checksum = 1235575018417502   // TimeToJulian
 4.29 ns, checksum = 1235575018417502   // TimeToStructMQLplus
21.04 ns, checksum = 1235575018417502  /// MQL's TimeToStruct()

Compiler Version: 4620 AVX, optimization - true
12th Gen Intel Core i7-12700K, AVX2 + FMA3
With hours (dt.hour+ dt.min+ dt.sec - on), random datetimes[].
1970.01.01 00:00:03 - 2097.11.29 23:59:19
 4.46 ns, checksum = 1240645497052022   // TimeToStruct2100
 4.19 ns, checksum = 1240645497052022   // TimeToStructFast
 4.71 ns, checksum = 1240645497052022   // TimeToStructFast_fxsaber
 4.51 ns, checksum = 1240645497052022   // TimeToCalendar
 9.38 ns, checksum = 1240645497052022   // TimeToJulian
 3.97 ns, checksum = 1240645497052022   // TimeToStructMQLplus
19.64 ns, checksum = 1240645497052022  /// MQL's TimeToStruct()

Compiler Version: 4620 X64 Regular, optimization - true
12th Gen Intel Core i7-12700K, AVX2 + FMA3
With hours (dt.hour+ dt.min+ dt.sec - on), random datetimes[].
1970.01.01 00:00:12 - 2097.11.29 23:59:10
 4.32 ns, checksum = 1237792403157277   // TimeToStruct2100
 4.02 ns, checksum = 1237792403157277   // TimeToStructFast
 4.68 ns, checksum = 1237792403157277   // TimeToStructFast_fxsaber
 4.15 ns, checksum = 1237792403157277   // TimeToCalendar
 8.70 ns, checksum = 1237792403157277   // TimeToJulian
 3.89 ns, checksum = 1237792403157277   // TimeToStructMQLplus
19.82 ns, checksum = 1237792403157277  /// MQL's TimeToStruct()
 
Dominik Egert #:

Some very little more optimized version:



Results:


Great Dominik, you were able to break the 5 nanoseconds record in this video https://youtu.be/0s9F4QWAl-E?si=ZzdFCrFtU8Ean2NJ&amp;t=590

 
Dominik Egert #:

A slightly more optimised version:

Wow! The first version, consistently the best on all processors (x86/AVX/AVX2).

Just probably should make a script to test for correctness of all algorithms. It's simple, IMHO. In a date loop compare all the results of calculations with the built-in function.

 
amrali #:
Great Dominik, you were able to break the 5 nanoseconds record in this video https://youtu.be/0s9F4QWAl-E?si=ZzdFCrFtU8Ean2NJ&amp;t=590


I guess that's due to CPU. Would be interesting to see results from an i9 14900KS or some other top grade CPU.
 
Dominik Egert #:

Some very little more optimized version:



Results:


AVX2 slower than X64 ? It's strange.
 
Alain Verleyen #:
AVX2 slower than X64 ? It's strange.

I suspect MQL compiler only applies AVX extensions to double values. Probably, the memory management is different when compiling via AVX extensions.

Maybe the optimizer misses some x86 optimizations when looking for AVX optimizations.

Would be interesting if you could get a statement from MQ on that.