Machine learning in trading: theory, models, practice and algo-trading - page 3281

 
fxsaber #:

Well, you need Pearson.

I'm not sure how to do it, and I'm sleepy.

Something similar.

>>> a = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9])
>>> b = np.array([1, 2, 3])
>>> a = (a - np.mean(a)) / (np.std(a))
>>> b = (b - np.mean(b)) / (np.std(b))
>>> np.correlate(a, b, 'full')
array([-1.8973666 , -1.42302495,  0.9486833 ,  0.9486833 ,  0.9486833 ,
        0.9486833 ,  0.9486833 ,  0.9486833 ,  0.9486833 , -1.42302495,
       -1.8973666 ])
>>> 
 
Maxim Dmitrievsky #:

I'm not sure how to do it, and I'm sleepy.

something similar

Yeah, that's not it.

 
fxsaber #:

Right, wrong.

It's almost something, look it up, I'm off.

 
fxsaber #:

Trying to quickly find similar short strings in a long string.

It takes more than six seconds for such implementation via Alglib to search for similar short strings (300) in the millionth string.

I accelerated it.

#include <fxsaber\Math\Math.mqh> // https://www.mql5.com/ru/code/17982

const vector<double> GetCorr( const double &Array[], const double &Pattern[], const int Step = 1 )
{
  double Corr[];  
  MathCorrelationPearson(Array, Pattern, Corr, Step);
  
  ArrayRemove(Corr, 0, ArraySize(Pattern) - 1);  
  
  vector<double> Res;
  Res.Swap(Corr);
  
  return(Res);
}

#property script_show_inputs

input int inRows = 300; // Длина короткой строки
input int inCols = 1000000; // Длина длинной строки

// Поиск похожей строки в длинной строке.
void OnStart()
{  
  if (inRows < inCols)
  {
    PrintCPU(); // https://www.mql5.com/ru/forum/86386/page3256#comment_49538685
    
    double Array[]; // Длинная строка, где будет искать.
    double Pattern[]; // Короткая строка, с которой будем сравнивать.
    CMatrixDouble Matrix;
    
    FillData(Array, Pattern, Matrix, inRows, inCols); // https://www.mql5.com/ru/forum/86386/page3278#comment_49725614
            
    Print(TOSTRING(inRows) + TOSTRING(inCols));

    vector<double> vPattern;  
    vPattern.Assign(Pattern);

    ulong StartTime, StartMemory; // https://www.mql5.com/ru/forum/86386/page3256#comment_49538685

    BENCH(vector<double> Vector1 = GetCorr(Matrix, vPattern)) // https://www.mql5.com/ru/forum/86386/page3278#comment_4972561 4
    BENCH(vector<double> Vector2 = GetCorr(Array, Pattern))
    BENCH(vector<double> Vector3 = GetCorr(Array, Pattern, -1))
    
    Print(TOSTRING(IsEqual(Vector1, Vector2)));
    Print(TOSTRING(IsEqual(Vector3, Vector2)));
  }      
}


Result.

EX5: 4000 AVX Release.
TerminalInfoString(TERMINAL_CPU_NAME) = Intel Core i7-2700 K  @ 3.50 GHz 
TerminalInfoInteger(TERMINAL_CPU_CORES) = 8 
TerminalInfoString(TERMINAL_CPU_ARCHITECTURE) = AVX 
inRows = 300 inCols = 1000000 
vector<double> Vector1 = GetCorr(Matrix, vPattern) - 7158396 mcs, 8 MB
vector<double> Vector2 = GetCorr(Array, Pattern) - 364131 mcs, 8 MB
vector<double> Vector3 = GetCorr(Array, Pattern, -1) - 323935 mcs, 7 MB
IsEqual(Vector1, Vector2) = true 
IsEqual(Vector3, Vector2) = true 

Now in 300 milliseconds.

 
fxsaber #:

Now in 300 milliseconds.

When no matrix can do it.

inRows = 30000 inCols = 10000000 
vector<double> Vector2 = GetCorr(Array, Pattern) - 10567928 mcs, 76 MB
vector<double> Vector3 = GetCorr(Array, Pattern, -1) - 3006838 mcs, 77 MB

It takes three seconds to find similar 30K strings in a 10M string.

 
fxsaber #:

When no matrix can handle it.

It takes three seconds to find similar 30K strings in a 10M string.

Very cool, but just as useless.
Is this an example of fft()?
 
mytarmailS #:
Is this an example with fft()?

300/1M is not fft, 30K/10M is fft.

 
fxsaber #:

When no matrix can handle it.

It takes three seconds to find similar strings of length 30K in a string of 10M.

Impressive result!

 

I took a sample from 2010 to 2023 (47k lines), divided it into 3 parts in chronological order, and decided to see what would happen if we swap these parts.

The size of subsamples train - 60%, test - 20% and exam - 20%.

I made these combinations (-1) - this is the standard order - chronological. Each sub-sample has its own colour.


Trained 101 models with different Seed for each set of samples, and got the following result


All metrics are standard, and it can be seen that it is difficult to determine the average profit of the models (AVR Profit), as well as the percentage of models whose profit exceeds 3000 points on the last sample that did not participate in training.

Maybe the relative success rate of the -1 and 0 variants in the training sample size should be reduced? In general, it seems that Recall reacts to this.

In your opinion, should the results of such combinations be comparable to each other in our case? Or is the data irretrievably outdated?

 
Aleksey Vyazmikin #:

I took a sample from 2010 to 2023 (47k lines), divided it into 3 parts in chronological order, and decided to see what would happen if we swap these parts.

The size of subsamples train - 60%, test - 20% and exam - 20%.

I made these combinations (-1) - this is the standard order - chronological. Each sub-sample has its own colour.


Trained 101 models with different Seed for each set of samples, and got the following result


All metrics are standard, and it can be seen that it is difficult to determine the average profit of the models (AVR Profit), as well as the percentage of models whose profit exceeds 3000 points on the last sample that did not participate in training.

Maybe the relative success rate of the -1 and 0 variants in the training sample size should be reduced? In general, it seems that Recall reacts to this.

In your opinion, should the results of such combinations be comparable to each other in our case? Or is the data irretrievably outdated?

Another do-it-yourself...

There is cross validation, everything is chewed and chewed..., widely used....