How to do clustering and apply it on new data - General

Aleksey Vyazmikin 2024.03.10 09:58 #34151

Forester #:
I think this formula is involved in training / finding cluster centres. For prediction you just need to find the nearest centre by C[]

Anyway, we need to see what's in the array.....

Are you friendly with AlgLib'om - can you show a primitive code how to do clustering and apply it on new data?

To be honest, I don't understand their abbreviations in functions - what to input/output....

That's the style! :) Any questions from newcomers [Archive!] Pure mathematics, physics,

Aleksey Vyazmikin 2024.03.10 10:02 #34152

Renat Akhtyamov #:

mu is the middle of a segment, a cluster in this case, I take it.

If it were a circle, the formula would work.

Mu, as I understand it, is the average value for each predictor in the cluster.

The delta of the predictor and cluster values, and then summing them, determines the proximity to the cluster in multidimensional space.

The result is used to select the closest cluster, as I understand it. I.e. it is necessary to do the calculation for all clusters.

When applying, we do not recalculate the average value simply.

Numerical series density Has anyone tried index How do you algorithmise

Forester 2024.03.10 13:09 #34153

Aleksey Vyazmikin #:

Anyway, we'll have to see what's in the array.....

Are you friendly with AlgLib - can you show a primitive code how to perform clustering and apply it to new data?

To be honest, I don't understand their abbreviations in functions - what to input/output....

It's been 4 years since I've done Alglib.

Forester 2024.03.10 13:12 #34154

It seems to be simple - see comments in Russian.

Forester 2024.03.10 13:14 #34155

Aleksey Vyazmikin #:

M, as I understand it, is the mean value for each predictor that happens to be in the cluster.

What is the average of all of them? It is the centre of the cluster for this column.

Maxim Dmitrievsky 2024.03.10 13:37 #34156

mytarmailS #:
Will ts make money on the test yes/no

probability by binary classification.

Either the slope of the captal curve on the test by regression or there FV or Sharpe or how much the figure will earn on the test.

Or better together two models and classification and Regression.

Then it is possible to generate 1000 TS, then selected 20 best ones by probability that they will work on the test, then selected n best ones by regression by Sharpe regression.

Cool

This is a serious experiment.

Like a dataset of 1000 models with estimates of their performance on new data? Why categorise them? You can just sort them.

Ah, to see on average which parameters are more likely to lead to success.

From theory to practice [WARNING CLOSED!] Any newbie [ARCHIVE!] Any rookie question,

mytarmailS 2024.03.10 15:18 #34157

Maxim Dmitrievsky #:
Like a dataset of 1000 models with estimates of their performance on new data? And why categorise them, you can just sort them out

Ah, to see on average which parameters are more likely to lead to success

No, that's not right... I'll try to explain it again, forget about the models for now....

You have a lot of TCs optimised for trains and there is a test.

Create a dataset for the model:

target = By test we see if the TS worked on the test (this is the target YES/NO).

date = (signs) are parameters of TS, capital curve, trades, FV, Sharpe (if TS is based on MO, the guts of the model).

Then we traim already as if a real model to answer whether a particular TS will work on the test or not.

Non-fitting system - main Buy a profitable EA! MetaTrader 5 and MetaTrader

mytarmailS 2024.03.10 18:36 #34158

Aleksey Vyazmikin #:

It's all a bit of nonsense. Until you can't detect the probability shift in a single sheet, the models will keep pouring.

And to work with a sheet or a quantum segment - you need quite a lot of responses on histories, and this is not available, and without it there is not enough stat data.... so the models will be questionable...

Where does the bias come from?

If the examples in the sheets are not enough and the models will cast, then why talk about the sheets at all.

Forester 2024.03.10 19:35 #34159

Forester #:
I haven't done Alglib for about 4 years.

Here I found in an old file my kmeans test with a predicate function:

.

#include <Math\Alglib\alglib.mqh>
   CMatrixDouble MatrixLearn;//обучающая часть данных
// заполнить MatrixLearn например MatrixLearn[row].Set(col,123.0); или считать из файла ф-ей LoadFullMatrix() см. ниже
   CMatrixDouble c;//центры кластеров - по ним можно отнести любую строку к одному из кластеров
   int klusters = 4, restarts=10;
   int xyc[],xycp[];//номер кластера для каждой строки

   KMeans(klusters, restarts, dt.MatrixLearn, c, xyc);//кластеризация: данные, число кластеров, число перезапусков - //в итоге матрицу можно поделить на klusters матриц и обрабатывать по отдельности
   predict_CKMeans(MatrixLearn,dt.MatrixLearn.Size(),dt.ins_all,klusters,c,xycp);
   Print("Predicted cluster indexes");
   string t0="";for(int i=0; i<(ArraySize(xycp)>20?20:ArraySize(xycp)); i++){ t0+=(string)(xycp[i])+", ";} Print(t0+"... (",ArraySize(xycp)," items)."); //печать первых 20 элементов
   int ac = ArrayCompare(xyc,xycp); Print("ArrayCompare: ",ac);


   // -------------- кластеризация --------------

   void KMeans(int klusters, int Restarts, CMatrixDouble &mtr, CMatrixDouble &c, int &xyc[])//кластеризация
      {
   //кластеризация в Data Mining приобретает ценность тогда, когда она выступает одним из этапов анализа данных, построения законченного аналитического решения. Аналитику часто легче выделить группы схожих объектов, изучить их особенности и построить для каждой группы отдельную модель, чем создавать одну общую модель для всех данных.
      Print("Кластеризация:");
      int info;
      CAlglib::KMeansGenerate(mtr,mtr.Size(),ins_all,klusters,Restarts,info,c,xyc);
      // в XYC - номер кластера к которому строка отнесена, в с  центр кластера для каждого столбца
      //в итоге матрицу можно поделить на klusters матриц и обрабатыватьпо отдельности
      Print("cluster indexes");
      string t0;for(int i=0; i<(ArraySize(xyc)>20?20:ArraySize(xyc)); i++){ t0+=(string)(xyc[i])+", ";} Print(t0+"... (",ArraySize(xyc)," items)."); //печать первых 20 элементов

      int xycp[];//проверка
      predict_CKMeans(mtr,mtr.Size(),ins_all,klusters,c,xycp);
      Print("Predicted cluster indexes");
      t0="";for(int i=0; i<(ArraySize(xycp)>20?20:ArraySize(xycp)); i++){ t0+=(string)(xycp[i])+", ";} Print(t0+"... (",ArraySize(xycp)," items)."); //печать первых 20 элементов
      int ac = ArrayCompare(xyc,xycp);
      Print("ArrayCompare: ",ac);
      //Print("matrix whose columns store cluster's centers:"); printMatrix(c);
   }
   
//+------------------------------------------------------------------+
//| k-means++ clusterization    -> predict cluster number            |
//| INPUT PARAMETERS:                                                |
//|     XY          -   dataset, array [0..NPoints-1,0..NVars-1].    |
//|     NPoints     -   dataset size, NPoints>=K                     |
//|     NVars       -   number of variables, NVars>=1                |
//|     K           -   desired number of clusters, K>=1             |
//| OUTPUT PARAMETERS:                                               |
//|     CT           -   array[0..NVars-1,0..K-1].matrix whose columns|
//|                     store cluster's centers                      |
//|     XYC         -   array[NPoints], which contains cluster       |
//|                     indexes                                      |
//+------------------------------------------------------------------+

void predict_CKMeans(CMatrixDouble &xy,const int npoints,
                                    const int nvars,const int k,
                                    CMatrixDouble &ct,int &xyc[])
{//--- fill XYC with center numbers
   ArrayResize(xyc,npoints);
    for(int i=0;i<npoints;i++){
        int cclosest=-1;
        double dclosest=1 E300, tmp;
        for(int j=0;j<k;j++){
         double v=0.0;
         for(int i_=0;i_<nvars;i_++){
            tmp=xy[i][i_]-ct[i_][j];
            v+=tmp*tmp;
         }
            if(v<dclosest){cclosest=j;dclosest=v;}//--- check
        }
        xyc[i]=cclosest;//--- change value
    }
}


//+--------------------загрузка полной обучающей матрицы из CSV файла входы и выходы ----------------------------------------------+
void LoadFullMatrix(string fName, string delimeter=","){//загрузка полной обучающей матрицы из CSV файла входы и выходы
   int file_handle=FileOpen(fName,FILE_READ|FILE_TXT|FILE_ANSI|FILE_COMMON);

      if(file_handle!=INVALID_HANDLE){
         string s; int k; ushort u_sep=StringGetCharacter(delimeter,0); string r[]; int cols=0,row=0; bool first_line=false;
         while(!FileIsEnding(file_handle)){
            s=FileReadString(file_handle);
            if(!first_line && row==0){first_line=true;continue;}//если 1-я строка, то пропуск - там м.б. названия строк
            k=StringSplit(s,u_sep,r);
            if(cols==0){cols=k;}
            if(cols!=k){Alert("Нарушена размерность матрицы (",cols,"!=",k,") в строке ",row);continue;}

            MatrixLearn.Resize(row+1, cols);
            for(int i=0;i<k;i++) {
               MatrixLearn[row].Set(i,(double)r[i]);
            }
            row++;
         }
         FileClose(file_handle);
         rows=row;
         ins_all=cols-outs;
         ins=ins_all;
         Print("Считана матрица (",ins_all," + ",outs,") = ",cols," x ",rows, " из файла ",fName);
      }
}

Maxim Dmitrievsky 2024.03.11 01:11 #34160

Useful research :)

https://youtube.com/shorts/CkZe-HTgr4s?si=fwT5tLc6E-RLFkT6

Machine learning in trading: theory, models, practice and algo-trading - page 3416