Mutual information as criteria for Stepwise Feature Selection

MetaTrader 5 — Statistics and analysis | 21 November 2024, 09:53

1 408

Francis Dube

Introduction

Mutual information is a valuable tool for identifying effective predictors, especially when dealing with complex, nonlinear relationships. It can uncover dependencies that other methods might miss, making it particularly suitable for models that can exploit such intricacies. This article explores the application of mutual information in feature selection, focusing on the algorithm proposed by Hanchuan Peng, Fuhui Long, and Chris Ding in their research paper titled, "Feature Selection Based on Mutual Information: Criteria of Max-Dependency, Max-Relevance, and Min-Redundancy".

We will begin by discussing the estimation of mutual information for continuous variables, then delve into the feature selection process itself. Finally, we will illustrate the algorithm's effectiveness through examples involving both synthetic and real-world datasets.

Estimating the mutual information of continuous variables

Mutual information, denoted as I(X;Y), quantifies the shared information between two random variables, X and Y. For discrete variables, its calculation is relatively straightforward and involves summing over joint and marginal probabilities.

Discrete Mutual Information

However, continuous variables pose a significant challenge due to their infinite range of possible values. A common approach to address this involves discretizing the continuous variables into bins and then applying the discrete mutual information formula. Unfortunately, the accuracy of this method is highly sensitive to the bin width or the number of bins chosen, which can lead to substantial variability in the estimated mutual information.

Continuous Mutual Information

To calculate mutual information for continuous variables, we must transition from discrete summations to continuous integrals. This requires knowledge of the joint and marginal probability density functions (PDFs), which are often unavailable due to limited data. The Parzen window technique provides a solution by approximating the PDFs directly from the data. This method involves placing a window function over the data and calculating a weighted sum of the data points within the window. The weight assigned to each data point decreases with its distance from the center of the window. By moving this window across the data space and calculating the weighted sum at each point, we can construct an estimate of the probability density function.

Parzen Density Approximation

This technique is particularly useful in scenarios where the underlying distribution is unknown or complex. By adapting the window shape and size, we can adjust the smoothness and accuracy of the density estimate. However, it is important to note that the choice of window parameters can significantly impact the quality of the estimation. The effectiveness of Parzen density estimation hinges on selecting an appropriate weighting function, which should ideally integrate to unity and exhibit rapid decay as the argument moves away from zero. A popular choice for this weighting function is the Gaussian function, characterized by a scale parameter, sigma.

Gaussian Weight Function

However, determining the optimal value for sigma is a challenging task, often resulting in significant variability in the estimated density. This sensitivity to the choice of sigma is a major drawback of the Parzen window method, making it less than ideal for tasks such as evaluating predictor candidates, where accurate density estimation is a prerequisite.

An alternative to the Parzen window method for estimating mutual information between continuous variables is adaptive partitioning. Adaptive partitioning offers a significant improvement over the naive approach of fixed binning. By iteratively subdividing regions with high information content, adaptive partitioning effectively focuses computational resources on the most relevant areas of the data space. This targeted approach results in more accurate estimates of mutual information, particularly when dealing with complex, non-uniform distributions. A key strength of adaptive partitioning lies in its ability to adapt to the underlying data structure. The recursive partitioning process ensures that regions with significant dependencies are further subdivided, while regions with minimal information remain intact. This dynamic approach avoids the common pitfalls of fixed binning, which can either over-smooth or overfit the data.

Furthermore, adaptive partitioning incorporates statistical tests to guide the partitioning process. By evaluating the statistical significance of potential subdivisions, adaptive partitioning can distinguish between genuine information and random noise. This helps prevent overfitting, which could otherwise lead to inflated estimates of mutual information. The primary problem with the naive method is that it pays excessive attention to areas of the bivariate domain with few or no cases, while potentially neglecting dense regions where most of the information lies.

Adaptive Partitioning

Adaptive partitioning begins with a coarse-grained partitioning of the data space. For each partition, the mutual information between the variables is calculated. If this value exceeds a predefined threshold, the partition is further subdivided into smaller sub-partitions. This recursive partitioning process continues until a stopping criterion is met. If the partitioning process stops too early, the resulting partitions may be too coarse, causing a loss of detail and a downward bias in the estimated mutual information. Conversely, if the partitioning process continues for too long, the partitions may become overly fine-grained, resulting in overfitting and an inflated estimate of mutual information due to noise.

To determine whether a partition should be further subdivided, a chi-square test is employed. This test assesses the independence between the two variables within the partition. The partition is divided into four sub-partitions, creating a 2x2 contingency table. The number of data points falling into each of the four sub-partitions is counted. Under the null hypothesis of independence between the two variables, the expected number of data points in each sub-partition is calculated based on the marginal totals. The calculated chi-square statistic is then compared to a critical value from the chi-square distribution with 1 degree of freedom. If the calculated value exceeds the critical value, the null hypothesis of independence is rejected, indicating a significant relationship between the two variables within the partition.

If the chi-square test is significant, the partition is further subdivided into smaller partitions. This process continues recursively until the partitions become sufficiently homogeneous, or the stopping criterion is met. If the chi-square test is not significant, the partition is considered homogeneous, and no further subdivision is necessary.

To address the possibility of a more complex relationship within a partition that might be missed by a simple 2x2 split, the algorithm incorporates an additional check. If the 2x2 chi-square test fails to detect a significant relationship, but the partition remains relatively large, a more refined 4x4 split is performed. A chi-square test is then applied to this 4x4 partition to assess the presence of a non-random distribution. If the 4x4 test also fails to detect a significant relationship, the partition is considered homogeneous, and its contribution to the overall mutual information is calculated. No further subdivision is necessary in this case.

If either the 2x2 or 4x4 chi-square test indicates a non-uniform distribution within a partition, the partition is subdivided into four smaller sub-partitions. These sub-partitions are handled differently by checking their size. If a sub-partition is too small, it is considered terminal. Its contribution to the overall mutual information is computed, and the partition is no longer considered for further subdivision. Otherwise, if a sub-partition is sufficiently large, it is flagged as a candidate for future subdivision.

The standard two-by-two chi-square test statistic is shown in the equation below.

2 by 2 Chi-Squate test statistic

Adaptive partitioning in MQL5

The Capm class in mutualinfo.mqh implements the adaptive partitioning algorithm to estimate mutual information for continuous variables. To manage the recursive partitioning process, it utilizes custom structure IntStack. This structure holds information about the partitions and any sub-partitions contained in it. The structure has six members, detailed below:

Xstart and Xstop: These define the range of indices for one variable within the current enclosing partition.
Ystart and Ystop: These define the range of indices for a second variable within the current enclosing partition.
DataStart and DataStop: These specify the starting and ending indices of the data points within an enclosing partition, marking the sub-partitions.

//+------------------------------------------------------------------+
//| int type stack structure                                         |
//+------------------------------------------------------------------+
struct IntStack
  {
   int Xstart ;     // X value (rank) at which this rectangle starts
   int Xstop ;      // And stops
   int Ystart ;     // Ditto for Y
   int Ystop ;
   int DataStart ;  // Starting index into indices for the cases in this
   int DataStop ;   // rectangle, and the (inclusive) ending index
  };

The Capm constructor takes three arguments:

no_split: A boolean flag that determines how to handle values that are approximately equal. If set to true, the algorithm ensures that such values are grouped together during partitioning, preventing them from being separated into different partitions.
dep_vals: A vector containing the values of one of the variables to be analyzed. This variable is typically the one whose mutual information will be calculated against numerous other variables.
chi_critical_value: A threshold for the chi-square tests used in the partitioning process. This value determines the significance level for the tests and influences the depth of the partitioning. Values between 4.0 and 8.0 are generally suitable, with 6.0 being a common choice with universal utility.

//+------------------------------------------------------------------+
//|  constructor                                                     |
//+------------------------------------------------------------------+
Capm::Capm(bool no_split,vector &dep_vals, double chi_critical_value = 6.0)
  {
   m_no_split = no_split;
   m_size = int(dep_vals.Size()) ;
   m_chi_crit = chi_critical_value;

   if(ArrayResize(m_indices,m_size)<0 || !m_tempvals.Resize(m_size) || ArrayResize(m_ranks,m_size)<0
      || (m_no_split && ArrayResize(m_tied_ranks,m_size)<0))
     {
      Print(__FUNCTION__, " Arrayresize error ", GetLastError());
      return;
     }

   for(int i=0 ; i<m_size ; i++)
     {
      m_tempvals[i] = dep_vals[i] ;
      m_indices[i] = i ;
     }

   qsortdsi(0, m_size-1, m_tempvals, m_indices) ;

   for(int i=0 ; i<m_size ; i++)
     {
      m_ranks[m_indices[i]] = i ;
      if(! m_no_split)
         continue ;
      if(i < m_size-1  &&
         m_tempvals[i+1] - m_tempvals[i] < 1.e-12 * (1.0+fabs(m_tempvals[i])+fabs(m_tempvals[i+1])))
         m_tied_ranks[i] = 1 ;
      else
         m_tied_ranks[i] = 0 ;
     }
  }

The Capm class constructor begins by setting the no_split flag, which determines whether to account for ties in the data, and the chi_critical_value, which sets the threshold for the chi-square test used to identify significant relationships between variables. Next, the constructor allocates memory for arrays to store the data, ranks, and tied values. It copies the variable values into a temporary vector and sorts them. The original indices of the data points are preserved in the m_indices array. Finally, the constructor assigns ranks to the sorted data points and identifies any tied values, marking them in the m_tied_ranks array if the no_split flag is set.

//+------------------------------------------------------------------+
//| Mutual information using the Adaptive partitioning method        |
//+------------------------------------------------------------------+
class Capm
  {

public:
                     Capm(bool no_split,vector &dep_vals, double chi_critical_value = 6.0) ;
                    ~Capm(void) ;
   double            fit(vector &raw) ;

private:
   bool               m_no_split;  // respect ties
   int m_size ;                        // Number of cases
   int m_ranks[] ;                     // 'Dependent' variable ranks
   int m_tied_ranks[] ;                // tied[i] != 0 if case with rank i == case with rank i+1
   double m_chi_crit ;                 // Chi-square test criterion
   int               m_indices[];      // indices
   vector            m_tempvals;       // temp values
  } ;

The fit() method in the Capm class is designed to estimate the mutual information between one variable (provided to the constructor) and another variable supplied as an argument. It takes a single vector as input, representing the values of a second variable. This vector should have the same dimension as the vector provided to the constructor. By calling the fit() method multiple times with different vectors, we can efficiently calculate the mutual information between the one variable and various variables.

//+------------------------------------------------------------------+
//|  mutual information for continuous variables                     |
//+------------------------------------------------------------------+
double Capm::fit(vector &raw)
  {
   int i, k, ix, iy, nstack, splittable ;
   int current_indices[], x[], fullXstart, fullXstop, fullYstart, fullYstop, ipos ;
   int trialXstart[4], trialXstop[4], trialYstart[4], trialYstop[4] ;
   int ipx, ipy, xcut[4], ycut[4], iSubRec, x_tied[], ioff ;
   int X_AllTied, Y_AllTied ;
   int centerX, centerY, currentDataStart, currentDataStop ;
   int actual[4], actual44[16] ;
   vector expected(16), xfrac(4), yfrac(4) ;
   double diff, testval;
   double px, py, pxy, MI ;

   IntStack stack[];

   if(ArrayResize(stack,1,256)<0)
     {
      Print(__FUNCTION__," arrayresize error ", GetLastError());
      return EMPTY_VALUE;
     }

   if(ArrayResize(current_indices,m_size)<0 || ArrayResize(x,m_size)<0
      || (m_no_split && ArrayResize(x_tied,m_size)<0))
     {
      Print(__FUNCTION__, " Arrayresize error ", GetLastError());
      return EMPTY_VALUE;
     }

   for(i=0 ; i<m_size ; i++)
     {
      m_tempvals[i] = raw[i] ;
      m_indices[i] = i ;
     }

   if(!qsortdsi(0, m_size-1, m_tempvals, m_indices))
     {
      Print(__FUNCTION__, " Arrayresize error ", GetLastError());
      return EMPTY_VALUE;
     }

   for(i=0 ; i<m_size ; i++)
     {
      x[m_indices[i]] = i ;
      if(! m_no_split)
         continue ;
      if(i < m_size-1  &&
         m_tempvals[i+1] - m_tempvals[i] < 1.e-12 * (1.0+fabs(m_tempvals[i])+fabs(m_tempvals[i+1])))
         x_tied[i] = 1 ;
      else
         x_tied[i] = 0 ;
     }

   for(i=0 ; i<m_size ; i++)
     {
      m_indices[i] = i ;
     }

   stack[0].Xstart = 0 ;
   stack[0].Xstop = m_size-1 ;
   stack[0].Ystart = 0 ;
   stack[0].Ystop = m_size-1 ;
   stack[0].DataStart = 0 ;
   stack[0].DataStop = m_size-1 ;
   nstack = 1 ;

   MI = 0.0 ;
   while(nstack > 0)
     {

      --nstack ;
      fullXstart = stack[nstack].Xstart ;
      fullXstop = stack[nstack].Xstop ;
      fullYstart = stack[nstack].Ystart ;
      fullYstop = stack[nstack].Ystop ;
      currentDataStart = stack[nstack].DataStart ;
      currentDataStop = stack[nstack].DataStop ;


      centerX = (fullXstart + fullXstop) / 2 ;
      X_AllTied = (x_tied.Size())  && (x_tied[centerX] != 0) ;
      if(X_AllTied)
        {
         for(ioff=1 ; centerX-ioff >= fullXstart ; ioff++)
           {
            if(! x_tied[centerX-ioff])
              {
               X_AllTied = 0 ;
               centerX -= ioff ;
               break ;
              }
            if(centerX + ioff == fullXstop)
               break ;
            if(! x_tied[centerX+ioff])
              {
               X_AllTied = 0 ;
               centerX += ioff ;
               break ;
              }
           }
        }

      centerY = (fullYstart + fullYstop) / 2 ;
      Y_AllTied = (m_tied_ranks.Size())  && (m_tied_ranks[centerY] != 0) ;
      if(Y_AllTied)
        {
         for(ioff=1 ; centerY-ioff >= fullYstart ; ioff++)
           {
            if(! m_tied_ranks[centerY-ioff])
              {
               Y_AllTied = 0 ;
               centerY -= ioff ;
               break ;
              }
            if(centerY + ioff == fullYstop)
               break ;
            if(! m_tied_ranks[centerY+ioff])
              {
               Y_AllTied = 0 ;
               centerY += ioff ;
               break ;
              }
           }
        }

      if(X_AllTied  ||  Y_AllTied)
         splittable = 0 ;
      else
        {
         trialXstart[0] = trialXstart[1] = fullXstart ;
         trialXstop[0] = trialXstop[1] = centerX ;
         trialXstart[2] = trialXstart[3] = centerX+1 ;
         trialXstop[2] = trialXstop[3] = fullXstop ;
         trialYstart[0] = trialYstart[2] = fullYstart ;
         trialYstop[0] = trialYstop[2] = centerY ;
         trialYstart[1] = trialYstart[3] = centerY+1 ;
         trialYstop[1] = trialYstop[3] = fullYstop ;

         for(i=0 ; i<4 ; i++)
            expected[i] = (currentDataStop - currentDataStart + 1) *
                          (trialXstop[i]-trialXstart[i]+1.0) / (fullXstop-fullXstart+1.0) *
                          (trialYstop[i]-trialYstart[i]+1.0) / (fullYstop-fullYstart+1.0) ;

         actual[0] = actual[1] = actual[2] = actual[3] = 0 ;
         for(i=currentDataStart ; i<=currentDataStop ; i++)
           {
            k = m_indices[i] ;
            if(x[k] <= centerX)
              {
               if(m_ranks[k] <= centerY)
                  ++actual[0] ;
               else
                  ++actual[1] ;
              }
            else
              {
               if(m_ranks[k] <= centerY)
                  ++actual[2] ;
               else
                  ++actual[3] ;
              }
           }

         testval = 0.0 ;
         for(i=0 ; i<4 ; i++)
           {
            diff = fabs(actual[i] - expected[i]) - 0.5 ;
            testval += diff * diff / expected[i] ;
           }

         splittable = (testval > m_chi_crit)  ?  1 : 0 ;

The method begins by initializing a stack to manage the partitioning process. The entire dataset is initially pushed onto the stack as the root partition. Subsequently, the algorithm iteratively pops a partition from the stack and evaluates its homogeneity. A chi-square test is applied to assess the statistical significance of any potential relationships between the variables within the partition. If the test indicates a significant relationship, the partition is split into four sub-partitions, and each sub-partition is pushed back onto the stack for further consideration. This process continues until all partitions meet the stopping criterion or become homogeneous.

if(! splittable && fullXstop-fullXstart > 30 && fullYstop-fullYstart > 30)
           {
            ipx = fullXstart - 1 ;
            ipy = fullYstart - 1 ;
            for(i=0 ; i<4 ; i++)
              {
               xcut[i] = (fullXstop - fullXstart + 1) * (i+1) / 4 + fullXstart - 1 ;
               xfrac[i] = (xcut[i] - ipx) / (fullXstop - fullXstart + 1.0) ;
               ipx = xcut[i] ;
               ycut[i] = (fullYstop - fullYstart + 1) * (i+1) / 4 + fullYstart - 1 ;
               yfrac[i] = (ycut[i] - ipy) / (fullYstop - fullYstart + 1.0) ;
               ipy = ycut[i] ;
              }
            for(ix=0 ; ix<4 ; ix++)
              {
               for(iy=0 ; iy<4 ; iy++)
                 {
                  expected[ix*4+iy] = xfrac[ix] * yfrac[iy] *
                                      (currentDataStop - currentDataStart + 1) ;
                  actual44[ix*4+iy] = 0 ;
                 }
              }
            for(i=currentDataStart ; i<=currentDataStop ; i++)
              {
               k = m_indices[i] ;
               for(ix=0 ; ix<3 ; ix++)
                 {
                  if(x[k] <= xcut[ix])
                     break ;
                 }
               for(iy=0 ; iy<3 ; iy++)
                 {
                  if(m_ranks[k] <= ycut[iy])
                     break ;
                 }
               ++actual44[ix*4+iy] ;
              }
            testval = 0.0 ;
            for(ix=0 ; ix<4 ; ix++)
              {
               for(iy=0 ; iy<4 ; iy++)
                 {
                  diff = fabs(actual44[ix*4+iy] - expected[ix*4+iy]) - 0.5 ;
                  testval += diff * diff / expected[ix*4+iy] ;
                 }
              }
            splittable = (testval > 3 * m_chi_crit)  ?  1 : 0 ;
           }
        }

      if(splittable)
        {

         for(i=currentDataStart ; i<=currentDataStop ; i++)
            current_indices[i] = m_indices[i] ;

         ipos = currentDataStart ;
         for(iSubRec=0 ; iSubRec<4 ; iSubRec++)
           {

            if(actual[iSubRec] >= 3)
              {
               stack[nstack].Xstart = trialXstart[iSubRec] ;
               stack[nstack].Xstop = trialXstop[iSubRec] ;
               stack[nstack].Ystart = trialYstart[iSubRec] ;
               stack[nstack].Ystop = trialYstop[iSubRec] ;
               stack[nstack].DataStart = ipos ;
               stack[nstack].DataStop = ipos + actual[iSubRec] - 1 ;
               ++nstack ;
               if(ArrayResize(stack,nstack+1,256)<0)
                 {
                   Print(__FUNCTION__," arrayresize error ", GetLastError());
                   return EMPTY_VALUE;
                 }

               if(iSubRec == 0)
                 {
                  for(i=currentDataStart ; i<=currentDataStop ; i++)
                    {
                     k = current_indices[i] ;
                     if(x[k] <= centerX  &&  m_ranks[k] <= centerY)
                        m_indices[ipos++] = current_indices[i] ;
                    }
                 }
               else
                  if(iSubRec == 1)
                    {
                     for(i=currentDataStart ; i<=currentDataStop ; i++)
                       {
                        k = current_indices[i] ;
                        if(x[k] <= centerX  &&  m_ranks[k] > centerY)
                           m_indices[ipos++] = current_indices[i] ;
                       }
                    }
                  else
                     if(iSubRec == 2)
                       {
                        for(i=currentDataStart ; i<=currentDataStop ; i++)
                          {
                           k = current_indices[i] ;
                           if(x[k] > centerX  &&  m_ranks[k] <= centerY)
                              m_indices[ipos++] = current_indices[i] ;
                          }
                       }
                     else
                       {
                        for(i=currentDataStart ; i<=currentDataStop ; i++)
                          {
                           k = current_indices[i] ;
                           if(x[k] > centerX  &&  m_ranks[k] > centerY)
                              m_indices[ipos++] = current_indices[i] ;
                          }
                       }
              }

            else
              {
               if(actual[iSubRec] > 0)
                 {
                  px = (trialXstop[iSubRec] - trialXstart[iSubRec] + 1.0) / m_size ;
                  py = (trialYstop[iSubRec] - trialYstart[iSubRec] + 1.0) / m_size ;
                  pxy = (double) actual[iSubRec] / m_size ;
                  MI += pxy * log(pxy / (px * py)) ;
                 }
              }
           }
        }
      else
        {
         px = (fullXstop - fullXstart + 1.0) / m_size ;
         py = (fullYstop - fullYstart + 1.0) / m_size ;
         pxy = (currentDataStop - currentDataStart + 1.0) / m_size ;
         MI += pxy * log(pxy / (px * py)) ;
        }
     }

   return MI;

For partitions that are deemed homogeneous, either due to a non-significant chi-square test or a minimal number of data points, the mutual information is calculated. This calculation involves estimating the joint and marginal probability distributions within the partition and applying the formula below.

Mutual Information Contribution

The process continues recursively, with partitions being split and evaluated until no further significant splits are possible. The final estimate of mutual information is obtained by summing the contributions from all terminal partitions, providing a comprehensive estimate based on the data's structure. If the method encounters an error, it returns the equivalent to MQL5's built in EMPTY_VALUE constant. In an upcoming section, we will use the Capm class in the implementation of a stepwise feature selection algorithm.

Predictor selection using mutual information

When applying mutual information to the task of feature selection, our goal is to identify a subset of predictors from a larger set of candidate predictors that maximizes the joint dependency with a target variable. This joint dependency, is a generalization of mutual information, where one of the quantities is a collection of variables rather than a single variable. Calculating the joint dependency for larger subsets of predictors becomes computationally intractable due to the curse of dimensionality. This occurs because the required multidimensional partitioning of the data leads to extremely sparse cells, especially as the number of dimensions (predictors) increases.

Additionally, the sheer number of possible feature combinations, can be overwhelming even for moderately sized feature sets. This combinatorial explosion makes it impractical to exhaustively search all possible subsets, necessitating the use of efficient search strategies.

A common approach to feature selection is forward stepwise selection. This method starts with an empty model and iteratively adds the feature that provides the largest improvement to the model's performance, as measured by a chosen criterion. While this approach is efficient, it suffers from the limitation of being greedy. It may miss optimal combinations of features, such as cases where two features, when combined, provide significantly better predictive power than either one alone. This is because the algorithm focuses on incremental improvements at each step, potentially overlooking synergistic relationships between features.

While other techniques like higher-order methods and backward selection exist, forward stepwise selection remains the most practical choice due to its computational efficiency, especially when dealing with large datasets and limited computational resources. Additionally, when applied to the specific context of joint dependency, forward stepwise selection exhibits a surprising property: it can effectively approximate the optimal feature subset, even though it does not explicitly optimize the joint dependency measure.

Peng, Long, and Ding proposed a feature selection algorithm based on mutual information, known as the Maximum Relevance, Minimum Redundancy (MRMR) criterion. This algorithm efficiently approximates the optimal feature subset without explicitly calculating the joint dependency, which is computationally intractable. The MRMR criterion is based on the concept of relevance and redundancy. The relevance of a feature set S to the target variable Y is defined as the average mutual information between Y and each feature in S.

Relevance

Here ∣S∣ is the number of features in S and I(Xi,Y) is the mutual information between feature Xi and Y.

While maximizing relevance is intuitive, it can lead to a suboptimal feature subset due to redundancy. If we simply select features based on their individual relevance to the target variable, we may end up with a set of highly correlated features that provide little additional information. To address this issue, we need to consider not only the relevance of a feature but also its redundancy with the already selected features.

The redundancy of a predictor candidate Xi regarding an already selected feature set S is defined as the average mutual information between Xi and each feature in S:

Redundancy

Where I(Xi,Xj) is the mutual information between candidate feature Xi and the feature Xj already in S.

It is important to note that the mathematical expression for redundancy is identical to the expression for relevance. The only difference lies in the interpretation. When calculating the mutual information between a candidate feature and the target variable, we refer to it as relevance. However, when calculating the mutual information between a candidate feature and a feature already in the selected set, we refer to it as redundancy.

Essentially, both concepts involve measuring the amount of information shared between two variables, but the context determines whether it is considered relevant or redundant. In relevance, we are concerned with the relationship between the feature and the target variable, while in redundancy, we focus on how much overlap exists between the candidate feature and the already selected features.

To summarize, the MRMR algorithm operates as follows:

The feature with the highest mutual information with the target variable Y is selected as the first feature.
At each subsequent step, the algorithm selects the feature that maximizes the following criterion.

MRMR Criterion

This criterion balances the feature's relevance to the target variable Y (i.e., I(X_i, Y)) with its redundancy with the already selected feature. The remarkable aspect of the MRMR algorithm is that it effectively approximates the optimal feature subset, even though directly optimizing the joint dependency is computationally intractable. This result, while not immediately obvious, is mathematically proven in the original paper.

To assess the statistical significance of the selected features, two Monte Carlo permutation (MCP) tests can be employed:

Individual Feature Significance: This test evaluates the significance of each individual feature by comparing its relevance to the target variable to a null distribution obtained through permutation. By permuting the feature values, we create a null distribution that assumes the feature has no relationship with the target. If the observed relevance is significantly higher than the permuted values, we can conclude that the feature is likely to be truly informative.
Overall Model Significance: This test assesses the significance of the entire selected feature set by comparing the cumulative sum of the individual feature relevance to a null distribution obtained through permutation. By permuting the target variable values, we create a null distribution that assumes the features have no relationship with the target. If the observed cumulative relevance is significantly higher than the permuted values, we can conclude that the selected feature set is likely to be predictive of the target variable.

MQL5 implementation of stepwise selection based on mutual information

The Cmrmr class in mutualinfo.mqh implements the MRMR algorithm and takes the following parameters in its constructor:

num_reps: The number of replications for the Monte Carlo permutation tests. Setting this to a value less than or equal to 1 disables the tests.
max_preds: The maximum number of features to be selected.
chisquare_thresh: The chi-square test threshold for the adaptive partitioning method used to estimate mutual information.
m_verbose: A boolean flag to control whether to display detailed output during the algorithm's execution.

//+-----------------------------------------------------------------------+
//|Relevance minus redundancy for building an optimal subset of predictors|
//+-----------------------------------------------------------------------+
class Cmrmr
  {
private:
   int               m_max_preds;
   int               m_reps;
   bool              m_verbose;
   int               m_samples;
   int               m_vars;
   matrix            m_preds;
   vector            m_target;
   vector            m_crits;
   double            m_chisquare_thresh;
   ulong             m_selected_vars[];
   matrix            m_critical_values;
   vector            mutualinfo(int which_size,int &which[],vector &targs);
public:
                     Cmrmr(int num_reps,int max_preds, double chisquare_thresh,bool verbose);
                    ~Cmrmr(void);
   bool              StepWise(matrix &predictors, vector &targets);
   matrix            GetCriticalValues(void);
   bool              GetSelectedVars(ulong &output[]);
  };

To initiate the feature selection process using the Cmrmr class, we need to instantiate an object of this class and then call its StepWise() method. The StepWise() method takes two primary inputs:

A matrix of candidate predictor variables: This matrix should contain the data for all the potential features that a practitioner wants to consider.
A vector of the target variable: This vector should contain the values of a target variable.

The method returns a boolean value indicating the success or failure of the feature selection process. The StepWise() method in the Cmrmr class initializes the necessary data structures and then proceeds to the MCP test loop. If MCP tests are not requested (`num_reps` is less than or equal to 1), the loop is executed only once. Within the loop, the method calculates the mutual information between each candidate predictor and the target variable using an instance of the Capm class.

//+------------------------------------------------------------------+
//|  Stepwise feature selection based on mutual information          |
//+------------------------------------------------------------------+
bool Cmrmr::StepWise(matrix &predictors,vector &targets)
  {
   if(m_selected_vars.Size())
      ArrayFree(m_selected_vars);

   m_preds = predictors;
   m_samples = int(m_preds.Rows());
   m_vars = int(m_preds.Cols());
   m_max_preds = m_max_preds>=m_vars?m_vars:m_max_preds;
   m_target = targets;
   
   
   int i, j, k, ivar, irep;
   int index[], stepwise_mcpt_count[], solo_mcpt_count[], stepwise_ivar[], which_preds[],original_stepwise_ivar[] ;

   int nkept,best_ivar;
   vector casework(m_samples), sorted(m_samples), mutual(m_vars);
   vector crit(m_vars), relevance(m_vars), original_relevance(m_vars), current_crits(m_vars), sorted_crits(m_vars);
   double best_crit, dtemp, group_pvalue,solo_pvalue;
   vector stepwise_crit(m_vars), original_stepwise_crit(m_vars);
   double sum_relevance;
   vector original_sum_relevance(m_vars), sum_redundancy(m_vars);
   vector target = m_target;

   best_crit = -DBL_MAX;
   best_ivar = -1;
   nkept = m_max_preds;

   if(ArrayResize(index,m_vars)<0 || ArrayResize(stepwise_mcpt_count,m_vars)<0 ||
      ArrayResize(solo_mcpt_count,m_vars)<0 || ArrayResize(which_preds,m_vars)<0 || ArrayResize(stepwise_ivar,m_vars)<0 ||
      ArrayResize(original_stepwise_ivar,m_vars)<0)
     {
      Print(__FUNCTION__," array resize error ", GetLastError());
      return false;
     }

   int unif_error;
   for(irep=0 ; irep<m_reps ; irep++)
     {
      if(irep)
        {
         i = m_samples ;
         while(i > 1)
           {
            j = (int)(MathRandomUniform(0.0,1.0,unif_error) * i);
            if(unif_error)
              {
               Print(__FUNCTION__," Mathrandomuniform error ", GetLastError());
               return false;
              }
            if(j >= i)
               j = i - 1 ;
            dtemp = target[--i] ;
            target[i] = target[j] ;
            target[j] = dtemp ;
           }
        }

      for(i=0 ; i<m_vars ; i++)
         which_preds[i] = i ;

      crit = mutualinfo(m_vars,which_preds,target);

      for(ivar=0 ; ivar<m_vars ; ivar++)
        {
         relevance[ivar] = crit[ivar] ;
         if(ivar == 0  ||  crit[ivar] > best_crit)
           {
            best_crit = crit[ivar] ;
            best_ivar = ivar ;
           }
        }

      stepwise_crit[0] = best_crit ;
      stepwise_ivar[0] = best_ivar ;
      sum_relevance = best_crit ;
      if(irep == 0)
        {

         original_stepwise_crit[0] = best_crit ;
         original_stepwise_ivar[0] = best_ivar ;
         original_sum_relevance[0] = sum_relevance ;
         stepwise_mcpt_count[0] = 1 ;

         for(ivar=0 ; ivar<m_vars ; ivar++)
           {
            index[ivar] = ivar ;
            original_relevance[ivar] = sorted_crits[ivar] = current_crits[ivar] = crit[ivar] ;
            solo_mcpt_count[ivar] = 1 ;
           }

         if(!qsortdsi(0, m_vars-1, sorted_crits, index))
           {
            Print(__FUNCTION__, " failed qsort ");
            return false;
           }

         if(m_verbose)
            Print("       Variable         MI") ;

         for(i=m_vars-1 ; i>=0 ; i--)
           {
            k = index[i] ;

            if(m_verbose)
               PrintFormat("%15s %12.4lf",string(k), current_crits[k]) ;
           }
        }

      else
        {
         if(sum_relevance >= original_sum_relevance[0])
            ++stepwise_mcpt_count[0] ;
         for(ivar=0 ; ivar<m_vars ; ivar++)
           {
            if(relevance[ivar] >= original_relevance[ivar])
               ++solo_mcpt_count[ivar] ;
           }
        }

      for(i=0 ; i<m_vars ; i++)
         sum_redundancy[i] = 0.0 ;

      for(nkept=1 ; nkept<m_max_preds ; nkept++)
        {

         if(irep == 0 && m_verbose)
           {
            Print("Predictors so far   Relevance   Redundancy   Criterion") ;
            for(i=0 ; i<nkept ; i++)
              {
               k = stepwise_ivar[i] ;
               PrintFormat("%15s %12.4lf %12.4lf %12.4lf",string(k), relevance[k], relevance[k] - stepwise_crit[i], stepwise_crit[i]) ;
              }
           }

         k = 0 ;
         for(i=0 ; i<m_vars ; i++)
           {
            for(j=0 ; j<nkept ; j++)
              {
               if(stepwise_ivar[j] == i)
                  break ;
              }
            if(j == nkept)
               which_preds[k++] = i ;
           }
         if(k != (m_vars - nkept))
           {
            Print(__FUNCTION__, " failed assertion ", __LINE__);
            return false;
           }

         k = stepwise_ivar[nkept-1] ;

         casework = m_preds.Col(k);

         crit = mutualinfo(m_vars-nkept,which_preds,casework);

         for(i=0 ; i<(m_vars-nkept) ; i++)
           {
            k = which_preds[i] ;
            sum_redundancy[k] += crit[i] ;
            index[i] = k ;
            sorted_crits[i] = current_crits[i] = ((relevance[k] - sum_redundancy[k]) / double(nkept)) ;
            if(i == 0  ||  current_crits[i] > best_crit)
              {
               best_crit = current_crits[i] ;
               best_ivar = k ;
              }
           }

         stepwise_crit[nkept] = best_crit ;
         stepwise_ivar[nkept] = best_ivar ;
         sum_relevance += relevance[best_ivar] ;

         if(irep == 0)
           {
            original_stepwise_crit[nkept] = best_crit ;
            original_stepwise_ivar[nkept] = best_ivar ;
            original_sum_relevance[nkept] = sum_relevance ;
            stepwise_mcpt_count[nkept] = 1 ;

            if(!qsortdsi(0, m_vars-nkept-1, sorted_crits, index))
              {
               Print(__FUNCTION__, " failed qsort ");
               return false;
              }

            if(m_verbose)
              {
               Print("Additional candidates, in order of decreasing relevance minus redundancy") ;
               Print("       Variable     Relevance   Redundancy   Criterion") ;

               for(i=m_vars-nkept-1 ; i>=0 ; i--)
                 {
                  k = index[i] ;
                  PrintFormat("%15s %12.4lf %12.4lf %12.4lf",
                              string(k), relevance[k], sum_redundancy[k] / nkept,
                              relevance[k] - sum_redundancy[k] / nkept) ;
                 }
              }
           }

         else
           {
            if(sum_relevance >= original_sum_relevance[nkept])
               ++stepwise_mcpt_count[nkept] ;
           }

        }

     }
//---
   m_critical_values = matrix::Zeros(nkept,m_reps>1?5:3);
//---
   if(ArrayResize(m_selected_vars,nkept)<0)
     {
      Print(__FUNCTION__, " failed array resize ", GetLastError());
      return false;
     }
//---
   if(m_verbose)
     {
      if(m_reps > 1)
         Print("Final predictors ||  Relevance ||  Redundancy || Criterion  ||  Solo pval || Group pval") ;
      else
         Print("Final predictors  ||  Relevance ||  Redundancy ||  Criterion") ;
     }
//---
   for(i=0 ; i<nkept ; i++)
     {
      k = original_stepwise_ivar[i] ;
      m_selected_vars[i] = ulong(k);

      m_critical_values[i][0] = original_relevance[k];
      m_critical_values[i][1] = original_relevance[k] - original_stepwise_crit[i];
      m_critical_values[i][2] = original_stepwise_crit[i];

      if(m_critical_values.Cols()>3)
        {
         group_pvalue = (double) stepwise_mcpt_count[i] / (double) m_reps;
         solo_pvalue = (double) solo_mcpt_count[k] / (double) m_reps;
         m_critical_values[i][3] = solo_pvalue;
         m_critical_values[i][4] = group_pvalue;
        }

      if(m_verbose)
        {
         if(m_reps > 1)
            PrintFormat("%15s %12.4lf %12.4lf %12.4lf    %8.3lf    %8.3lf",string(k), m_critical_values[i][0], m_critical_values[i][1], m_critical_values[i][2],solo_pvalue,group_pvalue) ;
         else
            PrintFormat("%15s %12.4lf %12.4lf %12.4lf",string(k), m_critical_values[i][0], m_critical_values[i][1], m_critical_values[i][2]);
        }
     }

   return true;

  }

The mutualinfo() method, which is a private method within the Cmrmr class, is responsible for estimating the mutual information between a given predictor and the target variable using the adaptive partitioning method. It returns a vector of mutual information approximations for the selected predictors specified as column indices to mutualinfo().

//+------------------------------------------------------------------+
//| continuous mutual infomation                                     |
//+------------------------------------------------------------------+
vector Cmrmr::mutualinfo(int which_size,int &which[],vector &targs)
  {
   vector res = vector::Zeros(m_vars);

   res.Fill(-DBL_MAX);

   vector vars;

   Capm mia(true,targs,m_chisquare_thresh);

   for(int i = 0; i<which_size; i++)
     {
      vars = m_preds.Col(which[i]);
      res[i] = mia.fit(vars);
     }

   return res;

  }

The StepWise() method stores the calculated mutual information values in the `relevance` vector, as these values will be needed in subsequent iterations to compute the redundancy term. The feature with the highest mutual information is selected as the first feature, and its information is stored in stepwise_crit and stepwise_ivar variables. Additionally, the sum_relevance variable is initialized to keep track of the cumulative relevance of the selected features, which will be used later for the MCP test.

For the initial, unpermuted replication, the algorithm stores the original values of the relevance and criterion for the first selected feature. It also initializes counters for the overall model significance test and individual feature significance tests. Additionally, a table displaying the mutual information of each candidate feature with the target variable is output to provide insights into the initial feature rankings. This table helps visualize the initial selection process and serves as a reference for comparison with the permuted values during the significance testing phase.

If we are not in the initial, unpermuted replication, the algorithm proceeds to perform the two MCP tests, designed to assess the statistical significance of the feature set selected by the MRMR algorithm. To efficiently manage the evolving redundancy as new features are added to the selected set, the algorithm uses the `sum_redundancy` array. The array is initialized with zeros. This array is used to store the cumulative redundancy for each candidate feature relative to the features that have already been selected.

At the start, since no features are selected, all values in this array are set to zero. When a new feature is added to the selected set, its redundancy with each of the remaining candidate features must be updated. The redundancy of a feature regarding the growing selected set is defined as the average mutual information between the candidate feature and the features already selected. The mutual information between the new feature and each of the remaining candidate features is computed. This quantifies how much information the new feature shares with each of the selected features. The redundancy value for each remaining feature is updated by adding the mutual information value between the newly added feature and the existing feature.

After updating the redundancy values, the algorithm calculates the MRMR (Maximum Relevance, Minimum Redundancy) criterion for each of the remaining candidate features. The MRMR criterion is calculated by subtracting the average redundancy of each feature from its relevance. Once the MRMR criterion is calculated for all the remaining candidate features, the feature with the highest MRMR score is selected. The algorithm repeats this process iteratively, updating the `sum_redundancy` array and recalculating the MRMR criterion each time a new feature is added. As more features are selected, the redundancy of the remaining features with the selected set is continuously updated, ensuring that the feature selection process considers the evolving relationships between features.

If this is the initial, unpermuted replication, the original values of these quantities are preserved for later comparison during the permutation tests. Otherwise, the counters for the "group" and "solo" permutation tests are incremented. After completing the desired number of iterations, the algorithm concludes by displaying a table summarizing the selected features, their relevance, and their p-values from the permutation tests. After `StepWise()` is called and executes successfully, `GetCriticalValues()` can be invoked to get a matrix of the table of values displayed in verbose mode. Whilst, `GetSelectedVars()` retrieves the indices of the selected variables (columns) corresponding to the `m_max_preds` property of the class.

//+------------------------------------------------------------------+
//| get the matrix of decision variables                             |
//| matrix rows arranged in terms of descending relevance of each    |
//| candidate predictor. Most relevant predictors critical values are|
//| in the first row, etc.                                           |
//| Column 0 is the calculated relevance                             |
//| Column 1 is the calculated redundancy                            |
//| Column 2 is the calculated Stepwise Mutual information           |
//| Column 3 and 4 will exist only if MCP test is specified          |
//| Column 3 is the Solo probability value                           |
//| Column 4 are the Group probability values                        |
//+------------------------------------------------------------------+
matrix Cmrmr::GetCriticalValues(void)
{
  return m_critical_values;
}
//+--------------------------------------------------------------------------+
//|get the indices of the most relevant predictors as determined by algorithm|
//+--------------------------------------------------------------------------+
bool Cmrmr::GetSelectedVars(ulong &output[])
{
 return (ArrayCopy(output,m_selected_vars)>0);
}

In the next section, we will see how the Cmrmr class is applied using synthetic and real-world datasets.

Examples of relevance minus redundancy

To effectively illustrate the application of the MRMR algorithm, we will start by applying it to a synthetic dataset. The dataset consists of 100 samples and 10 potential predictor variables (features). The target variable Y is constructed by summing the first four columns of the matrix, so the relationship between the target and the predictors is deterministic for these columns. However, columns 4 to 7 provide no real information about the target, serving as irrelevant features. Variables in columns 8 and 9 are combinations of variables at column indices {1,3} and {0,2} respectively.

//---
   MathSrand(125);

   matrix rdata(100,10);

   rdata.Random(0.0,1.0);

   vector dep = rdata.Col(0) + rdata.Col(1) + rdata.Col(2) + rdata.Col(3);

   vector sum02 = rdata.Col(0) + rdata.Col(2);

   vector sum13 = rdata.Col(1) + rdata.Col(3);

   if(!rdata.Col(sum13,8) || !rdata.Col(sum02,9))
     {
      Print(" Failed column insertion ", GetLastError());
      return;
     }

The objective is to use the MRMR algorithm to identify the columns as the most relevant predictors. This experiment is implemented in the RelevanceMinusRedundancy.mq5 script, which allows for customization of various hyperparameters, including the number of Monte Carlo permutations and the maximum number of features to select. The script is configured to run in verbose mode, providing detailed output during the feature selection process.

//inputs
input int NumReplications = 100;
input int MaxPredictors = 10;
input double ChiSquareThreshold = 6.0;

The initial output of the RelevanceMinusRedundancy.mq5 script are the individual mutual information values between each candidate predictor and the target variable. Arranged in descending order of mutual information, allowing for easy identification of the most relevant features.

Variable         MI
      9       0.2675
      8       0.1696
      0       0.1662
      3       0.1336
      2       0.0823
      1       0.0645
      6       0.0000
      7       0.0000
      4       0.0000
      5       0.0000

As expected, the first feature selected by the MRMR algorithm is the one with the highest individual mutual information with the target variable. In this case, it is the one in column index 9.

Predictors so far   Relevance   Redundancy   Criterion
            9       0.2675       0.0000       0.2675

The second feature selected is the variable in column index 8. Looking at the results of the analysis, this variable has high relevance and also minimal redundancy with the first selected variable.

Predictors so far   Relevance   Redundancy   Criterion
            9       0.2675       0.0000       0.2675
            8       0.1696       0.0000       0.1696

The third feature selected demonstrates how redundancy affects the selection process. Even though features in columns 0, 1, 2, and 3 are directly related to the target variable, their high redundancy with the already selected features (columns 8 and 9) reduces their overall MRMR criterion. Therefore, the algorithm selects less directly related features that have lower redundancy with the existing set. Only after selecting a few unrelated features will the algorithm start to prioritize the features in columns 0, 1, 2, and 3, as their redundancy with the selected set decreases.

Additional candidates, in order of decreasing relevance minus redundancy
Variable     Relevance   Redundancy   Criterion
    5       0.0000       0.0000       0.0000
    4       0.0000       0.0000       0.0000
    7       0.0000       0.0000       0.0000
    6       0.0000       0.0000       0.0000
    1       0.0645       0.0600       0.0044
    0       0.1662       0.1351       0.0311
    2       0.0823       0.1426      -0.0603
    3       0.1336       0.1781      -0.0445

While the irrelevant variables in this synthetic example had zero relevance, It is important to note that in real-world scenarios, irrelevant features may still exhibit some degree of a relationship with the target variable, leading to non-zero redundancy. In such cases, the MRMR algorithm effectively balances relevance and redundancy, prioritizing features that provide the unique information.

Final predictors ||  Relevance ||  Redundancy || Criterion  || Solo pval|| Group pval
9                     0.2675        0.0000       0.2675       0.010       0.010
8                     0.1696        0.0000       0.1696       0.010       0.010
4                     0.0000        0.0000       0.0000       1.000       0.010
5                     0.0000        0.0000       0.0000       1.000       0.010
6                     0.0000        0.0000       0.0000       1.000       0.010
7                     0.0000        0.0000       0.0000       1.000       0.010
1                     0.0645        0.0738      -0.0093       0.010       0.010
0                     0.1662        0.1811      -0.0149       0.010       0.010
2                     0.0823        0.1077      -0.0254       0.010       0.010
3                     0.1336        0.1584      -0.0247       0.010       0.010

The p-values obtained from the Monte Carlo permutation tests provide valuable insights into the statistical significance of the selected features. Lower p-values indicate stronger evidence that the feature is truly informative and not due to chance. A p-value threshold of 0.05 is commonly used to determine statistical significance.

Next is a demonstration of the algorithm on a more realistic dataset. The StepWiseFeatureSelectionByMutualInformation.mq5 script provides a practical application of the MRMR algorithm on financial data. It is based on the premise of predicting future returns of a specific symbol using a set of technical indicators. The script calculates four indicators: Money Flow Index (MFI), Moving Average (MA), Bears Power, and Bulls Power, for a range of look back periods. These indicators serve as the candidate features for the feature selection process. By applying the MRMR algorithm, the script aims to identify the most informative combination of indicators that can effectively predict future returns. This involves selecting features that are relevant with the target variable (future returns) while minimizing redundancy with other selected features.

//+------------------------------------------------------------------+
//|                  StepWiseFeatureSelectionByMutualInformation.mq5 |
//|                                  Copyright 2024, MetaQuotes Ltd. |
//|                                             https://www.mql5.com |
//+------------------------------------------------------------------+
#property copyright "Copyright 2024, MetaQuotes Ltd."
#property link      "https://www.mql5.com"
#property version   "1.00"
#property script_show_inputs
#resource "\\Indicators\\LogReturns.ex5"
#include<mutualinfo.mqh>
#include<ErrorDescription.mqh>
//+------------------------------------------------------------------+
//|indicator type                                                    |
//+------------------------------------------------------------------+
enum SELECT_INDICATOR
  {
   MFI=0,//MFI
   MA,//MA
   BEARS,//BEARS
   BULLS//BULLS
  };
//--- input parameters
input int NumReplications = 100;
input int MaxPredictors = 10;
input double ChiSquareThreshold = 6.0;
input bool VerboseOutPut = false;
input uint     period_inc=2;//lookback increment
input uint     max_lookback=6;
input ENUM_MA_METHOD         AppliedMA = MODE_SMA;
input ENUM_APPLIED_PRICE     AppliedPrice = PRICE_CLOSE;
input datetime SampleStartDate=D'2019.12.31';
input datetime SampleStopDate=D'2022.12.31';
input string   SetSymbol="BTCUSD";
input ENUM_TIMEFRAMES SetTF = PERIOD_D1;
//----
string predictor_names[];                      // variable names
int size_sample,                      //training set size
    size_observations,                //size of of both training and testing sets combined
    maxperiod,                        //maximum lookback
    indicator_handle=INVALID_HANDLE;  //long moving average indicator handle
//---
vector indicator[];                   //indicator indicator values;
//---
matrix feature_matrix;          //full matrix of features;
//+------------------------------------------------------------------+
//| Script program start function                                    |
//+------------------------------------------------------------------+
void OnStart()
  {
//---get relative shift of sample set
   int samplestart,samplestop,num_features;
   samplestart=iBarShift(SetSymbol!=""?SetSymbol:NULL,SetTF,SampleStartDate);
   samplestop=iBarShift(SetSymbol!=""?SetSymbol:NULL,SetTF,SampleStopDate);
   num_features = int((max_lookback/period_inc)*4);
//---check for errors from ibarshift calls
   if(samplestart<0 || samplestop<0)
     {
      Print(ErrorDescription(GetLastError()));
      return;
     }
//---set the size of the sample sets
   size_observations=(samplestart - samplestop) + 1 ;
   maxperiod=int(max_lookback);
//---check for input errors
   if(size_observations<=0 || maxperiod<=0)
     {
      Print("Invalid inputs ");
      return;
     }
//---allocate memory
   if(ArrayResize(indicator,num_features+1)<0)
     {
      Print(ErrorDescription(GetLastError()));
      return;
     }
//----get the full collection of indicator values
   int period_len;
   int k=0;
//---
   for(SELECT_INDICATOR select_indicator = 0; select_indicator<4; select_indicator++)
     {
      for(int iperiod=0; iperiod<int((indicator.Size()-1)/4); iperiod++)
        {
         period_len=int((iperiod+1) * period_inc);
         int try
               =10;
         predictor_names.Push(EnumToString(select_indicator)+"_"+string(period_len));
         while(try)
           {
            switch(select_indicator)
              {
               case MFI:
                  indicator_handle=iMFI(SetSymbol!=""?SetSymbol:NULL,SetTF,period_len,VOLUME_TICK);
                  break;
               case MA:
                  indicator_handle=iMA(SetSymbol!=""?SetSymbol:NULL,SetTF,period_len,0,AppliedMA,AppliedPrice);
                  break;
               case BEARS:
                  indicator_handle=iBearsPower(SetSymbol!=""?SetSymbol:NULL,SetTF,period_len);
                  break;
               case BULLS:
                  indicator_handle=iBullsPower(SetSymbol!=""?SetSymbol:NULL,SetTF,period_len);
                  break;
              }

            if(indicator_handle==INVALID_HANDLE)
               try
                  --;
            else
               break;
           }

         if(indicator_handle==INVALID_HANDLE)
           {
            Print("Invalid indicator handle ",EnumToString(select_indicator)," ", GetLastError());
            return;
           }

         Comment("copying data to buffer for indicator ",period_len);
         try
               = 0;
         while(!indicator[k].CopyIndicatorBuffer(indicator_handle,0,samplestop,size_observations) && try
                  <10)
              {
               try
                  ++;
               Sleep(5000);
              }

         if(try
               <10)
               ++k;
         else
           {
            Print("error copying to indicator buffers ",GetLastError());
            Comment("");
            return;
           }

         if(indicator_handle!=INVALID_HANDLE && IndicatorRelease(indicator_handle))
            indicator_handle=INVALID_HANDLE;
        }
     }

//---resize matrix
   if(!feature_matrix.Resize(size_observations,indicator.Size()-1))
     {
      Print(__LINE__);
      Print(ErrorDescription(GetLastError()));
      Comment("");
      return;
     }
//---copy collected data to matrix
   for(ulong i = 0; i<feature_matrix.Cols(); i++)
      if(!feature_matrix.Col(indicator[i],i))
        {
         Print(__LINE__);
         Print(ErrorDescription(GetLastError()));
         Comment("");
         return;
        }
//---
   int try
         = 10;
   while(try
            >-1 && indicator_handle == INVALID_HANDLE)
        {
         indicator_handle=iCustom(SetSymbol!=""?SetSymbol:NULL,SetTF,"\\Indicators\\LogReturns",0,1,1);
         try
            --;
        }
//---
   if(try
         <0)
        {
         Print("Could not initialize returns indicator ");
         Print(ErrorDescription(GetLastError()));
         Comment("");
         return;
        }
   else
     {
      try
            = 10;
     }
//---
   while(try
            >-1 && !indicator[indicator.Size()-1].CopyIndicatorBuffer(indicator_handle,0,samplestop-1,size_observations))
        {
         Sleep(2000);
         try
            --;
        }
//---
   if(try
         <0)
        {
         Print("Could not collect returns indicator info ");
         Print(ErrorDescription(GetLastError()));
         Comment("");
         return;
        }
   else
     {
      IndicatorRelease(indicator_handle);
      Comment("");
     }
//--- display the dataset
   string console;
   for(uint i = 0; i<predictor_names.Size(); i++)
      console+=StringFormat(" %12s ",predictor_names[i]);
   console+=" NextBar Returns (Target) ";
   Print(console);
   for(ulong i = 0; i<feature_matrix.Rows(); i++)
     {
      console = "";
      for(ulong j = 0; j<feature_matrix.Cols(); j++)
         console += StringFormat(" %12.6lf ",feature_matrix[i][j]);
      console+=StringFormat(" %12.6lf ",indicator[indicator.Size()-1][i]);
      Print(console);
     }
//---
   Cmrmr mstep(NumReplications,MaxPredictors,ChiSquareThreshold,VerboseOutPut);
//---
   if(!mstep.StepWise(feature_matrix,indicator[indicator.Size()-1]))
      return;
//---
   Print(" Final predictor labels ");
   ulong variables[];
   if(mstep.GetSelectedVars(variables))
     {
      for(uint i = 0; i<variables.Size(); i++)
         Print(predictor_names[variables[i]]);
     }
   return;
  }
//+------------------------------------------------------------------+

Below is a snippet of the dataset of indicators collected from the MetaTrader 5 terminal, along with the target variable in the last column. Altogether, there are 12 candidate predictors (indicators).

MFI_2         MFI_4         MFI_6          MA_2          MA_4          MA_6       BEARS_2       BEARS_4       BEARS_6       BULLS_2       BULLS_4       BULLS_6  NextBar Returns(Target)
0.000000     53.151442     46.921608   7213.654000   7279.552750   7260.007333    -76.371267   -101.867797   -107.113146     87.265733     61.769203     56.523854     -0.003564
0.000000     26.316595     34.451328   7180.962000   7244.558000   7255.420333    -41.795089    -70.889078    -83.462247     42.344911     13.250922      0.677753     -0.032447
0.000000      0.000000     36.720977   7053.740000   7133.697000   7204.281833   -127.731696   -210.813447   -251.244462    153.948304     70.866553     30.435538      0.054562

What follows are the results of the predictor analysis using the default parameters of the script. First, we look at the table of mutual information values. Here, we see that some indicators have zero mutual information with the target. The Moving Average indicator with a period length of 6 has the highest mutual information in relation to the target.

PS          Variable         MI
ND                 5       0.0308
LG                 4       0.0293
MJ                 3       0.0279
GM                 6       0.0227
JP                 9       0.0182
IS                 1       0.0165
MF                 8       0.0135
JI                 7       0.0126
HL                 0       0.0000
JO                 2       0.0000
GS                10       0.0000
HD                11       0.0000

Naturally, MA_6 is the first selection made by the algorithm. Since the first pick has been made, it will be interesting to see if there are any other candidate predictors that share information with it (MA_6). This can be inferred by examining the redundancy values below.

ME   Predictors so far   Relevance   Redundancy   Criterion
GH                 5       0.0308       0.0000       0.0308
II   Additional candidates, in order of decreasing relevance minus redundancy
FE          Variable     Relevance   Redundancy   Criterion
LE                 0       0.0000       0.0095      -0.0095
ED                 1       0.0165       0.0869      -0.0704
FD                 2       0.0000       0.1149      -0.1149
PE                11       0.0000       0.2768      -0.2768
ID                 8       0.0135       0.3251      -0.3116
NS                 7       0.0126       0.3492      -0.3366
CR                10       0.0000       0.3484      -0.3484
ES                 6       0.0227       0.4030      -0.3802
IS                 9       0.0182       0.4285      -0.4103
CR                 3       0.0279       2.5363      -2.5084
DR                 4       0.0293       3.0096      -2.9803

MFI_2 will be the next pick for the algorithm, as it has the least redundancy among all remaining candidate predictors, despite having zero relevance to the target variable. Also, note how the predictors in column indices 3 and 4 (MA_2 and MA_4, respectively) have the highest levels of redundancy with MA_6. This makes sense, as they are the same indicator calculated with shorter window lengths. Below is the predictor set after two more selections, both of which also have zero relevance to the target variable. Of note is the top predictor in the set of candidates that have not been selected. The algorithm begins considering relevant predictors due to the dilutive effect the last two picks had on the selected predictors as a group.

PQ   Predictors so far   Relevance   Redundancy   Criterion
FL                 5       0.0308       0.0000       0.0308
JK                 0       0.0000       0.0095      -0.0095
DK                 2       0.0000       0.1796      -0.1796
EE  Additional candidates, in order of decreasing relevance minus redundancy
JH         Variable     Relevance   Redundancy   Criterion
CH                6       0.0227       0.1469      -0.1241
PH                9       0.0182       0.1465      -0.1283
GI               10       0.0000       0.2024      -0.2024
PG                7       0.0126       0.2133      -0.2007
PF               11       0.0000       0.2308      -0.2308
GG                8       0.0135       0.2664      -0.2529
QG                1       0.0165       0.4679      -0.4514
KF                3       0.0279       0.8840      -0.8561
LF                4       0.0293       1.0372      -1.0079

We conclude our commentary on the results by skipping ahead and displaying the final set of predictors chosen by the algorithm.

OQ  Final predictors ||  Relevance ||  Redundancy || Criterion  ||  Solo pval || Group pval
HN                5       0.0308       0.0000       0.0308       0.010       0.010
DJ                0       0.0000       0.0095      -0.0095       1.000       0.010
JG                2       0.0000       0.1796      -0.1796       1.000       0.010
HD                6       0.0227       0.1620      -0.1393       0.010       0.010
FQ                9       0.0182       0.2023      -0.1842       0.010       0.010
DL               11       0.0000       0.2798      -0.2798       1.000       0.010
KJ                1       0.0165       0.2932      -0.2767       0.010       0.010
OG                7       0.0126       0.3298      -0.3172       0.010       0.010
OE               10       0.0000       0.4151      -0.4151       1.000       0.010
JP                8       0.0135       0.4585      -0.4450       0.010       0.010
RN   Final predictor labels
NO  MA_6
DE  MFI_2
NN  MFI_6
KP  BEARS_2
DJ  BULLS_2
FL  BULLS_6
PQ  MFI_4
MK  BEARS_4
FM  BULLS_4
OF  BEARS_6

Conclusion

In this article, we have explored the application of mutual information in feature selection, focusing on the MRMR (Max-Dependency, Max-Relevance, Min-Redundancy) algorithm proposed by Peng, Long, and Ding. We began by discussing the estimation of mutual information for continuous variables, highlighting its ability to capture complex, nonlinear dependencies between features and the target variable. Through our exploration of the MRMR algorithm, we demonstrated how it effectively balances the competing goals of maximizing relevance to the target variable while minimizing redundancy with already selected features.

By iteratively adding features based on their MRMR scores and assessing their statistical significance through Monte Carlo permutation tests, the algorithm ensures that the selected feature set provides the most informative and unique predictors for the model. The synthetic and real-world examples illustrated the practical utility of the MRMR algorithm in real-world data scenarios, where irrelevant features and multicollinearity often complicate feature selection. The MRMR algorithm's ability to prioritize relevant features and avoid redundant ones ensures that it remains a valuable tool in the feature selection process, especially in cases where the relationships between variables are intricate and nonlinear.

All code referenced in the article is attached below. The following table describes all the source code files that accompany the article.

FileName	Description
MQL5/include/np.mqh	A header file of matrix and vector utility funtions.
MQL5/include/mutualinfo.mqh	A header file containing the definition of Capm class which implements the adaptive partitioning method for estimating the mutual information of continous variables. And the definition of Cmrmr class that implements Stepwise feature selection based on Mutual information.
MQL5/scripts/RelevanceMinusRedundancy.mq5	A script demonstrating the use of the Cmrmr class using synthetic data
MQL5/scripts/StepWiseFeatureSelectionByMutualInformation.mq5	Another script also demonstrating the application of the Cmrmr class on a more realistic dataset.

Attached files |

Download ZIP

np.mqh (74.16 KB)

mutualinfo.mqh (26.74 KB)

RelevanceMinusRedundancy.mq5 (1.69 KB)

StepWiseFeatureSelectionByMutualInformation.mq5 (7.39 KB)

Mql5.zip (19.35 KB)

Warning: All rights to these materials are reserved by MetaQuotes Ltd. Copying or reprinting of these materials in whole or in part is prohibited.