Deep Neural Networks (Part II). Working out and selecting predictors

MetaTrader 5 — Integration | 18 September 2017, 12:20

12 542

Vladimir Perervenko

1.1. Feature transformation

1.2. Creating new features

2. Choosing predictors

Introduction

In the previous article we considered different aspects of obtaining and preparing input data and the goal variable. To run the scripts of this article, you either need to execute all the scripts of the first part or load the calculation results from the first part's attachment to RStudio.

1. Creating features

Creating features is a science (and art) of obtaining additional information from the data in hand. Our goal is not to add new data but to make use of what we already have. New capabilities allow us to obtain new features of a data sample. These features enable more precise labeling, characterization and division of the training data set. This provides extra accuracy.

This process can be divided into two stages:

Transformation. Depending on the scenario, this can be one of four types of transformation: normalizing data, removing skewness of variables, removing outliers and discretization.
Creating features. Extracting a new variable from the existing ones is called creating a new feature. This can reveal hidden relationships in the data set.

1.1. Feature transformation

1.1.1. Transformation

What is transformation of a variable?

In data modeling, a transformation is a replacement of a variable by a function of that variable. For instance, this could be the changing of the x variable by a square or cube root or logarithm x. In other words, transformation is a process which changes the distribution of a variable and the relationship of this variable with others.

Let us recall when transformation of a variable is useful.

When we want to change the scale of a variable or standardize its values for better understanding. This transformation is requires if different data have different scales. This does not result in the change of the distribution shape.

When complex nonlinear and curvilinear relationships between variables have to be transformed into a linear relationship. This is more vivid and provides a better capability to forecast. In such a case, a scatter plot can be used to find a relationship between two continuous variables. A logarithmic transformation is usually used in such a situation.

When an asymmetrical distribution needs to be changed to a symmetrical one for a simpler interpretation and analysis. Some modeling methods require a normal distribution of variables. Therefore, when we deal with a nonuniform distribution, we can use transformations that reduce skewness. For a right-skewed distribution we take a square or cube root or a logarithm of a variable whereas the left-skewed distribution is smoothed out by squaring/cubing or using the exponential function.

When a continuous variable needs to be transformed into a discrete one. The method of such a transformation is discretization.

What are the general methods of transformation?

There are various methods used for transforming variables. We have already mentioned some of them: square and cube root, logarithms, trigonometric functions and segmentation. Let us look into some methods in detail and identify their advantages and disadvantages.

Taking the logarithm. This is a general transformation method used for changing the shape of the distribution of a variable. This is usually used to reduce the right skewness. This function is not applicable to zero and negative values.

Square/cube root. This function has a significant impact on the distribution of the variable though not as powerful as taking the logarithm. The advantage of the cube root is that it can be used for zero and negative values. The square root can be taken only of positive values and zero.

Discretization/binning. This is used for categorization of values. Discretization is suitable for original data, percentile and frequencies. The choice of the categorization method is based on the nature of the data. We can carry out a joint segmentation of interdependent variables.

Any transformation of data leads to the change of the distribution. To illustrate this, we will use examples of two transformation methods.

Two problems of our initial data set are outliers and right skewness. We have already considered ways to remove outliers. Now, let us try to remove/reduce the asymmetry first and then remove the outliers.

Method 1.

To get rid of the strong right skewness of the x data set, we will take the logarithm to the base 2 and then remove outliers. As the values of the variables in the initial data set are much smaller than 1 and there are negative values amongst them, we will take logarithms of the variables having added 1 to each of it to increase accuracy. Let us see what will happen to the curve.

evalq({x.ln <- apply(x, 2, function(x) log2(x + 1))
       sk.ln <- skewness(x.ln)},
      env)
 > env$sk.ln
               ftlm      stlm      rbci      pcci   v.fatl
Skewness -0.2715663 -2.660613 -4.484301 0.4267873 1.253008
          v.satl   v.rftl     v.rstl    v.ftlm     v.stlm
Skewness 1.83489 2.065224 -0.0343451 -15.62414 0.01529019
            v.pcci
Skewness 0.1811206

Three variables — stlm, rbci and v.ftlm have a pronounced left skewness. The v.fatl, v.satl and v.rftl variables are still skewed to the right. The skewness of other variables has evened out. Let us remove and impute the outliers from this data set and then look at the skewness and distribution of variables:

evalq({
  foreach(i = 1:ncol(x.ln), .combine = "cbind") %do% {
    remove_outliers(x.ln[ ,i])
  } -> x.ln.out
  colnames(x.ln.out) <- colnames(x.ln)
  },  
env)
evalq({
  foreach(i = 1:ncol(x.ln), .combine = "cbind") %do% {
    capping_outliers(x.ln[ ,i])
  } -> x.ln.cap
  colnames(x.ln.cap) <- colnames(x.ln)
},  
env)
evalq({
  sk.ln.out <- skewness(x.ln.out) 
  sk.ln.cap <- skewness(x.ln.cap)
}, 
env)
> env$sk.ln.out
              ftlm       stlm       rbci        pcci
Skewness -0.119055 -0.3549119 -0.1099921 -0.01476384
              v.fatl      v.satl      v.rftl     v.rstl
Skewness -0.02896319 -0.03634833 -0.06259749 -0.2120127
              v.ftlm      v.stlm      v.pcci
Skewness -0.05819699 -0.01661317 -0.05420077
> env$sk.ln.cap
               ftlm       stlm       rbci        pcci
Skewness -0.1814781 -0.4582045 -0.1658855 -0.02849945
              v.fatl      v.satl     v.rftl     v.rstl
Skewness -0.04336238 -0.04400781 -0.0692754 -0.2269408
              v.ftlm      v.stlm      v.pcci
Skewness -0.06184128 -0.02856397 -0.06258243

Data in both data sets (x.out and x.cap) are nearly symmetrical. The distribution is shown on the diagrams below.

par(mfrow = c(2,2))
boxplot(env$x.ln, 
              main = "x.ln with outliers",
              xlab = "")
boxplot(env$x.ln.out, 
              main = "x.ln.out without outliers",
              xlab = "")
boxplot(env$x.ln.cap, 
              main = "x.ln.cap with imputed outliers",
              xlab = "")
par(mfrow = c(1,1))

x.ln x.ln.out

Fig.1. Log-transformed data with and without outliers

x.ln.cap

Fig.2. Log-transformed data with outliers imputed

Results are very similar to the previous transformation with one exception. The range of the change of variables has become wider.

Let us transform the x.ln.cap dataframe and see the variation and covariation of the set:

 evalq(x.ln.cap %>% tbl_df() %>% 
        cbind(Data = dataSetClean$Data, .,
              Class = dataSetClean$Class) -> 
        dataSetLnCap, 
      env)

Plot charts:

require(GGally)
evalq(ggpairs(dataSetLnCap, columns = 2:7, 
              mapping = aes(color = Class),
              title = "PredLnCap1"), 
      env)
evalq(ggpairs(dataSetLnCap, columns = 8:13, 
              mapping = aes(color = Class),
              title = "PredLnCap2"), 
      env)

LnCap1

Fig.3. Parameters of log-transformed data, part 1

LnCap2

Fig. 4. Parameters of log-transformed data, part 2

Method 2.

Transform the data using the sin(2*pi*x) function, remove and impute outliers and then assess the skewness, distribution of outliers and covariation of transformed variables on charts.

evalq({x.sin <- apply(x, 2, function(x) sin(2*pi*x))
      sk.sin <- skewness(x.sin)
      },
env)
#----------
evalq({
  foreach(i = 1:ncol(x.sin), .combine = "cbind") %do% {
    remove_outliers(x.sin[ ,i])
  } -> x.sin.out
  colnames(x.sin.out) <- colnames(x.sin)
},  
env)
#-----------------
evalq({
  foreach(i = 1:ncol(x.sin), .combine = "cbind") %do% {
    capping_outliers(x.sin[ ,i])
  } -> x.sin.cap
  colnames(x.sin.cap) <- colnames(x.sin)
},  
env)
#-----------
evalq({
  sk.sin.out <- skewness(x.sin.out) 
  sk.sin.cap <- skewness(x.sin.cap)
}, 
env)

What is the skewness of these transformed data sets?

env$sk.sin
                ftlm        stlm        rbci         pcci
Skewness -0.02536085 -0.04234074 -0.00587189 0.0009679463
             v.fatl    v.satl     v.rftl      v.rstl
Skewness 0.03280465 0.5217757 0.05611136 -0.02825112
             v.ftlm     v.stlm     v.pcci
Skewness 0.04923953 -0.2123434 0.01738377
> env$sk.sin.out
                ftlm        stlm        rbci       pcci
Skewness -0.02536085 -0.04234074 -0.00587189 0.03532892
             v.fatl      v.satl      v.rftl      v.rstl
Skewness 0.00360966 -0.02380975 -0.05336561 -0.02825112
               v.ftlm     v.stlm       v.pcci
Skewness 0.0009366441 0.01835948 0.0008843329
> env$sk.sin.cap
                ftlm        stlm        rbci       pcci
Skewness -0.02536085 -0.04234074 -0.00587189 0.03283132
              v.fatl      v.satl      v.rftl      v.rstl
Skewness 0.007588308 -0.02424707 -0.04106469 -0.02825112
              v.ftlm      v.stlm      v.pcci
Skewness 0.007003051 0.009237835 0.002101687

As you can see, this transformation made all the data sets symmetrical. Let us see what these sets look like:

par(mfrow = c(2, 2))
boxplot(env$x.sin, main = "x.sin with outlier")
abline(h = 0, col = 2)
boxplot(env$x.sin.out, main = "x.sin.out without outlier")
abline(h = 0, col = 2)
boxplot(env$x.sin.cap, main = "x.sin.cap with capping outlier")
abline(h = 0, col = 2)
par(mfrow = c(1, 1))

x.Sin

Fig.5. Data set transformed by the sin()function

At first glance, these data sets look better than the previous ones (the initial and transformed ones).

Now, we want to see the distribution of NA in variables after the outliers have been removed.

require(VIM)
evalq(a <- aggr(x.sin.out), env)

SinMissAggr

Fig.6. Distribution of NA in the data set

The left part of the chart shows the relative number of undefined data in every variable. The right side shows combinations of examples with a different number of NA (increasing from bottom to top). We can see the values:

> print(env$a)

 Missings in variables:
 Variable Count
     pcci   256
   v.fatl   317
   v.satl   289
   v.rftl   406
   v.ftlm   215
   v.stlm   194
   v.pcci   201

What is the distribution of NA in variables?

 par(mfrow = c(3, 4))
evalq(
  foreach(i = 1:ncol(x.sin.out)) %do% {
    barMiss(x.sin.out, pos = i, only.miss = TRUE, 
            main = "x.sin.out without outlier")
  }, env
)
par(mfrow = c(1, 1))

SinMissBar

Fig.7. Distribution of NA in variables

Observed values of the variable are shown in blue and the number of NA of other variables in different ranges of the values of the current variable are shown in red. The bar on the right represents the contribution of the current variable to the total number of NA of all variables.

Finally, let us take a look at the variation and covariation of the transformed data set with imputed outliers.

#---------------
evalq(x.sin.cap %>% tbl_df() %>% 
        cbind(Data = dataSetClean$Data, .,
              Class = dataSetClean$Class) -> 
        dataSetSinCap, 
      env) 
require(GGally)
evalq(ggpairs(dataSetSinCap, columns = 2:7, 
              mapping = aes(color = Class),
              title = "dataSetSinCap1 with capping outlier "), 
      env)
evalq(ggpairs(dataSetSinCap, columns = 8:13, 
              mapping = aes(color = Class),
              title = "dataSetSinCap2 with capping outlier"), 
      env)
#---------------------------

SinCap1

Fig.8. Parameters of sin()-transformed data, part 1

Fig.9. Parameters of sin()-transformed data, part 2

1.1.2. Normalization

We are preparing data for a neural network, hence variables must be brought within the range of { -1..+1 }. For that, the preProcess()::caret function with method = “spatialSign” is going to be used. Alternatively, data can be centered or scaled before normalization. This is a very simple process and we are not going to consider it in this article.

There is one thing to bear in mind though. Parameters of normalization obtained from the training data set are to be used for the test and validation sets.

For further use of the data set we obtained in previous calculations (dataSet without removing highly correlated values), let us divide into train/test/val and bring them within the range (-1,+1) without standardization.

Performing normalization with standardization, keep in mind that when normalization parameters (mean/median, sd/mad) are defined, parameters of imputing outliers are to be defined too. Going forward, they will be used for train/val/test. Earlier in this article we wrote two functions: prep.outlier() and treatOutlier(). They are designed for this purpose.

Sequence of operations:

Define parameters for outliers in train
Remove outliers in train
Define parameters of standardization in train
Impute outliers in train/val/test
Normalize train/val/test.

We are not going to consider this variant here. You can study it on your own.

Divide the data into train/val/test:

 evalq(
{
  train = 1:2000
  val = 2001:3000
  test = 3001:4000
  DT <- list()
  list(clean = data.frame(dataSet) %>% na.omit(), 
       train = clean[train, ], 
       val = clean[val, ], 
       test = clean[test, ]) -> DT
}, env)

Define parameters for normalization for the train set and normalize outliers in train/test/val:

 require(foreach)
evalq(
{
 preProcess(DT$train, method = "spatialSign") -> preproc 
 list(train = predict(preproc, DT$train), 
        val = predict(preproc, DT$val),
        test = predict(preproc, DT$test)
       ) -> DTn
}, 
env)

Let us take a look at the total statistics of the train set:

 > table.Stats(env$DTn$train %>% tk_xts())
Using column `Data` for date_var.
                     ftlm      stlm      rbci      pcci
Observations    2000.0000 2000.0000 2000.0000 2000.0000
NAs                0.0000    0.0000    0.0000    0.0000
Minimum           -0.5909   -0.7624   -0.6114   -0.8086
Quartile 1        -0.2054   -0.2157   -0.2203   -0.2110
Median             0.0145    0.0246    0.0147    0.0068
Arithmetic Mean    0.0070    0.0190    0.0085    0.0028
Geometric Mean    -0.0316   -0.0396   -0.0332   -0.0438
Quartile 3         0.2139    0.2462    0.2341    0.2277
Maximum            0.6314    0.8047    0.7573    0.7539
SE Mean            0.0060    0.0073    0.0063    0.0065
LCL Mean (0.95)   -0.0047    0.0047   -0.0037   -0.0100
UCL Mean (0.95)    0.0188    0.0333    0.0208    0.0155
Variance           0.0719    0.1058    0.0784    0.0848
Stdev              0.2682    0.3252    0.2800    0.2912
Skewness          -0.0762   -0.0221   -0.0169   -0.0272
Kurtosis          -0.8759   -0.6688   -0.8782   -0.7090
                   v.fatl    v.satl    v.rftl    v.rstl
Observations    2000.0000 2000.0000 2000.0000 2000.0000
NAs                0.0000    0.0000    0.0000    0.0000
Minimum           -0.5160   -0.5943   -0.6037   -0.7591
Quartile 1        -0.2134   -0.2195   -0.1988   -0.2321
Median             0.0015    0.0301    0.0230    0.0277
Arithmetic Mean    0.0032    0.0151    0.0118    0.0177
Geometric Mean    -0.0323   -0.0267   -0.0289   -0.0429
Quartile 3         0.2210    0.2467    0.2233    0.2657
Maximum            0.5093    0.5893    0.6714    0.7346
SE Mean            0.0058    0.0063    0.0062    0.0074
LCL Mean (0.95)   -0.0082    0.0028   -0.0003    0.0033
UCL Mean (0.95)    0.0146    0.0274    0.0238    0.0321
Variance           0.0675    0.0783    0.0757    0.1083
Stdev              0.2599    0.2798    0.2751    0.3291
Skewness          -0.0119   -0.0956   -0.0648   -0.0562
Kurtosis          -1.0788   -1.0359   -0.7957   -0.7275
                   v.ftlm    v.stlm    v.rbci    v.pcci
Observations    2000.0000 2000.0000 2000.0000 2000.0000
NAs                0.0000    0.0000    0.0000    0.0000
Minimum           -0.5627   -0.6279   -0.5925   -0.7860
Quartile 1        -0.2215   -0.2363   -0.2245   -0.2256
Median            -0.0018    0.0092   -0.0015   -0.0054
Arithmetic Mean   -0.0037    0.0036   -0.0037    0.0013
Geometric Mean    -0.0426   -0.0411   -0.0433   -0.0537
Quartile 3         0.2165    0.2372    0.2180    0.2276
Maximum            0.5577    0.6322    0.5740    0.9051
SE Mean            0.0061    0.0065    0.0061    0.0070
LCL Mean (0.95)   -0.0155   -0.0091   -0.0157   -0.0124
UCL Mean (0.95)    0.0082    0.0163    0.0082    0.0150
Variance           0.0732    0.0836    0.0742    0.0975
Stdev              0.2706    0.2892    0.2724    0.3123
Skewness           0.0106   -0.0004   -0.0014    0.0232
Kurtosis          -1.0040   -1.0083   -1.0043   -0.4159

This table shows us that the variables are symmetrical and have very close parameters.

Now, let us take a look at the distribution of variables in the train/val/test sets:

 boxplot(env$DTn$train %>% 
          dplyr::select(-c(Data, Class)),
        horizontal = T, main = "Train")
abline(v = 0, col = 2)
boxplot(env$DTn$test %>% 
          dplyr::select(-c(Data, Class)),
        horizontal = T, main = "Test")
abline(v = 0, col = 2)
boxplot(env$DTn$val %>% 
          dplyr::select(-c(Data, Class)),
        horizontal = T, main = "Val")
abline(v = 0, col = 2)

DTn

Fig.10. Distribution of variables in the train/val/test sets after normalization

The distribution is nearly the same in all sets. We also have to consider correlation and covariation of the variables in the train: set

require(GGally)
evalq(ggpairs(DTn$train, columns = 2:7, 
              mapping = aes(color = Class),
              title = "DTn$train1 "), 
      env)
evalq(ggpairs(DTn$train, columns = 8:14, 
              mapping = aes(color = Class),
              title = "DTn$train2"), 
      env)

DTn$train1

Fig.11. Variation and covariation of the set 1 train

DTn$train2

Fig.12. Variation and covariation of the set 2 train

There are no highly correlated data, the distribution is packed and has no outliers. The data can be divided well. On the face of it, there are only two problematic variables - stlm and v.rstl. We will verify this statement later when we assess the relevance of predictors. Now, we can look at the correlation of these predictors and the goal variable:

> funModeling::correlation_table(env$DTn$train %>% tbl_df %>%
+                    select(-Data), str_target = 'Class')
   Variable Class
1     Class  1.00
2    v.fatl  0.38
3      ftlm  0.34
4      rbci  0.28
5    v.rbci  0.28
6    v.satl  0.27
7      pcci  0.24
8    v.ftlm  0.22
9    v.stlm  0.22
10   v.rftl  0.18
11   v.pcci  0.08
12     stlm  0.03
13   v.rstl -0.01

The named variables are at the bottom of the table with very small correlation coefficients. The relevance of the v.pcci. variable also has to be verified. Let us check the v.fatl variable in the train/val/test: sets.

require(ggvis)
evalq(
  DTn$train %>% ggvis(~v.fatl, fill = ~Class) %>% 
  group_by(Class) %>%  layer_densities() %>% 
  add_legend("fill", title = "DTn$train$v.fatl"),
  env)
evalq(
  DTn$val %>% ggvis(~v.fatl, fill = ~Class) %>% 
    group_by(Class) %>%  layer_densities() %>% 
    add_legend("fill", title = "DTn$val$v.fatl"),
  env)
evalq(
  DTn$test %>% ggvis(~v.fatl, fill = ~Class) %>% 
    group_by(Class) %>%  layer_densities() %>% 
    add_legend("fill", title = "DTn$test$v.fatl"),
  env)

Train$vfatl

Fig.13. Distribution of the v.fatl variable in the train set after normalization

Valid$vfatl

Fig.14. Distribution of the v.fatl variable in the valid set after normalization

Test$vfatl

Fig.15. Distribution of the v.fatl variable in the test set after normalization

Performed analysis shows that normalization often produces a good distribution of predictors without outliers and highly correlated data. In large, this depends on the character of the raw data.

1.1.3. Discretization

Discretization refers to the process of transformation of a continuous variable to a discrete one by dividing its values into areas. The boundaries of these areas can be set using various methods.

The separation methods can be divided into two groups: quantitative methods, which do not involve the relationship with the goal variable and methods that take the range of the goal variable into account.

The first group of methods is nearly fully covered by the cut2()::Hmisc function. The sample can be divided into a preset number of areas with specified boundaries, into quartiles, into areas with a minimal number of examples on each and on equifrequent areas.

The second group of methods is more interesting because it divides the variable into areas connected with the levels of the goal variable. Let us consider several packages realizing these methods.

Discretization. This package is a set of algorithms of discretization with trainer. This can also be grouped in terms "from top to bottom" and "from bottom to top", which implement the algorithms of discretization. Let us consider some of them on the example of our dataSet.

At first, we will clean the set (without removing highly correlated variables) and then divide it into train/val/test sets in the ratio 2000/1000/1000.

require(discretization)
require(caret)
require(pipeR)
evalq(
  {
    dataSet %>%
    preProcess(.,
               method = c("zv", "nzv", "conditionalX")) %>%
    predict(., dataSet) %>%
    na.omit -> dataSetClean
    train = 1:2000
    val = 2001:3000
    test = 3001:4000
    DT <- list()
    list(train = dataSetClean[train, ], 
         val = dataSetClean[val, ], 
         test = dataSetClean[test, ]) -> DT
  }, 
  env)

We will use the mdlp()::discretization function that describes discretization using minimum description length. This function discretizes continuous attributes of the matrix by the entropy criterion with the minimal description length as the stopping rule.

evalq(
  pipeline({
    DT$train
    select(-Data)
    as.data.frame()
    mdlp()}) -> mdlp.train, envir = env)

The function returns a list with two slots. They are: cutp - a dataframe with cutoff points for each variable and Disc.data - a dataframe with labeled variables.

> env$mdlp.train%>%str()
List of 2
 $ cutp     :List of 12
  ..$ : num [1:2] -0.0534 0.0278
  ..$ : chr "All"
  ..$ : num -0.0166
  ..$ : num [1:2] -0.0205 0.0493
  ..$ : num [1:3] -0.0519 -0.0055 0.019
  ..$ : num 0.000865
  ..$ : num -0.00909
  ..$ : chr "All"
  ..$ : num 0.0176
  ..$ : num [1:2] -0.011 0.0257
  ..$ : num [1:3] -0.03612 0.00385 0.03988
  ..$ : chr "All"
 $ Disc.data:'data.frame':      2000 obs. of  13 variables:
  ..$ ftlm  : int [1:2000] 3 3 3 3 3 2 1 1 1 1 ...
  ..$ stlm  : int [1:2000] 1 1 1 1 1 1 1 1 1 1 ...
  ..$ rbci  : int [1:2000] 2 2 2 2 2 2 1 1 1 1 ...
  ..$ pcci  : int [1:2000] 2 2 1 2 2 1 1 2 3 2 ...
  ..$ v.fatl: int [1:2000] 4 4 3 4 3 1 1 2 3 2 ...
  ..$ v.satl: int [1:2000] 1 1 1 2 2 1 1 1 1 1 ...
  ..$ v.rftl: int [1:2000] 1 2 2 2 2 2 2 2 1 1 ...
  ..$ v.rstl: int [1:2000] 1 1 1 1 1 1 1 1 1 1 ...
  ..$ v.ftlm: int [1:2000] 2 2 1 1 1 1 1 1 2 1 ...
  ..$ v.stlm: int [1:2000] 1 1 1 2 2 1 1 1 1 1 ...
  ..$ v.rbci: int [1:2000] 4 4 3 3 2 1 1 2 3 2 ...
  ..$ v.pcci: int [1:2000] 1 1 1 1 1 1 1 1 1 1 ...
  ..$ Class : Factor w/ 2 levels "-1","1": 2 2 2 2 2 1 1 1 1 1 ...

What does the first slot tell us?

We have three unlabeled variables with the values not connected with the goal variable. These are 2,8 and 12 (stlm, v.rstl, v.pcci). They can be removed without the loss of quality of the data set. Please note that these variables were defined as irrelevant in the previous part of the article.

Four variables are divided into two classes, three variables are divided into three classes and two variables are divided into four classes.

Segment the val/test sets, using cutoff points obtained from the train set. For that, remove the unlabeled variables from the train set and save them into the train.d dataframe. Then, use the findInterval() function to label the test/val set using the cutoff points obtained earlier.

evalq(
  {
    mdlp.train$cutp %>% 
    lapply(., function(x) is.numeric(x)) %>%
    unlist -> idx   # bool
    #----train-----------------
    mdlp.train$Disc.data[ ,idx] -> train.d
    #---test------------
    DT$test %>% 
      select(-c(Data, Class)) %>%
      as.data.frame() -> test.d
  
    foreach(i = 1:length(idx), .combine = 'cbind') %do% {
      if (idx[i]) {findInterval(test.d[ ,i], 
                   vec = mdlp.train$cutp[[i]],
                   rightmost.closed = FALSE, 
                   all.inside = F,
                   left.open = F)}
    } %>% as.data.frame() %>% add(1) %>%
      cbind(., DT$test$Class) -> test.d
    colnames(test.d) <- colnames(train.d)
    #-----val-----------------
    DT$val %>% 
      select(-c(Data, Class)) %>%
      as.data.frame() -> val.d
    foreach(i = 1:length(idx), .combine = 'cbind') %do% {
      if (idx[i]) {findInterval(val.d[ ,i], 
                                vec = mdlp.train$cutp[[i]],
                                rightmost.closed = FALSE, 
                                all.inside = F,
                                left.open = F)}
    } %>% as.data.frame() %>% add(1) %>%
      cbind(., DT$val$Class) -> val.d
    colnames(val.d) <- colnames(train.d)
  },env
)

What do these sets look like?

> env$train.d %>% head()
  ftlm rbci pcci v.fatl v.satl v.rftl v.ftlm v.stlm v.rbci Class
1    3    2    2      4      1      1      2      1      4     1
2    3    2    2      4      1      2      2      1      4     1
3    3    2    1      3      1      2      1      1      3     1
4    3    2    2      4      2      2      1      2      3     1
5    3    2    2      3      2      2      1      2      2     1
6    2    2    1      1      1      2      1      1      1    -1
> env$test.d %>% head()
  ftlm rbci pcci v.fatl v.satl v.rftl v.ftlm v.stlm v.rbci Class
1    1    1    1      2      1      1      1      1      2    -1
2    1    1    3      3      1      1      2      2      3    -1
3    1    1    2      2      1      1      1      2      2    -1
4    2    1    2      3      1      1      2      2      3     1
5    2    2    2      3      1      1      1      2      3     1
6    2    2    2      4      1      1      2      2      3     1
> env$val.d %>% head()
  ftlm rbci pcci v.fatl v.satl v.rftl v.ftlm v.stlm v.rbci Class
1    2    2    2      2      2      2      1      2      2     1
2    2    2    2      2      2      2      1      2      2     1
3    2    2    2      3      2      2      1      2      2     1
4    2    2    2      4      2      2      2      2      3     1
5    2    2    2      3      2      2      1      2      2     1
6    2    2    2      3      2      2      2      2      2     1

> env$train.d$v.fatl %>% table()
.
  1   2   3   4 
211 693 519 577 
> env$test.d$v.fatl %>% table()
.
  1   2   3   4 
 49 376 313 262 
> env$val.d$v.fatl %>% table()
.
  1   2   3   4 
 68 379 295 258

Further use of the sets with discrete data depends on the model in use. If this is a neural network, predictors will have to be transformed into dummy-variables. How well are these classes divided by these variables? How well do they correlate with the goal variable? Let us make these relationships visual with cross-plot()::funModeling. Cross_plot shows how the input variable correlates with the goal variable receiving the likelihood coefficient for each range of each input.

Let us consider three variables v.fatl, ftlm and v.satl divided into 4, 3 and 2 ranges respectively. Plot charts:

evalq(
  cross_plot(data = train.d, 
             str_input = c("v.fatl", "ftlm", "v.satl"), 
             str_target = "Class", 
             auto_binning = F,
             plot_type = "both"), #'quantity' 'percentual'
  env
  )

Discret 1

Fig.16. Cross-plot of the v.fatl/Class variable

Discret 2

Fig.17. Cross-plot of the ftlm/Class variable

Discret 3

Fig.18. Cross-plot of the v.satl/Class variable

You can see that predictors are well correlated with the levels of the goal variable, have well pronounced thresholds dividing the levels of the Class variable.

Predictors can simply be divided into equal areas (in a non-optimal way) to see in which case they will correlate with the goal variable. Let us divide three previous variables and two bad ones (stlm, v.rstl) from the train set into 10 equifrequent areas and look the their cross-plot with the goal variable:

evalq(
  cross_plot(
      DT$train  %>% select(-Data) %>%
      select(c(v.satl, ftlm, v.fatl, stlm, v.rstl, Class)) %>%
      as.data.frame(), 
      str_input = Cs(v.satl, ftlm, v.fatl, stlm, v.rstl), 
      str_target = "Class", 
      auto_binning = T,
      plot_type = "both"), #'quantity' 'percentual'
  env
)

Plot five charts of these variables:

Discret 4

Fig.19. Cross-plot of the v.satl variable (10 areas) vs Class

Discret 5

Fig.20. Cross-plot of the ftlml variable (10 areas) vs Class

Discret 6

Fig.21. Cross-plot of the v.fatl variable (10 areas) vs Class

discret 8

Fig.22. Cross-plot of the stlm variable (10 areas) vs Class

Discret 9

Fig.23. Cross-plot of the v.rstl variable(10 areas) vs Class

It is clear from the diagrams that even when variables were divided into 10 discrete equifrequent areas, the v.fatl, ftlm and v.satl variables have a well pronounced threshold of dividing levels of the variables. It is clear why two other variables (stlm, v.rstl) are irrelevant. This is an efficient way to identify the importance of predictors. We will come back to this later in this article.

Now, let us see how the input variable correlates with the goal variable by comparing them using the Bayesian method posterior conversion rate. It is useful to compare categorical values that do not have an internal order. For that, we will use the bayes_plot::funModeling function. Let us take the v.fatl, ftlm and v.satl variables from the train.d, val.d and test.d sets.

#------BayesTrain-------------------
evalq(
  {
    bayesian_plot(train.d, input = "v.fatl", 
                  target = "Class", 
                  title = "Bayesian comparison train$v.fatl/Class",
                  plot_all = F, extra_above = 5, 
                  extra_under = 5)
  },env
)
evalq(
  {
    bayesian_plot(train.d, input = "ftlm", 
                  target = "Class", 
                  title = "Bayesian comparison train$ftlm/Class",
                  plot_all = F, extra_above = 5, 
                  extra_under = 5)
  },env
)
evalq(
  {
    bayesian_plot(train.d, input = "v.satl", 
                  target = "Class", 
                  title = "Bayesian comparison train$v.satl/Class",
                  plot_all = F, extra_above = 5, 
                  extra_under = 5)
  },env
)
#------------BayesTest------------------------
evalq(
  {
    bayesian_plot(test.d, input = "v.fatl", 
                  target = "Class", 
                  title = "Bayesian comparison test$v.fatl/Class",
                  plot_all = F, extra_above = 5, 
                  extra_under = 5)
  },env
)
evalq(
  {
    bayesian_plot(test.d, input = "ftlm", 
                  target = "Class", 
                  title = "Bayesian comparison test$ftlm/Class",
                  plot_all = F, extra_above = 5, 
                  extra_under = 5)
  },env
)
evalq(
  {
    bayesian_plot(test.d, input = "v.satl", 
                  target = "Class", 
                  title = "Bayesian comparison test$v.satl/Class",
                  plot_all = F, extra_above = 5, 
                  extra_under = 5)
  },env
)
#-------------BayesVal---------------------------------
evalq(
  {
    bayesian_plot(val.d, input = "v.fatl", 
                  target = "Class", 
                  title = "Bayesian comparison val$v.fatl/Class",
                  plot_all = F, extra_above = 5, 
                  extra_under = 5)
  },env
)
evalq(
  {
    bayesian_plot(val.d, input = "ftlm", 
                  target = "Class", 
                  title = "Bayesian comparison val$ftlm/Class",
                  plot_all = F, extra_above = 5, 
                  extra_under = 5)
  },env
)
evalq(
  {
    bayesian_plot(val.d, input = "v.satl", 
                  target = "Class", 
                  title = "Bayesian comparison val$v.satl/Class",
                  plot_all = F, extra_above = 5, 
                  extra_under = 5)
  },env
)
#------------------------------------------

BayesCorrTrain

Fig.24. Bayesian comparison of variables with the goal variable in the train set

BayesCorrVal

Fig.25. Bayesian comparison of variables with the goal variable in the val set

BayesCorrTest

Fig.26. Bayesian comparison of variables with the goal variable in the test set

We can see that the correlation of predictors with the goal variable is drifting more in the variable with 4 levels and more. This drift is smaller in the variables with two groups. Going forward, it will be useful to check how the accuracy of the model will be affected by using only two-range predictors.

The same task of splitting a variable into areas closely fitting the levels of the goal variable can be solved another way - by using the smbinning package. You can check it yourself. The previous article considers another interesting method of discretization. It can be implemented using the "RoughSets" package.

Discretization is an efficient method of transforming predictors. Unfortunately, not all the models can work with factor predictors.

1.2. Creating new features

Creating a variable is a process of creating new variables based on the existing ones. Let us look at the data set where date (dd-mm-yy) is an input variable. We can create new variables which will be better connected with the goal variable- day, months, year, day of the week. This step is used to reveal hidden relationships in the variable.

Creating derivative variables refers to the process of creating a new variable using a set of functions and various methods from an existing variable. The type of the variable to create depends on the curiosity of the business analyst, the set of hypotheses and their theoretical knowledge. The choice of methods is extensive. Taking the logarithm, segmentation, raising to the nth power are only few examples of the transformation methods.

Creating dummy variables is another popular method of working with variables. Usually dummy variables are used in the transformation of categorical variables into numerical ones. The categorical variable can take values of 0 and 1. We can create dummy variables for more than two classes of categorical variables with N and N-1 variables.

In this article we discuss situations that we come across on a daily basis as analysts. Listed below are several ways to extract maximum information from a data set.

Use the values of data and time as variables. New variables can be created by taking into account differences in date and time.
Create new ratios and proportions. Instead of storing past inputs and outputs in the data set, their ratios can be included. This may have a greater significance.
Use standard transformations. Looking at the fluctuations and areas of the variable together with the output, we can see if correlation will improve after basic transformations. Most frequently used transformations are log, exp, square and trigonometric variations.
Check variables for seasonality and create a model for the required period (week, month, session etc).

It is intuitive that the behavior of the market on Monday is different to the one on Wednesday and Thursday. This means that the day of the week is an important feature. Time of the day is equally as important for market. This defines if this is the Asian, European or American session. How can we define these features?

We will use the timekit package for that. tk_augment_timeseries_signature() is the central function of the package. This adds to the time labels of the initial data set pr the whole row of time data that can be useful as both additional features and parameters of the group. What data are they?

Index	the value of the index that was resolved
Index.num	the numeric value of the index in seconds. Base “1970-01-01 00:00:00”
diff	difference in seconds with the previous numeric value of the index
Year	year, index component
half	half of the index component
quarter	quarter, index component
month	month, index component with the base 1
month.xts	month, index component with the base 0, same as implemented in xts
month.lbl	month label as the ordered factor. Starts in January and ends in December
day	day, index component
hour	hour, index component
minute	minute, index component
second	second, index component
hour12	hour component in the 12-hour scale
am.pm	morning (am) = 1, afternoon (pm) = 2
wday	day of the week with the base 1 Sunday = 1, Saturday = 7
wday.xts	day of the week with the base 0, same as implemented in xts. Sunday = 0, Saturday = 6
wday.lbl	label of the day of the week as the ordered factor. Starts on Sunday and ends on Saturday
mday	day of the month
qday	day of the quarter
yday	day of the year.
mweek	week of the month
week	number of the week in a year (starts with Sunday)
week.iso	number of the week in a year according to ISO (Starts on Monday)
week2	module for fortnight frequency
week3	module for three-weekly frequency
week4	module for quad-weekly frequency

Let us take the initial data set pr, strengthen it with the tk_augment_timeseries_signature() function, add to the initial data set the mday, wday.lbl, hour, variables, remove undefined variables (NA) and group data by days of the week.

evalq(
  {
    tk_augment_timeseries_signature(pr) %>%
    select(c(mday, wday.lbl,  hour)) %>% 
    cbind(pr, .) -> pr.augm
    pr.compl <- pr.augm[complete.cases(pr.augm), ]
    pr.nest <- pr.compl %>% group_by(wday.lbl) %>% nest() 
  },
  env)
> str(env$pr.augm)
'data.frame':   8000 obs. of  33 variables:
 $ Data    : POSIXct, format: "2017-01-10 11:00:00" ...
 $ Open    : num  123 123 123 123 123 ...
 $ High    : num  123 123 123 123 123 ...
 $ Low     : num  123 123 123 123 123 ...
 $ Close   : num  123 123 123 123 123 ...
 $ Vol     : num  3830 3360 3220 3241 3071 ...
 ..................................................
 $ zigz    : num  123 123 123 123 123 ...
 $ dz      : num  NA -0.0162 -0.0162 -0.0162 -0.0162 ...
 $ sig     : num  NA -1 -1 -1 -1 -1 -1 -1 -1 -1 ...
 $ mday    : int  10 10 10 10 10 10 10 10 10 10 ...
 $ wday.lbl: Ord.factor w/ 7 levels "Sunday"<"Monday"<..: 3 3 3 3 3 3 3 3 3 3 ...
 $ hour    : int  11 11 11 11 12 12 12 12 13 13 ...

Same result can be reached if we use the lubridate library, having deleted data for Saturday.

require(lubridate)
evalq({pr %>% mutate(.,
                     wday = wday(Data), #label = TRUE, abbr = TRUE),
                     day = day(Data),
                     hour = hour(Data)) %>%
    filter(wday != "Sat") -> pr1
  pr1.nest <- pr1 %>% na.omit %>% 
    group_by(wday) %>% nest()}, 
  env
)
#-------
str(env$pr1)
'data.frame':   7924 obs. of  33 variables:
 $ Data  : POSIXct, format: "2017-01-10 11:00:00" ...
 $ Open  : num  123 123 123 123 123 ...
 $ High  : num  123 123 123 123 123 ...
 $ Low   : num  123 123 123 123 123 ...
 $ Close : num  123 123 123 123 123 ...
 $ Vol   : num  3830 3360 3220 3241 3071 ...
 ..........................................
 $ zigz  : num  123 123 123 123 123 ...
 $ dz    : num  NA -0.0162 -0.0162 -0.0162 -0.0162 ...
 $ sig   : num  NA -1 -1 -1 -1 -1 -1 -1 -1 -1 ...
 $ wday  : int  3 3 3 3 3 3 3 3 3 3 ...
 $ day   : int  10 10 10 10 10 10 10 10 10 10 ...
 $ hour  : int  11 11 11 11 12 12 12 12 13 13 ...

Data grouped by the days of the week look as follows (Sunday = 1, Monday = 2 and so on):

> env$pr1.nest
# A tibble: 5 × 2
   wday                  data
  <int>                <list>
1     4 <tibble [1,593 Ч 32]>
2     5 <tibble [1,632 Ч 32]>
3     6 <tibble [1,624 Ч 32]>
4     2 <tibble [1,448 Ч 32]>
5     3 <tibble [1,536 Ч 32]>

Additionally, the dL, dH variables from the pr data set can be used on the last three bars.

2. Choosing predictors

There are many ways and criteria of assessing the importance of predictors. Some of them were considered in the previous articles. Since in this article the emphasis is on the visualization, let us compare one visual and one analytical method of ealuating the importance of predictors.

2.1. Visual evaluation

We will use the smbinning package. Earlier, we used the funModeling package to assess predictors. We came to conclusion that visualization of a relationship is a simple and reliable way of identifying relevance of predictors. We are going to test how the smbinning package will handle normalized and transformed data. We will also find out how the transformation of predictors impacts their importance.

Gather in one set log-transformed, sin-transformed, tanh-transformed and normalized data and assess the dependence of the goal variable and predictors in these sets.

The sequence of processing of the primary set (shown on the diagram below)is as follows: clean the raw data in the dataSet (without removing highly correlated data), divide dataSet into the train/val/test sets and get the DT set. Then carry out actions for each type of transformation according to the block diagram below. Let us gather everything in one script:

Preprocess

Fig.27. Block diagram of preliminary processing

Clean the set, divide it into train/val/test sets and remove unnecessary data:

#----Clean---------------------
require(caret)
require(pipeR)
evalq(
  {
    train = 1:2000
    val = 2001:3000
    test = 3001:4000
    DT <- list()
    dataSet %>%
      preProcess(., method = c("zv", "nzv", "conditionalX")) %>%
      predict(., dataSet) %>%
      na.omit -> dataSetClean
    list(train = dataSetClean[train, ], 
         val = dataSetClean[val, ], 
         test = dataSetClean[test, ]) -> DT
    rm(dataSetClean, train, val, test)
  }, 
  env)

Process all outliers in all sets:

#------outlier-------------
evalq({
# define the new list for the result
  DTcap <- list()
# go through the three sets
  foreach(i = 1:3) %do% {
    DT[[i]] %>% 
# remove columns with (Data, Class)
      select(-c(Data, Class)) %>%
# transform into data.frame and store in the temporary variable x
      as.data.frame() -> x
    if (i == 1) {
# define parameters of outliers in the first input
      foreach(i = 1:ncol(x), .combine = "cbind") %do% {
        prep.outlier(x[ ,i]) %>% unlist()
      } -> pre.outl
      colnames(pre.outl) <- colnames(x)
    } 
# substitute outliers for 5/95 % and store the result in x.cap
    foreach(i = 1:ncol(x), .combine = "cbind") %do% {
      stopifnot(exists("pre.outl", envir = env))
      lower = pre.outl['lower.25%', i] 
      upper = pre.outl['upper.75%', i]
      med = pre.outl['med', i]
      cap1 = pre.outl['cap1.5%', i] 
      cap2 = pre.outl['cap2.95%', i] 
      treatOutlier(x = x[ ,i], impute = T, fill = T, 
                   lower = lower, upper = upper, 
                   med = med, cap1 = cap1, cap2 = cap2) 
      } %>% as.data.frame() -> x.cap
    colnames(x.cap) <- colnames(x)
    return(x.cap)
  } -> Dtcap
#remove unnecessary variables
  rm(lower, upper, med, cap1, cap2, x.cap, x)
}, env)

Transform variables in all Dtcap sets without the outliers with the log(x+1) function. Get the DTLn list with three sets of log-transformed variables.

#------logtrans------------
evalq({
  DTLn <- list()
  foreach(i = 1:3) %do% {
    DTcap[[i]] %>% 
      apply(., 2, function(x) log2(x + 1)) %>%
      as.data.frame() %>%
      cbind(., Class = DT[[i]]$Class)
  } -> DTLn
},
env)

Transform variables in all Dtcap sets without the outliers with the sin(2*pi*x) function. Get the DTSin list with three sets of sin-transformed variables.

#------sintrans--------------
evalq({
  DTSin <- list()
  foreach(i = 1:3) %do% {
    DTcap[[i]] %>% 
      apply(., 2, function(x) sin(2*pi*x)) %>%
      as.data.frame() %>%
      cbind(., Class = DT[[i]]$Class)
  } -> DTSin
},
env)

Transform variables in all Dtcap sets without outliers with the tanh(x) function. Get the DTTanh list with three sets of tanh-transformed variables.

#------tanhTrans----------
evalq({
  DTTanh <- list()
  foreach(i = 1:3) %do% {
    DTcap[[i]] %>% 
      apply(., 2, function(x) tanh(x)) %>%
      as.data.frame() %>%
      cbind(., Class = DT[[i]]$Class)
  } -> DTTanh
},
env)

Normalize the DT, DTLn, DTSin, DTTanh sets.

#------normalize-----------
evalq(
  {
# define parameters of normalization
    preProcess(DT$train, method = "spatialSign") -> preproc 
    list(train = predict(preproc, DT$train), 
         val = predict(preproc, DT$val),
         test = predict(preproc, DT$test)
 ) -> DTn
  }, 
  env) 
#--ln---
evalq(
  {
    preProcess(DTLn[[1]], method = "spatialSign") -> preprocLn 
    list(train = predict(preprocLn, DTLn[[1]]), 
         val = predict(preprocLn, DTLn[[2]]),
         test = predict(preprocLn, DTLn[[3]])
    ) -> DTLn.n
  }, 
  env)
#---sin---
evalq(
  {
    preProcess(DTSin[[1]], method = "spatialSign") ->  preprocSin 
    list(train = predict(preprocSin, DTSin[[1]]), 
         val = predict(preprocSin, DTSin[[2]]),
         test = predict(preprocSin, DTSin[[3]])
    ) -> DTSin.n
  }, 
  env)
#-----tanh-----------------
evalq(
  {
    preProcess(DTTanh[[1]], method = "spatialSign") -> preprocTanh 
    list(train = predict(preprocTanh, DTTanh[[1]]), 
         val = predict(preprocTanh, DTTanh[[2]]),
         test = predict(preprocTanh, DTTanh[[3]])
    ) -> DTTanh.n
  }, 
  env)

Use the mdlp::discretization function to discretize the DT set

##------discretize----------
#--------preCut---------------------
# define the cutpoints
require(pipeR)
require(discretization)
evalq(
  #require(pipeR) 
# takes some time !
  pipeline({
    DT$train
    select(-Data)
    as.data.frame()
    mdlp() 
  }) -> mdlp.train, 
  env)
#-------cut_opt----------
evalq(
  {
    DTd <- list()
    mdlp.train$cutp %>% 
# define the columns that have to be discretized
      lapply(., function(x) is.numeric(x)) %>%
      unlist -> idx   # bool
    #----train-----------------
    mdlp.train$Disc.data[ ,idx] -> DTd$train 
    #---test------------
    DT$test %>% 
      select(-c(Data, Class)) %>%
      as.data.frame() -> test.d
# rearrange data according to calculated ranges    
    foreach(i = 1:length(idx), .combine = 'cbind') %do% {
      if (idx[i]) {
        findInterval(test.d[ ,i], 
        vec = mdlp.train$cutp[[i]],
        rightmost.closed = FALSE, 
        all.inside = F,
        left.open = F)
        }
    } %>% as.data.frame() %>% add(1) %>%
      cbind(., DT$test$Class) -> DTd$test
    colnames(DTd$test) <- colnames(DTd$train)
    #-----val-----------------
    DT$val %>% 
      select(-c(Data, Class)) %>%
      as.data.frame() -> val.d
# rearrange data according to calculated ranges  
    foreach(i = 1:length(idx), .combine = 'cbind') %do% {
      if (idx[i]) {
        findInterval(val.d[ ,i], 
        vec = mdlp.train$cutp[[i]],
        rightmost.closed = FALSE, 
        all.inside = F,
        left.open = F)
        }
    } %>% as.data.frame() %>% add(1) %>%
      cbind(., DT$val$Class) -> DTd$val 
    colnames(DTd$val) <- colnames(DTd$train)
# tidy up
    rm(test.d, val.d)
  }, 
  env
)

Let us recall what variables the original data set DT$train contains:

require(funModeling)
plot_num(env$DT$train %>% select(-Data), bins = 20)

FSelect 1

Fig.28. Distribution of variables in the DT$train data set

Use the capabilities of the smbinning package to identify relevant predictors in the train subsets of all normalized data sets obtained earlier (Dtn, DTLn.n, DTSin.n and DTTanh.n). The goal variable in this package must be numeric and have values (0, 1). Let us write a function for necessary conversions.

#--------------------------------
require(smbinning)
targ.int <- function(x){
  x %>% tbl_df() %>%
  mutate(Cl = (as.numeric(Class) - 1) %>%
           as.integer()) %>%
  select(-Class) %>% as.data.frame()
}

Additionally, this package does not accept variables that have a dot in the name. The function below will rename all variables with a dot into variables with an underscore.

renamepr <- function(X){
  X %<>% rename(v_fatl = v.fatl,
               v_satl = v.satl,
               v_rftl = v.rftl,
               v_rstl = v.rstl,
               v_ftlm = v.ftlm,
               v_stlm = v.stlm,
               v_rbci = v.rbci,
               v_pcci = v.pcci)
  return(X)
}

Calculate and plot charts with relevant predictors.

par(mfrow = c(2,2))
#--Ln--------------
evalq({
  df <- renamepr(DTLn.n[[1]]) %>% targ.int
  sumivt.ln.n = smbinning.sumiv(df = df, y = 'Cl')
  smbinning.sumiv.plot(sumivt.ln.n, cex = 0.7)
  rm(df)
}, 
env)
#---Sin-----------------
evalq({
  df <- renamepr(DTSin.n[[1]]) %>% targ.int
  sumivt.sin.n = smbinning.sumiv(df = df, y = 'Cl')
  smbinning.sumiv.plot(sumivt.sin.n, cex = 0.7)
  rm(df)
  }, 
env)
#---norm-------------
evalq({
  df <- renamepr(DTn[[1]]) %>% targ.int
  sumivt.n = smbinning.sumiv(df = df, y = 'Cl')
  smbinning.sumiv.plot(sumivt.n, cex = 0.7)
  rm(df)
  }, 
env)
#-----Tanh----------------
evalq({
  df <- renamepr(DTTanh.n[[1]]) %>% targ.int
  sumivt.tanh.n = smbinning.sumiv(df = df, y = 'Cl')
  smbinning.sumiv.plot(sumivt.tanh.n, cex = 0.7)
  rm(df)
  }, 
env)
par(mfrow = c(1,1))

FSelect 2

Fig.29. Importance of predictors in the train subset of normalized sets

The five predictors v_fatl, ftlm, v_satl, rbci, v_rbci are strong in all sets, though their order differ. The four predictors pcci, v_ftlm, v_stlm, v_rftl have average strength. Predictors v_pcci and stlm are weak. Distribution of variables can be seen for each set in the order of their importance:

env$sumivt.ln.n
     Char     IV               Process
5  v_fatl 0.6823    Numeric binning OK
1    ftlm 0.4926    Numeric binning OK
6  v_satl 0.3737    Numeric binning OK
3    rbci 0.3551    Numeric binning OK
11 v_rbci 0.3424    Numeric binning OK
10 v_stlm 0.2591    Numeric binning OK
4    pcci 0.2440    Numeric binning OK
9  v_ftlm 0.2023    Numeric binning OK
7  v_rftl 0.1442    Numeric binning OK
12 v_pcci 0.0222    Numeric binning OK
2    stlm     NA No significant splits
8  v_rstl     NA No significant splits

The last three variables can be discarded. This way only five strongest and four average ones will be left. Let us define the names of the best variables (IV > 0.1).

evalq(sumivt.sin.n$Char[sumivt.sin.n$IV > 0.1] %>% 
        na.omit %>% as.character() -> best.sin.n, 
      env)
> env$best.sin.n
[1] "v_fatl" "ftlm"   "rbci"   "v_rbci" "v_satl" "pcci"  
[7] "v_ftlm" "v_stlm" "v_rftl"

Let us look at the v_fatl и ftlm variables in more detail.

evalq({
    df <- renamepr(DTTanh.n[[1]]) %>% targ.int
    x = 'v_fatl'
    y = 'Cl'
    res <- smbinning(df = df, 
                        y = y,
                        x = x) 
  #res$ivtable # Tabulation and Information Value
  #res$iv # Information value
  #res$bands # Bins or bands
  #res$ctree  # Decision tree from partykit
  par(mfrow = c(2,2))
  sub = paste0(x, "  vs  ", y) #rbci vs Cl"
  boxplot(df[[x]]~df[[y]],
          horizontal = TRUE, 
          frame = FALSE, col = "lightblue",
          main = "Distribution")
  mtext(sub,3) #ftlm
  smbinning.plot(res, option = "dist",
                 sub = sub) #"pcci vs Cl")
  smbinning.plot(res, option = "goodrate", #"badrate"
                 sub = sub) #"pcci vs Cl")
  smbinning.plot(res, option = "WoE",
                 sub = sub) #"pcci vs Cl")
  par(mfrow = c(1, 1))
}, env)

FSelect 3

Fig.30. Connection of the ranges of the v_fatl variable with the Cl goal variable

Along with useful information, the res object contains the points of dividing the variable into ranges optimally connected with the goal variable. In our particular case there are four ranges.

> env$res$cuts
[1] -0.3722 -0.0433  0.1482

We are going to do the same calculation for the ftlm variable and plot charts:

FSelect 4

Fig.31. Connection of the ranges of the ftlm variable with the Cl goal variable

Range cutoff points:

> env$res$cuts
[1] -0.2084 -0.0150  0.2216

The cutoff points will allow us to discretize the variables in our sets and see how much the following items differ:

important variables earlier defined using the mdlp::discretization function from the variables defined using the smbinning::smbinning; function
splitting variables into ranges.

We already have one data set discretized with the mdlp::discretization DTd function. We are going to do the same but this time we will use the smbinning::smbinning function for the train subset only.

Define the cut points:

evalq({
  res <- list()
  DT$train %>% renamepr() %>% targ.int() -> df
  x <- colnames(df)
  y <- "Cl"
  foreach(i = 1:(ncol(df) - 1)) %do% {
    smbinning(df, y = y, x = x[i])
  } -> res
  res %>% lapply(., function(x) x[1] %>% is.list) %>%
    unlist -> idx
}, env)

Discretize the DT$train: subset

evalq({
  DT1.d <- list()
  DT$train %>% renamepr() %>% 
    targ.int() %>% select(-Cl) -> train
  foreach(i = 1:length(idx), .combine = 'cbind') %do% {
    if (idx[i]) {
      findInterval(train[ ,i], 
                   vec = res[[i]]$cuts,
                   rightmost.closed = FALSE, 
                   all.inside = F,
                   left.open = F)
    }
  } %>% as.data.frame() %>% add(1) %>%
    cbind(., DT$train$Class) -> DT1.d$train
  colnames(DT1.d$train) <- colnames(train)[idx] %>%
    c(., 'Class')
},
env)

Identify the best variables with the importance greater than 0.1 in increasing order:

evalq({
  DT$train %>% renamepr() %>% targ.int() -> df
  sumivt.dt1.d = smbinning.sumiv(df = df, y = 'Cl')
  sumivt.dt1.d$Char[sumivt.dt1.d$IV > 0.1] %>% 
    na.omit %>% as.character() -> best.dt1.d
  rm(df)
}, 
env)

Plot a chart of splitting variables in the DTd$train: set

require(funModeling)
plot_num(env$DTd$train)

FSelect 5

Fig.32. Variables of the DT$ train set discretized with the mdlp function

The chart of the DT1.d set with all variables and with the best ones is shown below.

plot_num(env$DT1.d$train)

FSelect 6

Fig.33. Variables of the DT1 d$train set discretized with the smbinning function

plot_num(env$DT1.d$train[ ,env$best.dt1.d])

FSelect 7

Fig.34. Variables of the DT1.d$train set discretized with the smbinning function (the best are arranged in increasing order of information importance)

What can we see on the charts? Variables defined as important are the same in both cases but splitting into ranges differs. It has to be tested which variant gives better prediction on the model.

2.2. Analytical evaluation

There are many analytical methods to identify the importance of predictors by various criteria. We considered some of them earlier. Now, I would like to test an unusual approach to the choice of predictors.

We are going to use the varbvs package. In the varbvs function implemented are: fast algorithms for installing Bayesian models of choosing variables and calculating Bayesian coefficients where the result (or goal variable) are modeled with linear regression or logistical regression. These algorithms are based on the variational approximations described in "Scalable variational inference for Bayesian variable selection in regression, and its accuracy in genetic association studies" (P. Carbonetto and M. Stephens, Bayesian Analysis 7, 2012, pages 73-108). This software was used for work with large data sets with more than a million variables and with thousands of samples.

The varbvs() function receives a matrix and the goal variable receives a numeric vector (0, 1) as input data. Using this method, let us test which predictors will be defined as important in our set with normalized data DTTanh.n$train .

require(varbvs)
evalq({
  train <- DTTanh.n$train %>% targ.int() %>%  as.matrix()
  fit <- varbvs(X = train[ ,-ncol(train)] , 
                Z = NULL,
                y = train[ ,ncol(train)] %>% as.vector(),
                "binomial", 
                logodds = seq(-2,-0.5,0.1),
                optimize.eta = T,
                initialize.params = T,
                verbose = T, nr = 100
                )
  print(summary(fit))
}, env)

Welcome to           --       *                              *               
VARBVS version 2.0.3 --       |              |               |               
large-scale Bayesian --       ||           | |    |          || |     |   |  
variable selection   -- |     || | |    |  | ||  ||        |||| ||    |   || 
****************************************************************************
Copyright (C) 2012-2017 Peter Carbonetto.
See http://www.gnu.org/licenses/gpl.html for the full license.
Fitting variational approximation for Bayesian variable selection model.
family:     binomial   num. hyperparameter settings: 16 
samples:    2000       convergence tolerance         1.0e-04
variables:  12         iid variable selection prior: yes 
covariates: 0          fit prior var. of coefs (sa): yes 
intercept:  yes        fit approx. factors (eta):    yes 
Finding best initialization for 16 combinations of hyperparameters.
-iteration-   variational    max.   incl variance params
outer inner   lower bound  change   vars   sigma      sa
 0016 00018 -1.204193e+03 6.1e-05 0003.3      NA 3.3e+00
Computing marginal likelihood for 16 combinations of hyperparameters.
-iteration-   variational    max.   incl variance params
outer inner   lower bound  change   vars   sigma      sa
 0016 00002 -1.204193e+03 3.2e-05 0003.3      NA 3.3e+00
Summary of fitted Bayesian variable selection model:
family:     binomial   num. hyperparameter settings: 16
samples:    2000       iid variable selection prior: yes
variables:  12         fit prior var. of coefs (sa): yes
covariates: 1          fit approx. factors (eta):    yes
maximum log-likelihood lower bound: -1204.1931
Hyperparameters: 
        estimate Pr>0.95             candidate values
sa          3.49 [3.25,3.6]          NA--NA
logodds    -0.75 [-1.30,-0.50]       (-2.00)--(-0.50)
Selected variables by probability cutoff:
>0.10 >0.25 >0.50 >0.75 >0.90 >0.95 
    3     3     3     3     3     3 
Top 5 variables by inclusion probability:
  index variable   prob PVE coef*  Pr(coef.>0.95)
1     1     ftlm 1.0000  NA 2.442 [+2.104,+2.900]
2     4     pcci 1.0000  NA 2.088 [+1.763,+2.391]
3     3     rbci 0.9558  NA 0.709 [+0.369,+1.051]
4    10   v.stlm 0.0356  NA 0.197 [-0.137,+0.529]
5     6   v.satl 0.0325  NA 0.185 [-0.136,+0.501]
*See help(varbvs) about interpreting coefficients in logistic regression.

As you can see, five best predictors have been identified (ftlm, pcci, rbci, v.stlm, v.satl). They are in the top ten, which we identified earlier but in a different order and with other importance weights. Since we already have a model, let us check what result we will get on the validation and test sets.

Validation set:

#-----------------
evalq({
  val <- DTTanh.n$val %>% targ.int() %>%
    as.matrix()
  y = val[ ,ncol(val)] %>% as.vector()
  pr <- predict(fit, X = val[ ,-ncol(val)] , 
                Z = NULL)
  
}, env)
cm.val <- confusionMatrix(table(env$y, env$pr))
> cm.val
Confusion Matrix and Statistics

   
      0   1
  0 347 204
  1 137 312
                                          
               Accuracy : 0.659           
                 95% CI : (0.6287, 0.6884)
    No Information Rate : 0.516           
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.3202          
 Mcnemar's Test P-Value : 0.0003514       
                                          
            Sensitivity : 0.7169          
            Specificity : 0.6047          
         Pos Pred Value : 0.6298          
         Neg Pred Value : 0.6949          
             Prevalence : 0.4840          
         Detection Rate : 0.3470          
   Detection Prevalence : 0.5510          
      Balanced Accuracy : 0.6608          
                                          
       'Positive' Class : 0

The result does not look impressive at all. Test set:

evalq({
  test <- DTTanh.n$test %>% targ.int() %>%
    as.matrix()
  y = test[ ,ncol(test)] %>% as.vector()
  pr <- predict(fit, X = test[ ,-ncol(test)] , 
                Z = NULL)
  
}, env)
cm.test <- confusionMatrix(table(env$y, env$pr))
> cm.test
Confusion Matrix and Statistics

   
      0   1
  0 270 140
  1 186 404
                                        
               Accuracy : 0.674         
                 95% CI : (0.644, 0.703)
    No Information Rate : 0.544         
    P-Value [Acc > NIR] : < 2e-16       
                                        
                  Kappa : 0.3375        
 Mcnemar's Test P-Value : 0.01269       
                                        
            Sensitivity : 0.5921        
            Specificity : 0.7426        
         Pos Pred Value : 0.6585        
         Neg Pred Value : 0.6847        
             Prevalence : 0.4560        
         Detection Rate : 0.2700        
   Detection Prevalence : 0.4100        
      Balanced Accuracy : 0.6674        
                                        
       'Positive' Class : 0

The result is nearly the same. This means that the model was not retrained and generalizes data well.

So, according to varbvs, the best are ftlm, pcci, rbci, v.stlm, v.satl.

2.3. Neural network

As we are studying neural networks, let us test which predictors the neural network will select as the most important ones.

We are going to use the FCNN4R package which provides the interface for the core programs from the FCNN C ++ library. FCNN is based on a completely new representation of the neural network which implies efficiency, modularity and expandability. FCNN4R enables standard learning (backpropagation, Rprop, simulated annealing, stochastic gradient) and algorithms of pruning (minimum magnitude, Optimal Brain Surgeon) though I see this package as an efficient calculation engine above all.

Users can easily implement their algorithm using quick gradient methods along with functionality of restoring the network (removing weights and excessive neurons, rearranging input data and uniting networks).

Networks can be exported to the C functions to be able to be integrated into any program solution.

Create a fully connected network with two hidden layers. The number of neurons in each layer: input = 12 (number of predictors), output = 1. Initiate neurons by random weight in the range +/- 0.17. Set activation functions in each layer of the neural network (except the input one) = c("tanh", "tanh", "sigmoid"). Prepare the train/val/test sets.

The script below carries out this sequence of actions.

require(FCNN4R)
evalq({
mlp_net(layers = c(12, 8, 5, 1), name = "n.tanh") %>%
  mlp_rnd_weights(a = 0.17) %>% 
  mlp_set_activation(layer = c(2, 3, 4), 
  activation = c("tanh", "tanh", "sigmoid"), #"threshold", "sym_threshold",
                                            #"linear", "sigmoid", "sym_sigmoid",
                                            #"tanh", "sigmoid_approx",
                                            #"sym_sigmoid_approx"), 
                 slope = 0) -> Ntanh #show() 
#-------
train <- DTTanh.n$train %>% targ.int() %>% as.matrix()
test <- DTTanh.n$test %>% targ.int() %>%  as.matrix()
val <- DTTanh.n$val %>% targ.int() %>% as.matrix()
}, env)

We are going to use the rprop training method. Set the constants: tol — the error when the training must be stopped if this level is reached, max_ep — the number of epochs after which the training must be stopped, l2reg — regularization coefficient. Train the network with these parameters and assess visually what network and the training error we have.

evalq({
  tol <- 1e-1
  max_ep = 1000
  l2reg = 0.0001
net_rp <- mlp_teach_rprop(Ntanh, 
                          input = train[ ,-ncol(train)], 
                          output = train[ ,ncol(train)] %>% as.matrix(),
                          tol_level = tol, 
                          max_epochs = max_ep, 
                          l2reg = l2reg,
                          u = 1.2, d = 0.5, 
                          gmax = 50, gmin = 1e-06, 
                          report_freq = 100)
}, env)
plot(env$net_rp$mse, t = "l", 
     main = paste0("max_epochs =", env$max_ep, " l2reg = ", env$l2reg))

NN1

Fig.35. Error in neural network training

evalq(mlp_plot(net_rp$net, FALSE), envir = env)

NN2

Fig.36. Structure of the neural network

Prune

Pruning of the minimal value is a simple to use algorithm. Here, the weights with the smallest absolute value are switched off at each step. This algorithm requires the network relay nearly at each step and gives suboptimal results.

evalq({
  tol <- 1e-1
  max_ep = 1000
  l2reg = 0.0001
  mlp_prune_mag(net_rp$net, 
                input = train[ ,-ncol(train)], 
                output = train[ ,ncol(train)] %>% as.matrix(),
                tol_level = tol,  
                max_reteach_epochs = max_ep, 
                report = FALSE,
                plots = TRUE) -> net_rp_prune
  
}, env)

NN3

Fig.37. Pruned neural network

We can see that with a certain structure of the neural network, initial setup, activation functions and learning error, the neural network with the structure (12, 2, 1, 1) is sufficient. Which predictors were chosen by the neural network?

evalq(
  best <- train %>% tbl_df %>%  select(c(1,5,7,8,10,12)) %>% colnames(),
           env)
env$best
[1] "ftlm"   "v.fatl" "v.rftl" "v.rstl" "v.stlm"
[6] "v.pcci"

The v.rstl and v.pcci variables are not present among the best nine variables defined earlier.

I would like to stress that here we showed that a neural network can independently and automatically select important predictors. This choice depends not only on the predictors but the structure and parameters of the network.

Happy experimenting!

Conclusion

In the following part we will speak about ways to delete noise examples from the set, how to decrease the size of inputs and what effect this will have along with ways to divide original data into train/val/test.

Application

1. You can download the FeatureTransformation.R, FeatureSelect.R, FeatureSelect_analitic.R FeatureSelect_NN.R scripts and the diagrams showing the work of scripts of Part_1 of this article RData from Git /Part_II.

Translated from Russian by MetaQuotes Ltd.
Original article: https://www.mql5.com/ru/articles/3507

Warning: All rights to these materials are reserved by MetaQuotes Ltd. Copying or reprinting of these materials in whole or in part is prohibited.