How to calculate the Gini Index? - General

Forester 2022.10.24 13:18 #28001

elibrarius #:

https://www.mql5.com/ru/blogs/post/723619

By the way, does anyone know how to calculate the root of the Gini index (I understand how to calculate the root, but the Gini index itself)? I'd prefer a code example. It would be interesting to experiment with it.
As I noted at the time "Giniindex and Gini coefficient are different things - don't confuse them".
For the G ini coefficient we use a GSC, i.e. we add noise. I have attached to the article the found code for calculating R and PY.
Index is something else.

ADX as internal function iCustom with REI Any questions from newcomers

Aleksey Nikolayev 2022.10.25 15:43 #28002

elibrarius #:

By the way, does anyone know how to calculate the root of the Gini index (I understand how to calculate the root, but the Gini index itself)? I'd prefer an example of the code. It would be interesting to experiment with it.
As I noted at the time "Giniindex and Gini coefficient are different things - don't confuse them".
For the G ini coefficient we use a GSC, i.e. we add noise. I have attached to the article the found code for calculating R and PY.
Index is something else.

There is Gini coefficient and there is Gini Impurity. The first is used as a metric in binary classification, article. The second is used in decision trees as an analogue of entropy.

Коэффициент Джини. Из экономики в машинное обучение

2018.03.06
habr.com

Интересный факт: в 1912 году итальянский статистик и демограф Коррадо Джини написал знаменитый труд «Вариативность и изменчивость признака», и в этом же году «Титаник» затонул в водах Атлантики. Казалось бы, что общего между этими двумя событиями? Всё просто, их последствия нашли широкое применение в области машинного обучения. И если датасет...

Aleksey Vyazmikin 2022.10.25 19:10 #28003

mytarmailS

Once you made a script, which I decided to use again

library('caret')

way <-         "D:\\FX\\MT5_CB\\MQL5\\Files\\Po_Vektoru_TP_0_SL_0\\EURUSD_0\\Setup"
df1 = read.csv("D:\\FX\\MT5_CB\\MQL5\\Files\\Po_Vektoru_TP_0_SL_0\\EURUSD_0\\Setup\\train.csv", header = TRUE, sep = ";",dec = ".")


cor.test.range <- seq(from = 0.1,to = 0.9,by = 0.1)  # диапазон перебора в коеф корр

get.findCorrelation <- function(data , not.used.colums , cor.coef){
  library('caret')
  df2 <-  cor(     data[, ! colnames(data)  %in%  not.used.colums])  
  not.need <- findCorrelation(df2, cutoff=cor.coef) 
  not.need.nms <- colnames(df2[,not.need])  # получаем имена переменных что не прошли коррел тест
  reduced_Data <- data[, ! colnames(data)  %in%  not.need.nms]
  return(reduced_Data)}


for(i in 1:length(cor.test.range)){
  
    reduced_Data <- get.findCorrelation(data = df1 , 
                                      not.used.colums = c("Target_100_Buy","Target_100_Sell","Target_P","Time","Target_100"),
                                      cor.coef = cor.test.range[i] )


  
  #reduced_Data <- get.findCorrelation(data = reduced_Data , 
  #                                    not.used.colums = c("Target_100_Buy","Target_100_Sell","Target_P","Time","Target_100"),
  #                                    cor.coef = cor.test.range[i]*-1 )  
    
  file.name <- paste0("train2_" , cor.test.range[i] , ".csv")
  final.way <- paste0(way , file.name)
  
  
  #write.csv2(x = reduced_Data,file = final.way,row.names = F)  # возможно это лучше
  
   write.table(reduced_Data, file = final.way,
               append = FALSE, quote = FALSE, sep=";",
               eol = "\n", na = "NA", dec = ".", row.names = FALSE,
               col.names = TRUE, qmethod = c("escape", "double"),
               fileEncoding = "")
}

I ran it on a sample, and it gives an error - I can't understand where to look for the error and how to fix it - maybe you know, since you use these libraries/packages?

R version 4.0.5 (2021-03-31) -- "Shake and Throw"
Copyright (C) 2021 The R Foundation for Statistical Computing
Platform: x86_64-w64-mingw32/x64 (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

[Workspace loaded from F:/FX/R/.RData]

Loading required package: Matrix
Error: package or namespace load failed for ‘Matrix’ in getClassDef(class1):
 reached elapsed time limit
> source('F:/FX/R/Viborka_Korrelyaciya_v_02.R', echo=TRUE)

> library('caret')
Загрузка требуемого пакета: lattice
Загрузка требуемого пакета: ggplot2

> #way <- "F:\\FX\\Открытие Брокер_Demo\\MQL5\\Files\\Proboy_236_TP_4_SL_4\\Si\\Setup\\"
> #df1 = read.csv("F:\\FX\\Открытие Брокер_Demo\\MQL5\\Files\ ..." ... [TRUNCATED] 

> df1 = read.csv("D:\\FX\\MT5_CB\\MQL5\\Files\\Po_Vektoru_TP_0_SL_0_G_2_Bi_v2\\EURUSD_0\\Setup\\train.csv", header = TRUE, sep = ";",dec = ".")

> cor.test.range <- seq(from = 0.1,to = 0.9,by = 0.1)  # диапазон перебора в коеф корр

> get.findCorrelation <- function(data , not.used.colums , cor.coef){
+   library('caret')
+   df2 <-  cor(     data[, ! colnames(data)  %in%  not.use .... [TRUNCATED] 

> for(i in 1:length(cor.test.range)){
+   
+     reduced_Data <- get.findCorrelation(data = df1 , 
+                                       not.used.co .... [TRUNCATED] 
Warning messages:
1: пакет ‘caret’ был собран под R версии 4.1.0 
2: пакет ‘ggplot2’ был собран под R версии 4.1.0 
> source('F:/FX/R/Viborka_Korrelyaciya_v_02.R', echo=TRUE)

> library('caret')

> #way <- "F:\\FX\\Открытие Брокер_Demo\\MQL5\\Files\\Proboy_236_TP_4_SL_4\\Si\\Setup\\"
> #df1 = read.csv("F:\\FX\\Открытие Брокер_Demo\\MQL5\\Files\ ..." ... [TRUNCATED] 

> df1 = read.csv("D:\\FX\\MT5_CB\\MQL5\\Files\\Po_Vektoru_TP_0_SL_0\\EURUSD_0\\Setup\\train.csv", header = TRUE, sep = ";",dec = ".")

> cor.test.range <- seq(from = 0.1,to = 0.9,by = 0.1)  # диапазон перебора в коеф корр

> get.findCorrelation <- function(data , not.used.colums , cor.coef){
+   library('caret')
+   df2 <-  cor(     data[, ! colnames(data)  %in%  not.use .... [TRUNCATED] 

> for(i in 1:length(cor.test.range)){
+   
+     reduced_Data <- get.findCorrelation(data = df1 , 
+                                       not.used.co .... [TRUNCATED] 
Error in findCorrelation_fast(x = x, cutoff = cutoff, verbose = verbose) : 
  The correlation matrix has some missing values.
In addition: Warning message:
In cor(data[, !colnames(data) %in% not.used.colums]) :
  стандартное отклонение нулевое

Everything worked fine on a binary sample.

mytarmailS 2022.10.26 06:25 #28004

Aleksey Vyazmikin #:

You once made a script that I decided to use again.

I ran it on a sample, and it gives an error - I can't understand where to find the error and how to fix it - maybe you know, since you use these libraries/packages?

Everything worked fine on a binary sample.

I don't use this library, I did it once, I think it was just for you.... You must have done something wrong in your new data if it works with the old data

Aleksey Vyazmikin 2022.10.26 06:46 #28005

mytarmailS #:
I don't use this bibliotheca, I did it once, I think it was just for you.... You must have done something wrong with your new data if it works with the old data.

Yeah, for me. It works with binary, before I looked it was mostly running on binary. Too bad they don't tell you which column/row is wrong.

mytarmailS 2022.10.26 07:16 #28006

Aleksey Vyazmikin #:

Yeah, for me. It works with binary, before I looked it was mostly running on binary. Too bad they don't tell you which column/row is wrong.

Double-check your data to make sure it matches the binary.

If you can't figure it out, send me a small piece of your data that the script doesn't work with, as well as the script itself and remind me what it should do.

I'll try to help...

Any questions from newcomers Please Can someone help [ARCHIVE!] Any rookie question,

Aleksey Vyazmikin 2022.10.26 13:57 #28007

mytarmailS #:
Double-check your data to make sure it matches those binary....

If you can't figure it out, send me a small piece of your data that the script doesn't work with, as well as the script itself and remind me what it should do.

I'll try to help...

Why do they have to match the binary? I just said that the script works, but it doesn't work with all data.

I cut the sample and attached the script in a separate archive.

The script removes correlated columns from the sample and saves the new sample.

Columns are excluded depending on the correlation threshold.

Files:

test_mini.zip 5324 kb

Viborka_Korrelyaciya_v_02.zip 2 kb

Backtesting with tick data Problem with a script Issues saving screenshots

mytarmailS 2022.10.26 16:36 #28008

Aleksey Vyazmikin #:

Why do they have to match the binary? I just said that the script works, but it doesn't work with all data.

I don't know, maybe you changed the separator or something...

And I still don't understand what error the script is giving.

And why did you install the packages on the newer R and use the old R?

mytarmailS 2022.10.26 18:29 #28009

Aleksey Vyazmikin #:

Here you go, I had to rewrite it all over again, it was so shitty code that I didn't understand what it was doing

df <- read.csv(file = file.choose(), header = T,sep = ";",dec = ".",stringsAsFactors = F)


#  указываем какие колонки не использовать
not_used_vars <- c("Target_100_Buy","Target_100_Sell","Target_P","Time","Target_100")


#  сохраняем отдельно не испозуемые колонки
not_used_vars_df <- df[,not_used_vars]


#  создаем датафрейм для поиска, без колонок  not_used_vars
df <- df[!names(df) %in% not_used_vars]


#  чтобы прочитать как работает функция и примеры  ?caret::findCorrelation
#  находим колонки которые не коррелированы с порогом корреляции 0,9    "cutoff = 0.9"
not_corr_colums <- caret::findCorrelation(as.matrix(df), cutoff = 0.9, exact = F,names = F)


#  оставляем df с некоррелироваными колонками
df <- df[,not_corr_colums]


#  обьединяем все в результирующий датафрейм
df <- cbind.data.frame(not_used_vars_df , df)


#  сохраняем результат
res_save_way <- "C:\\......\\not_correl_data.csv"
write.csv2(x = df,file = res_save_way,row.names = F)

Vladimir Perervenko 2022.10.26 19:33 #28010

Aleksey Vyazmikin #:

You once made a script that I decided to use again.

I ran it on a sample, and it gives an error - I can't understand where to find the error and how to fix it - maybe you know, since you use these libraries/packages?

Everything worked fine on a binary sample.

The error says that undefined values(NA) have appeared in the correlation matrix and the findCorrelation function cannot use it. Open the package and read the function description.

The scripts are messy and a sea of unnecessary intermediate results. below is the corrected script

#=====================================================================
 require(tidyft)
#--get  df1------------------------------------------------------------
way <-         "D:\\FX\\MT5_CB\\MQL5\\Files\\Po_Vektoru_TP_0_SL_0\\EURUSD_0\\Setup"
df1 = read.csv(paste0(way, "train.csv"), header = TRUE, sep = ";",dec = ".")
#df1 = fread(paste0(way, "train1.csv"))
#fst::write_fst(df1, "train1.fst")
#-----archiv--------------------------------
 ft <- as_fst(df1) #
 rm(df1)


#---constanti--------------------------------------------
 cor.test.range <- seq(from = 0.1,to = 0.9,by = 0.1)  #  диапазон перебора в коеф корр
not.used.colums = c("Target_100_Buy","Target_100_Sell","Target_P","Time","Target_100")

ft %>% select_fst(cols = not.used.colums, negate = TRUE)-> dt
#--function--------------------------------------------
 get.findCor<- function(data , cor.coef = cor.test.range){
    import::here(.from = caret, findCor = findCorrelation)
    data %>%
        cor(method = "kendall", use = "pairwise" ) %>%
        findCor(cutoff = cor.coef, exact = FALSE, names = TRUE)->nms
        if(nms!= 0)
        select_dt(data, cols = nms, negate = TRUE)
}
#----Calculate--------------------------------------------------------------
for(i in seq_len(length(cor.test.range))){
    get.findCor(dt, cor.coef = cor.test.range[i])-> dt.n
    paste0("train2_" , cor.test.range[i]*10 , ".csv") %>%
        paste0(way , .) %>% fwrite(dt.n, .)
    rm(dt.n)
}

Explanations in order:

1. You don't need to load the "caret" package into the global scope. It is very heavy, pulling a lot of dependencies and data. You only need one function of it. You import it directly into the get.findCor function.

The tidyft package is a very fast dataframe manipulation package. Use it.

Declaring Calculated Variable in How I assemble my [Archive!] Any rookie question,

Machine learning in trading: theory, models, practice and algo-trading - page 2801