Machine learning in trading: theory, models, practice and algo-trading - page 3615

 
Maxim Dmitrievsky #:
I'll have to write an article anyway, it'll be a lot of letters.

What the hell is an article...

There's a function to show you the details, that's it!!!

 
mytarmailS #:

fuck this article...

There's a function to show you the data for it, that's it!!!

I have a complicated code with all sorts of add-ons, no simple code. It would take a long time to explain.
 
Maxim Dmitrievsky #:
I have a tricky code with all sorts of add-ons, no simple one. It's gonna take a long time to explain

You don't need any fancy stuff.

Your function takes one argument "dataset", so show it.

ALL!!!!!!!!!!!!!!!

 
mytarmailS #:

I don't need any fancy equipment.

Your function takes basically one argument "dataset", so show it.

ALL!!!!!!!!!!!!!!!

Dataset any dataset. Traits + labels at the end. Labels do not participate in clustering.

You copy the function and test it on your own data in your development environment. This is the best way to avoid torture.

The probability of error is minimal, the function is simple.
 
Maxim Dmitrievsky #:
Dataset any. Traits + labels at the end. Labels do not participate in clustering.

You copy the function and test it on your own data in your development environment. This is the best way to avoid torture.

How the f..k are we gonna compare your result and mine if the dataset is any....

Let's go back to my first message about reproducibility and confidence in correct work...


Are you drunk or something?

 
Yes 😀
 

Try it

We still won't compare the results because kmins randomly acts, the marks will be slightly different on the output

library(dplyr)
library(stats)

fix_labels_subset_mean <- function(dataset, n_clusters = 200, subset_size = 100) {
  # Применяем KMeans для кластеризации
  set.seed(123)  # Для воспроизводимости результатов
  kmeans_result <- kmeans(dataset[, -ncol(dataset)], centers = n_clusters)
  dataset$clusters <- kmeans_result$cluster

  # Вычисляем среднее значение 'labels' для каждого кластера
  cluster_means <- dataset %>%
    group_by(clusters) %>%
    summarise(mean_label = mean(labels))

  # Сортируем кластеры по их средним значениям и выбираем те, которые наиболее далеки от 0.5
  sorted_clusters <- cluster_means %>%
    mutate(distance = abs(mean_label - 0.5)) %>%
    arrange(desc(distance)) %>%
    head(subset_size) %>%
    pull(clusters)

  # Создаем словарь для отображения средних значений в новые значения только для выбранных кластеров
  mean_to_new_value <- cluster_means %>%
    filter(clusters %in% sorted_clusters) %>%
    mutate(new_value = ifelse(mean_label < 0.5, 0.0, 1.0)) %>%
    select(clusters, new_value)

  # Применяем изменения к исходным значениям 'labels' только для выбранных кластеров
  dataset <- dataset %>%
    left_join(mean_to_new_value, by = "clusters") %>%
    mutate(labels = ifelse(!is.na(new_value), new_value, labels)) %>%
    select(-clusters, -new_value)

  return(dataset)
}

# Пример использования функции
# dataset <- read.csv("path_to_your_dataset.csv")
# result <- fix_labels_subset_mean(dataset)
Files:
dataset.csv  11854 kb
 
We still won't compare results because kmins acts randomly, the labels will be slightly different in the outputs
 
Then take any dataset you want. Fix the trace sample, train it. And compare in the tester on the test with and without fixes. That's it.

It is the original and corrected datasets that should be compared. See improvements in the tester on the new data.
 
So?