Machine learning in trading: theory, models, practice and algo-trading - page 3612

 
Forester #:

Your method makes it easier to train the model. There is less noise for it and it is easier to train. It will be easier for the same tree to identify a cluster in the data in which all 100% of examples =1, not 60% of examples.
But by labelling 40% of examples with a different class, you add noise to the trade, you put 1 instead of 0.

But your charts look stable and promising enough on OOS.
Do you plan to make a signal? Even without subscription, just to see what will happen in real life on OOS.

I don't want signals, I want a hedge fund. Or just a lot of money on the card. I don't play signals.

Yes, you are correct in your interpretation of the whole process.
 

Are we gonna Google it?


 
Maxim Dmitrievsky #:
I don't want signals, I want a hedge fund. Or just a lot of money on the card.

)))))))))

maturely

 
Relatively recently came across a very savvy resource on ML.
Open Source Research for Business
  • 2021.03.30
  • www.firmai.org
Home page for FirmAI Open Science
 

To see all the horror that is going on in the markup, I printed the number of elements in each cluster and the deviation of the mean of labels from 0.5, in descending order:

Cluster 83: Count = 18, Mean = 0.33
Cluster 70: Count = 187, Mean = 0.22
Cluster 82: Count = 436, Mean = 0.10
Cluster 73: Count = 457, Mean = 0.09
Cluster 89: Count = 961, Mean = 0.09
Cluster 7: Count = 380, Mean = 0.09
Cluster 94: Count = 670, Mean = 0.09
Cluster 79: Count = 681, Mean = 0.08
Cluster 38: Count = 250, Mean = 0.08
Cluster 62: Count = 24, Mean = 0.08
Cluster 14: Count = 483, Mean = 0.08
Cluster 23: Count = 516, Mean = 0.08
Cluster 41: Count = 1181, Mean = 0.08
Cluster 60: Count = 412, Mean = 0.08
Cluster 2: Count = 667, Mean = 0.08
Cluster 12: Count = 651, Mean = 0.07
Cluster 0: Count = 646, Mean = 0.07
Cluster 39: Count = 1077, Mean = 0.07
Cluster 43: Count = 455, Mean = 0.06
Cluster 4: Count = 581, Mean = 0.06
Cluster 99: Count = 713, Mean = 0.06
Cluster 71: Count = 1368, Mean = 0.06
Cluster 72: Count = 1487, Mean = 0.06
Cluster 31: Count = 1340, Mean = 0.06
Cluster 51: Count = 1340, Mean = 0.06
Cluster 1: Count = 756, Mean = 0.06
Cluster 69: Count = 936, Mean = 0.05
Cluster 93: Count = 590, Mean = 0.05
Cluster 90: Count = 852, Mean = 0.05
Cluster 77: Count = 936, Mean = 0.05
Cluster 8: Count = 274, Mean = 0.05
Cluster 34: Count = 1050, Mean = 0.05
Cluster 86: Count = 979, Mean = 0.05
Cluster 58: Count = 508, Mean = 0.05
Cluster 66: Count = 632, Mean = 0.04
Cluster 55: Count = 556, Mean = 0.04
Cluster 75: Count = 1118, Mean = 0.04
Cluster 80: Count = 2365, Mean = 0.04
Cluster 65: Count = 334, Mean = 0.03
Cluster 53: Count = 1592, Mean = 0.03
Cluster 63: Count = 237, Mean = 0.03
Cluster 22: Count = 1155, Mean = 0.03
Cluster 27: Count = 1454, Mean = 0.03
Cluster 98: Count = 1062, Mean = 0.03
Cluster 88: Count = 17, Mean = 0.03
Cluster 5: Count = 767, Mean = 0.03
Cluster 67: Count = 1543, Mean = 0.03
Cluster 46: Count = 645, Mean = 0.03
Cluster 24: Count = 1028, Mean = 0.03
Cluster 85: Count = 498, Mean = 0.03
Cluster 37: Count = 1992, Mean = 0.03
Cluster 57: Count = 1523, Mean = 0.02
Cluster 26: Count = 476, Mean = 0.02
Cluster 28: Count = 695, Mean = 0.02
Cluster 10: Count = 1630, Mean = 0.02
Cluster 32: Count = 1522, Mean = 0.02
Cluster 50: Count = 651, Mean = 0.02

The following is not a complete list. There are only 3 clusters with deviation >= 0.1. That is, the number of buy and sell marks differs by 10%.

In the other clusters, the opposite labels fell out with a probability of almost 50%.

This is hell for training the model, because it is not sure of anything.

Let's see what kind of trading will happen if you select, fix and trade just 3 of these clusters:

Well such a thing, sitting around picking these pieces, even though they are pluses on the OOS.

But if you take a whole bunch of clusters above a certain probability threshold and fix them:

already better, at the threshold of 0.03. I prepared a bit of code for the main purpose - to calculate correct combinations of clusters.

 
Maxim Dmitrievsky #:

To see the horror that goes on in the markings.

And there's the full code, from loading the data to testing the model.
 
mytarmailS #:
Do you have the full code, starting from data loading to model testing?

I'll post it later when I finish the probabilities.

Maybe an article.

 

And it is clear that the more clusters you split into, the more good ones you can find, but the number of samples in them will be small.

In fact, most datasets are rubbish. And often even fixing it doesn't help, or there are few deals.

In the example, I split into 500 clusters, the best ones are deduced. It makes no sense to combine and combine them manually.

Iteration: 0
Cluster 479: Count = 9, Mean = 0.50
Cluster 437: Count = 12, Mean = 0.50
Cluster 195: Count = 22, Mean = 0.41
Cluster 255: Count = 10, Mean = 0.40
Cluster 52: Count = 26, Mean = 0.38
Cluster 246: Count = 14, Mean = 0.36
Cluster 420: Count = 27, Mean = 0.35
Cluster 354: Count = 20, Mean = 0.35
Cluster 366: Count = 32, Mean = 0.34
Cluster 271: Count = 18, Mean = 0.33
Cluster 229: Count = 69, Mean = 0.33
Cluster 349: Count = 26, Mean = 0.31
Cluster 373: Count = 39, Mean = 0.29
Cluster 326: Count = 85, Mean = 0.29
Cluster 289: Count = 66, Mean = 0.29
Cluster 295: Count = 14, Mean = 0.29
Cluster 353: Count = 18, Mean = 0.28
Cluster 202: Count = 61, Mean = 0.27
Cluster 296: Count = 29, Mean = 0.26
Cluster 286: Count = 33, Mean = 0.26
Cluster 344: Count = 57, Mean = 0.25
Cluster 13: Count = 73, Mean = 0.25
Cluster 101: Count = 44, Mean = 0.25
Cluster 397: Count = 20, Mean = 0.25
Cluster 209: Count = 28, Mean = 0.25
Cluster 43: Count = 43, Mean = 0.24
Cluster 262: Count = 84, Mean = 0.24
Cluster 88: Count = 61, Mean = 0.24
Cluster 129: Count = 38, Mean = 0.24
Cluster 277: Count = 19, Mean = 0.24
Cluster 90: Count = 171, Mean = 0.24
Cluster 417: Count = 135, Mean = 0.23
Cluster 60: Count = 45, Mean = 0.23
Cluster 190: Count = 15, Mean = 0.23
Cluster 128: Count = 89, Mean = 0.23
Cluster 460: Count = 22, Mean = 0.23
Cluster 154: Count = 110, Mean = 0.23
Cluster 109: Count = 65, Mean = 0.22
Cluster 385: Count = 36, Mean = 0.22
Cluster 328: Count = 71, Mean = 0.22
Cluster 191: Count = 46, Mean = 0.22
Cluster 120: Count = 44, Mean = 0.20
Cluster 379: Count = 54, Mean = 0.20
Cluster 56: Count = 138, Mean = 0.20
Cluster 360: Count = 131, Mean = 0.20
Cluster 87: Count = 50, Mean = 0.20
Cluster 431: Count = 43, Mean = 0.20
Cluster 259: Count = 26, Mean = 0.19
Cluster 232: Count = 13, Mean = 0.19
Cluster 436: Count = 55, Mean = 0.19
Cluster 111: Count = 45, Mean = 0.19
 

F-ya to output such information and to fix clusters by threshold, not by the number of best clusters:

via chatgpt can be rewritten to other languages I guess.

def find_best_clusters_proba(dataset, n_clusters=200, probability_threshold=0.02) -> pd.DataFrame:
    # Применяем KMeans для кластеризации
    dataset['clusters'] = KMeans(n_clusters=n_clusters).fit(dataset[dataset.columns[1:-1]]).labels_
    # Вычисляем среднее значение 'labels' для каждого кластера
    cluster_means = dataset.groupby('clusters')['labels'].mean()
    # Вычисляем количество элементов в каждом кластере
    cluster_counts = dataset.groupby('clusters')['labels'].count()
    # Вычисляем абсолютные разности между средними значениями меток и 0.5
    abs_diff_from_05 = cluster_means.sub(0.5).abs()
    # Выбираем все индексы, значения которых больше или равны порогу вероятности
    sorted_clusters = abs_diff_from_05[abs_diff_from_05 >= probability_threshold].index
    # Сортируем кластеры по их абсолютным разностям
    sorted_clusters_for_print = abs_diff_from_05.sort_values(ascending=False).index
    for cluster in sorted_clusters_for_print:
        count = cluster_counts[cluster]
        mean = abs_diff_from_05[cluster]
        print(f"Cluster {cluster}: Count = {count}, Mean = {mean:.2f}")
    # Создаем словарь для отображения средних значений в новые значения только для выбранных кластеров
    mean_to_new_value = {cluster: 0.0 if mean < 0.5 else 1.0 for cluster, mean in cluster_means.items() if cluster in sorted_clusters}
    # Применяем изменения к исходным значениям 'labels' только для выбранных кластеров
    dataset['labels'] = dataset.apply(lambda row: mean_to_new_value[row['clusters']] if row['clusters'] in mean_to_new_value else row['labels'], axis=1)
    # Создаем новый столбец 'meta_labels'
    dataset['meta_labels'] = dataset['clusters'].apply(lambda x: 1 if x in sorted_clusters else 0)
    # Удаляем столбец 'clusters'
    dataset = dataset.drop(columns=['clusters'])
    return dataset
 
You can't understand the whole process by one function, show it with data and model, so that you can run the whole process and make sense of it.