Discussion of article "Advanced resampling and selection of CatBoost models by brute-force method" - page 14

 
mytarmailS:

Karoch I don't know, maybe I have a wrong gmm ))) But I don't see the difference between with it and without it, in my opinion everything is decided by the target and nothing else....


I have 60k data in total.

I take the first 10k and randomly select 500 points.

I either train the model on them immediately or train the gmm and then train the model.

test on the remaining 50k

And even in the usual way you can find such models as with gmm , and with the same frequency they are genetised.

for example

model without gmm is trained on 500 points , test on 50k


=================================================================================================

Saw an interesting thing to think about.....

There is such a point of view that the market should be divided into states and trade in each state a different strategy, but all known to me attempts were unsuccessful, either the state does not see or the model trades badly even in "kind of one" state.

But in this approach, you can see quite clearly which market the model "likes". and which ones it doesn't.

Probably because of the returns from the mashka as signs, the model works better in flat.

It is possible to manually divide into states and add these periods to the track. You need to balance the examples by 'states', or make artificial ones via gmm. How so, I never got such results on a bare model. Maybe a few Mashas can do that.
 
Maxim Dmitrievsky:
You can manually split into states and slip those periods into the traine. You need to balance examples by 'states', or make artificial ones via gmm

Yes, you can do HMM by state , but it will all be recognised by a sliding window , so with a lag on the window size , so ...... )

I just saw that there is a real clear view of the states , it seemed interesting .

 
mytarmailS:

Yes, you can do HMM by states , but it will all be recognised by a sliding window , and therefore with a lag on the window size , and therefore ...... )

I just saw that you can see the states really clearly here, it seemed interesting.

Trends are usually less than flat, so it seems to me that it will always be like this, you should sample them. The same clustering can be used to divide them into states.
 
Maxim Dmitrievsky:
. How come, I never got results like that on a naked model. Maybe a few Mashkas can do it.

It's been with the gmm, I've been trying different things, this and that.

 
Maxim Dmitrievsky:

I have an obsession with creating a training sample by optimising distributions or functions.

Without using any sample at all, just generate "something" and test it on real data.

But I don't know how to realise it yet

=====================================================

I also have an idea to improve the quality by removing bad trees from the model, this may also help.

 
mytarmailS:

I have an obsession with creating a training sample by optimising distributions or functions.

Then without starting from any sample at all, just generate "something" and test it on real data.

But I don't know how to realise it yet.

=====================================================

I also have an idea to improve the quality by removing bad trees from the model, that might help too.

You're the one who wants to get into the thick of stochastic modelling
 
Maxim Dmitrievsky:

It's a curious approach. For balancing the classes. Could be played up for our purposes. It just came to me.

https://towardsdatascience.com/augmenting-categorical-datasets-with-synthetic-data-for-machine-learning-a25095d6d7c8

I tried to integrate this approach into the clusteriser from the article, but not as a class balancing method, but as a generator of a new balanced dataset.

There's a great method there for calculating the Mahalanobis distance between two one-dimensional arrays. The article says it' s a multivariate generalisation of how many standard deviations a sample is from the mean of the distribution

Haven't fully experienced this metric yet, but the author suggests using it to assess whether the generated features belong to a particular class .

https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.mahalanobis.html

To calculate this indicator, we need 2 univariate arrays and a covariance matrix.

In our case, the first array is the generated feature, the second array is the average distributions of features from GMM. The covariance matrix is also taken from GMM. GMM is prepared for each class separately. Also generated, mean, standard deviations for each trait and labels. These are needed for generating new data.

import numpy.linalg as lnalg
from scipy.spatial.distance import mahalanobis

#initialization 
gmms = dict()
desc_df = list()
inv_sig = dict()
arr_classes = np.sort(X.labels.unique())
cols = X.columns.tolist()
pr_c = add_labels(pr.copy(), min=60, max=120, add_noize=0.0)
X = pr_c[pr_c.columns[2:]]
x_train = X.copy()

#create descriptive statistics, train gmm, extract means and covariances for each classes
for ind, cls in enumerate(arr_classes):

    desc_df.append(x_train[x_train['labels'] == cls].describe())

    x_trainGMM = x_train[x_train['labels'] == cls].values[:, :-1]
    gmm = mixture.GaussianMixture(n_components=1, covariance_type='full').fit(x_trainGMM)
    gmms[cls] = (gmm.means_, gmm.covariances_)

    # invert the matrix for mahalanobis calc
    mu, sig = gmms[cls]
    isig = lnalg.inv(sig)
    inv_sig[cls] = mu, isig

Everything is ready for generating and selecting new data. Below randomly on the basis of the mean and deviation, generated features for each class in the number of more than 60 times than specified. This is necessary to have something to choose from. And the labels are brought to the state 0 -1.

 def brute_force(samples=5000):
    import numpy.linalg as lnalg
    gen = []

    for index_cl, cls in enumerate(arr_classes):

        dlt = samples * 60
        sub_arr = np.zeros((dlt, len(cols), 1))
        
        #generate samples ahd lables for cls class
        col_counter = 0
        for col in cols:
            sub_arr[:, col_counter] = np.random.normal(loc=desc_df[index_cl][col]['mean'],
                                                       scale=desc_df[index_cl][col]['std'],
                                                       size=(dlt, 1)
                                                       )
            col_counter += 1
        sub_arr = sub_arr.reshape((sub_arr.shape[:-1]))

        #normalization lables
        sub_arr[-1] = np.where(sub_arr[-1] >= 0.5, 1, 0)

        mh = np.zeros((arr_classes.shape[0]))
        counter = 0

        #selection of the most successful samples
        for index, i in enumerate(sub_arr):
            for m_index, m_cls in enumerate(arr_classes):
                mu, isig = inv_sig[m_cls]
                mh[m_index] = mahalanobis(i[:-1], mu, isig)
            
            #if gmm assignment the same as the original label add in gen
            if np.argmin(mh) == i[-1]:
                gen = np.append(gen, i)
                counter += 1
            if counter == int(samples / 2):
                break


...

After for each sample the Mahalanobis distance index is calculated , with respect to the arrays of mean distributions of GMM for both classes. we get an array of 2 values that show the proximity of the generated sample to both classes. the minimum value will show which one. If the label coincides with it, we add it to the training sample. And when the sample fills up to the set value, we move to the next class. This way we get a perfectly balanced sample.

But it does not cancel the tambourine dancing and complicated relations with randomness. But if you try hard enough, you can get a normal result:

If I have time and energy, I will try to sow the distributions of features from 25 to 75 quantiles in the generator, maybe it will give something.

I also tried to use the distance indicator to evaluate the choice of target features. The idea was that with correctly selected labels and targets, the average value of this indicator will decrease.

results = np.zeros(x_train.shape[0])
mh = np.zeros((arr_classes.shape[0]))
for index, i in enumerate(x_train.to_numpy()):
    for m_ind, m_cls in enumerate(arr_classes):
        mu, isig = inv_sig[m_ind]
        mh[m_ind] = mahalanobis(i[:-1], mu, isig)

    if np.argmin(mh) == i[-1]:
        results[index] = mh[np.argmin(mh)]

acc = results.sum() / results.shape[0]

print('Accuracy:', acc)

I ran all available "successful" combinations of target and features and also reproduced "unsuccessful" combinations. With such a cursory analysis, the indicator decreases for successful and increases for unsuccessful variants. There may be some correlation, but you have to check. If you have any grid scanner release or GA, you can check it out

 
welimorn:

Not bad, you, so caught my eye. I tried to integrate this approach into the clustering engine from the article, only not as a class balancing method, but as a generator of a new balanced dataset.

I ran all the "successful" combinations of target and attributes I had, as well as reproduced "unsuccessful" combinations. In this cursory analysis, the index decreases for successful and increases for unsuccessful variants. There may be some correlation, but you have to check. If you have any grid scanner release or GA, you can check it out

No scanner yet. Great, will have to take a close look. I've been gathering information so far about additional approaches that can improve the model (besides coders). I'll probably formalise an article soon.

 
Maxim Dmitrievsky:

No scanner yet. Great, will have to take a close look. In the meantime, I've been gathering information on additional approaches that can improve the model (besides coders). I'll probably formalise an article soon.

To your aforementioned combining successful models in the search process, I have tried combining successful models with different attributes. This technique evens out the drawdown in some parts of the history. It was also noticed that adding models with R^2 from 0.65 improves the results, even if there are models with R^2 0.85-0.95.

 
welimorn:

In addition to the above-mentioned combining of successful patterns in the search process, I tried combining successful patterns with different attributes. This technique evens out the drawdown in some parts of the history. It was also noticed that adding models with R^2 from 0.65 improves the results, even if there are models with R^2 0.85-0.95.

Yes, but often at the expense of reducing the number of trades by 10-20%.