Machine learning in trading: theory, models, practice and algo-trading - page 3698

 
prosvetlenniy_mudrec #:
Wait, I didn't realise straight away, they are only non-linear regression, that's what the mathematicians at our, Russian, MSU say, not intelligence at all?
Yes, Big (non-)Linear Models 😁
 
Maxim Dmitrievsky #:
Yes, Big (non-)Linear Models 😁

Got it, Thanks.

 
prosvetlenniy_mudrec #:

Got it. Thank you.

Only they use different texts as data. Then they are trained to predict the next word. The principle is the same, nothing dangerous so far :)
 
Maxim Dmitrievsky #:
Only they use different texts as data. Then they are trained to predict the next word. The principle is the same, there is nothing dangerous there yet :)

This is all nonsense, I have been thinking since my childhood in the USSR, when neural network will be able to create and acquire desires and the ability to think, and even then I was afraid, sorry, for cowardice. :)

Why I'm writing about it, now Alice is very clever.

 
prosvetlenniy_mudrec #:

It's all nonsense, I have been thinking since my childhood in the ssr, when the neural network will be able to create and acquire desires and the ability to think, and even then I was afraid, sorry, for cowardice. :)

Why I'm writing about it, now Alice is very clever.

Yeah, well, so is a calculator. It's just not usual for a programme to write in your language. It's new.
 

You're right, it's unusual. Whathappens when she makes her own tongue. I mean, we all know that's going to happen. There are different opinions here.

I'm sure Alice will have a chat with chat gpt. It's a very interesting conversation.)

 

Claude 3.7 remains. It is a habitually strong model in programming. Unfortunately, Gemini and Claude initially suggested to put the package nonconformist, which I do not start on my mac. I had to ask them to rewrite to the available options.

Further/later I will test Gemini and Claude variants and compare with my own. Well, chatGPT is out, although I'll look into the package he suggested later, but couldn't get up and running.

GitHub - donlnz/nonconformist: Python implementation of the conformal prediction framework.
GitHub - donlnz/nonconformist: Python implementation of the conformal prediction framework.
  • donlnz
  • github.com
Python implementation of the conformal prediction framework [1]. Primarily to be used as an extension to the scikit-learn library. (API documentation is currently severely deprecated; for instructions on basic usage, please refer to README.ipynb, and the running examples available under /examples/ in the repository.) nonconformist requires: The...
 

Oh, man, there's no way to sort through what's generated. I've got a lot of unnecessary stuff, I've cleaned it up. Now we'll ask for a description :)

def meta_learners_manual_conformal_analogue(models_number: int,
                                            iterations: int,
                                            depth: int,
                                            bad_samples_fraction: float,
                                            epsilon_threshold: float = 0.1):
    """ Modifies the meta_learners function to detect potentially incorrect labels without using the nonconformist library, emulating the idea of conformal predictions. """"
    global hyper_params # Assume hyper_params is a global dictionary, or pass it as an argument
    
    # Check for the existence of hyper_params, if it is not defined, you can set default values or print an error
    if 'hyper_params' not in globals():
        # This part can be removed if hyper_params is always defined globally
        print("Warning: 'hyper_params' not defined globally. Using mock hyper_params for structure demonstration.")
        hyper_params = {
            'markup': 0.001, 'min': 1, 'max': 15, 'direction': 1,
            'forward': pd.Timestamp.now() + pd.Timedelta(days=1),
            'backward': pd.Timestamp.now() - pd.Timedelta(days=30*12)
        }
        # If get_labels_one_direction etc. functions cannot work without real hyper_params,
        # it's better to throw out the exception here #
        # raise NameError("Global variable 'hyper_params' is not defined.")

    # Use your original method of data retrieval
    dataset = get_labels_one_direction(get_features(get_prices()),
                                       markup=hyper_params['markup'],
                                       min=1, # Use min_val if 'min' is a keyword in the function
                                       max=15, # Use max_val if 'max' is a keyword in the function
                                       direction=hyper_params['direction'])
    
    data = dataset[(dataset.index < hyper_params['forward']) & (dataset.index > hyper_params['backward'])].copy()

    if data.empty:
        print("Warning: DataFrame 'data' is empty after initial filtering. Check hyper_params['forward'] / ['backward'] and dataset indexes.")
        data['meta_labels'] = np.nan
        return data

    # Define the names of columns with attributes.
    # Source code: X = data[data.columns[1:-2]]
    # This means that the first column is skipped (probably ID), and the last two are skipped too (labels, etc).
    if data.shape[1] < 4: # At least (ID/skip, trait, dummy_end_col, label) for slice [1:-2]
        raise ValueError("Not enough columns in 'data' to extract traits using the data.columns[1:-2] rule. "
                         f "Number of columns: {data.shape[1]}, колонки: {data.columns.tolist()}")
    feature_column_names = data.columns[1:-2].tolist()
    if not feature_column_names:
         raise ValueError(f "The feature list is empty after slicing data.columns[1:-2]. Check the data structure. Columns: {data.columns.tolist()}")

    BAD_WAIT = pd.Index([])
    BAD_TRADE = pd.Index([])

    for i in range(models_number):
        print(f "Model processing {i+1}/{models_number}...")
        
        sample_frac = 0.3
        # Make sure there is enough data for sampling and subsequent splits
        if len(data) * sample_frac < 20 : # Minimum quantity to divide into 3 parts + stratification
            print(f "Warning: Total data size({len(data)}) слишком мал для выборки {sample_frac*100}% и последующих разделений. "
                  "Skip all iterations.")
            break 
            
        current_sample_df = data.sample(frac=sample_frac, random_state=i)
        
        X_current_sample = current_sample_df[feature_column_names]
        y_current_sample = current_sample_df['labels']

        if len(X_current_sample) < 6: # Min. for 3 splits, 2 each for stratification
            print(f "Skip iteration {i+1} из-за недостаточного количества образцов ({len(X_current_sample)}) в current_sample_df для разделения.")
            continue
        
        stratify_opt_y_current = y_current_sample if y_current_sample.nunique() > 1 else None
        
        try:
            # Split current_sample_df into: 1. Training (Proper Train), 2. Calibration (Calibration), 3. CatBoost validation (Eval Set)
            X_proper_train_temp, X_cb_eval, y_proper_train_temp, y_cb_eval = train_test_split(
                X_current_sample, y_current_sample, test_size=0.2, shuffle=True, 
                stratify=stratify_opt_y_current, random_state=i
            )

            if len(X_proper_train_temp) < 2: # Need at least 2 for the following division
                 raise ValueError("Insufficient data after the first split for further splitting for training and calibration.")

            stratify_opt_y_proper_temp = y_proper_train_temp if y_proper_train_temp.nunique() > 1 else None
            X_prop_train, X_calib, y_prop_train, y_calib = train_test_split(
                X_proper_train_temp, y_proper_train_temp, test_size=0.25, # 0.25 * 0.8 = 0.2 (20% of current_sample_df for calibration)
                shuffle=True, stratify=stratify_opt_y_proper_temp, random_state=i
            )
        except ValueError as e: # Catching errors from train_test_split (e.g. not enough members in the class to stratify)
            print(f "Error during data separation in iteration {i+1}: {e}. Пропуск итерации.")
            continue
            
        if X_prop_train.empty or X_calib.empty or X_cb_eval.empty:
            print(f "Skip iteration {i+1} из-за того, что одно из разделений (обучение, калибровка или валидация CB) оказалось пустым.")
            continue

        meta_m = CatBoostClassifier(iterations=iterations,
                                    depth=depth,
                                    eval_metric='Accuracy', # Your original was custom_loss=['Accuracy'], eval_metric='Accuracy'
                                    verbose=False,
                                    use_best_model=True,
                                    random_seed=i,
                                    early_stopping_rounds=10 if iterations > 20 else None)
        
        meta_m.fit(X_prop_train, y_prop_train, eval_set=(X_cb_eval, y_cb_eval), plot=False)
        
        # --- Beginning manual implementation of an analogue of the conformal approach ---

        # 1. Obtain predicted probabilities for the calibration set
        calib_probs = meta_m.predict_proba(X_calib)
        
        # 2. Compute "non-conformity estimates" for the calibration set
        # Score = 1 - P(true_mark)
        calib_scores = np.zeros(len(y_calib))
        # Use y_calib.values or y_calib.to_numpy() for numpy operations if y_calib is a Series
        y_calib_numpy = y_calib.to_numpy() if isinstance(y_calib, pd.Series) else y_calib

        for k_idx, true_label_val in enumerate(y_calib_numpy):
            true_label_idx = int(true_label_val)
            if 0 <= true_label_idx < calib_probs.shape[1]:
                calib_scores[k_idx] = 1.0 - calib_probs[k_idx, true_label_idx]
            else:
                calib_scores[k_idx] = 1.0 # Maximum non-conformity

        if len(calib_scores) == 0:
            print(f "Warning: calibration set is empty or failed to compute estimates in iteration {i+1}. Пропуск этой итерации.")
            continue

        # 3. Prediction for current_sample_df (coreset) and calculation of "empirical p-values"
        X_coreset_features = current_sample_df[feature_column_names]
        y_coreset_true_labels = current_sample_df['labels']
        
        coreset_probs = meta_m.predict_proba(X_coreset_features)
        
        calculated_p_values = np.zeros(len(y_coreset_true_labels))
        y_coreset_true_labels_numpy = y_coreset_true_labels.to_numpy() if isinstance(y_coreset_true_labels, pd.Series) else y_coreset_true_labels

        for k in range(len(y_coreset_true_labels_numpy)):
            true_label_class_idx = int(y_coreset_true_labels_numpy[k])
            
            current_score = 1.0 
            if 0 <= true_label_class_idx < coreset_probs.shape[1]:
                current_score = 1.0 - coreset_probs[k, true_label_class_idx]
            else: # If the label is out of range (unlikely for 0/1), consider maximal nonconformal
                 pass # current_score already 1.0

            p_val = (np.sum(calib_scores >= current_score) + 1.0) / (len(calib_scores) + 1.0)
            calculated_p_values[k] = p_val

        # --- End of manual implementation ---
        
        suspicious_mask = calculated_p_values < epsilon_threshold
        suspicious_original_indices = current_sample_df.index[suspicious_mask]
        
        flagged_bad_samples_df = current_sample_df.loc[suspicious_original_indices]
        
        diff_negatives_w_indices = flagged_bad_samples_df[flagged_bad_samples_df['labels'] == 0].index
        diff_negatives_t_indices = flagged_bad_samples_df[flagged_bad_samples_df['labels'] == 1].index
        
        idx_type_cast = pd.DatetimeIndex if isinstance(data.index, pd.DatetimeIndex) else pd.Index
        BAD_WAIT = BAD_WAIT.union(idx_type_cast(diff_negatives_w_indices))
        BAD_TRADE = BAD_TRADE.union(idx_type_cast(diff_negatives_t_indices))

    # Initialise to_mark_w and to_mark_t with empty Series if BAD_WAIT/BAD_TRADE are empty
    # to avoid errors when calling .mean() or accessing by index
    to_mark_w = pd.Series(BAD_WAIT.value_counts()) if not BAD_WAIT.empty else pd.Series(dtype='int64')
    to_mark_t = pd.Series(BAD_TRADE.value_counts()) if not BAD_TRADE.empty else pd.Series(dtype='int64')

    threshold_w = (to_mark_w.mean() * bad_samples_fraction) if not to_mark_w.empty else 0
    threshold_t = (to_mark_t.mean() * bad_samples_fraction) if not to_mark_t.empty else 0
    
    marked_idx_w = to_mark_w[to_mark_w > threshold_w].index if not to_mark_w.empty else pd.Index([])
    marked_idx_t = to_mark_t[to_mark_t > threshold_t].index if not to_mark_t.empty else pd.Index([])

    data['meta_labels'] = 1.0
    data.loc[data.index.isin(marked_idx_w), 'meta_labels'] = 0.0
    data.loc[data.index.isin(marked_idx_t), 'meta_labels'] = 0.0
    
    return data
 
General philosophy: Identifying "suspicious" labels using ensemble and conformal prediction emulation

The main goal of this code is to identify and label data instances whose source labels look "incorrect" or "anomalous" in terms of machine learning models. Rather than blindly trusting the original markup, the code tries to find those data instances that systematically "confuse" the models.

The key idea is to:

Use an ensemble of models (meta-learning): Train multiple models (models_number) on different subsamples of the data. If different models trained on different slices of data consistently find a particular example "weird" (in the context of its label), this reinforces the belief that something is wrong with the label.
Emulate conformal predictions (manually): Instead of using an off-the-shelf nonconformist library, the code implements the basic logic of conformal analysis. This allows you to assess how "typical" or "non-conformal" each example in the calibration set is with respect to the model's predictions, and then use this information to assess the "strangeness" of other examples.
Focus on probabilities: Instead of simply predicting classes, the code analyses the predicted probabilities. "Non-conformality" here is defined as 1 - P(true_mark). That is, if the model is confident about the true label (high probability), the example is conformal; if the model is not confident or predicts a high probability for a different class, the example is nonconformal.
Use thresholding and aggregation: The decision whether to mark a label as "bad" is not based on a single prediction, but on how often a given example fell into "suspicious" throughout all iterations of the ensemble, and whether it exceeded a certain threshold.

Structure and key steps:

Data preparation (hyper_params, dataset, data):
Philosophy: Start by obtaining and filtering relevant data. Global hyper_params drive this process by defining timelines and other parameters for feature and label extraction.
Key point: Validation and checks for empty data or insufficient columns are a manifestation of defensive programming, preventing errors in later steps.

Main loop (by models_number):
Philosophy: Each iteration of the loop represents training one model from the ensemble and evaluating the "suspiciousness" of labels using it.
Steps within the loop:
Sampling (current_sample_df):
Philosophy: Create diversity in the training data for each model. This helps to make the ensemble more robust and reduce the impact of outliers or specific features of a single subsample. Each model sees a slightly different "view" of the data.
Data partitioning (X_prop_train, X_calib, X_cb_eval):
Philosophy: This is the heart of the emulation of the conformal approach.
X_prop_train, y_prop_train: The "right" training set. The model (CatBoost) will learn from it.
X_calib, y_calib: The calibration set. It is not used to train the main model. Instead, it is used to see how the model estimates "non-conformality" on examples it has not seen during training. This allows a baseline level of "expected non-conformality" to be established.
X_cb_eval, y_cb_eval: Set for internal validation of CatBoost (for eval_metric and early_stopping_rounds). Helps to get the best version of the CatBoost model.
Important: Stratification at partitioning (stratify=...) is used to preserve the distribution of classes in each subsample, which is critical for classification tasks, especially with unbalanced classes.
Model training (CatBoostClassifier):
Philosophy: Using gradient bousting (CatBoost) as a powerful classifier. Use_best_model=True and early_stopping_rounds help prevent overtraining and select the optimal model based on X_cb_eval.
Manual implementation of the analogue of the conformal approach:
Step 1: Probabilities for the calibration set (calib_probs): We obtain the probabilities predicted by the model for each class on the calibration set.
Step 2: Nonconformity estimates for calibration (calib_scores):
Philosophy: For each example on the calibration set, a "non-conformity score" is computed. In this code, it is defined as 1.0 - calib_probs[k_idx, true_label_idx]. That is, the lower the probability assigned to the true class, the higher the non-conformity score.
These scores form a distribution that shows how "wrong" the model's predictions can be on data that it did not see during training, but for which we know the true labels.
Step 3: Prediction for current_sample_df and computing "empirical p-values":
Philosophy: We now want to evaluate each example from the current subsample (current_sample_df, which here acts as the "coreset" or test set for this iteration of conformal analysis).
For each example k from current_sample_df:
Its own non-conformality score (current_score) is computed using the same principle: 1.0 - coreset_probs[k, true_label_class_idx].
This current_score is compared to the calib_scores distribution.
"Empirical p-value" (p_val) is calculated as the fraction of examples in the calibration set (plus one) that have a non-conformity score at least as high (>=) as the current_score of the current example. The formula (np.sum(calib_scores >= current_score) + 1.0) / (len(calib_scores) + 1.0) is the standard formula for obtaining p-values in conformal prediction (adding 1 to the numerator and denominator for correctness and avoiding division by zero/p-value=0).
Interpretation of p-value: A small p-value means that the current example (with its true label) is more "non-conformal" (more "weird") than most examples in the calibration set.
Identification of "suspicious" (suspicious_mask, flagged_bad_samples_df):
Philosophy: Samples whose p-value is below a given threshold (epsilon_threshold) are considered "suspicious". The epsilon_threshold is a significance level; for example, 0.1 means that we flag examples that are in the top 10% of the "most "non-conformal" compared to the calibration set.
The indices of these suspicious examples are added to BAD_WAIT (for labelling 0) or BAD_TRADE (for labelling 1).

Aggregation of results and final labelling:
Philosophy: After all iterations, we aggregate information about which examples were labelled as "suspicious" and how often.
to_mark_w.value_counts() and to_mark_t.value_counts(): Counts how many times each unique index was marked as suspicious for class 0 and 1, respectively.
Thresholds (threshold_w, threshold_t):
Philosophy: Instead of flagging everything that has been deemed suspicious at least once, a stricter criterion is introduced. The threshold is calculated based on the average frequency of suspicious hits and the bad_samples_fraction parameter. This is a heuristic to weed out random hits and focus on systematically "bad" tags.
Only samples that have been flagged as suspicious more often than this threshold will be finally recognised as having an incorrect label.
Final labelling (data['meta_labels']):
Initially, all labels are considered correct (meta_labels = 1.0).
Then, for identified "bad" examples, meta_labels are set to 0.0.

Key philosophical pillars of the code:

Distrust of source data: It is assumed that source labels may contain errors.
Ensemble strength: The judgement of multiple models trained on different data is more reliable than the judgement of a single model.
Statistical evaluation of "strangeness": Conformal prediction emulation provides a statistically valid way to measure how much a given example (with its label) sticks out from the overall picture learnt by the model.
Iterative refinement: The process is repeated, and information about "bad" labels is accumulated.
Pragmatic heuristics: The final decision about a bad tag is made based on frequency and threshold (bad_samples_fraction), which is a practical compromise.
Defensive programming: Lots of checks on data sizes, possibility of stratification, etc. to make the code more robust to different inputs.

Possible areas for thought (not criticism, but aspects to think about):

Choice of non-conformality measure: 1 - P(true_label) is one common measure. There are others.
Sensitivity to hyperparameters: epsilon_threshold, bad_samples_fraction, sample_frac (for subsamples) - these parameters can significantly affect the result, and their optimal choice may require experimentation.
Handling multiclass classification: The current code, especially the part with BAD_WAIT and BAD_TRADE and the final labelling, is clearly geared towards binary classification (labels 0 and 1). An adaptation would be required for a multiclass task.

Overall, this is a thoughtful approach to identifying potentially erroneous labels that attempts to introduce the principles of conformal analysis without direct reliance on specialised libraries, relying on ensemble and statistical evaluation.
 

And the main differences between the generated code and the original one:

In brief, instead of comparing directly the original labels with the predicted labels and determining prediction errors, p_values via pre-calibration are used.