Gaussian Processes in Machine Learning (Part 1): Classification Model in MQL5

MetaTrader 5 — Statistics and analysis | 16 June 2026, 13:39

1 373

Evgeniy Chernish

Introduction

We continue our acquaintance with the machine learning model – Gaussian processes (GP). In the previous article, we examined in detail the regression problem, where the main goal was to predict continuous values. Today we have to deal with a much more complex topic – classification. Its main difficulty is that the inference for classification in Gaussian processes does not have a closed-form solution, which requires the use of approximate methods such as Laplace approximation.

To effectively solve this complex problem, we will develop a modular library of Gaussian processes in MQL5. This approach will allow us to structure the code by separating the GP model into independent components and will provide a solid foundation for further improvements and extensions. This library will become a universal tool for both regression and classification tasks.

In the first part of the article, we will examine in detail the theory of GP classification, including the mathematics underlying the approximate methods. We will also introduce the main class of the library — GaussianProcess, which will unite all components of the model, as well as the GPOptimizationObjective class responsible for integration with the Alglib optimization library.

Classification

Classification is a machine learning task that involves assigning one of the predefined categories to an object. For example, in finance, classification can help predict whether a stock price will rise or fall based on historical data.

In this article, we will focus on binary classification, where an object belongs to one of two classes, such as "rise" (+1) or "fall" (-1). Unlike methods such as support vector machines (SVM) or decision trees, which only produce a class label, GPs allow for probabilistic prediction. For example, a model might say that there is a 75% chance that a stock will rise. Such information is especially valuable in trading, where the degree of confidence in a prediction helps make informed decisions, allowing one to filter out unreliable signals.

Unfortunately, solving a classification problem using GP is significantly more complex than regression. This is related to the type of likelihood used:

In regression, Gaussian likelihood is typically used. The combination of the GP (as prior distribution of the function) and the Gaussian likelihood allows us to obtain the posterior distribution analytically, which simplifies all calculations.
For classification where the targets are discrete class labels, Gaussian likelihood is not suitable. Instead, one may use, for example, the logit likelihood. This results in the posterior distribution also not being Gaussian and not having a closed-form solution.

As a consequence, we have to resort to complex methods of approximate inference. The basic idea of these methods is to approximate a true non-Gaussian posterior distribution with a Gaussian distribution centered at its mode. In this article, we will focus on the Laplace approximation, as it is one of the simplest and most effective approaches for obtaining a Gaussian approximation of the posterior distribution.

For binary classification, the underlying idea of GP-based prediction is quite simple. We start with a prior distribution of the latent functions f(x). Imagine that the GP generates not just one function, but an infinite set of possible functions, each of which is a potential "latent" dependency in the data. Then each of these potential realizations of the latent function f(x) is "passed" through the logistic function (sigmoid). The sigmoid transforms any real number (the value of f(x)) into a probability between 0 and 1, which will be our prior probability π(x) of belonging to the "+1" class:

Class probability

It is important to note that π is a deterministic function of f, but since f itself is stochastic (random, a sample from the GP), then the function π also becomes stochastic. This concept is clearly illustrated in Fig. 1 and 2 for one-dimensional input space X.

Sample latent function f(x)

Fig. 1. Realization of the latent f(x) function

Figure 1 shows just one possible implementation of the latent function, demonstrating the typical behavior of the function corresponding to the given kernel hyperparameters.

Class Probability π (x)

Fig. 2. The same function transformed using the sigmoid

Fig. 2 demonstrates the result of applying the logistic (sigmoid) function to the same function f(x):

Logistic function

Thus, we obtain a prior probability distribution of class membership π(x)=σ(f(x)), which at this stage does not yet take into account the training data y. Without observations of y, this prior distribution remains just our initial hypothesis, unsupported by empirical evidence; without it, the model lacks information about which of its initial assumptions were correct and which require revision.

Naturally, the choice of prior assumptions significantly influences the final posterior results. This is a key feature of the Bayesian approach, since the properties of the prior distribution of functions, and therefore the final model, depend on the researcher's decision on the kernel type.

Inference

So, to make informed predictions, we need to take into account real training data y. This is where inference comes into play. Its main goal is to transform our prior beliefs into posterior ones, that is, ones adjusted to take into account observed data. For classification, this process naturally divides into two sequential steps.

Step 1: Predictive distribution of the latent function f∗

In the first step, we compute p(f*|X, y, x*) the posterior distribution of the latent function f* for a new test point x* given the observed training data (X, y). It is defined by the following integral:

Posterior f*

where:

p(f*∣X, x*, f) is the conditional distribution of the latent function f* at a new test point x* given the latent functions f at the training points X. This distribution is always normal, since the GP by definition has a joint normal distribution,
p(f|X, y) is the posterior distribution of the latent functions f on the training data. Due to the nonlinear likelihood function (sigmoid), it is not Gaussian.

It is important to note that since p(f|X, y) is not normal, this integral has no closed-form solution. This means that we will need approximate methods to calculate it.

Step 2: Final predictive probability π*

In the second step, we use this predictive distribution to form a final probability prediction π* — the probability that a test point x* belongs to the positive class (y* = +1):

Prediction Probability

Here σ(f*) is the logistic (sigmoid) function, which transforms the value of the latent function f* into a probability between 0 and 1. The integral itself means that we average these probabilities over all possible values of f*, weighted by their posterior predictive distribution. In essence, this one-dimensional integral is the mathematical expectation of the function σ(f*) with respect to the distribution p(f*|X, y, x*).

Again, for the logit likelihood this integral has no closed-form solution. Therefore, here too we will need approximate methods. Looking ahead, we will say that our GP library implements three such approximations, which allows you to choose the appropriate method depending on the requirements for accuracy and computational costs:

probit approximation,
numerical integration,
Monte Carlo method.

These two steps we just described — computing the posterior distribution of the latent function and then integrating to obtain the predictive probability — represent the general framework for Bayesian inference in GP. These are the two integrals that we should calculate to obtain the desired prediction, and both require the use of approximate methods.

Laplace approximation

As we have already found out, Bayesian inference for classification involves integrals that are not solvable in closed form. The Laplace approximation solves this problem by approximating the non-Gaussian distribution p(f∣X, y) with the Gaussian distribution q(f∣X, y). Since the conditional distribution p(f*∣X, x*, f) is also Gaussian, the resulting predictive distribution p(f*∣X, y, x*) also becomes Gaussian. This allows us to derive analytical formulas for the mean and variance of f*, which significantly simplifies further calculations. Thus, the beauty and computational efficiency of the Laplace approximation lies in its ability to reduce the computation of the posterior distribution and predictions to operations on Gaussian distributions.

It is important to understand that the Laplace approximation is a compromise. It makes a closed-form intractable problem computationally solvable, but at the expense of accurately representing the true form of the posterior distribution. The quality of this normal approximation directly depends on how close the true distribution of p(f∣X, y) is to normal. The closer it is, the more accurate the approximation will be, and vice versa.

If we are interested in the true distribution of p(f*∣X, y, x*), and not its approximation, then MCMC (Markov Chain Monte Carlo) methods are usually used for this. Although the MCMC method can provide more accurate estimates, it is computationally very expensive and difficult to implement. MCMC can be used as a gold standard for comparison with approximate inference methods.

Now let's take a closer look at what Laplace approximation is. This approximation is built around the mode (maximum) of the true posterior distribution p(f∣X, y). It uses a second order Taylor expansion of the logarithm of the posterior density around this mode. Mathematically, we approximate the logarithm of the posterior density as follows:

Laplace approximation

where:

q(f∣X, y) is a Gaussian approximation for the posterior distribution p(f∣X, y),
f_hat = argmax(f) p(f|X, y) — mode of the posterior distribution,
A = −∇∇ log p(f|X, y)|f=f_hat - Hessian of the negative logarithm of the posterior distribution at the mode point.

First of all, to perform the Laplace approximation, we need to find the most probable value of the latent function f, that is, the mode f_hat. To obtain the posterior p(f∣X, y), we use Bayes' rule. We already know that this rule relates the posterior distribution to the likelihood p(y∣f), the prior p(f∣X), and the marginal likelihood p(y∣X) as follows:

Posterior p(f|X, y)

To maximize p(f∣X, y) with respect to f, we do not need to know the normalization constant p(y∣X), since it does not depend on f and, therefore, does not affect the position of the maximum. Therefore, we can work with an unnormalized posterior distribution, which is proportional to the product of the likelihood and the prior p(y∣f)p(f∣X).

To simplify the calculations and avoid numerical problems with very small probability values, we take the logarithm of this unnormalized posterior distribution. Due to the property of logarithms, the product of probabilities becomes the sum of their logarithms:

Psi (f)

Ψ(f) is the objective function that we will maximize using Newton's method to find the mode of the latent function. Newton's method requires calculating the first and second derivatives of Ψ(f) with respect to f.

Differentiating this equation with respect to f, we obtain:

Gradient and Hessian Psi (f)

where:

W = −∇∇ log p(y|f) - negative Hessian of the log-likelihood, which is a diagonal matrix.

Once we have calculated the gradient and Hessian, we iteratively find the mode using Newton's method:

Newton’s method

At each iteration, Newton's method updates our current mode guess in the direction determined by the gradient and the Hessian until convergence is achieved.

Once the mode is obtained, we can calculate the covariance matrix of the approximated Gaussian distribution. This matrix is equal to the negative inverse Hessian Ψ(f) calculated at the mode point f_hat.

Thus, the covariance matrix Σ of our Gaussian approximation is:

Sigma f train = A^-1

This completes the description of the first step of the Laplace approximation - finding a normal approximation of the posterior distribution.

Prediction in Laplace approximation

Once we have obtained q(f∣X, y), we can proceed to the second step of inference - prediction for new test points x∗. At this stage we want to find the predictive distribution p(f∗∣X, y, x∗). Due to the Laplace approximation that made p(f∣X, y) Gaussian (in the form q(f∣X, y)), and the fact that p(f∗∣X, x∗, f) is also Gaussian, the resulting predictive distribution p(f∗∣X, y, x∗) also becomes Gaussian. This allows us to analytically obtain its posterior mean and variance.

The mean value of the latent function f* for a new test point x* (mu_f_star) is calculated as:

Posterior mu_f_star

The variance of the latent function Var(f*) for a new test point x* (Sigma_f_star) is calculated as:

Posterior variance Sigma_f_star

Now that we have the mean and variance of the predictive distribution, we can finally calculate the desired probability of belonging to the class π∗:

Laplace predictive probability

This formula is the heart of probability prediction in the Laplace-based GP classifier.

You may have noticed that we do not compute the class probabilities simply as σ(E[f∗]), that is, by substituting the posterior mean of f∗ directly into the sigmoid function. This approach, known as MAP-prediction (Maximum A Posteriori Prediction), certainly has the right to exist.

However, by calculating the MAP prediction σ(E[f∗]), we ignore the uncertainty in f*. We simply take the central estimate f* (the mean) and convert it into a probability. When we calculate E[σ(f*)] (which corresponds to integration), we take into account the entire shape of the distribution of f*. This gives us a more accurate and meaningful predictive probability, especially when there is significant uncertainty in f* (i.e. large variance of V[f*]), or when the distribution of f* is asymmetric. This approach is called averaged predictive probability.

Understanding this difference has important practical implications:

If your only goal is to obtain a binary class label (e.g., "buy" or "sell", +1 or -1), then using the simpler MAP prediction may be sufficient, since it will yield the same label as the more computationally expensive average prediction.
However, if you care about the probabilities themselves, then the averaged predictive probabilities (E[σ(f*)]) are still more accurate, since they fully take into account the uncertainty of the model.

In trading, a simple binary class label ("buy" or "sell") is not sufficient. We need the fine gradation of certainty that probabilities provide. The probability value allows us to filter trading signals. A signal with a probability of success of 0.51 (which is only slightly better than random guessing) will have much less value than a signal with a probability of 0.60. This allows the trader to set thresholds for entering a trade. For example, we can decide that trades will only be opened when the probability of success is higher than 0.55 or 0.60, thereby reducing the number of false signals.

Marginal likelihood in Laplace approximation

Now that we understand the inference mechanism in GP for classification, the question arises: how to tune our model for optimal predictions? The answer lies in marginal likelihood (LML). This is the objective function we use to optimize the hyperparameters θ of our model. Without calculating it, it is impossible to find the best parameters that explain our data:

LML

where, B

B matrix

Having defined the objective function to be optimized, the next important step is to calculate its partial derivatives with respect to the hyperparameters θ. This is necessary because we will be performing optimization using analytical gradients. This approach speeds up calculations several times compared to numerical methods. Analytical gradients allow the optimizer to move more efficiently and accurately toward the minimum of the NLML objective function.

The LML gradient consists of an explicit and an implicit part:

LML gradient

Formula for calculating the explicit part:

LML gradient - Explicit part

Here the main problem is to calculate the derivative of the kernel matrix K with respect to each hyperparameter. We will deal with the implementation of derivatives for the selected kernel function in the second part of the article.

The implicit part consists of two factors. The first factor in the implicit part is found using the formula:

LML gradient - implicit part1

To calculate this formula, we will have to calculate the third derivative of the logarithm of the likelihood.

The second multiplier of the implicit part is calculated as follows:

LML gradient - implicit part2

In conclusion, we note that NLML is needed not only for estimating hyperparameters, but also for comparing different models (for example, with different types of kernels). Models with a lower NLML value are considered better because it means a higher marginal likelihood, meaning that the model explains the observed data better.

In addition, GPs, by using marginal likelihood to optimize hyperparameters, automatically address the trade-off between data fit and model complexity. NLML naturally penalizes overly complex models, preventing overfitting. Thanks to this, there is no need for explicit stopping criteria to prevent overfitting, as is done, for example, when training neural networks. NLML optimization itself strives to find the optimal balance. This is one of the main advantages of the Bayesian approach to Gaussian processes.

Gaussian process library

Now that we have covered all the necessary theoretical concepts, let's move on to practical implementation. Our main goal is to create a universal GP library in MQL5 that will serve as a reliable tool for prediction tasks. This library will have a modular architecture, where the GP model is broken down into independent, interchangeable components, which will allow for easy expansion of its capabilities and ensure ease of maintenance. It will be developed taking into account the following key functional features:

Flexibility in kernel selection: the ability to easily connect existing covariance kernels, as well as create their combinations (SumKernel, ProductKernel) to model more complex dependencies;
Support for various likelihood functions;
Support for various posterior distribution inference methods for classification and regression problems;
Multifunctionality: the library must be universal, which will allow solving both regression and binary classification problems;
Hyperparameter optimization: using analytical gradients to improve the speed and accuracy of the training process. Integration with the Alglib library should ensure efficient optimization of model hyperparameters.

Let's take a closer look at the library structure. It consists of six main components, each of which implements specific functionality:

The GaussianProcess class is the central hub of the library, managing the entire lifecycle of a GP model — from initialization and hyperparameter optimization to performing predictions on new data.

GPOptimizationObjective class: This auxiliary class serves as a "bridge" between our library and the Alglib optimization library. It adapts the objective function and its gradient to the format required by Alglib (via inheritance from CNDimensional_Grad).

IKernel interface: Defines a set of methods for various covariance functions (kernels). It includes such implementations as RBFKernel, LinearKernel, PeriodicKernel, and their combinations (SumKernel, ProductKernel).

ILikelihood interface: Defines a set of methods for likelihood functions. Implementations include GaussianLikelihood for regression and LogitLikelihood for binary classification.

IInference interface: Provides methods for inferring the posterior distribution of the latent GP function. Currently, ExactInference and LaplaceInference are implemented.

Auxiliary Structures and Utilities (StructUtils.mqh): A set of common enumerations, data structures (for inference and prediction results), and functions needed to work with data, matrices, and graphs to visualize results.

Thanks to this modular structure and well-defined interfaces, we can easily add new kernels, inference methods, and likelihood functions, allowing for easy future development of the library.

GaussianProcess class

The GaussianProcess class is the central class of the library. It encapsulates all the logic needed to build, train, and predict a GP model. Designed using the composition principle, GaussianProcess does not directly contain kernel, likelihood, or inference functionality. Instead, it integrates these components using three main interfaces:

kernel (IKernel),
likelihood function (ILikelihood),
inference method (IInference).

This allows the GP model to be flexibly adapted to various prediction tasks without changing the underlying GaussianProcess class.

//+------------------------------------------------------------------+
//| Gaussian process class                                           |
//+------------------------------------------------------------------+
class GaussianProcess
{
private:
    IKernel*      m_kernel;       // pointer to the selected kernel
    ILikelihood*  m_likelihood;   // pointer to the selected likelihood function
    IInference*   m_inference;    // pointer to the selected inference method
    matrix        m_X_train;      // Training input data Nxd
    vector        m_y_train;      // Training target data Nx1

    GPInferenceResult m_last_inference_result; // Structure storing the latest inference results
    
    int m_last_termination_type;  // Optimization operation completion code
    int m_last_iterations_count;  // Number of iterations performed by the optimizer
    double m_last_nlml_value;     // Final NLML value after optimization
    
private:
    // Auxiliary function for numerical integration
    double CalculateNumericalProbability(double mu_f_star, double sigma_f_star_diag, LogitLikelihood *likelihood);

public:
    // Class constructor
    GaussianProcess(IKernel* kernel, ILikelihood* likelihood, IInference* inference,
    const matrix &X_train, const vector &y_train);
    // Static method for creating a GaussianProcess object with input parameters validation
    static GaussianProcess* Create(IKernel* kernel, ILikelihood* likelihood, IInference* inference,
    const matrix &X_train, const vector &y_train);
    
    // Destructor
    ~GaussianProcess();
    
    // --- Methods for getting the model state ---
    // Return the results of the last inference operation
    GPInferenceResult GetLastInferenceResult() const;
    // Return the completion type of the last hyperparameter optimization
    int GetLastTerminationType() const;
    // Return the number of iterations performed during the last hyperparameter optimization
    int GetLastIterationsCount() const;
    // Return the negative logarithm of the marginal likelihood after optimization
    double GetLastNLML() const;
    // Return the pointer to the kernel in use
    IKernel* GetKernel() const;
    // Return the current values of all hyperparameters being optimized.
    vector GetCurrentHyperparameters();
  
    // --- Training and configuration methods ---
    // Run the full model training process, including hyperparameter optimization
    bool Fit();
    // Perform a single inference step without hyperparameter optimization
    bool PerformInference();
    // Set the training data for the model 
    void SetTrainingData(const matrix& X, const vector& y);
    // Set the given hyperparameters for the kernel and likelihood function
    void SetHyperparameters(const vector &params);
    // Method called by the optimizer to calculate the objective function (NLML)
    double CalculateNLMLObjective(const vector &hyperparameters);
    
    // --- The method performs a prediction for new test data
    // The predictmode parameter determines the method for calculating probabilities for classification (PROBIT, NUM_INTEGR, MONTE_CARLO)
    bool Predict(const matrix &X_test, GPPredictionResult &result, PredictMode mode = PROBIT);
    
    // --- Auxiliary methods ---
    // Static method for generating samples from prior GP
    static bool SamplePriorGP(const matrix &x, IKernel* kernel, int num_samples, matrix &f_samples,
                              bool plot_samples = false, int plot_display_seconds = 10);
    //--- Method for logging the final values of hyperparameters
    void PrintOptimizedKernelParameters();  
};

Let's look at the main methods of the class:

There are two main ways to create an instance of a class:

Create method: Use this method to safely create a GaussianProcess object. This method performs the necessary checks on the input data (X_train, y_train, interface pointers) and returns a pointer to the object or NULL on error.

//+------------------------------------------------------------------+
//| Create method                                                    |
//+------------------------------------------------------------------+
GaussianProcess* GaussianProcess::Create(IKernel* kernel, ILikelihood* likelihood, IInference* inference,
const matrix &X_train, const vector &y_train)
{
    // 1. Check for NULL pointers
    if (kernel == NULL || likelihood == NULL || inference == NULL) {
        Print("ERROR: Kernel, Likelihood, or Inference pointer is NULL");
        return NULL;
    } 
    // 2. Check the validity of X_train and y_train inputs
    if (X_train.Rows() == 0 || y_train.Size() == 0 || X_train.Rows() != y_train.Size()) {
    Print("ERROR: Invalid training data dimensions");
    return NULL;
    }
    // 3. Check the compatibility of 'likelihood' and 'inference'
    string likelihood_name = likelihood.GetName();
    string inference_name  = inference.GetName();   
    if (inference_name == "ExactInference" && likelihood_name != "GaussianLikelihood") {
        Print("ERROR: ExactInference supports only GaussianLikelihood!");
        delete kernel; delete likelihood; delete inference;
        return NULL;
    }    
    // 4. If all checks are passed, create the object
    GaussianProcess* gp_model = new GaussianProcess(kernel, likelihood, inference,X_train, y_train);
    if (gp_model == NULL) {
        Print("ERROR: Failed to create GaussianProcess object");
        delete kernel; delete likelihood; delete inference;
        return NULL;
    }    
    return gp_model;
}

Class constructor: Provides a direct way of initialization, without any data checks. If you are confident in your data, you can create an object using the constructor.

//+------------------------------------------------------------------+
//| GaussianProcess class constructor                                |
//+------------------------------------------------------------------+
GaussianProcess::GaussianProcess(IKernel* kernel, ILikelihood* likelihood, IInference* inference,
const matrix &X_train, const vector &y_train) :
    m_kernel(kernel),
    m_likelihood(likelihood),
    m_inference(inference),
    m_X_train(X_train), 
    m_y_train(y_train), 
    m_last_termination_type(0),
    m_last_iterations_count(0),
    m_last_nlml_value(0.0){ }

Fit() method: starts the full model training process. This method optimizes the kernel hyperparameters and the likelihood function using the MinBleic optimizer, which minimizes the negative log marginal likelihood (NLML).

//+------------------------------------------------------------------+
//| Method for training the model                                    |
//+------------------------------------------------------------------+
bool GaussianProcess::Fit()
{
    // Create the GPOptimizationObjective object passing it the pointer to the current GaussianProcess object
    // This pointer goes into the private field of the m_gp class, with which we call the method
    // CalculateNLMLObjective to get the NLML value for the current set of hyperparameters
    GPOptimizationObjective objective_func(GetPointer(this));
    CNDimensional_Rep frep; 
    CObject Obj;
    vector initial_hyperparams = GetCurrentHyperparameters(); // Get the initial values of the hyperparameters
    double theta[];
    ArrayResize(theta, (int)initial_hyperparams.Size()); 
    VectorToArray(initial_hyperparams,theta);
    int num_params = (int)initial_hyperparams.Size();
    double s[];
    double bndl[];
    double bndu[];
    ArrayResize(s, num_params);
    ArrayResize(bndl, num_params);
    ArrayResize(bndu, num_params);

    int param_idx = 0; 
    IKernel* kernels_to_process[]; // array of pointers to the IKernel interface
    
    // Logic for obtaining kernels to set boundaries 
    // This block of code determines what type of kernel we are dealing with
    // and fills the kernels_to_process array with the corresponding pointers:
    if (dynamic_cast<SumKernel*>(m_kernel) != NULL) {          // Check if the current kernel m_kernel is a SumKernel object  
        SumKernel* sum_k = dynamic_cast<SumKernel*>(m_kernel); // If yes, then we cast the m_kernel type to the SumKernel* type
        sum_k.GetKernels(kernels_to_process); // and call the GetKernels() method, which fills the kernels_to_process array with all the kernels included in the sum
    } else if (dynamic_cast<ProductKernel*>(m_kernel) != NULL) { // Similar logic if the kernel is a ProductKernel object
        ProductKernel* prod_k = dynamic_cast<ProductKernel*>(m_kernel);
        prod_k.GetKernels(kernels_to_process);
    } else {
        ArrayResize(kernels_to_process,1); // If the kernel is neither a sum nor a product (i.e. it is not a composite kernel), 
        kernels_to_process[0] = m_kernel; // then the kernels_to_process array simply contains a pointer to m_kernel.
    }

   // This loop iterates over each base kernel found in the kernels_to_process array
   // and sets its hyperparameters to an initial scale s, a lower bound bndl, and an upper bound bndl
    for(int i = 0; i < ArraySize(kernels_to_process); i++) {
        IKernel* current_k = kernels_to_process[i];
            string kernel_name = current_k.GetName();   
            if (kernel_name == "RBFKernel") {
                if (param_idx + 2 <= num_params) {    
                    s[param_idx] = 1.0; bndl[param_idx] = 1e-3; bndu[param_idx] = 1e3; param_idx++;    
                    s[param_idx] = 1.0; bndl[param_idx] = 1e-3; bndu[param_idx] = 1e3; param_idx++;    
                } 
            } else if (kernel_name == "LinearKernel") {
                if (param_idx + 1 <= num_params) {
                    s[param_idx] = 1.0; bndl[param_idx] = 1e-3; bndu[param_idx] = 1e3; param_idx++;    
                } 
            } else if (kernel_name == "PeriodicKernel") {
                if (param_idx + 3 <= num_params) {
                    s[param_idx] = 1.0; bndl[param_idx] = 1e-3; bndu[param_idx] = 1e3; param_idx++;    
                    s[param_idx] = 1.0; bndl[param_idx] = 1e-3; bndu[param_idx] = 1e3; param_idx++;    
                    s[param_idx] = 1.0; bndl[param_idx] = 1e-3; bndu[param_idx] = 1e3; param_idx++;    
                } 
            }           
    }
    
// --- Add bounds and scales for likelihood parameters (if any) ---
// LogitLikelihood has no hyperparameters, so this block will be skipped for it
// GaussianLikelihood has 1 parameter (sigma)
if (m_likelihood.GetNumHyperparameters() > 0) {
    if (param_idx + m_likelihood.GetNumHyperparameters() <= num_params) {
        s[param_idx] = 1.0;           // Scale
        bndl[param_idx] = 1e-10;      // Lower bound 
        bndu[param_idx] = 1e3;        // Upper bound
        param_idx++;
    } 
}
    CMinBLEICStateShell state;
    CMinBLEICReportShell rep; // object that will contain a report on the optimization results
    //-----------------------  optimizer stopping criteria
    double epsg = 0.0001;     //Gradient precision (0 means gradient stopping is disabled)
    double epsf = 0.0000;     //Precision by function value    
    double epsw = 0.0000;     //Accuracy by parameters 
    //-------------------------   
    double epso = 0.00001;    //Parameters for external convergence conditions in BLEIC
    double epsi = 0.00001;    //Parameters for internal convergence conditions in BLEIC
    CAlglib::MinBLEICCreate(theta, state);       // initialize the optimizer. It creates the initial state for MinBLEIC using the initial hyperparameter values from the theta array.
    CAlglib::MinBLEICSetBC(state, bndl, bndu);   // Set the lower (bndl) and upper (bndu) bounds for each parameter
    CAlglib::MinBLEICSetScale(state, s);         //Sets the scales (s) for each parameter. This can help the optimizer work more efficiently with parameters of different orders of magnitude.
    CAlglib::MinBLEICSetInnerCond(state,epsg,epsf,epsw);    
    CAlglib::MinBLEICSetOuterCond(state, epso, epsi);    
    CAlglib::MinBLEICOptimize(state, objective_func, frep, 0, Obj); // start the optimization    
    CAlglib::MinBLEICResults(state, theta, rep); // optimization report
    
    m_last_termination_type = rep.GetTerminationType();
    m_last_iterations_count = rep.GetInnerIterationsCount();
    m_last_nlml_value = objective_func.GetNLML(); // Get the final NLML
//------------------------------------------------------------------------------------    
//    TerminationType field contains completion code, which can be:
//-8     internal integrity control detected    infinite    or    NAN    values    in
//     function/gradient. Abnormal termination signalled.
//-3     inconsistent constraints. Feasible point is
//     either nonexistent or too hard to find. Try to
//     restart optimizer with better initial approximation
// 1     relative function improvement is no more than EpsF.
// 2     relative step is no more than EpsX.
// 4     gradient norm is no more than EpsG
// 5     MaxIts steps was taken
// 7     stopping conditions are too stringent,
//     further improvement is impossible,
//     X contains best point found so far.
// 8     terminated by user who called minbleicrequesttermination(). X contains
//     point which was "current accepted" when    termination    request    was
//     submitted.
//-------------------------------------------------------------------------------------  
// Determine the success of optimization based on TerminationType
    bool success = true;     
    if (m_last_termination_type < 0)
    {
        Print("Error: GP optimization failed. Completion type: ", m_last_termination_type);
        success = false;
    } 
    // Update the model hyperparameters after optimization
    vector optimized_hyperparams;
    optimized_hyperparams.Assign(theta);
    SetHyperparameters(optimized_hyperparams);   
    return success;     
}

Inside the Fit() method, we prepare everything necessary for the optimizer to work effectively.

A special object objective_func (GPOptimizationObjective) is created that represents the NLML objective function and its analytical gradient in a format understandable by Alglib. A pointer to the current GaussianProcess object is passed to its constructor (this is necessary to call the CalculateNLMLObjective method).

Next, we obtain the current values of all model hyperparameters in the theta hyperparameter array. These values (obtained from the kernel and likelihood function) will serve as a starting point for searching for the optimum. For each hyperparameter, scales (s), lower (bndl) and upper (bndu) bounds are specified. Boundaries prevent searching for solutions in ill-posed or meaningless regions (e.g. negative scale lengths or variances). Scaling is used by the optimizer to normalize parameters, which improves stability and convergence speed, especially when the parameter orders of magnitude differ greatly. Default s = 1.0

Next, we declare an array of kernels_to_process pointers to the Ikernel interface. It will be used to store a list of all base kernels whose hyperparameters need to be optimized. If we have a simple kernel (not composite), then this array will have only one element - a pointer to this kernel. If it is a SumKernel or ProductKernel, then it will store pointers to all the kernel that are part of this composition.

Next, using the dynamic_cast operator, we check whether the current kernel m_kernel (which is a field of the GaussianProcess class and points to the user-selected kernel) is an instance of SumKernel or ProductKernel. If so, a type cast to SumKernel or ProductKernel occurs, and the GetKernels() method is called, which fills the kernels_to_process array with all the kernels included in the sum or product of kernels. If the kernel is neither a sum nor a product (i.e. it is a regular kernel, such as an RBFKernel), then the kernels_to_process array simply contains a pointer to m_kernel itself.

After that, we loop through each base kernel found in kernels_to_process and set its hyperparameters to scale s and bounds bndl and bndu.

Finally, after all the kernel hyperparameters, the likelihood function hyperparameters are processed. The Gaussian likelihood has one parameter, while the logit likelihood has no parameters. After all parameters have been prepared, the optimization process starts.

The CalculateNLMLObjective() method acts as a link between the main GaussianProcess class and the external Alglib optimizer. This is exactly the objective function that the MinBleic optimizer constantly calls (via the GPOptimizationObjective class) to evaluate the current hyperparameter values. Its main task is to return the NLML value for a given set of hyperparameters.

//+-------------------------------------------------------------------+
//| Method that will be called by the optimizer to calculate NLML     |
//+-------------------------------------------------------------------+
double GaussianProcess::CalculateNLMLObjective(const vector &hyperparameters)
{
    //  Set all hyperparameters (kernels and likelihoods)
    SetHyperparameters(hyperparameters);    
    //  Call the inference function, which will calculate NLML
    m_inference.Infer(m_X_train, m_y_train, m_kernel, m_likelihood,m_last_inference_result);
    if (!m_last_inference_result.success) {    
        Print("Inference Error !");
        return DBL_MAX;    
    }    
    return m_last_inference_result.nlml_value;
}

At each iteration, the MinBleic optimizer proposes a new set of hyperparameters. The first thing CalculateNLMLObjective() does is take this set (hyperparameters) and use the SetHyperparameters() method to update the corresponding parameters inside the kernel (m_kernel) and likelihood function (m_likelihood) objects. This is very important because all subsequent NLML calculations should be based on these current hyperparameter values.

After the hyperparameters are updated, the method calls Infer() on the inference object (m_inference). This is the main step where all the complex mathematical calculations aimed at estimating the posterior distribution take place.

The inference results, including the NLML value and its gradients (which will be used by the Grad function), are stored in the private class field m_last_inference_result.

If the inference is successful, the method returns NLML.

The GaussianProcess::SetHyperparameters(const vector &params) method is responsible for distributing and setting optimized values of the kernel hyperparameters and the likelihood function.

//+------------------------------------------------------------------+    
//| Method for setting hyperparameters                               |
//+------------------------------------------------------------------+
void GaussianProcess::SetHyperparameters(const vector &params)
{
//+------------------------------------------------------------------+    
//This is a call of the polymorphic SetHyperparameters method on the object pointed to by m_kernel.
//Since m_kernel is a pointer to a base type (IKernel*), calling SetHyperparameters
//will be redirected to a concrete implementation of this method in the derived kernel class
//m_kernel refers to. For example, if m_kernel actually points to an object
//RBFKernel, RBFKernel::SetHyperparameters(params) is called. If this is SumKernel, 
//the SumKernel::SetHyperparameters(params) method is called, and so on.    
//+------------------------------------------------------------------+
    int kernel_params_count = m_kernel.GetNumHyperparameters();
    int likelihood_params_count = m_likelihood.GetNumHyperparameters();
    // Set kernel parameters
    vector kernel_hps(kernel_params_count);
    for(int i = 0; i < kernel_params_count; i++) {
        kernel_hps[i] = params[i];
    }
    m_kernel.SetHyperparameters(kernel_hps);    
    // Set the likelihood parameters
    vector likelihood_hps(likelihood_params_count);
    for(int i = 0; i < likelihood_params_count; i++) {
        likelihood_hps[i] = params[kernel_params_count + i];
    }
    m_likelihood.SetHyperparameters(likelihood_hps);
}

The params vector contains all hyperparameters of the GP model in a fixed order: first come all hyperparameters of the kernel (or kernels, if it is a composite kernel), and then come the parameters of the likelihood function. The key feature of this method is the use of polymorphism. The same call to m_kernel.SetHyperparameters() behaves differently depending on the actual type of the object pointed to by m_kernel at runtime.

Predict() method. This is essentially what a model is built for: to make predictions based on new data.

//+------------------------------------------------------------------+
//| Prediction method for regression and classification              |
//+------------------------------------------------------------------+    
bool GaussianProcess::Predict(const matrix &X_test, GPPredictionResult &result,PredictMode predict_mode)
{     
    // 1. Check that the model has been trained
    if (!m_last_inference_result.success) {
        Print("Error: Predict - Inference results not available");
        return false;
    }
    // 1.1 Check the match of the number of features
    if (X_test.Cols() != m_X_train.Cols()) {
        Print("Error: Predict - Number of features in X_test  must match X_train ");
        return false;
    }

    int N_train = (int)m_X_train.Rows();
    int N_test = (int)X_test.Rows();

    // 2. K_s and K_ss are calculated regardless of the type of inference/likelihood
    matrix K_s = m_kernel.Compute(m_X_train, X_test);            
    matrix K_ss = m_kernel.Compute(X_test, X_test);
    
    // --- 3. Logic for calculating mu_f_star and Sigma_f_star (common for both types of problems) ---
    //------------------------- Algorithm 2.1 GPML----------------------------------------
    if (m_inference.GetName() == "ExactInference") {
        // For ExactInference
        matrix L_K_noisy = m_last_inference_result.L_K_noisy;
        vector alpha = m_last_inference_result.alpha;
        
        result.mu_f_star = K_s.Transpose() @ alpha;            

        matrix V(N_train, N_test);    
        if (!L_K_noisy.LinearEquationsSolution(K_s, V)) {
            Print("Error: Predict (Exact) - LinearEquationsSolution failed");
            return false;        
        }          
        result.Sigma_f_star = K_ss - V.Transpose() @ V;

    } else if (m_inference.GetName() == "LaplaceInference") {
    //------------------------- Algorithm 3.2 GPML ----------------------------------------
        matrix W = -1 * m_last_inference_result.H;    
        matrix L_B = m_last_inference_result.L_B;    
        matrix sW = m_last_inference_result.sW;        
        vector f_hat = m_last_inference_result.mu_f_train;    
        vector grad_f_hat = m_likelihood.LogLikelihoodGradient(f_hat, m_y_train);
        // Eq[f*∣X,y,x*]=k(x*)^T K^−1 f_hat = k(x*)^T ∇log p(y∣f_hat) 
        result.mu_f_star = K_s.Transpose() @ grad_f_hat;
        
        matrix SwKs = sW @ K_s;
        matrix V(N_train, N_test);    
        if (!L_B.LinearEquationsSolution(SwKs, V)) {
            Print("Error: Predict (Laplace) - LinearEquationsSolution failed");
            return false;
        }
        // Vq[f*|X, y,x*] = Kss - Ks^T(K + W^-1)^-1 Ks
        result.Sigma_f_star = K_ss - V.Transpose() @ V;
    }    
    
    // --- 4. Likelihood-specific logic (Likelihood) ---
if (m_likelihood.GetName() == "GaussianLikelihood") {
    // --- 4.1. Regression (GaussianLikelihood) ---
    double noise_variance = 0.0;
    vector likelihood_params = m_likelihood.GetHyperparameters();
    if (likelihood_params.Size() > 0) {
        noise_variance = likelihood_params[0] * likelihood_params[0];
    }    
    result.Sigma_y_star = result.Sigma_f_star + matrix::Identity(N_test, N_test) * noise_variance;
    result.mu_y_star = result.mu_f_star; // For Gaussian likelihood mu_y_star = mu_f_star

    } else if (m_likelihood.GetName() == "LogitLikelihood") {
        // --- 4.2. Classification (LogitLikelihood) ---
          // Make sure m_likelihood is a LogitLikelihood to access the sigmoid method
        LogitLikelihood *logit = dynamic_cast<LogitLikelihood*>(m_likelihood);
        if (logit == NULL) {
            Print("Error: Failed to cast m_likelihood to LogitLikelihood in Predict");
            return false;
        }
        
        result.predicted_probabilities.Resize(N_test);
        result.predicted_labels.Resize(N_test);
        double mc_samples_array[]; 
 
        for (int i = 0; i < N_test; i++) {
            double mu_f_star_i = result.mu_f_star[i];  //mean of the posterior distribution q(f*|X,y,x*)
            double sigma_f_star_diag_i = result.Sigma_f_star[i, i]; // variance of the posterior distribution q(f*|X,y,x*)
    //------------------- 1)Probit Approximation----------------------
            if (predict_mode == PROBIT) {
                double k_i = 1.0 / MathSqrt(1.0 + M_PI / 8.0 * sigma_f_star_diag_i);
                result.predicted_probabilities[i] = logit.sigmoid(mu_f_star_i * k_i);}
    // ----------------- 2) Numerical integration ---------------------------------------
            else if (predict_mode == NUM_INTEGR) {  
                result.predicted_probabilities[i] = CalculateNumericalProbability(
                    mu_f_star_i,
                    sigma_f_star_diag_i,
                    logit
                );} 
   // ----------------------3) Monte Carlo Method ---------------------------------------              
            else if (predict_mode == MONTE_CARLO) {      
                // Number of samples for Monte Carlo
                int num_samples = 10000;                
                ArrayResize(mc_samples_array, num_samples); 
                double std_dev_f_star_i = MathSqrt(sigma_f_star_diag_i);
                // Generate num_samples values from N(mu_f_star_i, std_dev_f_star_i)             
                MathRandomNormal(mu_f_star_i, std_dev_f_star_i, num_samples, mc_samples_array);
                double sum_sigmoid_samples = 0.0;
                for (int s = 0; s < num_samples; s++) {
                    sum_sigmoid_samples += logit.sigmoid(mc_samples_array[s]);
                }
            //To get the expected probability p(y*=+1|X,y,x*)
            //we calculate the arithmetic mean of all obtained values σ(f_sample*). 
            //By the law of large numbers, when num_samples is large enough, 
            //this average will be a good approximation of the true value of the integral     
                result.predicted_probabilities[i] = sum_sigmoid_samples / num_samples;
            }    
            // Predicted labels (+1 or -1)
            result.predicted_labels[i] = (result.predicted_probabilities[i] >= 0.5) ? 1.0 : -1.0;
        }
    }
    return true;    
    }

The prediction results (mean, variance, probabilities, class labels) are set to the GPPredictionResult structure.

First of all, we calculate the matrices K* and K**. These matrices are the basis for predictions in GP. They are needed to calculate the mean and variance of the latent function f* at new test points. The logic here depends on which inference method (ExactInference or LaplaceInference) was used during training, since they provide different components for the prediction formulas (Algorithm 2.1 for Exact, Algorithm 3.2 for Laplace from the book GPML by Rasmussen and Williams).

If ExactInference was used, the pre-computed L_K_noisy and alpha are retrieved. If LaplaceInference was used, W, L_B, sW and f_hat (mode) are extracted. In both cases, the result is the mean (mu_f_star) and covariance matrix (Sigma_f_star) of the latent function for each test point.

As we have already discussed in the theoretical part of the article, there is a problem of calculating the integral to obtain the class probability. Therefore, approximations are used:

predict_mode == PROBIT (Probit approximation)):

This is a frequently used quick approximation. It replaces the sigmoid function with the cumulative distribution function of the normal distribution, which is similar in shape. This allows us to calculate the integral analytically.

predict_mode == NUM_INTEGR (Numerical integration):

In this mode, the CalculateNumericalProbability auxiliary function is called. It numerically approximates the integral by dividing the range of f* into discrete intervals and summing the values. It may be more accurate, but slower.

predict_mode == MONTE_CARLO (Monte Carlo Method):

This is a stochastic method. A large number of random samples f* are generated from the posterior distribution q(f*∣X, y, x*). For each sample f*, sigma(f*) is calculated.

The arithmetic mean of all these values sigma(f*) is an approximation of the desired probability p(y*=+1|X, y, x*). This is the most computationally expensive method. To generate samples from a normal distribution, the MathRandomNormal standard library function is used.

Based on the calculated probabilities, a decision is made about the predicted class label for each of the above approximation methods. If the probability of belonging to class +1 is greater than or equal to 0.5, then +1 is predicted, otherwise -1.

GPOptimizationObjective class

//+------------------------------------------------------------------+
//| Class for the Alglib optimizer objective function                |
//+------------------------------------------------------------------+
class GPOptimizationObjective : public CNDimensional_Grad
{
private:  
    GaussianProcess* m_gp; // pointer to GaussianProcess object
    double nlml;           // Negative log-likelihood
public:
    // Constructor 
    GPOptimizationObjective(GaussianProcess* gp_instance) : m_gp(gp_instance), nlml(0.0) {}
    double GetNLML() { return nlml; }
    ~GPOptimizationObjective() {}

    // Grad method that will be called by the optimizer
    virtual void Grad(CRowDouble &w, double &func,CRowDouble &grad, CObject &obj) override {     
        // Convert CRowDouble to a vector for passing to GP
        vector hyperparameters(w.Size());
        for(int i = 0; i < (int)w.Size(); i++) { 
           hyperparameters[i] = w[i];        
        }
        
        // Call the GP method to calculate NLML
        func = m_gp.CalculateNLMLObjective(hyperparameters);
        nlml = func; 
        
         GPInferenceResult current_result = m_gp.GetLastInferenceResult(); 
        if (!current_result.success ) {
            Print("Warning: GPOptimizationObjective::Grad - Gradient calculation failed");
            for(int i = 0; i < (int)w.Size(); i++) {
                grad.Set(i, DBL_MAX); 
            }
            return;
        }            
        // Fill grad with elements from current_result.nlml_gradient
        for(int i = 0; i < (int)w.Size(); i++) {
        grad.Set(i, current_result.nlml_gradient[i]);
        }       
    }      
};

This class is the link between our GaussianProcess class and the external Alglib optimization library (specifically, the MinBLEIC optimizer).

Alglib requires that the objective function it optimizes conform to a certain interface. This is exactly what GPOptimizationObjective is for. It inherits from CNDimensional_Grad, the Alglib base class that defines this interface. This base class provides virtual methods that the GPOptimizationObjective class should implement. These methods allow Alglib optimizers to work with any objective function, provided that it provides both the function value and its gradient.

The private member GaussianProcess* m_gp contains a pointer to our GaussianProcess object. This allows the GPOptimizationObjective class to call the CalculateNLMLObjective method to perform the necessary calculations.

The Grad() method is the most important part of this class. It overrides the virtual method from CNDimensional_Grad and is called by the Alglib optimizer on each iteration. The Grad() function receives the current hyperparameter vector w from Alglib and should return the value of the func objective function and the grad vector of its gradients.

Conclusion

Let's sum up the intermediate results.

In the first part of the article, we laid a solid theoretical foundation for understanding the GP classification model. We have examined in detail the principles of operation of GP for binary classification and the Laplace approximation method. This method is critically important because it makes the classification problem practical and computationally efficient for online trading needs, unlike the accurate but incredibly expensive MCMC method.

Having dealt with the theoretical constructs, we moved on to practical implementation, designing and describing two key classes of our GP library:

GaussianProcess: the main class that encapsulates all the logic for building, training, and predicting a GP model,
GPOptimizationObjective: acts as a middleman, preparing the objective function and its gradient in the format required by the Alglib library for hyperparameter optimization.

In the second part, we will complete the library implementation by providing:

detailed description and implementation code of key interfaces: IKernel (for various kernels), IInference (for inference methods) and ILikelihood (for likelihood functions);
examples of the library operation on synthetic data to clearly demonstrate its capabilities;
practical application in trading: we will develop indicators for classification and regression based on our library, demonstrating how GPs can be used to make trading decisions.

Translated from Russian by MetaQuotes Ltd.
Original article: https://www.mql5.com/ru/articles/18875

Attached files |

Download ZIP

GP.mqh (77.47 KB)

Warning: All rights to these materials are reserved by MetaQuotes Ltd. Copying or reprinting of these materials in whole or in part is prohibited.

This article was written by a user of the site and reflects their personal views. MetaQuotes Ltd is not responsible for the accuracy of the information presented, nor for any consequences resulting from the use of the solutions, strategies or recommendations described.

Evgeniy Chernish

Ukraine
5042

Introduction

Classification

Inference

Laplace approximation

Prediction in Laplace approximation

Marginal likelihood in Laplace approximation

Gaussian process library

GaussianProcess class

GPOptimizationObjective class

Conclusion

Other articles by this author