Creating training and testing samples

We have come quite a long way in building our library for building neural networks. We have completed the work on constructing the basic dispatcher class for our neural network and have created everything necessary for building a fully connected neural layer. We still have a lot of work to do. However, we can already build our first neural network and test its performance using real data. Since we will have several implementations of various architectural solutions, to compare the results of model performance, we will take a small data subset. Let's create two data samples: a larger one for training the neural network and a smaller one for testing the trained neural network.

Allocating a separate sample for training is a common practice. During the training process of a neural network, weights are adjusted in such a way that the neural network accurately describes the training dataset to the best extent possible. By using a sufficiently large number of weights, the neural network is able to learn the training sample down to the smallest detail. However, in doing so, the neural network loses the ability to generalize the data. In such a state, the neural network is called "overfitted". It is not possible to detect this in a training sample. However, if you compare the performance of the neural network on the training dataset and on data that is not part of the training dataset, the difference in results will clearly indicate this. A slight deterioration of the results on the test sample is allowed, but it should not be drastic. Of course, the data in the samples should be comparable. Most often, to achieve this, the overall available data is randomly divided into two sets in a ratio of 70-80% for the training dataset and 20-30% for the testing dataset. In most cases, it will be necessary to divide the general population into three subsamples:

  • training 60%
  • validation 20%
  • test 20%

The validation dataset is used to select the best training parameters and neural network architecture. However, we will not be using a validation dataset at this point, as we would like to compare different implementations under otherwise equal conditions.

To generate samples, let's create the script create_initial_data.mq5. In the script we will specify the following parameters:

  • The period for loading data is specified as a start date and an end date; within this period, we will retrieve historical data and indicator data from the server;
  • Timeframe used to load the analyzed data;
  • Number of analyzed historical bars per pattern;
  • Name of files for recording training and test samples;
  • Data normalization flag.

Earlier, we discussed extensively the importance of normalizing the data that is fed into the neural network as input. Now, we can practically verify how data normalization affects the results of training the neural network. It is to assess the impact of this factor that I introduced the data normalization parameter. Here, it's important to note that the data fed into the neural network as input should be comparable both in the testing and training datasets, as well as during the practical application operation of the neural network. Therefore, in practice, it will be necessary to store the normalization parameters and use them when normalizing data coming in during the practical application of the neural network.

Recall that in the section on selecting the input data to feed into the neural network, we selected two indicators: RSI and MACD. We will use them during the process of training neural networks within the practical experiments outlined in this book.

Let's look at the script algorithm. Initially, following the analogy with the scripts discussed while selecting the source data, we will connect the selected indicators to the chart and obtain handles for accessing the indicator data.

//+------------------------------------------------------------------+
//| External parameters for script operation                         |
//+------------------------------------------------------------------+
// Beginning of the period of the general population
input datetime Start = D'2015.01.01 00:00:00';  
// End of the period of the general population
input datetime End = D'2020.12.31 23:59:00';    
// Timeframe for data loading
input ENUM_TIMEFRAMES TimeFrame = PERIOD_M5;    
// Number of historical bars in one pattern
input int      BarsToLine = 40;                 
// File name for recording the training sample
input string   StudyFileName = "study_data.csv";
// File name for recording the test sample
input string   TestFileName  = "test_data.csv"
// Data normalization flag
input bool     NormalizeData = true;            
//+------------------------------------------------------------------+
//| Beginning of the script program                                  |
//+------------------------------------------------------------------+
void OnStart(void)
  {
//--- Connect indicators to the chart
   int h_ZZ = iCustom(_SymbolTimeFrame"Examples\\ZigZag.ex5"48147);
   int h_RSI = iRSI(_SymbolTimeFrame12PRICE_TYPICAL);
   int h_MACD = iMACD(_SymbolTimeFrame124812PRICE_TYPICAL);
   double close[];
   if(CopyClose(_SymbolTimeFrameStartEndclose) <= 0)
      return;

After that, we check the validity of the obtained handles and load the historical data of indicators into dynamic arrays. It should be noted that for the ZigZag indicator, we will load a bit more data. The reason for this is the specifics of this indicator. The buffer of this indicator points only to the found extrema. In other cases, the indicator returns zero values. Therefore, for the last patterns analyzed, the target values may be outside the analyzed period.

//--- Load indicator data into dynamic arrays
   double zz[], macd_main[], macd_signal[], rsi[];
   datetime end_zz = End + PeriodSeconds(TimeFrame) * 500;
   if(h_ZZ == INVALID_HANDLE || 
      CopyBuffer(h_ZZ0Startend_zzzz) <= 0)
     {
      PrintFormat("Error loading indicator %s data""ZigZag");
      return;
     }
   if(h_RSI == INVALID_HANDLE || 
      CopyBuffer(h_RSI0StartEndrsi) <= 0)
     {
      PrintFormat("Error loading indicator %s data""RSI");
      return;
     }
   if(h_MACD == INVALID_HANDLE || 
      CopyBuffer(h_MACDMAIN_LINEStartEndmacd_main) <= 0 ||
      CopyBuffer(h_MACDSIGNAL_LINEStartEndmacd_signal) <= 0)
     {
      PrintFormat("Error loading indicator %s data""MACD");
      return;
     }

In addition to the selected indicators, let's load the candlestick closing prices. We will use these handles to determine the direction of price movement towards the nearest extremum and the strength of the upcoming movement.

After loading the data, we organize the process of determining the target values at each step of the historical data. To do this, we will create a reverse loop and loop through all the values of the ZigZag indicator and if the value differs from zero, we will save it to the extremum variable. Simultaneously, we will iterate through closing price values, and by measuring the deviation of the last recorded extremum from the closing price, we will determine the direction and strength of the upcoming movement. Let's save the obtained values into dynamic arrays target1 and target2.

   int total = ArraySize(close);
   double target1[], target2[], macd_delta[], test[];
   if(ArrayResize(target1total) <= 0 || 
      ArrayResize(target2total) <= 0 ||
      ArrayResize(testtotal) <= 0 || 
      ArrayResize(macd_deltatotal) <= 0)
      return;
//--- Calculate targets: direction and distance 
//--- to the nearest extremum
   double extremum = -1;
   for(int i = ArraySize(zz) - 2i >= 0i--)
     {
      if(zz[i + 1] > 0 && zz[i + 1] != EMPTY_VALUE)
         extremum = zz[i + 1];
      if(i >= total)
         continue;
      target2[i] = extremum - close[i];
      target1[i] = (target2[i] >= 0 ? 1 : -1);
      macd_delta[i] = macd_main[i] - macd_signal[i];
     }

Here, it's important to note that on a time chart, the extremum should always come after the analyzed closing price. Therefore, the closing price is taken from the previous bar compared to the last checked value of the ZigZag indicator.

In the same loop, we will determine the distance between the main and signal lines of the MACD indicator and store them in a separate dynamic array called macd_delta.

After calculating the targets and the distance between the MACD indicator lines, we normalize the data. Of course, we will perform normalization only when this requirement is specified by the user in the script parameters. The purpose of normalization is to transform the original data so that its values are in the range of -1 to 1 centered on point 0. It's important to pay attention to a series of introductory aspects that stem from the characteristics of the indicators themselves.

The RSI indicator is constructed in such a way that its values are normalized within a range from 0 to 100. Hence, we do not need to determine the maximum and minimum data value of this indicator to normalize it. Therefore, the algorithm for normalizing the readings of this indicator is limited by the constant 50 which is the middle of the range of possible indicator values. The formula for normalizing the values is as follows.

The values of the MACD indicator do not have an upper and lower boundary of the range, but they are centered around point 0. This is because, based on the construction principles of the indicator, it reflects whether the price is above or below the moving average. The same can be said about the calculated distance between the base and signal lines of the indicator. The signal line can be either above or below the base line. However, at the moment the lines cross, the distance between them is 0. Therefore, for normalizing the data, we will take the value of the indicator and divide it by the absolute value of the maximum deviation over the analyzed period.

Here, I want to once again emphasize the importance of data comparability for the training, testing dataset, and data used during practical application. If we normalize the training and test sample data now, we will have to keep the normalization parameters of all three indicators of the MACD indicator for practical application.

After defining the normalization parameters, we will organize a cycle for enumeration and appropriate correction of historical values of indicators.

Only the initial data is normalized, not the target values.

//--- Data normalization
   if(NormalizeData)
     {
      double main_norm = MathMax(MathAbs(macd_main[ArrayMinimum(macd_main)]),
                                         macd_main[ArrayMaximum(macd_main)]);
      double sign_norm = MathMax(MathAbs(macd_signal[ArrayMinimum(macd_signal)]),
                                         macd_signal[ArrayMaximum(macd_signal)]);
      double delt_norm = MathMax(MathAbs(macd_delta[ArrayMinimum(macd_delta)]),
                                         macd_delta[ArrayMaximum(macd_delta)]);
      for(int i = 0i < totali++)
        {
         rsi[i] = (rsi[i] - 50.0) / 50.0;
         macd_main[i] /= main_norm;
         macd_signal[i] /= sign_norm;
         macd_delta[i] /= delt_norm;
        }
     }

Certainly, sometimes it can be useful to normalize the target values to fit them into the range of a specific activation function. However, in such cases, similar to normalizing input data, it's crucial to preserve the normalization parameters for decoding the neural network's outputs in industrial applications. These considerations lie at the interface between the neural network and the main program, and the solution largely depends on the specific task.

After preparing the dataset, the next step is to split it into training and testing sets. A common practice is to randomly select records from the entire dataset for the test set, with the remaining data used for training. It is highly discouraged to take consecutive patterns for the test set, whether they are the first or last in the dataset. This is primarily because a small subset of data is more susceptible to the influence of local trends. Such a sample may not be representative for extrapolating the evaluation to the entire dataset. On the other hand, randomly selecting records from the entire dataset provides a higher probability of extracting patterns that differ significantly for the test set. This kind of sample will be more independent of local trends and more representative to enable the evaluation of the neural network performance on the global dataset. However, it should be noted that there are cases where consecutive patterns are chosen for the test set, but these are specific instances related to the architecture of certain models.

To split the dataset into training and testing sets, we will create an array of flags called test. This array will have the same size as our global dataset. The values of its elements will indicate the usage direction of the pattern:

  • 0 means the training sample
  • 1 means the testing sample

For binary classification, you can also use an array of logical values. However, when we need to add a validation dataset, we can easily use the value 2 for it, whereas using an array of logical values doesn't provide us with such flexibility.

Our flag array will be first initialized with zero values. In other words, we establish that by default the pattern belongs to the training dataset. We then determine the number of patterns for the test set. Then we create a loop based on the number of elements for the testing dataset, generating random values within this loop. The random value generator should return an integer number between 0 and the size of the general population. In my solution, I used the MQL5 built-in MathRand function to generate pseudo-random numbers. This function returns an integer value in the range of 0 to 32767. However, since the size of the dataset is expected to be over 33,000 elements, I multiplied two random numbers. Such a version is capable of generating more than 1 billion random values. To scale the obtained random number to the size of our population, we first divide the generated random number by the square of 32767, thereby normalizing the random number within the range of 0 to 1. Then multiply by the number of elements in our general population. The resulting number will tell us the ordinal number of the pattern for the test sample.

All we have to do is write 1 to the corresponding element of the flag array. However, there is still a chance of landing twice (or even more times) on the same element of the flag array. If we do not control for this, we are very likely to get a test sample smaller than expected. Therefore, before writing 1 to the selected element of the flag array, we first check its current state. If it already contains 1, we decrease the loop iteration count by 1 and generate the next random number. Thus, if we hit the same element, the loop iteration counter will not be incremented, ensuring that we obtain a testing dataset of the expected size as output.

//--- Generate randomly the data indices for the test sample
   ArrayInitialize(test0);
   int for_test = (int)((total - BarsToLine) * 0.2);
   for(int i = 0i < for_testi++)
     {
      int t = (int)((double)(MathRand() * MathRand()) / MathPow(32767.02) * 
                    (total - 1 - BarsToLine)) + BarsToLine;
      if(test[t] == 1)
        {
         i--;
         continue;
        }
      test[t] = 1;
     }

This is the end of the preparatory work. The only thing left to do is to save the prepared data into the appropriate files. To write the data, we open two files for writing according to the names specified in the script parameters. An obvious thing to do would be to create binary files to record numeric data. They take up less disk space and are faster to work with. But since we are going to load data from applications written in other programming languages, in particular from Python scripts, the most universal approach is to use CSV files.

We open two CSV files for writing and immediately check the resulting handles for accessing the files. Erroneous handles will signal a file opening error. The corresponding message will be displayed in the terminal log.

//--- Open the training sample file for writing
   int Study = FileOpen(StudyFileNameFILE_WRITE | 
                                       FILE_CSV | 
                                       FILE_ANSI","CP_UTF8);
   if(Study == INVALID_HANDLE)
     {
      PrintFormat("Error opening file %s: %d"StudyFileNameGetLastError());
      return;
     }
//--- Open the test sample file for writing
   int Test = FileOpen(TestFileNameFILE_WRITE | 
                                     FILE_CSV | 
                                     FILE_ANSI","CP_UTF8);
   if(Test == INVALID_HANDLE)
     {
      PrintFormat("Error opening file %s: %d"TestFileNameGetLastError());
      return;
     }

After successfully opening the files, we set up a loop to iterate through all elements of the population. Note that the loop does not start from the zeroth element, but from the element corresponding to the number of bars in the pattern. After all, for a complete pattern record, we must specify the data of several previous candles. We will divide the training and test samples at the stage of writing to the file. By checking the value of the corresponding element in the flag array, we will replace the file handle for pattern recording with the file handle of the correct dataset. The actual pattern writing to the file is encapsulated in a separate function, which we will review a little later. To track the process, we will output the percentage of completion in the comments on the chart.

Upon completion of the loop, we will clear the comments on the chart, close the files, and log information about the file names and paths to the journal.

//--- Write samples to files
   for(int i = BarsToLine - 1i < totali++)
     {
      Comment(StringFormat("%.2f%%"i * 100.0 / (double)(total - BarsToLine)));
      if(!WriteData(target1target2rsimacd_mainmacd_signalmacd_deltai,
                                      BarsToLine, (test[i] == 1 ? Test : Study)))
        {
         PrintFormat("Error to write data: %d"GetLastError());
         break;
        }
     }
//--- Close the files
   Comment("");
   FileFlush(Study);
   FileClose(Study);
   FileFlush(Test);
   FileClose(Test);
   PrintFormat("Study data saved to file %s\\MQL5\\Files\\%s",
               TerminalInfoString(TERMINAL_DATA_PATH), StudyFileName);
   PrintFormat("Test data saved to file %s\\MQL5\\Files\\%s",
               TerminalInfoString(TERMINAL_DATA_PATH), TestFileName);
  }

To write information about the pattern to a file, we will create a function called WriteData. In the function parameters, we will pass pointers to arrays of source and target data, the sequential number of the last bar in the pattern in the data arrays, the number of bars to analyze for one pattern, and the file handle for writing data. The choice of the last bar in the pattern instead of the first is made in an attempt to approximate the pattern construction to the real conditions of neural network operation. When working with real-time stock price time series, we are always on the latest known bar at the current moment. We analyze information from several recent bars, which already constitute history, and try to understand the most probable upcoming price movement. Similarly, the bar specified in the parameters here represents the "current moment" for us. We take the specified number of bars before it, and all of this constitutes the pattern we analyze. Based on this pattern, our neural network should determine the probable price movement and its strength.

//+------------------------------------------------------------------+
//| Function for writing a pattern to a file                         |
//+------------------------------------------------------------------+
bool WriteData(double &target1[], // Buffer 1 of target values
               double &target2[], // Buffer 2 target values
               double &data1[],   // Buffer 1 of historical data
               double &data2[],   // Buffer 2 of historical data
               double &data3[],   // Buffer 2 of historical data
               double &data4[],   // Buffer 2 of historical data
               int cur_bar,       // Current bar of the end of the pattern
               int bars,          // Number of historical bars 
                                  // in one pattern
               int handle)        // Handle of the file to be written
  {

Let's first collect the information on the pattern into a string variable of type string. In doing so, we don't forget to insert a delimiter between the values of the elements. The delimiter must match the delimiter specified when opening the CSV file. Collecting data into a string variable is a forced compromise. The point is that the FileWrite function for writing to a text file has a limit of 63 parameters to write, and each call to write is terminated with an end-of-line character. Now, we have two problems before us:

  1. By specifying all pattern data within one call of the WriteData function, using 4 indicators per 1 bar, we will be able to describe no more than 15 candlesticks.
  2. We have to collect information on all the bars at once.

We cannot use a loop to iterate through the array values. We need to manually specify all the elements to be written in the parameters of the data-writing function. The use of a string variable helps address these issues. In a simple loop, we can collect all values into one text string. In this process, we are not limited in the number of included parameters. Of course, during the collection of indicators into the string, we will need to insert a delimiter between them, thus simulating a CSV file string. Moreover, we will write the already assembled string to the file only once. Consequently, the function will insert an end-of-line character at the end of the write once. Thus, the entire pattern in our file will be recorded in a single line.

//--- check the file handle
   if(handle == INVALID_HANDLE)
     {
      Print("Invalid Handle");
      return false;
     }
//--- determine the index of the first record of the historical data of the pattern
   int start = cur_bar - bars + 1;
   if(start < 0)
     {
      Print("Too small current bar");
      return false;
     }

//--- Check the correctness of the index of the data and the data written to the file 
   int size1 = ArraySize(data1);
   int size2 = ArraySize(data2);
   int size3 = ArraySize(data3);
   int size4 = ArraySize(data4);
   int sizet1 = ArraySize(target1);
   int sizet2 = ArraySize(target2);
   string pattern = (string)(start < size1 ? data1[start] : 0.0) + "," +
                    (string)(start < size2 ? data2[start] : 0.0) + "," +
                    (string)(start < size3 ? data3[start] : 0.0) + "," +
                    (string)(start < size4 ? data4[start] : 0.0);
   for(int i = start + 1i <= cur_bari++)
     {
      pattern = pattern + "," + (string)(i < size1 ? data1[i] : 0.0) + "," +
                                (string)(i < size2 ? data2[i] : 0.0) + "," +
                                (string)(i < size3 ? data3[i] : 0.0) + "," +
                                (string)(i < size4 ? data4[i] : 0.0);
     }
   return (FileWrite(handlepattern
                    (double)(cur_bar < sizet1 ? target1[cur_bar] : 0),
                    (double)(cur_bar < sizet2 ? target2[cur_bar] : 0)) > 0);
  }

As a result, we obtain a structured CSV file in which a delimiter is placed between every two adjacent elements and each row represents a separate pattern for analyzing the data.

It should also be noted that to prevent an array out-of-bounds error, we should check the index values against the array sizes before accessing the data arrays. In case of an incorrect index, we write 0 instead of the indicator value. During the operation of the neural algorithm, all values of the input indicator vector are multiplied by the weights, and the resulting products are summed into a common sum. Multiplying any weight by 0 always returns 0. Therefore, zero-valued indicators have no direct effect on the outcome of the neuron performance. Of course, we can talk about indirect influence here. Indeed, there could be a situation where the contribution of a particular indicator to the overall sum is insufficient to activate the neuron. However, this is a lesser evil, and we accept these risks.

Perhaps, it is worth mentioning that for future tests of our models, we will immediately create two sets of training data:

  • We will write the training samples with non-normalized initial data to the files study_data_not_norm.csv and test_data_not_norm.csv.
  • We will write the training datasets with non-normalized source data into files named study_data.csv and test_data.csv.

To create the aforementioned training datasets, we will use the previously described script from the file create_initial_data.mq5. We will run it twice to collect the same historical data but change the filenames for data recording and the "Data normalization flag."