Taking Neural Networks to the next level - page 2


There is one more thing to be mentioned about the internal data processing within the hidden layers' "black box": the so-called dropout. As far as I know, the concept of dropout goes back to a publication in 2014 by Nitish Srivastava et al. (Dropout: A Simple Way To Prevent Neural Networks From Overfitting, Journal of Machine Learning Research 15 (2014) 1929-1958). Dropout means "dropping out", i.e. temporarily switching off a certain percentage of the neurons in a given iteration. Those neurons are randomly chosen and change with every iteration. With them being inactive it means that also their connections to the neurons in the previous and following layer are inactive during that iteration. They stay inactive until backpropagation is completed. Then a new set of dropout neurons is randomly chosen and the cycle starts over again:

Nithish Srivastava et al (Journal of Machine Learning Research 15 (2014)

The idea of dropout is taken from evolution and sexual reproduction, by the way. During reproduction half of the genes ("alleles") of both mother and father are randomly recombined. In other words: offspring is a randomly chosen subset of the genes that are available in the pool of total genes in a parent generation. An individual child will only inherit about 50% of the mother's and father's gene, not the other half, while its sibblings might inherit some of those genes, because they just are a different "recombination". Therefore the genes (alleles) that a given child didn't inherit didn't dissapear from the population's gene pool - they were just not all used for that individual child. Maybe you see that changing the genes/alleles "in use" for a child is a little bit like switching on an off neurons that the next layer gets information from. In evolution this trick of achieving variations allows the population to evolve over time by "survival of the fittest". This by far plays a bigger role than mutations alone (as many would think).

If we come back to neural networks, introducing such variations with the dropout trick (and therefore not using the exact same neurons and connections in every iterations) allows the network to learn to come to good solutions even under slightly changing circumstances, which is why the dropout method is a trick against overfitting. Remember: overfitting means being overly adapted to one (usually past) dataset and doing well with those data, whilst behaving worse on new / unseen data. The risk of overfitting is particularly present in large neural networks with many layers. A choice of about 20% dropout is usually a good starting point to work with.

The opposite of being overfit would be generalisation. A little bit of generalisation is a good thing when it comes to neural networks. We usually don't want them to only be able to deal with situations that are 100% alike, but also with different, unseen data, that were not part of the training set. Beyond that, adding dropout is a way of adding additional denoising, which also applies to using dropout in autoencoders.

If we translate the procedure into program code, dropout is sometimes added as dropout layers. Those layers don't have any activation function but only act as filters that mask a percentage of neurons in the preceding layer. Me myself, I don't like to consider dropout as an individual layer, but I prefer to just add a dropout variable for each neuron, as kind of an on/off-switch. When I cycle via for-loops through the layers and neurons in every feedforward pass, I just use a bool variable "dropout[layer][neuron]" that is randomly set to TRUE with the propability of my chosen dropout level, like let's say 20%. If it is "TRUE" I then just skip that neuron (so no "add weighted inputs, plus bias, activate") and also ignore all connections coming from that neuron and do the same during backpropagation. In the next forward pass the dropout bool variables then are newly assigned.


3. the post-processing

Let's talk about the scale of our desired labels compared to the outputs of the neurons in the last layer, i.e. the output layer: Considering a standard neuron, the calculation method for updating its value is always the same: inputs times weights --> plus bias --> activation function. Some more fancy neurons like LSTM cells have on top of that their so-called  "gates" (activation gate, input gate, forget gate, output gate) with their individual weight matrices (and bias and an activation function of their own), but the end result is comparable: in the end, a cell's output is always the result of some activation function. Like for any neuron, this is also true for the neurons in the output layer, which is why the range that the output values can be within is dictated by the chosen activation function of the last layer.

This can be a problem, or at least we need to chose the model's parameters in a way to deal with it. If our "true" labels for example are in a range between 0 and 1000 and we have the sigmoid function (just as an example) as the chosen activation function of the last layer, this just doesn't match, because the sigmoid function returns only values between 0 an 1 (or more precisely: sigmoid(x) is between 0.5-1 if x is positive, or 0-0.5 if x is negative). "Doesn't match" in this case means the the network will almost always put out a result of +1 (because more is not possible with sigmoid) and we almost always will end up with a gigantic error between output and label and the network can't do anything against it.

If you take another example, the ReLU function, results can range between 0 and positive infinity. If the labels are very small values, this can also be a not so perfect match, although it could be done.

To summarize: we either need to chose an activation function (for the last layer) whose range of possible results matches with the labels, or we need to rescale the labels before we compute the errors into the backpropagation process. How we scale the labels then essentially depends on the activation function. If we have tanh as last activation function (for example), which can put out values between -1 and +1 one, then a simple min-max scaling method that squashes the labels between -1 and +1 might be the obvious idea, whereas for example normalisation (1=one standard deviation, zero mean) would be a bad idea, because the label range would exceed the output value range.

If we have scaled the labels to fit for backpropagation, then of course we have to do the opposite with the results of forward propagation in order to come up with usable results on the same scale as the labels.

If the labels consisted of a data series that we made stationary, like in our autoencoder example, we need to reverse this, too, of course.


We now have all the parts together in order to start with building the autoencoder model. After including my Multilayer Perceptron file and defining a class object "CMLP ae;" on the global scope, I can from there on use just "ae" in order to refer to the autoencoder model.

In the real code I'm setting all those numbers as input variables, but in order not to confuse with too many custom variable names, I replaced them here in order to better explain what I'm doing. Input data will be a stationary series of 360 prices with 10-second increments, so that they add up to price information of one hour. Then I use a neural network with 13 layers. The middle "bottleneck" layer has 36 neurons, so that we reduce the number of datapoints (that will later be fed into the LSTM network) by 90%. This is now just a starting point as a proof of concept; I might e.g. also end up with 5 second-increments and 720 datapoints and maybe reducing to just 10 bottleneck neurons... we'll see.

Some might say that such a data density, with 5 or 10 second increments, is overkill - but that is exactly the point. I want to start with an overkill of data and than find a much simpler and denoised representation. That's exactly what the autoencoder is made for.

Also something like 13 layers might seem like overkill. BUT: if we have a big difference between the number of layers in the input layer and the bottleneck layer (which makes sense in order to profit from the whole autoencoder idea in the first place), then we either need many layers or the number of neurons from one layer to the next will be much different, which also isn't good. We need to scale down more moderatly.

I chose all tanh as activation functions, 20% dropout in all hidden layers, a learning rate of 0.01, standard normalisation scaling for the input features and minmax scaling (+/- 1) for the labels:

   // build autoencoder model
   // -   1. add layers and assign activation functions
   for (int l=1;l<=11;l++){ae.actfunct[l]=f_tanh;}
   // -   2. feature scaling parameters
   // -   3. label scaling parameters
   // -   4. set learning rate
   // -   5. load weight&bias matrix file
   if (AEload){ae.load(AEfilename);}
   // -   6. alternatively: create/initialize new weight&bias matrix
   else {ae.weight_init_method=Xavier;ae.init();}
   // -   7. set dropout level for all hidden layers
   for (int l=1;l<ae.layers-1;l++){ae.dropout_level[l]=0.2;}
   // -   8. show loss function on chart

It may take me some testing and tweaking, then I'll keep you guys updated how it turned out with training on real data, followed by the next step: performing time series analysis with those data by feeding the bottleneck data into an LSTM network.

To be continued....

Enrique Dangeroux
Enrique Dangeroux  
If we can agree financial time series is a random walk. How are you going to prevent the LSTM from not simply outputting a value close to the current time step as future time step value?

I like your question. You're pointing to a problem that is often seen in time series analysis. There are many examples on the internet about seemingly miraculous price forecasting algorithms, but once you take a closer look, it often can be observed, that the algorithm has only learned to make more or less a copy of the last time step.

This makes complete sense! If(!) everything is basically random, any network output that has an upwards or downwards bias towards the next timestep will lead to a higher average error than not having a bias at all and just sticking to the level of the last timestep. Therefore a network can learn not to be biased in any direction and it finally becomes just a fancy big copy machine, that just reproduces the price by a lag of 1 timestep.

This is avoided by making the series stationary via differencing or even second degree differencing (taking the difference of the difference). The stationarity hypothesis can then be tested with some kind of "unit root test" like the commonly used Augmented Dickey Fuller (ADF) test.

Enrique Dangeroux
Enrique Dangeroux  

Now, you need to difference a potentially trending time series like financials, but I do not see how differencing or even differencing the difference takes out the random walk. For example if you transform the timeseries to 1 for close higher then open, and -1 vice versa, the LSTM still reverts to outputting marginally the last observation. 

Wat cost function are you considering?


If the outcome of this experiment should be that the model only shows that future and past prices are 100% uncorrelated and all price movements are entirely random then I'm totally okay with it. I'm not trying to sell an expert advisor. I'm just genuinely curious about the predictive potential of the combination LSTM+autoencoder in the forex market. I know that I'm by far not the first person to try that, but the publications that are available often work with stocks or stock indices and not Forex and I also want to experiment with different timeframes and LSTM architectures later on, so that's really something I need to do by myself and not just some article that I could study. Thanks for your opinion and your considerations!

About the random walk: I think we should distinguish between:

(1.) some precautionary measures (like the stationarity requirement) with data preparation and model architecture about how we don't get caught in some pitfalls of a potentially present random-walk


(2.) the evaluation methods of the outcome (prediction versus actual)

For the latter I have to admit that I didn't yet think it completely through, but I will certainly not just look at the predictions and decide if I like them. Measures like mean absolute error (MAE) and the coefficient of determination (r_squared) might play a role; maybe a crossvalidation test. I'm no expert on statistics and have to further think about how I'll do it (maybe you have a good idea?). What makes it a little more complicated is that I'm not just dealing with a univariate input timeline and a single output, but multivariate input and output data.

As for the cost function I'm just working with a standard MSE loss. I'm aware that it has the characteristics of exaggerating the impact of outliers. On the other hand, it's extremly easy to implement in code. As I wrote in the first post of this series I'm still learning and by no means a machine learning expert, so there might be better ways to do it. As far as I'm aware MSE is usually okay for regression problems (as is the case in this experiment), whereas cross-entropy loss (with softmax activation as output layer) is more or less standard for classification problems.

[These things are certainly interesting to further discuss, but I also try to keep in mind that I started this thread in a way that anybody with no previous machine learning knowledge at all is able to keep pace.]


I read with great curiosity. Especially the construction of the net is a task in itself. I myself experiment in all directions. Also I will schedule a pure classification based on image data without any LSTM.


I can report about some progress on the project.

The predominant problem, that I encountered, was finding a good learning rate.

As I mentioned earlier, this can be difficult. If it's too low, the learning process takes forever. If it's too high our calculations in the hidden layers produce exploding interim results. This doesn't cause my program to crash because I used Mql5's  "MathIsValidNumber()" function literally EVERYWHERE where exploding or vanishing numbers are a risk, without a single exception and always with an attached rule about what to do if an invalid number was indeed found. But this doesn't mean that the algorithm learns at a good speed and therefore doesn't save me from the tweaking.

Finding a good learning rate is quickly done via trial and error if it is a SMALL network. With 13 layers it's a different story.

Apart from that, the purpose of the first few layers mainly is to learn some vague properties, wheras the later layers are more for the details. This is why we want the first layers to be more stable and not change vastly with every iteration. We want slow learning for the first layers and fast learning for the last layers. This is a contradiction that cannot be resolved with one single global learning rate, which is why I now implemented the new feature of independently adjustable learning rates for the individual layers. Apart from that, I until now only had implemented time decay for the learning rate, i.e. a continous decline of the learning rates in order to allow for fine-tuned learning during the later stages of the learning process. I now also added a "momentum" variable for the learning rate. These are all typical properties that are common in machine learning libraries like Tensorflow with Keras. From those I also took the idea of implementing some features for automated optimization of the learning speed (like an "algorithm for the algorithm"), precisely the RMSprop and ADADELTA algorithm (the formulas where found after a quick google search). If you're new to machine learning, forget about these functions for now - let's just say I did some fine-tuning on the learning rate.

I also observed that the LeakyReLU activation function was most effective for the task. The results after ~1000 iterations with tanh were not half as good as with LeakyReLU almost at the beginning of the learning process.

This is a picture of a very early result: I stopped the algorithm whithin seconds after start-up, when only 32 backpropagation iterations where completed (remember: we're talking about the AUTOENCODER part here, the LSTM is the next step):


The vertical lines mark one hour intervals. The pink(ish) lines (360 dots per hour interval = 1 price every ten seconds) are the autoencoder's attempt to rebuilt the prices from the bottleneck neurons (13 layers, input layer with 360 neurons, bottleneck layer 36 neurons, all other hidden layers 720 neurons each, 360 output neurons, 20% dropout, RMSprop Optimizer with beta=0.99, eta=1e-07, "Xavier" weight initialization, bias-init=1, normalized scaling of inputs and labels, output layer ident-function, all other layers (including input layer): LeakyReLU). Of course, this example is still lacking any details. It only detects the vague direction of the price and doesn't yet recognise real patterns - as can be expected after only 32(!!) iterations, but I think for such an early example it's quite good. I'm quite happy to see that the algorithm is doing something that makes sense. The direction is there! Now let's see how it's doing after a few million iterations.... 

Icham Aidibe
Icham Aidibe  

I pin the thread to follow the progression of this project. 

@Marco vd Heijden, neural network & patterns, it seems to me you also working on such a project, are you ?