Taking Neural Networks to the next level - page 8

Brian Rumbles
Brian Rumbles  
Thanks for the code of all the statistical methods, I will be trying them out soon.. And of coarse for all your new additions on this thread. I was reading above and thinking that maybe MAs or regreasions could take in data as input that is only below a certain number standard deviations from a relevant chosen mean, not sure if that would be smoother or more useful or less useful
Brian Rumbles
Brian Rumbles  
I also think it would be fair to ask everyone at this point to clarify the purpose for the ma or regression. Is it for trend prediction or future price prediction or what? 
Brian Rumbles:
....clarify the purpose for the ma or regression...

As I see it: various attempts to get the signal behind the noise // tools for referencing if the price is in transition to a higher or lower state.

Purpose / what we do with them..? Isn't that mainly up to us and the trading strategy?


Concerning the power of prediction of future prices (as my subjective interpretations):

1. Autoencoders: Please consider that the autoencoder part in my model only served for denoising and performance improvement. The predictive part was done by the LSTM-network only. Autoencoders don't extrapolate, so direct predictions are impossible with them. Prediction here is only possible via combination with other models. The reason why autoencoders don't extrapolate is that we feed present data as inputs and because labels and inputs are the same, the outputs also reflect the present state. If we fed future prices as labels (=during training), the same model would of course learn to extrapolate into the future, but with different inputs and labels it's by definition no longer an autoencoder, but a normal MLP (with built-in denoising, if we keep the bottleneck-architecture).

2. Moving Averages: can't be used for prediction because they lack a directional component (remember: extrapolation by applying some kind of gradient or trend line to an MA line is a different model and not part of the MA itself)

3. Polynomial regression: returns a coefficient matrix that can be applied to future timesteps, so the possibility of prediction by extrapolation is readily built into the model

4. Digital filters, FFT, Goertzel...: same as with polyn. regression; prediction by extrapolation is usually possible without adding anything to the model

Of course, no predictive model gives us "miraculous" insights for future prices. The additional problem that I have with Moving Averages is that they are wrong for the PRESENT already, before even thinking about forecasting (wrong in the sense of an unnecessary time axis error component that doesn't improve on the denoising purpose and has no other advantage).


I can now finally show you in a screenshot what a mean by polynomial regression and what I use to call "trace line" in this case; the magenta line is a 100 period regression line (power: 15), the continuation as a dotted line on the upper right is a 10-period extrapolation, the aqua color line is the trace line (non repainting trace of the right end point of the regression line (=without the optional extrapolation part)); of course the regression can also be combined with chanel lines like known from Bollinger Bands.

I have other versions of this indicator that include a color encoding of the slope of the PRT line and optional smoothing (which can make sense if color/slope changes are used as entry/exit signals).

Below the beginning of the magenta line you see one little example where the vertical distance between price and polyn.regr. traceline (PRT) is increasing a bit as indication that the downtrend is slowing down; this is no unwanted horizontal shift (=time axis) like with moving averages, but vertical drift as a useful trading signal indicating that the price moves away from a previous reference.

polyn. regr. + trace line


PRT slope color encoding plus fixed distance (pips) chanel

(- don't get confused by those dotted red/light green lines, they belong to the trailing stop management

 - don't get confused by the otherwise similar color scheme, but this time different meaning: aqua+yellow+magenta is the PRT line (with yellow for neutral slope, aqua for ascending, magenta for descending, the regression itself isn't shown, parallel channel in gray):

PRT color encoding

and "Bollinger Style" with standard deviation channel (but PRT instead of MA as central line):

PRT chanel


Okay... it's probably now really time to get back to neural networks... (yeah, I know, the off-topic intermezzo was mostly my own fault...):

In the meantime I finished the code of a multicurrency EA version for neural network price forecasting, so that it's now ready for training, validation and testing.

As I mentioned, the predictions in previous attempts (single currency) were not very reliable on average for any random moment in time, so I wanted to concentrate more on finding high-probability setups only.

I chose a classifier model now instead of forecasting exact prices, because the method that I'll use this time closely follows an approach suggested by Dr. Marcos Lopez De Prado ("Quant of the year" 2019, author of the book "Advances in financial machine learning").

The network has 3 outputs that are labeled based on the "triple barrier method":

 - output 1: an upper price level (n pips, fixed distance) is hit first (=upper "horizontal barrier")

 - output 2: a lower price level (n pips, fixed distance) is hit first (=lower "horizontal barrier")

 - output 3: no price level is hit within a max. number of minutes (="vertical barrier")

The activation function of the output layer is "softmax", which has the nice quality that all three outputs together add up to 1.0, so that the individual outputs can be seen as probabilities within a distribution.

Because it is a classifier this time, the loss function that we want to minimize during training this time isn't MSE loss (mean squared error), but Cross Entropy Loss.

The network has a normal MLP architecture for now, but I might also give it a try and compare with LSTM cells.

As I mentioned earlier, MLPs are good for pattern recognition, LSTMs and related types of recurrent networks are better for long time dependencies. So both have advantages. A multilayered fully connected LSTM network combines the advantages of both and this is also the model that I had initially used with the autoencoder. Without the autoencoder (which gets a little complicated with multi-currency trading), computation performance will suffer, which is why I start with a normal MLP; this doesn't mean that it can't have many neurons/layers, but not having to backpropagate on top of that through lots of time-steps is gonna make the training part a lot faster. We'll see.

Nevertheless, we're not done yet with a standard MLP network. Further following the suggestion of Dr. M.Lopez De Prado, I'm taking the outputs and the correct labels and thereby obtain true positives / true negatives / false positives / false negatives and can make a second (!) MLP network learn (after training of the main network) with this "meta labeling", so that I can calculate things like accuracy (validity), precision (reliability), recall and F-Score. The objective is to use these values for selection of high probability setups only.

For the inputs of the primary/main network, I'm using n periods of High/Low/Close prices (1.) of the main chart symbol and (2.) additional symbols that are conveniently communicated as an input variable (=comma separated list). Instead of pure prices, I take log returns a differencing method. The plan is to use at least all major pairs (EURUSD, USDJPY, GPBUSD, USDCAD, USDCHF) plus AUDUSD, as long as MT5 can handle these many price histories simultanously... It is the job of the neural network to find correlations among the currency pairs by itself and thereby derive possible consequences for the next upcoming prices of the main chart symbol.

I also added the month, the day of the week and the time as input variables.

For those of you who think about developing neural networks by themselves (may it be MQL or e.g. Python..), let's think for a moment about how to best feed these variables into a network (and if you don't know it yet, maybe I can show a neat trick):

Let's take the hour of the day as an example: 23 is followed by 0... does this really make sense? The minutes 23:59 and 0:00 are direct neighbors, but their values are at the highest possible distance. We have no continuity and the network will have some issues trying to make something meaningful out of this huge step. So what can we do?

One very common method (in fact the standard method for this purpose) is called "one-hot" encoding, which means we don't take just one input for the hour of the day, but 24 (i.e. 0-23). If for example the hour is 15:xx, then input number 15 gets the value 1, all other 23 of these inputs get the value 0. This method isn't that rare at all. Think of image recognition: an RGB sub-pixel is either ON or OFF, so it totally makes sense to encode a picture as "one-hot" encodings of all those MegaPixels that the images is made of.

If we only encode the hour, we need those 24 inputs. If we also encode the minute of the hour we have 60 more. Then 12 for the month... All this is absolutely feasible, but there might be a more elegant way...:

Think of the hour hand of a clock (and let's say this clock has a 24h watchface instead of 12h): instead of taking the value of the hour, we might instead take the angle of the hour hand, then we get a 360 degrees circle. Still, between 359° and 0°, there is this huge gap that we want to avoid. So how do we achieve continuity? The magic trick: the sine and cosine wave function! They are continuous, no gaps between neighbor values. If we put this into code, the declaration of the inputs can then look something like this:

MqlDateTime timestruct;

et voilà.. we just used only 2 inputs for continuous time information that is precise down to the second, instead of 24+60+60=144 inputs for the one-hot encoding method;

sin(2*M_PI*mon/12) and cos(2*M_PI*mon/12)    do the same for the month; this method works for all kinds of such "cyclic" variables.

Okay... now let's see if the multicurrency network version is training without any surprises and I'll come back later with some results...

Jean Francois Le Bas
Jean Francois Le Bas  

I wouldn't think of such an MA as a "good trading MA", but just an MA with a lower denoising degree.

the smoothing is adaptive, that's the beauty of it, meaning when there is no reversal in sight, we use bigger periods that smooth out any noise. It's NOT a "mean reversal strategy", it's simply a "reversal strategy"

of course it's hard to determine if we need to use bigger periods or smaller ones.

In the end, what's interesting is that we can use a simple MA for this task, with adaptive period and it will be VERY responsive. that's another beauty of it, it doesn't require high processing power to be computed (as regression) as a simple MA does the job.

Jean Francois Le Bas:

I don't see how being very responsive is a good thing. If I want to know where the action is, I can take the price itself. MA[1] is also very responsive ;-)

A chosen precision compromise is not a problem, but the whole point of this denoising thing.

Under the assumption that price = signal + error, all those overtuned MAs tend to return something like MA = signal + 90%*error + time_related_error_component, instead of MA = signal. And yes, polynomial regression isn't perfect either because it has a vertical error component in situations when the last prices (the action) are outliers to the underlying regression formula, so these are situations where detachment of the price from the indicator is in fact exaggerated by this vertical error component. But I wouldn't see this as a problem, because this phenomenom  just exaggerates the basic purpose of any MA. I successfully use these "breakouts" / detachments away from the regression as entry signals in trading with real money.

Isn't it the whole point that we WANT the price to be able to detach itself from the moving average in order to see if something irregular is happening? Why else on earth do you think people use stuff like a 200 days MA?

Adaptive Moving Averages may reduce some error components by a variable degree instead of making use of them as signals of irregularities that could translate into $$$.

And I see another problem. Apart from time lag, the most critiqued thing about Moving Averages (in general) are the whipsaws with false signals during ranging market periods. This is somehow impossiple to avoid, as long a you don't define a minimum price detachment from the indicator (like Bollinger Bands try to do). Now by using adaptive Moving Average techniques, the indicator stays close to the price even when it leaves the range. This only leads to bringing the whipsaw problem to where the action happens, too - which is unnecessary (btw: the autoencoder approach can also be critiqued on this).

I obviously don't know what exactly your algorithm is (though I wonder why you say "doesn't require high processing power to be computed" and in the description of your indicator I quote you with " YOU NEED A HIGH-END CPU TO USE THIS INDICATOR, SPECIALLY IN ADAPTIVE AND REALTIME MODE" (with the capital bold letters as part of the quote)... whatever... I totally agree that moving averages usually need neglibible processing power, which is a valid argument.

I also totally appreciate the mathematical beauty of an adaptive Moving Average like Perry Kaufman's AMA, but I just don't see the benefits for actual trading applications. This is my personal opinion and we'll probably just agree to disagree.


update: the first trial with multi-currency training (6 currency pairs in parallel) is looking good... CrossEntropyLoss with softmax activation is steadily declining....

(I paused for the screenshot after only a few iterations)

multi currency MLP


I'm wondering if the "triple barrier" classification method (--> Dr. M. Lopez De Prado) really is the best idea. During early training tests in the visual tester mode I could see that when the output for "vertical barrier gets hit first" got the highest value, the other two values often where not all evenly distrituted. For example:

- output for "vertical barrier first": 0.5

- output for "upper barrier first": 0.35

- output for "lower barrier first": 0.15

In such a situation, the obvious reaction would be to stay flat, because the vertical barrier gets the highest rating. On the other hand, there is an evident long bias, so that (given a good precision and accuracy rating), going long could in fact be the best action. Relatively high ratings for the "flat" class are an inevitable consequence if the vertical barrier is set too close to the decision time. Im therefore thinking about removing the vertical barrier and just make a binary decision (that is supported by a secondary evaluation through meta-labeling). The "flat" option would still be advised if the absolute value in a binary rating was too close to 0.5 or accuracy and precision below a threshold.

Any thoughts on that are welcome!