I pin the thread to follow the progression of this project.
@Marco vd Heijden, neural network &
patterns, it seems to me you also working on such a project, are you ?
That is correct @Icham Aidibe
I have done some work along these lines in the past.
But as you know i was writing about a different approach similar to pattern recognition.
As in facial recognition or to recognize a cat in a picture or to decode human handwriting or voice to text it's all basically the same networks
just adapted to a different target application.
For me the most important part is that it has to be able to learn by itself only feeding off of win and lose feedback and a OHLC datafeed.
Yup, then I did well to call you there around
@Chris70 : Marco is your man !
time for an update... in the meantime I
1. added the "ADAM" Optimizer algorithm as another way to make the learning process more efficient (the Python guys will know what I'm
talking about); it's not really necessary, but also gives me more options for future work with neural networks
2. implemented apoptosis and pruning functions; now what the hell is this guy talking about? well... "apoptosis" is
the medical/biological term for "programmed cell death", which is a natural occurance in the aging body and it also kills some tumor cells
before the develop into actual cancer (which other cells, that escape such mechanisms then might do..), but dying of cells and being
replaced by fresh cells is also an important feature of repair mechanisms for the health of any tissue and it also plays its role during brain
development in our childhood --> and here is the connection to neural networks:
When we decide on the architecture of a network, the decisions how many layers and how many neurons per layer we actually need isn't that obvious
at all. Apart from that, an oversized network also is more likely to suffer from overfitting. So "apoptosis" basically means "killing" of
redundant neurons after performing a scheduled ranking of the significance of each neuron. There are different methods how the ranking can
be done, e.g. by looking at the total sum of all (absolute) inoing and outgoing weights to/from an individual neuron. The general concept of
apoptosis goes back to the early 90s and it never was a big thing, but it might actually be in the near future when we need to make neural networks
more efficient for mobile applications. The reason, why it was not super successful until now is probably that reducing the number of
neurons comes with a price, i.e. accuracy decline. There is a compromise to be made between making the network shallower (and faster and less
memory demanding) and accuracy on the other hand. Up to a point the impact isn't huge, though. I ran a little test with my autoencoder by making
it completely oversized at first: with 13 layers, I chose 720 neurons for EACH layer (=also all hidden layers) except for the bottleneck
layer (8676 neurons in total (!!!).... now this is obvious overkill...on purpose for the experiment). Then I made the algorithm "kill" 50%
of the neurons (scheduled over the first 10000 backpropagation iterations) and compared the results to another test run where I kept all
neurons alive: with the 50% "apotosis" the mean squared error (MSE) was 2.5% higher. Not much of a difference...
"pruning" is also a concept that is stolen directly from neuro-anatomy, but this time it doesn't refer to the neuron cells, but to their connections,
namely those "axon" antennas. For the brain, especially during early development it's "use it or lose it". Connections that are not used can
be retracted, which is part of the plasticity of the brain. Back to computers and neural networks: why not make a neural network more
efficient by deleting some irrelevant weights that have very low absolute values or rarely contribute to neuron activations
("deleting"=assigning "0" values to them in the forward pass and ignoring them in backpropagation, which is the more performance
demanding part)? That's exactly what "pruning" is about: we give our network a little haircut by making it remove some weights and therefore
making it more sparse. The effects are similar to "apoptosis": you do a little and get more performance, you do too much and you lose accuracy.
So both methods have their issues, but I think it's nice to know about the concepts and if any of you is thinking about developing your own
networks in MQL, it's just one more thing worth a consideration.
3. [and then I also did some bugfixing and changes on the file structure, how I save all those data]
Here is an example (EURUSD 1min chart) of the way how the trained autoencoder (here now an example after ~19.000 iterations) now "sees" a chart
(for this example I used real tick data and 720 inputs with >=5 second increments, adding up to 1 hour of data per input, encoded to 36
numbers in the "bottleneck" neurons, so the encoder is reducing the data by 95%. I think with this example you can see what I meant be
"denoising" in an earlier post: the rebuilt price pattern follows the main(!) moves, but not every little spike. Compared to e.g. just using
a moving average, you can see that regression encoding has the benefit of having absolutely no lag. The resolution is not super detailed, but
that's exactly what we want - a simple representation of what the price is actually doing:
You seem to have profound grasp of deep nets.I guess you have a biostatistics background .I have
---Would information bottleneck principle make further improvement to your existent deep net algorithm ? ?https://arxiv.org/abs/1503.02406
---You said you added the "ADAM" Optimizer algorithm .Will you share the full codes in
the forum ?Thanks.
You seem to have profound grasp of deep nets.I guess you have a biostatistics background .I
have some questions:
---Would information bottleneck principle make further improvement to your existent deep net algorithm
---You said you added the "ADAM" Optimizer algorithm .Will you share the full codes
in the forum ?Thanks.
My professional background is in a completely different field, so everything I know is self-taught and there are for sure many things I still
need to learn about. But just like I learnt a lot by reading articles that are publically available on the internet, I hope to be helping some
guys out there who are on a similar journey as I am. So I believe (/hope) that the content of this series is of some value for anybody working with
neural networks, even if I don't plan to give away the entire code. Please understand that there has been a lot of work put into it. I think the
potential benefit of this series is more to help with the underlying theory, give some ideas and pointing out some problems that one might
stumble upon when developing neural networks. Apart from this, I'm doing this on the basis of an experiment, i.e. investigating on the
pratical usefulness of LSTM-networks with autoencoders for price forecasting.
The principle behind the "ADAM" algorithm has been explained in many good articles and also the formulas are publically available. Here is a
good example about Optimizers, including Adam:
But please note that those optimizers are a nice gimmick that speeds up the learning process a little, but they are by no means necessary.
You'll do just fine with simple vanilla gradient descent!! There are only two things I highly recommend in this case:
(1) working with time decay for the learning rate in order to allow for fine-tuned learning at the later training stages; this can be done be
simple multiplying the learning rate with a factor "factor/iterations"
(2) adding some kind of momentum component in order to reduce the risk of getting stuck in local optima of the loss function, which prevents the
overall error from further decline; the simplest way how this can be done is by performing the weight corrections by using a running moving
average of the calculated necessary corrections (instead of the individual values).
The bottleneck principle is the essential component of any neural network autoencoder, so of course I don't only make use of it in my
algorithm, but the whole autoencoder principle would be useless without the bottleneck part. I explained this in post N°4 of this thread.
Update with good news: now that the autoencoder part is complete and that I'm therefore having some usable data, I was now also able to couple the
autoencoder with the LSTM network and all bugs seem to be fixed for now.
I did a few short test runs that confirmed that the LSTM network is performing the learning process correctly (decrease in the loss function
and output of some predicted
patterns (=after re-encoding) that visually make a good first impression.
So now it's time for the training part of the LSTM network and trying to make some price predictions that are hopefully useful enough to be put
into some actual trading.
This series has made some progress, so for those who didn't read it all (and maybe as a quick recap summary), once again in a few words what the
program is doing:
1. I trained a neural network (autoencoder type MLP) to be able to encode chart patterns of 1-hour intervals into more simplistic and
2. this encoding process goes along with a reduction of the amount of data by 95% (720 prices with 5 second increment now represented by a
series of only 36 numbers)
3. every 36-number sequence represents a single time step input (=one new input every hour) for another neural network: a "recurrent"
neural network with a special type of memory cell neurons, the so called LSTM (for long short-term memory)
4. these LSTM network are particularly suitable for time series analysis - in this case: the attempt of forex price prediction
5. with its memory capability and a lookback period of given number of timesteps I will try to make a prediction by 1 hour ahead
6. the "output" = the predictions of the LSTM network will also be in the form of a 36-number sequence --> those are not understable at first
glance and therefore need to be
DEcoded first by once again using the trained autoencoder, now "in reverse" (more precisely: feeding the 36-number sequence into the
"bottleneck" part of the autoencoder and calculation from there to the output layer (instead of going from the inputs to the bottleneck,
like one would do for
7. the LSTM network will be trained to consider the future prices as correct output and will therefore try to do the same in
8. this is possible because during training historical data are used and the "future" is then already known, so the network learns how it must
behave to make the best predictions that the past data allow for
9. it is very well possible that there isn't much to be predicted at all, because the past might not indicate nothing at all about future price
moves and maybe the "random walk" theory and the "efficient market hypothesis" are 100% correct - this is something that this experiment
will have a look at
10. IF there is something useful to predict, I will then translate this information into trading conditions
If you google for pictures of LSTM networks, you will often see representations where there is just one neuron cell shown per timestep. This
is confusing at first and often only serves the purpose of simplicity (although those very basic time series analysis networks with just one
neuron and one input per timestep certainly do exist and they may work just fine, depending on the task at hand). When dealing with neural
networks most of the variables actually are array elements, so almost everything is represented by vectors/matrices and not by single
numbers (=scalars). This is why in graphical representations of neural network architectures often a single neuron in reality stands for a
whole layer of neurons and it doesn't have just one input, but a vector of inputs, that each are associated with weights.
I'm mentioning all this because the LSTM network that this thread is about is also 3-dimensional: (1) several layers with (2) several neurons
each are stacked upon each other and all of those neurons are connected via (3) the time dimension to earlier versions of themselves.
All the calculations both in the learning process and during forward testing and training go through this entire 3-dimensional grid. It
possibly isn't surprising, that this comes with some challenges. I'm always astonished how the algorithm calculates its way through
several thousand neurons per second. Yes: with MQL5 !! Say what you want about MQL, but nobody shall say that it can't be fast if we let it...
The callenge comes more with the absolute numbers, i.e. seeing some extremely high or low numbers that can no longer be represented as
"double" precision floating point numbers. LSTM type recurrent neural networks (RNN) suffer much less from the "vanishing gradient"
problem then simple RNN's, but this doesn't say that there are
no limitations. If exploding or vanishing numbers are produced depends on many things, also on the scaling of inputs and labels (=
targets = "real" values), on the method of weight and bias initialization, the type of activation functions, the dimensions of the network
and the size in each dimension.
In my early test runs (all with 36 neurons in all LSTM layers) I was able to get the training going with for example:
- 500 timesteps and just 1 LSTM layer
- 48 timesteps and 3 LSTM layers
- 24 timesteps and 5 LSTM alyers
every time I went significantly beyond these orders of magnitude, I was seeing NaN/Inf numbers and no quick decrease of the loss function;
I need some more fine-tuning before I decide on which dimensions I will go with, but at least it totally seems like something useful to work
Before the actual test in any "scientific" experiment, of course the hypothesis ("significant predictions are possible") versus the Null
hypothesis ("it's just a random walk") and the methodology need to be clearly defined.
But once I get the results, it should also be clear how to interpret them. And these decisions should be made in
advance, so not by just collecting the results and deciding if I like what I see.
I attached a file as an example of what my network test
result reports currently look like (forget about the actual numbers there, this is really just an example).
I certainly want to look at
- r squared (coefficient of determination)
- mean absolute error
- max. absolute deviation
But especially with "r squared", the problem is, that there is no cut-off that defines "good enough". I clearly shouldn't be zero, but I will
need something else in order to decide if my predictions are statistically significant.
Question for the statistics guys out there: does anybody have a good idea which measure I could use?
I should mention that
(1.) both predictions and labels of course won't be 'normally' distributed due to the "fat tails" problem with financial data and
(2.) I'm not dealing with a single prediction, but a whole series for price forecasts for the next hour. Of course I can (and will) extract the
predicted open/high/low/close from them, but it would be nice to have some kind of statistical tool to decide if the whole price sequence
might also be just random, or if instead I have actually predicted something useful
(3.) it shouldn't be super complicated to implement into Mql code, because I would very much prefer to directly work with the numbers that I get
returned by the program and rather not to depend on R / SPSS / MatLab ...
Any thoughts / ideas very much appreciated !
go on man! you're a pioneer, buy/sell & show us something concrete expressed in USD
Focus on point 3. Then you have other means to your disposal like writing an indicator which does not require any statistical measure.
Statistical measure alone can be misleading from just numbers, plotting a chart would reveal everything very quickly.
Here is a picture of a simple sine wave forecast, just to make sure the code i wrote works.
I work with no third party software, just MQL. I use "online training" method only. The picture shows the start (zero knowledge) and the
Here is pic from another "indicator" trying to forecast EURUSD. It can be noted that it shows the Naive forecast.
My point being. Picture speaks more than thousand measurements.
Regarding testing for random walk. You can use your LSTM, if it reverts to naive forecast and you know your LSTM is capable of predicting, the culprit
will be your data where no structure can be found. There are also other ways. Read this
Or just google random walk or random walk forecast.