Taking Neural Networks to the next level - page 7

 

Thanks for your insights shared, it is inspiring. I am also eager to know more about the autoencoder use , you say it can replace an MA and has no lag, what is the catch?


If you are willing to share could you also tell me more about what you think of as an outlier or how you define this?

 

Let's change the "lag" question a little bit and ask ourselves: why is "lagging" a mandatory quality for any moving average or why is it impossible for a MA not to be lagging?

Simply because for a given point in time a moving average marks a price LEVEL (think of a scalar instead of a vector). Yes, sure, we can look at MAs and observe bent lines with all kinds of upwards and downwards components, but this is only the result of many data points in a row. The individual data points don't have any directional information (in other words: the "best answer" for an "extrapolation" of a moving average just means drawing a horizontal line, i.e. that stays at that level). Of course, you could extrapolate by applying a gradient to a MA curve (the human eye/brain easily sees such a tendency and will make this extrapolation for us), but then you have something else (beyond the basic idea of a moving average), i.e some kind of trend line of a moving average is not the same as the moving average itself.

This is a problem. Consider for example a series of prices that all just go up and up and up.... Any 6-year-old could draw a "regresssion" line and see that the next datapoint in the sequence will probably be higher than the previous data points, because of the visible directional information. But a moving average can NEVER be higher than the past data points that it is calculated from. This is why a moving average always is trailed behind of where the action happens.

Any regressive approximation model on the other hand can easily extrapolate to higher/lower values (= beyond the range of the input data) that perfectly fit into the tendency of the previous sequence. This is not just true for autoencoders. You mentioned e.g. Fast Fourier Transformation --> should work just as well.

I mentioned polynomial regression (or it's non-repainting trace line) and said that it is lagging less than e.g. a Hull Moving Average. Still, it does lag a little bit. And it does so especially when the chart is very volatile / "spiky", because it has a delay when it tries to go around such corners/spikes. Why is that so: because of the possible shapes that polynomial functions can have in general. They always are smooth curves and can never have sharp edges/spikes. Fourier Transformations will have the same "problem", by the way, if the underlying periodical spectrum lacks any spikes.  An extremely complex non-linear function represented by hundreds of neurons in an autoencoder on the other hand can be as smooth or non-smooth as necessary. It can draw a price curve of any shape that is the best possible denoised approximation in that exact moment (and not 5 minutes ago).

-----

By the way: I still owe you guys an answer if better forex predictions with neural networks are possible if we use multi-currency inputs; I didn't finish this attempt because of performance issues (memory problems during training = if Metatrader tries to load historical data for many symbols), but I didn't yet give up on that. 

 
Brian Rumbles:

If you are willing to share could you also tell me more about what you think of as an outlier or how you define this?

I simply think of outliers as the tails of a probability distribution, with the probability of a given deviation from the mean of the distribution resulting from the cumulative distribution function (cdf).

The only thing we need to keep in mind is that "normal" (gaussian) distributions are in many situations useless in the stocks and currency markets, so if we want to know if a price move is extreme, just using quantiles from standard deviations is probably not the best idea. I'm not saying that I didn't often use standard deviations in the past, but this is a bad habit I recently tried to get rid of.

I think a better matching alternative is the Laplace distribution and also student-t distribution + Cauchy distribution can be okay depending on the settings, but there are probably some math geeks out there with a more comprehensive answer on the subject.

I just took the formulas for the probability density functions and cumulative distribution functions from the Wikipedia articles.

 

it takes us away a little from the neural network topic, but because you asked...:     

(note: the formulas are not exactly simple and mistakes can happen quickly, so if anybody finds an error, you're welcome to tell me)

// ===== DISTRIBUTION FUNCTIONS =====

//+------------------------------------------------------------------+
//|      probability density functions (pdf)                          |
//+------------------------------------------------------------------+
// Gaussian normal
double pdf_norm(double x_val,double mu=0, double sigma=1)
  {
   return (1/sqrt(2*M_PI*pow(sigma,2))) * exp(-0.5*pow((x_val-mu)/sigma,2));
  }
// Cauchy  
double pdf_cauchy(double x_val,double x_peak, double gamma)
  {
   return 1 / (M_PI*gamma*(1+pow((x_val-x_peak)/gamma,2)));
  }
// Laplace
double pdf_laplace(double x_val,double mu=0,double scale_factor=0.707106781)
  {
   // note: standard for the scale factor is sigma/sqrt(2)
   return exp(-fabs(x_val-mu)/scale_factor)/(2*scale_factor);
  }

// Pareto
double pdf_pareto(double x_val,double alpha, double tail_index)
  {
   // note: the tail index can be understood as the scale parameter
   //       the alpha can be understood as the shape parameter
   if (x_val>=tail_index)
     {return (alpha*pow(tail_index,alpha))/pow(x_val,alpha+1);}
   else
     {return 0;}
  }
  
// Lomax (=Pareto type II)
double pdf_lomax(double x_val,double alpha=1,double tail_index=1)
  {
   // note: not defined for x<0 !!
   return (alpha/tail_index)*pow(1+MathMax(x_val,0)/tail_index,-(alpha+1));
  }
  
//+------------------------------------------------------------------+
//|      cumulative distribution functions (cdf)                     |
//+------------------------------------------------------------------+
// Gaussian normal
double cdf_norm(double x_val,double mu=0,double sigma=1)
  {
   return 0.5*(1+erf((x_val-mu)/(sigma*sqrt(2))));
  }
  
// Cauchy
double cdf_cauchy(double x_val,double x_peak=0,double gamma=1)
  {
   return 0.5 + arctan((x_val-x_peak)/gamma)/M_PI;
  }
  
// Laplace
double cdf_laplace(double x_val,double mu=0,double scale_factor=0.707106781)
  {
   if (x_val<mu)
     {return 0.5*exp((x_val-mu)/scale_factor);}
   else
     {return 1-0.5*exp(-(x_val-mu)/scale_factor);}
  }
  
// Pareto
double cdf_pareto(double x_val,double alpha,double tail_index)
  {
   if (x_val>=tail_index)
     {return 1-pow(tail_index/x_val,alpha);}
   else
     {return 0;}
  }
  
// Lomax
double cdf_lomax(double x_val,double alpha,double tail_index)
  {
   // note: not defined for x<0!!
   return 1-pow(1+MathMax(0,x_val)/tail_index,-alpha);
  }
  
//+------------------------------------------------------------------+
//|      gaussian error function (erf)                               |
//+------------------------------------------------------------------+
// necessary auxiliary function for gaussian cdf,
// approximation according to the formula by Karagiannidis+Lioumpas (2007)
double erf(double x_val)
  {
   double A=1.98;
   double B=1.135;
   return sign(x_val)*(1-exp(-A*fabs(x_val))*exp(-pow(x_val,2))/(B*sqrt(M_PI)*fabs(x_val)));
  }

//+------------------------------------------------------------------+
//|      signum function                                             |
//+------------------------------------------------------------------+
double sign(double x_val)
  {
   return (double)(x_val>=0)-(double)(x_val<0);
  }

//+------------------------------------------------------------------+
//|      random values for a given distribution                      |
//+------------------------------------------------------------------+
// reasoning: inverting pdf function, get x values from uniform y distribution
double rand_norm(double mu=0,double sigma=1)
  {
   double random=(double)rand() / 32767;              // random value within range 0-1
   random/=sqrt(2*M_PI*pow(sigma,2));                 // reduce to the top of the distribution (f(x_val=mu))
   double algsign=1;if (rand()>16383){algsign=-1;}    // get random algebraic sign
   return algsign * (mu + sigma * sqrt (-2 * log (random / (1/sqrt(2*M_PI*pow(sigma,2))))));
  }
  
double rand_cauchy(double x_peak=0,double gamma=1)
  {
   double random=(double)rand() / 32767;              // random value within range 0-1
   random/=M_PI*gamma;                                // reduce to the top of the distribution (=f(x_val=x_peak))
   double algsign=1;if (rand()>16383){algsign=-1;}    // get random algebraic sign   
   return algsign* (sqrt ( gamma/(random*M_PI) - pow(gamma,2) ) + x_peak);
  }
  
double rand_uni(double x_mean=0.5,double range=0.5)
  {
   double random=(double)rand() / 16383 - 1;          // random value within range +/- 1
   random*=range;
   random+=x_mean;
   return random;
  }
  
double rand_laplace(double mu=0,double scale_factor=0.707106781)
  {
   double random=(double)rand() / 32767;              // random value within range 0-1
   random/=2*scale_factor;                            // reduce to top of distribution (f(x_val=mu)
   double algsign=1;if (rand()>16383){algsign=-1;}    // get random algebraic sign
   return mu + algsign*scale_factor*log(random*2*scale_factor);
  }
  
double rand_pareto(double alpha=1,double tail_index=1)
  {
   double random=(double)rand() / 32767;              // random value within range 0-1
   random*=(alpha*pow(tail_index,alpha))/pow(tail_index,alpha+1); // top of distribution is given for x_val=tail_index
   return pow((alpha*pow(tail_index,alpha))/random,1/(alpha+1));
  }
  
double rand_lomax(double alpha=1,double tail_index=1)
  {
  double random=(double)rand() / 32767;              // random value within range 0-1
  random*=(alpha/tail_index)*pow(1/tail_index,-(alpha+1));
  return tail_index*(pow((random*tail_index)/alpha,-1/(alpha+1))-1);
  }
 
Chris70:

Let's change the "lag" question a little bit and ask ourselves: why is "lagging" a mandatory quality for any moving average or why is it impossible for a MA not to be lagging?

Simply because for a given point in time a moving average marks a price LEVEL (think of a scalar instead of a vector). Yes, sure, we can look at MAs and observe bent lines with all kinds of upwards and downwards components, but this is only the result of many data points in a row. The individual data points don't have any directional information (in other words: the "best answer" for an "extrapolation" of a moving average just means drawing a horizontal line, i.e. that stays at that level). Of course, you could extrapolate by applying a gradient to a MA curve (the human eye/brain easily sees such a tendency and will make this extrapolation for us), but then you have something else (beyond the basic idea of a moving average), i.e some kind of trend line of a moving average is not the same as the moving average itself.

This is a problem. Consider for example a series of prices that all just go up and up and up.... Any 6-year-old could draw a "regresssion" line and see that the next datapoint in the sequence will probably be higher than the previous data points, because of the visible directional information. But a moving average can NEVER be higher than the past data points that it is calculated from. This is why a moving average always is trailed behind of where the action happens.

Any regressive approximation model on the other hand can easily extrapolate to higher/lower values (= beyond the range of the input data) that perfectly fit into the tendency of the previous sequence. This is not just true for autoencoders. You mentioned e.g. Fast Fourier Transformation --> should work just as well.

I mentioned polynomial regression (or it's non-repainting trace line) and said that it is lagging less than e.g. a Hull Moving Average. Still, it does lag a little bit. And it does so especially when the chart is very volatile / "spiky", because it has a delay when it tries to go around such corners/spikes. Why is that so: because of the possible shapes that polynomial functions can have in general. They always are smooth curves and can never have sharp edges/spikes. Fourier Transformations will have the same "problem", by the way, if the underlying periodical spectrum lacks any spikes.  An extremely complex non-linear function represented by hundreds of neurons in an autoencoder on the other hand can be as smooth or non-smooth as necessary. It can draw a price curve of any shape that is the best possible denoised approximation in that exact moment (and not 5 minutes ago).

-----

By the way: I still owe you guys an answer if better forex predictions with neural networks are possible if we use multi-currency inputs; I didn't finish this attempt because of performance issues (memory problems during training = if Metatrader tries to load historical data for many symbols), but I didn't yet give up on that. 

please show a curve so we believe you

i got excellent results with my Adaptive SSA (better than regular SSA which is already adaptive)


as for the high difference between sudden moves and the MA : what you need to do is to separate high candles into smaller ones (for example is the candle is higher than 300 points then create a new candle). this solves the problems you can have with ususal MAs and you can get the reversal much sooner than at the end of the bar

 
Jean Francois Le Bas:

please show a curve so we believe you

i got excellent results with my Adaptive SSA (better than regular SSA which is already adaptive)

Sorry, I can't show any curves from Metatrader screenshots right now because I'm running a backtest that lasts several days, but it's a simple mathematical fact that any moving average is not only "biased" away from the last datapoint in the sequence into the direction of the mean of the preceding datapoints, but it IS the mean - so there's just no other way.

You can apply whatever tricks you want, like WMA, TEMA... or an adaptive SSA that you sell for 225 $$$ (good luck with that... just in case you're only trying to hijack this thread for your advertisement (??)) nothing can change the fact that a mean of a series of non-identical numbers can for example never exceed or match the extremes of the sequence, whereas the reality of a "best fit" sometimes requires exactly that. Any improvement is only possible by adding more data, but this cancels out the reasoning behind a moving average (if you think about it, the only real non-lagging MA is the one with a period length of 1, i.e. the price itself).

So in which sense are regression models better? Regression - to my understanding - is an iterative method for the solution of best(!) fit that the amount of available data allows for under the simultanous requirement for the best compromise between best fit and best possible generalization (/ least overfitting).

Try to be better than best and you'll know why regression will always win against averaging techniques. Regression gives us the least (possible) error by definition. Or more precisely: the error is asymptotically approaching the minimum error possible under the circumstances of the given amount of data and potential restrictions of the applied regression formula. A simple linear regression is such a restricted example that still has some way to go until deserving the term "best possible". Neural networks can have an infinite complexity, which is why there is no theoretical limit to the model quality (inverse relationship between complexity and restrictions), which in the end does in fact allow for this "best possible" model.

[Disclaimer: elimination of overfitting behaviour during validation is of course a bit trial and error, but overfitting is nothing exclusive for regression models, it affects low-period MA's, too, and lagging and overfitting are two completely different problems, so it's not the subject here.]

Don't know about you, but next time you go on a date wouldn't you prefer "best possible match" over "moving average"...?   ;-)

You could substract a moving average (or your SSA) from a regression of "best fit", then you'll know exactly what your amount of unnecessary residual error is. Regression in practice of course also has residual error, because the best possible fit "usually" in life isn't the perfect fit, but I promise you that it will always be lower than any averaging method.

Please logically falsify this before you ask for curves (I don't sell anything, by the way).

 

I've just remembered.....  I experienced   similar results  with Neuroshell Trader software.I So I  predicted RSI indicator 1  bar in the future  with Neuroshell's  shallow neural network algorithm.The one bar ahead  predicted  RSI indicator was slightly smoother than the original RSI without lag.But the smoothing  was nowhere near spectacular.

@Chris70 

My question is how smooth is your autoencoder  algorithm that is  alternative for moving average?Can you shoot a close picture  of your autoencoder please?

I tried sort of dimention reduction with Chaos Hunter software (particle swarm-evolutionary algorithms) .I used  dozens of different  moving averages as inputs.I applied some of Chaos Hunter's functions (poly-neural network- trigonometric function etc..).Only  two moving averages produced some interesting results  (Fourier type).If I can find the test pictures I would shoot a picture.The problem is I am not a coder and I could not find a right coder to code the formula in mql4.And I am not asking anyone in here to code it.

 

@nevar: thanks for the keyword "dimensionality reduction", because this is exactly what it is all about and many people get it wrong:

It's a no-brainer that normally our charts are 2-dimensional: time-axis and price-axis. If we calculate a moving average for a given period, the dimensionality isn't just reduced, but the time axis is completely eliminated, because the order of a sequence of prices is irrelevant for their mean. We just put a bunch of prices into a bucket, can mix them around as much as we like and their average will still be the same.

Now what is the dimension of "lag"? Exactly: time!! So it's not really astonishing that it's hard to get rid of lag after all time dimension information has been erased. It's impossible! We are averaging exclusively on the price axis!

Try instead to change the order of datapoints in a 2-dimensional regression scenario and you'll get a completely different result.

What we essentially really want, is signal/noise separation and not just completely(!) remove a dimension.

It doesn't really surprise me (and perfectly supports my arguments), that you saw better results with Fourier type filters, just because they have nothing to do with Moving Averages.

Again.. sorry, no screenshots possible for a couple of days. You maybe remember the pictures with those pink dots earlier in this thread, so this is how autoencoders see a chart. Essentially, the smoothness much depends on the complexity of the network and can be deliberately changed with the click of a button (/changing a few EA input settings). Anything is possible, from an extremly simple representation to a perfect copy of the original chart pattern (which isn't what we want, because we want to remove some noise). In order to be able to draw more complex and non-smooth shapes that are closer to a possibly also non-smooth underlying "real" (=denoised) signal, we essentially need enough hidden neurons plus an amount of bottleneck neurons that isn't ridiculously low. These are just model choices instead of limitations. It can be set up exactly as desired.

 
Chris70:

@nevar: thanks for the keyword "dimensionality reduction", because this is exactly what it is all about and many people get it wrong:

It's a no-brainer that normally our charts are 2-dimensional: time-axis and price-axis. If we calculate a moving average for a given period, the dimensionality isn't just reduced, but the time axis is completely eliminated, because the order of a sequence of prices is irrelevant for their mean. We just put a bunch of prices into a bucket, can mix them around as much as we like and their average will still be the same.

Now what is the dimension of "lag"? Exactly: time!! So it's not really astonishing that it's hard to get rid of lag after all time dimension information has been erased. It's impossible! We are averaging exclusively on the price axis!

Try instead to change the order of datapoints in a 2-dimensional regression scenario and you'll get a completely different result.

What we essentially really want, is signal/noise separation and not just completely(!) remove a dimension.

It doesn't really surprise me (and perfectly supports my arguments), that you saw better results with Fourier type filters, just because they have nothing to do with Moving Averages.

Again.. sorry, no screenshots possible for a couple of days. You maybe remember the pictures with those pink dots earlier in this thread, so this is how autoencoders see a chart. Essentially, the smoothness much depends on the complexity of the network and can be deliberately changed with the click of a button (/changing a few EA input settings). Anything is possible, from an extremly simple representation to a perfect copy of the original chart pattern (which isn't what we want, because we want to remove some noise). In order to be able to draw more complex and non-smooth shapes that are closer to a possibly also non-smooth underlying "real" (=denoised) signal, we essentially need enough hidden neurons plus an amount of bottleneck neurons that isn't ridiculously low. These are just model choices instead of limitations. It can be set up exactly as desired.

thank you for your input. Sorry didn't mean to plug my product more than that, but i still think it's a good MA alternative. Yes regression is good but i think it's not optimal and the end-points (or do i miss something?)

of course MAs will ALWAYS lag, unless maybe you input less past bars (the more the past bars, the more the lag, i think it's just logical) while retaining a good smoothing amount

also i think another alternative to your no-lag MA, would be to adapt the curve to the price : use smaller periods when there is higher volatility, and inversely when there is low activity. Of course, there are millions of different ways to do that but i think it's another good way to process timeseries. Of course the curve is then not "smooth" at all, but it reacts to price where we need, and that's all that matters (in trading of course)


what is funny is many algorithms try to smooth the data very well (meaning reducing the "peaks") while a good trading "MA" tries to "keep the peaks" while smoothing any otherwhere

do you think a NN can be trained to do that?


Jeff

 

@Jean Francois(/Jeff): Thanks for your reply. All this takes us further away from the Neural Network topic, but I think it's interesting.

If we try to figure out if there are better alternatives for Moving Averages, wo should first clarify the objective, i.e. what are Moving Averages made for? I think their purpose is to give us a simpler representation of the predominant bigger price moves, so that we can profit from any drifts away from these major moves, for example as an indication of a new trend, for mean reversion strategies, etc.. What I am saying is that we WANT a non-perfect representation. But we want this lower precision level on the price axis only, and NOT on the time axis.

Excluding irrelevant noise spikes and therefore coming up with a representation that doesn't perfectly fit the original pattern, is not a problem, but the whole purpose. What I'm saying here is nothing new and nothing you're not aware of, so the reason why I'm pointing this out is just that for the purpose of denoising / simplification, i.e. smoothing the price,  there is just no necessity for smoothing the time axis, too. This is just not part of the objective. But Moving Averages do exactly that.

I'm no proponent of "no-lag MAs". Instead, I suggest not to use Moving Averages at all. Yes, of course, we need some method of referencing to a "typical" price at any given moment in time, because we need some measure to decide if a price is relatively "cheap" or "expensive" in that moment. Don't get me wrong: I understand why people use Moving Averages. What they often don't see, is that the error in the time dimension is just unnecessary. There are many other methods of denoising that are better for the job, like digital filters and regression models, that preserve the time axis. Like you say: we want to "reduce the peaks", but there's absolutely no need for a shift by an error component on the time axis.

A Moving Average that keeps more peaks is also keeping more noise and therefore missing it's original purpose, so I wouldn't think of such an MA as a "good trading MA", but just an MA with a lower denoising degree.

Reason: