Help write a linear regression - page 6

 

Probably the best thing in this case is to find the arithmetic mean of all X[i], subtract it from the values themselves, calculate the regression coefficients and correct them back again. In principle, nothing prevents you from doing the same with Y[i]. But I haven't tried to find the formula, although it is obviously not difficult. There's obviously some trick with weakly conditioned matrices.

P.S. You can also normalize the data series to about the same order.

 
Mathemat писал (а) >>

Probably the best thing in this case is to find the arithmetic mean of all X[i], subtract it from the values themselves, calculate the regression coefficients and correct them back again. In principle, nothing prevents you from doing the same with Y[i]. But I haven't tried to find the formula, although it is obviously not difficult. There's obviously some trick with weakly conditioned matrices.

P.S. You can also normalize the data series to about the same order.

And I have and suggested through MOJ. Straightforward formula no correction needed.

Now you can put the ACF in the code base as well. Re-checked, all coincides with an accuracy of 8 characters, but it is most likely due to the entry in the file values of ACF calculated in MQL.

 

Sergei, I already stumbled over this a couple of years ago when I was writing my LR. The way out is simple - listen to Candid's recommendation. The only thing I would specify in this recommendation is to subtract not Time[Bars-1], but time of the first value X[]. Firstly, it makes the procedure code universal since the X start is moved within the procedure. Secondly, if there are a lot of bars on the chart (3 years is 1000000 minutes, i.e. 60000000 sec), then subtracting the time of the first bar on the chart will not fundamentally change the situation. Third, you will be able to go back to your original formulas without any MOs, which means you will be able to remove cycle repetition while maintaining accuracy.

One more thing. I noticed that your X[] is the time of the minutes. That is, the X's are arranged equidistantly. Which means that you can get away from time altogether and use the bar number. If you make this transition, everything will be counted accurately and quickly. You can check. This is also preferable from the point of view that your LR will work the same on M1 and D1 (imagine how different the X values on D1 will be if it is Time and not bar number).

 
Yurixx писал (а) >>

Sergei...

Thanks. Tried it all.

I just laid out my version of the calculation, maybe someone will find it useful. I don't need to move anything. I don't feel comfortable moving X to 0. I use this function to calculate ACF, it should be time bound (there are some dependencies).

 

In general, there is no need to move X itself to point 0. To do this, it is sufficient to use an internal array shifted by X[1] in the LR function instead of X itself. You can even do without an array - just subtract the value of X[1] at the moment the sums are calculated.

By the way, if you tried it, did it not help ?

 
Yurixx писал (а) >>

In general, there is no need to move X itself to point 0. To do this, it is sufficient to use an internal array shifted by X[1] in the LR function instead of X itself. You can even do without an array - just subtract the value of X[1] at the moment the sums are calculated.

By the way, if you tried it, did it work?

I tried it and it seems to work. But there is one nuance. If the algorithm gives such an error for an array of 6 numbers, we have no guarantee that the error will not accumulate even with an offset. The array I work with is 7200 (minutes). That's why I found this algorithm, and it works correctly. I had to give up the other one, because I don't trust it anymore.

//+------------------------------------------------------------------+
//| Формула предлагаемая мной                                        |
//| Рассчет коэффициентов A и B в уравнении                          |
//| y(x)=A*x+B                                                       |
//| используються формулы https://forum.mql4.com/ru/10780/page5       |
//+------------------------------------------------------------------+

void LinearRegr(double X[], double Y[], int N, double& A, double& B)
{
      double mo_X = 0.0, mo_Y = 0.0, var_0 = 0.0, var_1 = 0.0;
      
    for ( int i = 0; i < N; i ++ )
      {
        mo_X +=X[i];
        mo_Y +=Y[i];
      }
    mo_X /=N;
    mo_Y /=N;
        
    for ( i = 0; i < N; i ++ )
      {
        var_0 +=(X[i]-mo_X)*(Y[i]-mo_Y);
        var_1 +=(X[i]-mo_X)*(X[i]-mo_X);
      }
        A = var_0 / var_1;
        B = mo_Y - A * mo_X;
}

>> I don't need any shifts.

 

No problem, Sergei, use whatever you like. I just want to draw your attention to a small detail.

As you certainly understand, MO is between max and min for any row. The code you have given actually means to move the starting point to [mo_X, mo_Y]. And to do this, you loop through all your 7200 values. And then you subtract, while calculating the sums, the zero point coordinates from the row coordinates. You might as well take any point in a series [Xm, Ym] as the origin and perform the calculation of the second cycle, replacing [mo_X, mo_Y] with [Xm, Ym].

Parameter A of a linear regression is invariant with respect to origin transfers. MO has nothing to do with it.

You can check this fact in 3 minutes on paper.

That's why the cycle of calculating IR is unnecessary. We just need to bring the values of X and Y to close orders.

 
Prival писал (а) >>

If the algorithm gives such an error for an array of 6 numbers, there is no guarantee that the error will not accumulate even with an offset.

The issue here is not the number of numbers, but the fact that in each of those 6 sits as a permanent (useless) additive 1216600000. It simply contains no information whatsoever. But it is 10 significant digits. Let it be 9, the last 0 is uninformative because all 6 of them are present too. When squared, this rubbish will block out 17 significant digits of the mantissa. And there are only 15 in it. That is, it will dump the lowest digits (down the toilet). Meanwhile, it is these discarded digits that contain the needed information (they contain part of the information about the variable component X).

 

So I didn't make this formula up. It's in the books. And from that formula, which uses squares, this one is derived (without squares). Just sit with a pencil. When I get to the scanner, I will post a page from Tikhonov V.I. "Statistical Radio Engineering" p.446.

 
Exactly, just sit with a pencil and you will see that if you replace Xi -> Xi-X0 and Yi -> Yi-Y0 in your original formula for slope b, then this new formula is equivalent to the original one. For any values of X0 and Y0. Therefore the sums Xi and Yi (which is the MO calculation) can be moved inside the second cycle, which halves the LR computation time. And to get accuracy we should choose proper X0 and Y0. And it is better to do it so that orders of series X and Y are closer to each other.
Reason: