Machine learning in trading: theory, models, practice and algo-trading - page 553

 
Elibrarius:

Missed)

Another point, if you take for example not 0, but for example 0.5 - it even with your method will "float" from sample to sample.

Only hard manual setting of range for each input will help. But it is not clear how to determine it. Maybe, for example, to run the data for a year, and discard 1-5% of emissions. And work with them during the year. Although in a year they will change.


Yes, you need to take the maximum available sample first and find the multiplier (I have it named multiplier for some reason :)

And if in new sample suddenly there will be value more... well you will have to divide by it already. But this rarely happens if we take increments with a small lag, for example close[0]/close[10]... and with close[0]/close[100] it can already be more frequent... but I think there already is an exception, especially if we periodically retrain NS

I don't have time to think about 0.5. :)

 

You should only use predictors that HAVE a RELATION to the target variable. In this case, "linearly" or "non-linearly" is irrelevant, irrelevant to the very precisely worded "have a relation".

To clarify the meaning of "have a relation" again, I'll give an example that I've given several times on this thread.


Target: population, has two classes: men and women

We take one predictor: clothing. Has two values: pants and skirts.

With such classes and such a predictor, one part of the predictor will predict one class, and the other part will predict the other according to the "pants-men", "skirts-women" principle. Ideally, we can build an error-free classification.

If we apply this system to the Scots, there will be an overlap in the "skirts" worn by both sexes. This intersection will give an upper bound on the error, which cannot be overcome.


It is obligatory to check the model on two different time files: before January 1 and after January 1.

On the first file: teach, test and "out-of-sample". If all three error values are approximately equal, then run on the second file - the error should not differ significantly (no more than 10%) from the first three.

THIS MODEL IS NOT RETRAINED.


PS.

If you include predictors that are not relevant to the target, "coffee grounds", the error can be drastically reduced. I, for one, don't consider models where the error is less than 10%. But it's always the same: the error on the second file is a multiple of the error on the first file. The model is REBUILD, it has picked up some values from the noise, trying to reduce the classification error, and there probably won't be any such values in the future, or there will be, or in times.... This model has no predictive power.

 
SanSan Fomenko:

On the first file: learn, test, and "out-of-sample". If all three error values are about equal, then run on the second file - the error should not differ much (no more than 10%) from the first three.

What is the difference between.

1) the "out of sample" section from the first file + another "out of sample" file

2) and one "out of sample" but a larger plot, which includes the second file as well?

It seems to me the result will be the same. If by the first option everything is bad in the second file, then in the second option, the same data will ruin everything.

 

Well, as a rule, when training the NS there are three sections. Training, Test and Control. If the error in the control section is within normal limits, then we consider that the model is not retrained. IMHO.

 
Mihail Marchukajtes:

Well, as a rule, when training the NS, there are three sections. Training, Test and Control. If the error in the control section is within normal limits, then we consider that the model is not retrained. IMHO.

I agree, but why SanSanych introduces a 4th section (the second office one) is unclear. After all, it is possible to expand one to include both.
 
elibrarius:

What is the difference between.

1) the "out of sample" section from the first file + another "out of sample" file

2) and one "out-of-sample" but larger section including the second file?

It seems to me the results will be the same. If things are bad on the first one in the second file, then the same data will screw things up in the second file as well.


The first file is divided into three parts at random, i.e. pieces of training, test, and control samples are mixed up by date. That does not happen in real trading.

But the second is an imitation trade: we always trade AFTER the training section. Contrary to your opinion, very often the results on the second file are very different from the results on the first - the model is retrained and not suitable for use.

 
Grigoriy Chaunin:

https://github.com/RandomKori/Py36MT5 Here are the sources of Python library for MT5. The only problem is with arrays. Passing an array or receiving it does not work correctly. I debugged the DLL code in Visual Studio. It all works. The question is, it may be a terminal bug. How to work with the library is not written. It doesn't make sense. No one needs it without arrays. Although, maybe I messed up in pythom.mqh file Help me figure it out. It will be useful for all.

It's a good idea and useful MT5 library, but synchronizing with python script file is quite a troublesome task.

I think it is better to synchronize MQL variables directly with python variables through a local dictionary and execute Python code fragments directly from string constants in EA code.

I've tried the test compiling bcc64 from command line and it works fine in Python 3.6:

#include <stdio.h>
#include "python.h"
#pragma link "python36.lib"

int main(int argc, char **argv)
{
  Py_Initialize();
  PyObject* main = PyImport_AddModule("__main__");
  PyObject* global = PyModule_GetDict(main);
  PyObject* local = PyDict_New();

  int a, b = 2, c = 2; // synchronize python variables and calc a = b * c
  PyDict_SetItemString(local, "b", PyLong_FromLong(b));
  PyDict_SetItemString(local, "c", PyLong_FromLong(c));
  a = PyLong_AsLong(PyRun_String("b * c", Py_eval_input, global, local));
  printf("%d*%d=%d\n",b,c,a);

  PyRun_SimpleString("import sys"); // import python sys and read version
  printf(_PyUnicode_AsString(PyRun_String("sys.version", Py_eval_input, global, local)));

  Py_Finalize();
  return 0;
}

It would be nice to add this functionality to your library, I was going to write my own library, but unfortunately for now I am busy with P-net library for python.

By the way, I wrote about this new neural network here in a branch, according to preliminary results of tests on examples with Fisher's Iris, it is trained three orders of magnitude faster than DNN in TensorFlow, with equal test results.

 
SanSanych Fomenko:

The first file is divided into three parts at random, i.e. pieces of training, test and control samples are mixed up by date. This does not happen in real trading.

But the second is an imitation trade: we always trade AFTER the training section. Contrary to your opinion, very often the results on the second file are very different from the results on the first - the model is retrained and not suitable for use.

I always put the first 3 parts in sequence. And if the 3rd is bad, the model is overtrained.
 

Do not forget that any data redundancy leads to a delay in the model's entry into combat. That directly affects the quality of the received signals after....

I personally chose the following methodology. I reversed the model obtained using Buy signals and tested it on the same part of the market, but for Sell signals. Thus, I don't lose precious time and adequately estimate the model possibilities. IMHO

 
SanSanych Fomenko:

The first file is divided into three parts at random, i.e. pieces of training, test and control samples are mixed up by date. This does not happen in real trading.

But the second is an imitation trade: we always trade AFTER the training section. Contrary to your opinion, very often the results on the second file are very different from the results on the first file - the model is overtrained and not suitable for use.


For forecasting systems, the order in which the data comes in is important. For classification, NO.

Reason: