How to train the most common machine learning algorithm in R.K - Articles, Library comments

СанСаныч Фоменко 2015.01.16 08:41 #101

CyberCortex:

I deal with very specific things, so I'm not inclined to discuss anything "in general".

So more specifically.

There are about 140 machine learning packages in the caret shell - I use two packages for random tree forests and one package for support vector machines (SVM) in my work. As you can see making any generalisations for the machine learning tools available in R is not something I am able to do.

Moreover.

This is a branch that uses the very limited Rattle shell, of which I have only used the randomforest package in this article.

Furthermore.

From the said package I use only a part of functions in this article.

Further I will comment your post using only these limitations, but I can put the programme code and the results of its use under my words.

So.

1. "All financial series are so-called time series, in which the order of the values is important." - nobody denies this and this order is not broken, even though it is a time series. You, having trained the model on the prices P1, P2, P3...Pn, do not change their order when testing on Out Of Samples or when actually using them.

This is completely inconsistent with the specified packages. It is possible to implement your remark within other packages, but it would be much more complex code. The code used in the article (which is the approach that is most common in machine learning algorithms)is as follows:

The initial sample (dataset) is split into three parts: train (70%), test(15%), validate(15). The partitioning algorithm is as follows: for example, for train, 70% of the rows of the original dataset are randomly selected rather than sequentially. From the remaining 30%, another 15% is again randomly selected. It is clear that the remaining 15% is also a random sequence of rows. There can be no question of any observance of the sequence of bars.

But that is not all.

Training on the train set uses only a part of the data (about 66%). The remaining part of the data is the Out of sample (OOS) estimate. That is, the bars on which the OOV was computed, they were different, but they were mixed with the bars on which training was performed. The paper gives this estimate and it always gives the best performance.

Then in rattle you can get an estimate of the trained model on two more datasets. By comparing THREE estimates, conclusions can be drawn.

Once again, it takes special effort to train models on time series. If you want to use rattle, the target variable and the corresponding predictors should allow for random ordering of bars in training and testing.

2. I agree with you on one thing: if the input is 100% rubbish predictors, we get 100% rubbish in the output. This is obvious and nobody argues with it. All I am saying is that there are algorithms where it doesn't matter to cull the data, because they give good results on Out Of Samples with any amount of rubbish data other than 100%, because rubbish data is not used de facto. It is also important here to distinguish between algorithms for which data dimensionality reduction is critical, such as with principal component analysis or autoencoders, and algorithms that are insensitive to data dimensionality.

They aren't. Randomforest has a built-in algorithm for determining the significance of predictors. This algorithm is completely useless if there is rubbish among the predictors. The randomforest package itself has an additional function to pre-screen out junk predictors, but it is not available in rattle.

"In the stock market, research on the relationship between economic causes and direction of movement is common, but none of this applies to intraday intervals." - Well it does, and it does apply to intraday intervals, such as the release of Non-Farm Payrolls.

I don't know, I'm not aware of this resource. But I know the economy well. And I can confidently assert that the hourly data on Gazprom do not depend on any economic data on other companies. No questions about the results of the month.

3. of course I understand you, everyone earns as he knows how, but have you ever implemented any machine learning algorithms yourself? I am convinced that in order to understand how an algorithm works, you need to write it yourself from scratch. Believe me, in this case you will discover things that are not written about in books. And even seemingly obvious elements that seemed easy before, actually work differently than you thought:)

It's one thing to make mercs and another to drive them. I prefer to drive, but everyone makes their own choice.

PS.

I have written a book that has more in depth answers to the questions you posed.

PSPS

My personal experience shows that the main time up to 70% is spent on selecting predictors - dull and tedious work. On the grounds of selling the book some collective has formed - no one manages to make fast and effective selection of predictors, which would not generate overtrained models. And most readers have already moved beyond the book and are using better tools for the job.

With appreciation for your meaningful interest in the thread.

Machine learning in trading: Bayesian regression - Has What to feed to

CyberCortex 2015.01.16 12:30 #102

faa1947:

Thank you for your detailed reply.

But I would ask you to clarify in some places.

First you wrote above that the example of my programme working on medical data is "illustrative" and referred to the fact that time series have a strict sequence.

"All financial series are so-called time series, in which the order of the values is important."

And then you write that in Random Forest, which you use: "There can be no question of any adherence to the sequence of bars."

And let me ask: how then does Random Forest work on time series if the algorithm initially uses the random subspace method and shuffles these time series?

"Once again, to learn models on time series requires special effort." - then we are back to the beginning again. So why waste time on such algorithms if it takes special effort to make them work on time series? We are not talking from an academic point of view in the context of university research, where such work is encouraged, but from a practical point of view.

"I can confidently state that the Gazprom sentiment is independent of any economic data on other companies." - does that mean that Gazprom shares on intraday intervals are not affected in any way by the RTS index, which includes Gazprom and other companies?

"It's one thing to make a Mercedes and another to drive it. I prefer to drive, but everyone makes their own choice. " - in sophistry this is called a digression from the original thesis:) I was talking about algorithms and their realisation, and you imperceptibly put forward a thesis about cars, which is outwardly connected with the initial thesis, but not identical to it. I think there is a "small" gap between algorithms and cars. Don't mind me, I just like to notice such peculiarities:)

To summarise: what you write specifically about Rattle and Random Forest is most likely relevant to reality and you are knowledgeable about it.

But one question is still open: why should a person, having a choice of two algorithms of equal quality (the first of which works perfectly well on time series without special efforts and skills, and the second - with them), make a choice in favour of the latter? And will it be optimal in this case?

Machine learning in trading: ArraySetAsSeries() Questions from a "dummy"

CyberCortex 2015.01.16 12:32 #103

joo:
I couldn't get past that sentence. Respect!

Thank you:)

СанСаныч Фоменко 2015.01.16 12:56 #104

CyberCortex:

But one question is still open: why should a person, having a choice of two algorithms of equivalent classification quality (the first of which works perfectly well on time series without special efforts and skills, and the second - with such), make a choice in favour of the latter? And will it be optimal in this case?

Different tools solve different problems, which in turn are determined by the available material.

Rattle is a great tool to test an idea quickly. But it will be quite difficult to build a working model with it.

You can go deeper into rattle and from it pull a log of ready accesses as part of the randomforest package. If your target variable is, for example, predicting asset increments rather than trends, and you have managed to find predictors for that, then randomforest will be very useful. If you are going to predict trends, you will have to manually divide the sample into chunks, preserving the sequence, which is difficult (though possible) in rattle and work directly in the randomforest package, which does not impose any restrictions on the algorithm of input sample formation. There is a fairly extensive set of sample generation tools for testing. These are separate packages.

Well, and so on. The general conclusion is that there is no "optimal" tool.

Each of the packages has many subtleties that solve certain problems in the raw data. In general, the process is not even very simple.

PS. When I wrote about Gazprom, I meant about the relationship between quotations and profit type values. And the index is arithmetic, the same turkey..... But that's a different problem... Although the use of machine learning in the stock market is more promising than in forex. As it seems to me.

Bayesian regression - Has Machine learning in trading: Market prediction based on

CyberCortex 2015.01.16 13:02 #105

faa1947:

Different tools solve different problems, which in turn are determined by the material available.

Rattle is a great tool to test an idea quickly. But it is difficult to build a working model with it.

You can go deeper into rattle and from it pull a log of ready accesses as part of the randomforest package. If your target variable is, for example, predicting asset increments rather than trends, and you have managed to find predictors for that, then randomforest will be very useful. If you are going to predict trends, you will have to manually divide the sample into chunks, preserving the sequence, which is difficult (though possible) in rattle and work directly in the randomforest package, which does not impose any restrictions on the algorithm of input sample formation. There is a fairly extensive set of sample generation tools for testing. These are separate packages.

Well, and so on. The general conclusion is that there is no "optimal" tool.

Each of the packages has many subtleties that solve certain problems in the raw data. In general, the process is not even very simple.

PS. When I wrote about Gazprom, I meant about the relationship between quotations and profit type values. And the index is arithmetic, the same turkey..... But that's a different problem... Although the use of machine learning in the stock market is more promising than in forex. It seems to me.

Thank you, I have no more questions.

Thomas Schroeder 2015.01.21 12:08 #106

nice article

Andrew Kreimer 2015.01.22 21:02 #107

Great work!

Dr. Trader 2015.01.23 09:38 #108

Interesting article, thanks. I heard about R for the first time, it looks like a very useful thing. For a long time I want to make a neural network that can trade itself based on historical data, I will try to unload history from mt5 (ohlc, spread, volumes), pass it to rattle and see what happens.

Indicator to analyze history Machine learning in trading: Missing price when retrieve

СанСаныч Фоменко 2015.01.23 12:11 #109

Dr.Trader:
Interesting article, thanks. I heard about R for the first time, it looks like a very useful thing. For a long time I want to make a neural network that can trade on the basis of historical data, I will try to unload history from mt5 (ohlc, spread, volumes), pass it to rattle, and see what happens.

Rattle has six models, one of them is NS. I recommend to compare the results of NS with random forests, ada and SVM. I think you will be very surprised by the results.

Vladimir Perervenko 2015.01.28 16:13 #110

Dr.Trader:
Interesting article, thanks. I heard about R for the first time, it looks like a very useful thing. For a long time I want to make a neural network that can trade on the basis of historical data, I will try to unload history from mt5 (ohlc, spread, volumes), pass it to rattle and see what happens.

This is exactly what rattle is not designed for. You need to work directly in R. Here is a variant of such a solution https://www.mql5.com/en/articles/1103.

Good luck

Третье поколение нейросетей: "Глубокие нейросети"

2014.11.27
Vladimir Perervenko
www.mql5.com

Статья посвящена новому и очень перспективному направлению в машинном обучении — так называемому "глубокому обучению" и конкретней "глубоким нейросетям". Сделан краткий обзор нейросетей 2 поколения, их архитектуры связей и основных видов, методов и правил обучения и их основных недостатков. Далее рассмотрена история появления и развития нейросетей 3 поколения, их основные виды, особенности и методы обучения. Проведены практические эксперименты по построению и обучению на реальных данных глубокой нейросети, инициируемой весами накапливающего автоэнкодера. Рассмотрены все этапы от выбора исходных данных до получения метрик. В последней части статьи приведена программная реализация глубокой нейросети в виде индикатора-эксперта на MQL4/R.

Discussion of article "Random Forests Predict Trends" - page 11