Machine learning in trading: theory, models, practice and algo-trading - page 2424

 
Transcendreamer:

Actually, the burden of proof is on the prosecution, so it is up to you to prove that the product is substandard (does not match the claimed)

I'm just appealing to logic and common sense 🧐

Well, and you're not a defendant)))) More like an expert) Then do your expert opinion

 
YURY_PROFIT:

Well, you're not a defendant, either.) More like an expert.) Then do your expert opinion.

I'm going to ask you for expert evidence 😉 because you're the accuser.

Or maybe you've already made a million, and it's not enough for you.

 
Igor Makanu:

What nonsense, there are users, there are producers of products/goods/services

You didn't build your own car, did you? You bought a ready-made car from a car manufacturer.

SZS: you can do it from a scientific point of view ... have you heard the Pythagoras theorem? and where is yours?)))


This is humor on profile resources, from what I just read: "the 3 elements came together, bullshit photographer, bullshit model and bullshit cosplay


so... well as if the Market allows you to put out new versions of the product, and never mind that some product authors just re-optimize their EAs on new historical data...... "in general, the elements came together" - here, by the way, one of the "elements" - imho, low customer training, few are able to properly test the product, but it is everywhere - including the buyers of the above mentioned cars - so to speak marketing

Where did you see stupidity? It was written in that post, first of all intellectual work was meant, and secondly - what "has NOT learned to use".

To criticize textual publications, you need to be at ONE level with the author. What a ridiculous comparison you made with Pythagoras. What is the point of this at all?

A more appropriate example might be the following: You bought a quantum computer, but you can't learn how to use it, even after reading the detailed instructions.

I hope you understand what the fundamental difference is.

 
Hot Chilean Guys - are you mixing up the thread?
 
The level can be seen immediately on the perception of certain topics (links) and other things. 2-3 people in the subject, the rest are just to fluff, as usual
 
mytarmailS:

What's the difference between "game \stop game" and "open \not open" or "buy \not buy"?

Nothing in my opinion, the usual classification...


The start/stop of another robot is supposed to be easier than your own buy/sell...

less market noise (noise is filtered by a controlled robot), solution is easier to find - because there are less invariants

 
Maxim Kuznetsov:

The start/stop of another robot is supposed to be easier than your own buy/sell...

There is less market noise (the noise is filtered by the controlled robot), the solution is easier to find - because there are fewer invariants

There is no difference, start/stop will be controlled by other signs where there will be noise
 
Maxim Kuznetsov:

The start/stop of another robot is supposed to be easier than your own buy/sell...

less market noise (noise is filtered by the controlled robot), solution is easier to find - because there are less invariants

Htz...

I'm still skeptical, I filtered the net with another one and that was bad, but you filter some TS-shka and that's ok?

 

So, I have conducted the first phase of research, which I announced earlier, let's try to understand what we get in fact. I will write and think at once, while I do not know the result to the end, there is a lot of information, and how to properly analyze it - the same question.

Let me begin in order, I took a sample from 2014 to the first half of 2021 (60% - train, 20% - test, 20% - exam), predictors 5336 pieces, fixed all parameters - the tree depth of 6 and random-seed 100 set, the learning rate of 0.03 and 1000 iterations (trees) with automatic termination in case of no improvement after a new 100 trees in the control sample, the other settings are not important, but variable parameters are the type of quantization and the number of quantum boundaries. Number of quantum bounds increases in progression from 8 to 512 and quantization type - 6 different variants, quantization tables are saved to a separate file.

Having trained all models, we obtain a table of 42 models ordered by "Balans_Exam" column - independent sampling.

The screenshot shows a table with hidden central values, but the first five best and five worst ones are shown, the average value of the indicators has been calculated for the whole sample.



As a result I selected two models - highlighted in green - they differ in the number of quanta - 8 and 128, respectively, and in the type of quantization - Median and UniformAndQuantiles.

Then I divided sampling within the test into 8 parts, so that each part had 6 months, and trained models with the first and second fixed quantum table separately, while for each, let's call it a project, I used 5 options for training, in which the parameter random-seeded - 100 options from 8 to 800 with step 8:

  1. Train 1000 trees with no stopping control on the test subsample;
  2. Train 1000 trees on a subsample with stop control on the test subsample after 100 iterations without improvement;
  3. Train 100 trees without stopping control on the test subsample;
  4. Train 50 trees with no stop control on the test subsample;
  5. Train 5 trees without stopping control on the test subsample.

After training was completed, the resulting models were analyzed for the following options for obtaining statistics about CatBoost predictors:

  1. PredictionValuesChange;
  2. LossFunctionChange;
  3. InternalFeatureImportance.

Then I averaged the results separately by Seed of each 1/8 part of the sample, and combined them into an overall table, which was ordered by the average value of the predictor significance index in each segment, while separately checking for a significant predictor in each segment and using the table ordering by this index as well. The described procedure was done for each project and each type of model statistics.

Below is an excerpt from the table for variant 5 training and variant 1 model analysis

Then I created settings for the exclusion from training of predictors not included in the n first predictors, if there were not enough predictors meeting the criteria, the settings file was not created. The settings were made for each statistical variant and for the project. The following limitations on the number of predictors used for training were used: 5/25/50/100/300/500/1000/2000/3000. In this way we got a set of settings.

Next I conducted training with a fixed setting of the quantum table on the sample - 60% test - 20% exam - 20% with a maximum number of trees 1000 and stop training on the sample test, training was conducted for all settings and two versions of the quantum tables, 100 models with random-seeded - 100 options from 8 to 800 in increments of 8. In addition, separate training was performed for the two quantum tables with no predictor exclusion but with random-seeded enumeration - 100 variants from 8 to 800 in increments of 8.

Below is the table with theMedian 8-boundary breakdown setup- the first and last 5 best choices.

Below is a table with a 128-boundary split adjustment by UniformAndQuantiles method - the first and last 5 best choices.


The first conclusion that can be made is that the model has potential, which depends on the predictors used, the use of which is affectedby random-seed. And thinking aloud, I will suggest that the purpose of selecting the settings/methods should not be the best result, but the average result of profit or other indicators. I will note that the average value of the financial result on the sample outside of training (column Balans_Exam) in the first variant is 2222.39 and in the second variant 1999.13.

Next, we make a table of the average metric values of the models with a breakdown of the settings for their training.

Below is a table with a breakdown into 8 bounds according to theMedianmethodfor different settings responsible for the exclusion of predictors- the first 10 best variants are the average values.


Below is a
table with 128 bounds according to the UniformAndQuantiles methodfor the different settings responsible for the predictor exclusion - the first 10 best variants - the average values.


To decipher what we have here in the "File_Name" column I suggest using the following table



Let's try to analyze it step by step, decreasing the number of combinations we can see.

The table below calculates which "Projects" are in the top ten of the two quant tables.

And here we see that in both tables there are good representatives of the first project (Exp_000) and the fifth(Exp_004), which is better and from which to abandon is not clear, but the fact that they are both in the top ten reason to think hard. Perhaps you should take statistics with any coefficients on the entire table - I do not know - propose options. However I want to note that variant Exp_004 is good as it takes the least time to prepare data for creating the setup files, which is logical because there are only 5 trees. I think it is too early to make final conclusions on the choice of the number of trees for the initial training, what do you think?

In the table below for the tens of two quantum tables, let's look at the type of predictor analysis and the limit number of predictors used in the model.



From the table we can see that the first method of analysis showed a greater number of responses, and also from the whole table we can see that most of the settings of the number of predictors used in the model do not exceed 50 pieces.

I propose now to look at the results of the models themselves, let's take those samples of projects which settings turned out to be in the majority, for the first quantitative table - CB_Svod_Exp_000_x_000000002 , and for the second - CB_Svod_Exp_004_x_000000002.


Below is a table with the Median method of setting the CB_Svod_Exp_000_x_000000002 predictor selection by 8 bounds-the first 5 best and 5 worst variants.



Below is a table with the 128-boundary split by UniformAndQuantiles method of the predictor selection setting CB_Svod_Exp_004_x_000000002 - the first 5 best and 5 worst vari ants.

Below are summary tables for comparison - the first row contains data from the initial quantitative table, the second row contains data afterrandom-seeded selection, and the third row contains results of selection after predictor selection procedure:

1.table with 8-border partitioning setup according toMedianmethod



Table with 128 bounds according toUniformAndQuantiles method



The estimates from the two tables show a decrease in the results on the training and testing samples, and an improvement in the performance on the independent sample, in other words, the effect of fitting has decreased due to an improvement in the characteristics of the predictors and a decrease in their number.


What tentative conclusions can be drawn:

1. Simply feeding the CatBoost sample is possible, but manipulating the predictors can significantly improve the model, including the financial result.

2. It is not always necessary to use a large number of predictors available in the sample to get a good result - it turns out that using only 1% of all predictors is sufficient to achieve good results, which we estimate from the average.

To develop this idea, we need to do experiments on other samples, and if the result is repeated, we can think about reducing the number of combinations to find promising results. The goal is to develop a blind method to find the best averages without peeking into the test and exam samples, which would increase the sample for training by 40% and still add identifying predictors with a stable response.

You can think about additional filtering of predictors at the time of evaluation by adding an adjustment factor for their usefulness/efficiency based on the resulting financial outcome.

Why am I looking at financial indicators - the point is that different events can occur in the market and if the model can preferentially select events with higher profits, then I like this approach of the model, while looking at the estimated statistical indicators of the model and the graph itself.

I hope you found the post interesting, I look forward to your comments!

I have attached a file with all the tables - who is interested and who wants to think.

Files:
CB_Svod_Si_Q.zip  697 kb
 
Well, you could take 5-15 increments and the numbers would be just as good.

Or weed out all the predictors by correlation first (seconds of time), and then take the remaining 5-15 (if you can get that many)

That's how econometrics saves you time.

Reason: