Machine learning in trading: theory, models, practice and algo-trading - page 213

 

Another simulation example.

We build 20,000 linear models (1,000 observations everywhere, number of predictors from 1 to 20 (1,000 models for each number), plus one independent variable). The data are i.i.d., N(0,1).

The purpose of the simulation is to make sure that the F-statistic does not exceed a critical value when the MNC regression is built on independent data (that do not contain dependencies) that satisfy the requirements of the lin.model. So, it can be used as an indicator of model training.

############### simulate lm f-stats with random vars


rm(list=ls());gc()


library(data.table)

library(ggplot2)


start <- Sys.time()


set.seed(1)


x <- as.data.table(matrix(rnorm(21000000, 0, 1), ncol = 21))

x[, sampling:= sample(1000, nrow(x), replace = T)]


lm_models <- x[, 

 {

  lapply(c(1:20), function(x) summary(lm(data = .SD[, c(1:x, 21), with = F], formula = V21 ~ . -1))$'fstatistic'[[1]])

 }

 , by = sampling

]


lm_models_melted <- melt(lm_models, measure.vars = paste0('V', c(1:20)))


crtitical_f_stats <- qf(p = 0.99, df1 = c(1:20), df2 = 1000, lower.tail = TRUE, log.p = FALSE)


boxplot(data = lm_models_melted, value ~ variable); lines(crtitical_f_stats, type = 's', col = 'red')


Sys.time() - start


gc()

Code running time: 1.35 minutes.

 

Useful Code. Visualizes transaction sequences in three hypostases.

##########################


rm(list=ls());gc()


library(data.table)

library(ggplot2)

library(gridExtra)

library(tseries)


start <- Sys.time()


set.seed(1)


x <- as.data.table(matrix(rnorm(1000000, 0.1, 1), ncol = 1)) #random normal value with positive expectation

x[, variable:= rep(1:1000, times = 1000)]

x[, trade:= 1:.N, by = variable]


x.cast = dcast.data.table(x, variable ~ trade, value.var = 'V1', fun.aggregate = sum)

x_cum <- x.cast[, as.list(cumsum(unlist(.SD))), by = variable]


monte_trades <- melt(x_cum, measure.vars = names(x_cum)[-1], variable.name = "trade", value.name = 'V1')

setorder(monte_trades, variable, trade)

monte_trades_last <- as.data.table(monte_trades[trade == '1000', V1])

quantile_trade <- monte_trades[, quantile(V1, probs = 0.05), by = trade]

RF_last <- monte_trades[, V1[.N] / maxdrawdown(V1)[[1]], by = variable]



p1 <- ggplot(data = monte_trades, aes(x = trade, y = V1, group = variable)) +

geom_line(size = 2, color = 'blue', alpha = 0.01) + 

geom_line(data = quantile_trade, aes(x = trade, y = V1, group = 1), size = 2, alpha = 0.5, colour = 'blue') +

ggtitle('Simulated Trade Sequences of Length 1000')


p2 <- ggplot(data = monte_trades_last, aes(V1)) +

geom_density(alpha = 0.1, size = 1, color = 'blue', fill = 'blue') + 

scale_x_continuous(limits = c(min(monte_trades$V1), max(monte_trades$V1))) +

coord_flip() +

ggtitle('Cumulative Profit Density')


p3 <- ggplot(data = RF_last, aes(V1)) +

geom_density(alpha = 0.1, size = 1, color = 'blue', fill = 'blue') + 

geom_vline(xintercept = mean(RF_last$V1), colour = "blue", linetype = 2, size = 1) +

geom_vline(xintercept = median(RF_last$V1), colour = "red", linetype = 2, size = 1) +

ggtitle('Recovery Factor Density + Mean (blue) and Median (red)')


grid.arrange(p1, p2, p3, ncol = 3)


Sys.time() - start


gc()

Works for about 45 seconds. Draws for about 1.5 minutes.

 
Alexey Burnakov:


Beautiful, thank you.
 
Alexey Burnakov:

The purpose of the simulation is to make sure that the F-statistic does not exceed the critical value when the MNA regression is built on independent data (not containing dependencies), satisfying the requirements of the linear model. So, it can be used as an indicator of model training.

I did not fully understand about fstatistic. The data here are random, but the model has learned something, so we can conclude that the model is fitting and retrained. Which means the model evaluation must be bad. That is, I was expecting a negative fstatistis, or some other indication that things are bad on the graph.
How do I correctly interpret the result from that example?
I understand that the first predictor can be considered more qualitative than the first + second. And 1+2 is better than 1+2+3. Is it like that? Is it reasonable to genetically select the set of predictors that will give the highest fstatistic?
 
Dr.Trader:
I didn't fully understand about the fstatistic. The data here is random, but the model has learned something, so you can conclude that the model is fitting and over-trained. Which means the model evaluation must be bad. That is, I was expecting a negative fstatistis, or some other indication that things are bad on the graph.
How do I correctly interpret the result from that example?
I understand that the first predictor can be considered more qualitative than the first + second. And 1+2 is better than 1+2+3. Is it like that? Is it reasonable to genetically select the set of predictors that will give the highest fstatistic?

Look at the F distribution table.http://www.socr.ucla.edu/applets.dir/f_table.html

The F-statistic is a value that depends on the degrees of freedom. It is always positive, since we have a one-sided distribution.

But the model doesn't learn anything, because the trained model must have a high F-statistic (greater than or equal to critical at a given alpha - as it sounds when testing the null hypothesis).

In all cases did not exceed the critical value at alpha = 0.01, and can be set to 0.0001, for example.

That said, I wanted to make sure (I didn't study this at university) that by adding noise variables the linear model would not show an increase in learning. As you can see...

F-Distribution Tables
  • Ivo Dinov: www.SOCR.ucla.edu
  • www.socr.ucla.edu
Statistics Online Computational Resource
 
Alexey Burnakov:

Useful Code. Visualizes transaction sequences in three hypostases.

Regarding the above code. Please write at least brief comments in the code. Especially when you use complex expressions. Not everyone knows and uses the "data.table" package, it is not superfluous to explain what dcast.data.table does, what is .N, .SD. You don't post the code to show how deep you are into the subject. In my opinion the published code should help other users (even with beginner's level) to understand the script.

It is wonderful that R allows you to program the action in several ways, but it is desirable that the readability of the code is not lost.

A few suggestions on the code:

- intermediate variables x, x.cast, x.cum are not needed in calculations and only take up memory. All calculations which don't require saving intermediate results, it's desirable to do them through pipe

For instance

#---variant-------------
rm(list=ls());gc()
library(data.table)
library(ggplot2)
library(gridExtra)
library(tseries)
#----
require(magrittr)
require(dplyr)
start <- Sys.time()
monte_trades <- as.data.table(matrix(rnorm(1000000, 0.1, 1), ncol = 1)) %>%
        .[, variable := rep(1:1000, times = 1000)]%>%
        .[, trade := 1:.N, by = variable] %>%
        dcast.data.table(., variable ~ trade, value.var = 'V1', fun.aggregate = sum)%>%
        .[, as.list(cumsum(unlist(.SD))), by = variable]%>%
        melt(., measure.vars = names(.)[-1], variable.name = "trade", value.name = 'V1')%>%
        setorder(., variable, trade)
monte_trades_last <- as.data.table(monte_trades[trade == '1000', V1])
quantile_trade <- monte_trades[, quantile(V1, probs = 0.05), by = trade]
RF_last <- monte_trades[, V1[.N] / maxdrawdown(V1)[[1]], by = variable]
Sys.time() - start
#Time difference of 2.247022 secs

Of course, it takes a very long time to build graphs.

Not a criticism.

Good luck

 
Dr.Trader:
I didn't fully understand about the fstatistic. The data here is random, but the model has learned something, so we can conclude that the model is fitting and over-trained. Which means the model evaluation must be bad. That is, I was expecting a negative fstatistis, or some other indication that things are bad on the graph.
How do I correctly interpret the result from that example?
I understand that the first predictor can be considered more qualitative than the first + second. And 1+2 is more qualitative than 1+2+3. Is it like that? Is it reasonable to genetically select the set of predictors that will give the highest fstatistic?

And here is an example where we assume that the fully trained model will include 20 variables, with their weight increasing (1 variable - weight 1, the 20th variable - weight 20). And let's see how the distribution of F statistics will change after successive predictors are added to the model:

############### simulate lm f-stats with non-random vars


rm(list=ls());gc()


library(data.table)

library(ggplot2)


start <- Sys.time()


set.seed(1)


x <- as.data.table(matrix(rnorm(20000000, 0, 1), ncol = 20))

x[, (paste0('coef', c(1:20))):= lapply(1:20, function(x) rnorm(.N, x, 1))]

x[, output:= Reduce(`+`, Map(function(x, y) (x * y), .SD[, (1:20), with = FALSE], .SD[, (21:40), with = FALSE])), .SDcols = c(1:40)]

x[, sampling:= sample(1000, nrow(x), replace = T)]


lm_models <- x[, 

 {

  lapply(c(1:20), function(x) summary(lm(data = .SD[, c(1:x, 41), with = F], formula = output ~ . -1))$'fstatistic'[[1]])

 }

 , by = sampling

]


lm_models_melted <- melt(lm_models, measure.vars = paste0('V', c(1:20)))


crtitical_f_stats <- qf(p = 0.99, df1 = c(1:20), df2 = 1000, lower.tail = TRUE, log.p = FALSE)


boxplot(data = lm_models_melted, value ~ variable, log = 'y'); lines(crtitical_f_stats, type = 's', col = 'red')


summary(lm(data = x[sample(1000000, 1000, replace = T), c(1:20, 41), with = F], formula = output ~ . -1))


Sys.time() - start


gc()

Graph with logarithmic y-axis:

Apparently, yes...

> summary(lm(data = x[sample(1000000, 1000, replace = T), c(1:20, 41), with = F], formula = output ~ . -1))


Call:

lm(formula = output ~ . - 1, data = x[sample(1e+06, 1000, replace = T), 

    c(1:20, 41), with = F])


Residuals:

     Min       1Q   Median       3Q      Max 

-19.6146  -2.8252   0.0192   3.0659  15.8853 


Coefficients:

    Estimate Std. Error t value Pr(>|t|)    

V1    0.9528     0.1427   6.676  4.1e-11 ***

V2    1.7771     0.1382  12.859  < 2e-16 ***

V3    2.7344     0.1442  18.968  < 2e-16 ***

V4    4.0195     0.1419  28.325  < 2e-16 ***

V5    5.2817     0.1479  35.718  < 2e-16 ***

V6    6.2776     0.1509  41.594  < 2e-16 ***

V7    6.9771     0.1446  48.242  < 2e-16 ***

V8    7.9722     0.1469  54.260  < 2e-16 ***

V9    9.0349     0.1462  61.806  < 2e-16 ***

V10  10.1372     0.1496  67.766  < 2e-16 ***

V11  10.8783     0.1487  73.134  < 2e-16 ***

V12  11.9129     0.1446  82.386  < 2e-16 ***

V13  12.8079     0.1462  87.588  < 2e-16 ***

V14  14.2017     0.1487  95.490  < 2e-16 ***

V15  14.9080     0.1458 102.252  < 2e-16 ***

V16  15.9893     0.1428 111.958  < 2e-16 ***

V17  17.4997     0.1403 124.716  < 2e-16 ***

V18  17.8798     0.1448 123.470  < 2e-16 ***

V19  18.9317     0.1470 128.823  < 2e-16 ***

V20  20.1143     0.1466 137.191  < 2e-16 ***

---

Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1


Residual standard error: 4.581 on 980 degrees of freedom

Multiple R-squared:  0.9932, Adjusted R-squared:  0.993 

F-statistic:  7123 on 20 and 980 DF,  p-value: < 2.2e-16

 
Vladimir Perervenko:

Thank you! I didn't know how to do that yet. But really, you should keep the calculations in your memory as much as possible. It will be faster. Good kung fu...

With your tweaks:

#---variant-------------

rm(list=ls());gc()

library(data.table)

library(ggplot2)

library(gridExtra)

library(tseries)

#----

require(magrittr)

require(dplyr)

start <- Sys.time()

monte_trades <- as.data.table(matrix(rnorm(1000000, 0.1, 1), ncol = 1)) %>%

.[, variable := rep(1:1000, times = 1000)]%>%

.[, trade := 1:.N, by = variable] %>%

dcast.data.table(., variable ~ trade, value.var = 'V1', fun.aggregate = sum)%>%

.[, as.list(cumsum(unlist(.SD))), by = variable]%>%

melt(., measure.vars = names(.)[-1], variable.name = "trade", value.name = 'V1')%>% 

setorder(., variable, trade)

monte_trades_last <- as.data.table(monte_trades[trade == '1000', V1])

quantile_trade <- monte_trades[, quantile(V1, probs = 0.05), by = trade]

RF_last <- monte_trades[, V1[.N] / maxdrawdown(V1)[[1]], by = variable]


p1 <- ggplot(data = monte_trades, aes(x = trade, y = V1, group = variable)) +

geom_line(size = 2, color = 'blue', alpha = 0.01) + 

geom_line(data = quantile_trade, aes(x = trade, y = V1, group = 1), size = 2, alpha = 0.5, colour = 'blue') +

ggtitle('Simulated Trade Sequences of Length 1000')


p2 <- ggplot(data = monte_trades_last, aes(V1)) +

geom_density(alpha = 0.1, size = 1, color = 'blue', fill = 'blue') + 

scale_x_continuous(limits = c(min(monte_trades$V1), max(monte_trades$V1))) +

coord_flip() +

ggtitle('Cumulative Profit Density')


p3 <- ggplot(data = RF_last, aes(V1)) +

geom_density(alpha = 0.1, size = 1, color = 'blue', fill = 'blue') + 

geom_vline(xintercept = mean(RF_last$V1), colour = "blue", linetype = 2, size = 1) +

geom_vline(xintercept = median(RF_last$V1), colour = "red", linetype = 2, size = 1) +

ggtitle('Recovery Factor Density + Mean (blue) and Median (red)')


grid.arrange(p1, p2, p3, ncol = 3)


Sys.time() - start

The running time is 47 seconds. That is, the code is prettier and more compact, but there is no difference in speed... Drawing, yes, is very long. 1000 lines with transparency - because of them...

 
Alexey Burnakov:

Thank you! I didn't know how to do that yet. But really, you should keep the calculations in your memory as much as possible. It'll be faster. Good kung fu...

The running time is 47 seconds. I mean, the code is nicer and more compact, but there's no difference in speed... Drawing, yes, is very long. 1000 lines with transparency - because of them...

My calculation takes

#time in seconds

# min lq mean median uq max neval

# 2.027561 2.253354 2.254134 2.275785 2.300051 2.610649 100

But this is not so important. It was about the readability of the code.

Good luck

PS. And parallelize the lm() calculations. This is exactly the case when you need

 
Vladimir Perervenko:

I have a calculation that takes

#-execution time in seconds

# min lq mean median uq max neval

# 2.027561 2.253354 2.254134 2.275785 2.300051 2.610649 100

But this is not so important. It was about the readability of the code.

Good luck

PS. And parallelize your lm() calculations. That's exactly what you need.

Nope. You are giving the time for part of the code before the graphs. I've indicated it, along with the graphs.

My time to graphs is 1.5 seconds. Your method is 1.15 seconds.

rm(list=ls());gc()


library(data.table)

library(ggplot2)

library(gridExtra)

library(tseries)


start <- Sys.time()


set.seed(1)


x <- as.data.table(matrix(rnorm(1000000, 0.1, 1), ncol = 1)) #random normal value with positive expectation

x[, variable:= rep(1:1000, times = 1000)]

x[, trade:= 1:.N, by = variable]


x.cast = dcast.data.table(x, variable ~ trade, value.var = 'V1', fun.aggregate = sum)

x_cum <- x.cast[, as.list(cumsum(unlist(.SD))), by = variable]


monte_trades <- melt(x_cum, measure.vars = names(x_cum)[-1], variable.name = "trade", value.name = 'V1')

setorder(monte_trades, variable, trade)

monte_trades_last <- as.data.table(monte_trades[trade == '1000', V1])

quantile_trade <- monte_trades[, quantile(V1, probs = 0.05), by = trade]

RF_last <- monte_trades[, V1[.N] / maxdrawdown(V1)[[1]], by = variable]


Sys.time() - start

rm(list=ls());gc()

library(data.table)

library(ggplot2)

library(gridExtra)

library(tseries)

#----

require(magrittr)

require(dplyr)


start <- Sys.time()


monte_trades <- as.data.table(matrix(rnorm(1000000, 0.1, 1), ncol = 1)) %>%

.[, variable := rep(1:1000, times = 1000)]%>%

.[, trade := 1:.N, by = variable] %>%

dcast.data.table(., variable ~ trade, value.var = 'V1', fun.aggregate = sum)%>%

.[, as.list(cumsum(unlist(.SD))), by = variable]%>%

melt(., measure.vars = names(.)[-1], variable.name = "trade", value.name = 'V1')%>% 

setorder(., variable, trade)

monte_trades_last <- as.data.table(monte_trades[trade == '1000', V1])

quantile_trade <- monte_trades[, quantile(V1, probs = 0.05), by = trade]

RF_last <- monte_trades[, V1[.N] / maxdrawdown(V1)[[1]], by = variable]


Sys.time() - start

It turns out that your method is faster...

Reason: