Market prediction based on macroeconomic indicators - page 56

 
Vladimir:

3: Yields on securities of different maturities plus one more

Buy & hold annual percentage rate, 1974 - present: APR = 7.35%

Buy & sell strategy using economic indicators: APR = 13.18%.

This strategy gave a sell signal in December 2019. No buy signal has been given so far. Apparently the market will go down.

Buy and hold.

It would be interesting to see a forward test of such a model, but it is not possible here.

Right now, as far as I understand, everyone is waiting for the election.

 
Vladimir:

3: Yields on securities of different maturities plus one more

Buy & hold annual percentage rate, 1974 - present: APR = 7.35%

Buy & sell strategy using economic indicators: APR = 13.18%.

This strategy gave a sell signal in December 2019. So far no buy signal has been given. Apparently the market will go down.

Are we talking about a specific instrument or a general indicator?

 
Vladimir :

So, the task is to predict the S&P 500 index based on the available economic indicators.

Step 1: Find the indicators. The indicators are publicly available here: http://research.stlouisfed.org/fred2/ There are 240,000 of them. The most important is the growth of GDP. This indicator is calculated every quarter. Hence our step - 3 months. All indicators for a shorter period are recalculated for a 3-month period, the rest (annual) are discarded. We also discard indicators for all countries except the United States and indicators that do not have a deep history (at least 15 years). So, with painstaking work, we filter out a bunch of indicators and get about 10 thousand indicators. We formulate a more specific task of predicting the S&P 500 index one or two quarters ahead, having 10 thousand economic indicators available with a quarterly period. I do everything in MatLab, although it is possible in R.

Step 2: Convert all data to stationary form by differentiation and normalization. There are many methods here. The main thing is that the original data can be restored from the converted data. Without stationarity, no model will work. The S&P 500 series before and after the conversion is shown below.

Step 3: Choose a model. Maybe a neural network. One can do multi-variable linear regression . You can do multi-variable polynomial regression. After testing linear and nonlinear models, we come to the conclusion that the data is so noisy that it makes no sense to enter a nonlinear model. y(x) graph, where y = S&P 500 and x = one of 10 thousand indicators, is a nearly circular cloud. Thus, we formulate the task even more specifically: predict the S&P 500 index one or two quarters ahead, having 10 thousand economic indicators with a quarterly period, using multi-variable linear regression.

Step 4: We select the most important economic indicators from 10 thousand (reduce the dimension of the problem). This is the most important and difficult step. Let's say we take the history of the S&P 500 as long as 30 years (120 quarters). To represent the S&P 500 as a linear combination of economic indicators of various kinds, it is enough to have 120 indicators to accurately describe the S&P 500 over these 30 years. Moreover, the indicators can be absolutely any in order to create such an accurate model of 120 indicators and 120 S&P 500 values. So you need to reduce the number of inputs below the number of described function values. For example, we are looking for the 10-20 most important input indicators. Such tasks of describing data with a small number of inputs selected from a huge number of candidate bases (dictionary) are called sparse coding.

There are many methods for selecting predictor inputs. I tried them all. Here are the main two:

  1. We classify all 10,000 data by their S&P 500 predictive power. Predictive power can be measured by correlation coefficient or mutual information.
  2. We go through all 10 thousand indicators along the bottom and choose the one that gave the linear model y_mod = a + b*x1 describing the S&P 500 with the smallest error. Then we select the second input again by enumerating the remaining 10 thousand -1 indicators so that it describes the remainder y - y_mod = c + d*x2 with the least error. Etc. This method is called stepwise regression or matching pursuit.

Here are the top 10 indicators with the highest correlation coefficient with the S&P 500:

Series ID Lag Corr Mut info
'PPICRM' 2 0.315 0.102
'CWUR0000SEHE' 2 0.283 0.122
'CES1021000001' one 0.263 0.095
'B115RC1Q027SBEA' 2 0.262 0.102
'CES1000000034' one 0.261 0.105
'A371RD3Q086SBEA' 2 0.260 0.085
'B115RC1Q027SBEA' one 0.256 0.102
'CUUR0000SAF111' one 0.252 0.117
'CUUR0000SEHE' 2 0.251 0.098
'USMINE' one 0.250 0.102

Here are the top 10 indicators with the most mutual information with the S&P 500:

Series ID Lag Corr Mut info
'CPILEGSL' 3 0.061 0.136
'B701RC1Q027SBEA' 3 0.038 0.136
'CUSR0000SAS' 3 0.043 0.134
'GDPPOT' 3 0.003 0.134
'NGDPPOT' 5 0.102 0.134
'OTHSEC' 4 0.168 0.133
'LNU01300060' 3 0.046 0.132
'LRAC25TTUSM156N' 3 0.046 0.132
'LRAC25TTUSQ156N' 3 0.046 0.131
'CUSR0000SAS' one 0.130 0.131

Lag is the delay of the input series with respect to the simulated S&P 500 series. As can be seen from these tables, different methods of choosing the most important inputs result in different sets of inputs. Since my ultimate goal is to minimize model error, I chose the second input selection method, i.e. enumeration of all inputs and selection of the input that gave the least error.

Step 5: Choose a method for calculating the error and coefficients of the model. The simplest method is the COEX method, which is why linear regression using this method is so popular. The problem with the RMS method is that it is sensitive to outliers, i.e. these outliers significantly affect the coefficients of the model. To reduce this sensitivity, the sum of the absolute values of the errors can be used instead of the sum of the squared errors, which leads to the method of least moduli (MLM) or robust regression. This method does not have an analytical solution for the model coefficients unlike linear regression. Usually modules are replaced by smooth/differentiable approximating functions and the solution is carried out by numerical methods and takes a long time. I tried both methods (lean regression and MHM) and did not notice much advantage of MHM. Instead of MHM, I took a detour. In the second step of obtaining stationary data by differentiating them, I added a non-linear normalization operation. That is, the original series x[1], x[2], ... x[i-1], x[i] ... is first converted into a difference series x[2]-x[1] ... x [i]-x[i-1] ... and then each difference is normalized by replacing it with sign(x[i]-x[i-1])*abs(x[i]-x[i-1] )^u, where 0 < u < 1. For u=1, we get the classical COSE method with its sensitivity to outliers. At u=0, all values of the input series are replaced by binary values +/-1 with almost no outliers. For u=0.5, we get something close to MNM. The optimal value of u lies somewhere between 0.5 and 1.

It should be noted that one of the popular methods for converting data to a stationary form is to replace the values of the series with the difference in the logarithms of these values, i.e. log(x[i]) - log(x[i-1]) or log(x[i]/x[i-1]). The choice of such a transformation is dangerous in my case, since the dictionary of 10 thousand entries has many rows with zero and negative values. The logarithm also has the benefit of reducing the sensitivity of the RMS method to outliers. Essentially, my transformation function sign(x)*|x|^u has the same purpose as log(x), but without the problems associated with zero and negative values.

Step 6: Calculate the model prediction by substituting the fresh input data and calculating the model output using the same model coefficients that were found by linear regression in the previous history segments. Here it is important to keep in mind that the quarterly values of economic indicators and the S&P 500 come almost simultaneously (with an accuracy of 3 months). Therefore, to predict the S&P 500 for the next quarter, the model must be built between the current quarterly value of the S&P 500 and entries delayed by at least 1 quarter (Lag>=1). To predict the S&P 500 one quarter ahead, the pattern must be built between the current quarterly value of the S&P 500 and entries delayed by at least 2 quarters (Lag>=2). Etc. The accuracy of predictions decreases significantly with increasing delay greater than 2.

Step 7: Check the accuracy of the predictions on the previous history. The original technique described above (putting each input into the previous history, choosing the input that gives the smallest MSD, and calculating the prediction from the fresh value of that input) produced a prediction MSD that was even worse than the random or null predictions. I asked myself this question: why should an entrance that fits well into the past have a good predictable ability for the future? It makes sense to select model inputs based on their previous prediction error, rather than based on the smallest regression error on the known data.

In the end, my model can be described step by step like this:

  1. Uploading economic data from stlouisfed (about 10 thousand indicators).
  2. Preobrazeum data to a stationary form and normalize.
  3. We choose a linear model of the S&P 500 index, analytically solved by the RMS method (linear regression).
  4. We choose the length of the history (1960 - Q2 2015) and divide it into a training segment (1960 - Q4 1999) and a test segment (Q1 2000 - Q2 2015).
  5. We start predictions from 1960 + N + 1, where N*4 is the initial number of known quarterly S&P 500 values.
  6. On the first N data, a linear model is built y_mod = a + b*x for each economic indicator, where y_mod is the S&P 500 model and x is one of the economic indicators.
  7. We predict N + 1 bar with each model.
  8. We calculate the prediction errors of N + 1 bars by each model. We remember these mistakes.
  9. We increase the number of known S&P 500 values by 1, i.e. N + 1, and repeat steps 6-9 until we reach the end of the training segment (Q4 1999). At this step, we have stored prediction errors from 1960 + N +1 to Q4 1999 for each economic indicator.
  10. We start testing the model in the second period of history (Q1 2000 - Q2 2015).
  11. For each of the 10 thousand inputs, we calculate the standard error of predictions for 1960 - Q4 1999.
  12. Out of 10 thousand inputs, we select the one that had the lowest RMS prediction for 1960 - Q4 1999.
  13. We build a linear model y_mod = a + b*x for each economic indicator for 1960 - Q4 1999.
  14. We predict Q1 2000 by each model.
  15. The prediction of the selected input with the lowest RMS of predictions for the previous time interval (1960 - Q4 1999) is chosen as our main prediction of Q1 2000.
  16. We calculate the prediction errors of all inputs for Q1 2000 and add them to the RMS of the same inputs for the previous time interval (1960 - Q4 1999).
  17. Move on to Q2 2000 and repeat steps 12-17 until we reach the end of the test area (Q2 2015) with the unknown value of the S&P 500, the prediction of which is our main goal.
  18. We accumulate prediction errors for Q1 2000 - Q4 2014 made by inputs with the lowest standard deviation of predictions in the previous segments. This error (err2) is our out-of-sample prediction error model.

In short, the choice of a predictor depends on their RMS of previous S&P 500 predictions. There is no looking ahead. The predictor can change over time, but at the end of the test segment, it basically stops changing. My model selected PPICRM with a 2-quarter lag as the first input to predict Q2 2015. Linear regression of the S&P 500 with the selected PPICRM(2) input for 1960 - Q4 2014 is shown below. Black circles - linear regression. Multi-colored circles - historical data for 1960 - Q4 2014. The color of the circle indicates the time.


Stationary S&P 500 predictions (red line):

S&P 500 predictions in raw form (red line):

The graph shows that the model predicts the growth of the S&P 500 in the second quarter of 2015. Adding a second input increases the prediction error:

1 err1=0.900298 err2=0.938355 PPICRM (2)

2 err1=0.881910 err2=0.978233 PERMIT1 (4)

where err1 is the regression error. It is obvious that it decreases from the addition of a second input. err2 is the root-mean-square prediction error divided by the random prediction error. That is, err2>=1 means that my model's prediction is no better than random predictions. err2<1 means my model's prediction is better than random predictions.

PPICRM = Producer Price Index: Crude Materials for Further Processing

PERMIT1 = New Private Housing Units Authorized by Building Permits - In Structures with 1 Unit

The model described above can be rephrased in this way. We gather 10 thousand economists and ask them to predict the market for the quarter ahead. Each economist comes up with his own prediction. But instead of picking a prediction based on the number of textbooks they've written or the number of Nobel Prizes they've won in the past, we wait a few years collecting their predictions. After a significant number of predictions, we see which economist is more accurate and begin to believe his predictions until some other economist surpasses him in accuracy.

The analysis is impressive, but one question remains: have you ever thought that predicting 500 (!!!) mutually influencing rates at once is somewhat more difficult than 1? Really, why was the S&P500 chosen? Well, 500 independent and difficult to predict companies form it. Elementary logic suggests that it is desirable to test this approach on one issuer, which will increase confidence in the final result by about ... 500 times. :)
 
Реter Konow:
The analysis is impressive, but one question remains: did it ever occur to you that predicting 500 (!!!) affecting each other at once is somewhat more difficult than 1? I mean, really, why is the S&P500 chosen? Its 500 independent and hard to predict companies form it. Elementary logic dictates that this approach should preferably be tested on one issuer, which will increase the confidence in the final result by about... by a factor of 500. :)
Wrong. Although the theme is called "predicting the market based on macroeconomic indicators", but the indicators have no meaning in this analysis. Just variables substituted into some formula after being mathematically depersonalised and dissociated with all external semantic and logical connections to the World. Dry numbers, arranged in abstract numerical series, serve as a model for a neural network, which predicts... no, not the market, but the very same numerical series.

From this point of view, it makes no difference how many rates are in the index and it does not matter what their companies' policies are, but it is necessary that all variables and values are substituted in the formula. And so it is.

So, everything is correct, only the topic should be called differently, because essentially, it's not fundamental analysis, but technical analysis.
 
Реter Konow:
Wrong. Even though the topic is called "predicting the market based on macroeconomic indicators", the indicators are irrelevant in this analysis. Just variables substituted into some formula after being mathematically depersonalised and de-identified with all the external semantic and logical links to the World. Dry numbers, arranged in abstract numerical series, serve as a model for a neural network, which predicts... no, not the market, but the very same numerical series.

From this point of view, it makes no difference how many courses are in the index and it does not matter what policies their companies have, but that all variables and values need to be substituted in the formula. And so it is.

So, everything is correct, only the topic should be called differently, because in essence it is not fundamental but technical analysis.

It turns out the technical analysis on the fundamental data.

Fundamental analysis is not that simple.There are many factors affecting prices not falling under economic indicators. These are elections, Brexit, all sorts of rumours and so on. They can affect the price more than all the economic indicators.

 
Off topic. <br/ translate="no">
I have long been interested by the questions that arose from reading the numerous traders' "stream of consciousness" on the forum:

1. Why do people easily and quickly forget about the object of study, getting lost in the "number series", coefficients, etc., and never get back to the point of origin, forever wandering in the wilds of mathematics?

2. What motivates them? Is it really just money?

The first question is not yet answered, but the second ... ...there is one:

All their mate.searches lead to a certain fantasy formulaic-table "paradise", where existence is collected, ordered and predictable in all aspects. This is how clever people imagine the Grail.

The first post of this thread demonstrates neither the first nor the second step in this obscure direction...
 
Uladzimir Izerski:

You get technical analysis onfundamental data.

It's not all that simple withfundamental analysis.There are many factors affecting prices which do not fall under economic indicators. They are elections, Brexit, rumours of all kinds and so on. They can affect the price more than all the economic indicators.

Yes, that's right.
 

To Peter: I am not predicting the S&P500 directly. The purpose of this paper is to predict recessions in order to get out of the market before they occur and improve the profitability of the buy&hold strategy. Although the S&P500 contains stocks of 500 companies, it is driven by institutional investors who buy and sell the index itself (or its options), not its components. 13% a year doesn't seem like much, but enough for big money where turnover is important. Bernie Madoff attracted his clients by promising them a modest 10% a year, which he failed to achieve.

To Uladzimir: I agree that price fluctuations depend on different social and political events, elections, brexits, infections etc. In the end it all comes down to supply and demand for products/services, unemployment, and other indicators of the economy. I don't care about day-to-day market price fluctuations. Even a simple buy&hold strategy earns 7.4% a year. I care about avoiding long positions during recessions and improving the profitability of this strategy. By the way, another strategy is buying real estate. But that only yields 5% a year, in the US.

 
Реter Konow:
Wrong. Even though the topic is called "predicting the market based on macroeconomic indicators", the indicators are irrelevant in this analysis. Just variables substituted into some formula after being mathematically depersonalised and de-identified with all the external semantic and logical links to the World. Dry numbers, arranged in abstract numerical series, serve as a model for a neural network, which predicts... no, not the market, but the very same numerical series.

From this point of view, it makes no difference how many courses are in the index and it does not matter what policies their companies have, but that all variables and values need to be substituted in the formula. And so it is.

So, everything is correct, only the topic should be called differently, because in essence, it is not fundamental analysis, but technical analysis.

So, what is the forecast for the S&P500 ?

 
Vladimir:

I'm sorry, but all this for 5-13% a year??? It's not worth the effort.)

Reason: