Market prediction based on macroeconomic indicators

 

So, the task is to predict the S&P 500 index based on the available economic indicators.

Step 1: Find the indicators. The indicators are publicly available here: http://research.stlouisfed.org/fred2/ There are 240,000 of them. The most important is the growth of GDP. This indicator is calculated every quarter. Hence our step - 3 months. All indicators for a shorter period are recalculated for a 3-month period, the rest (annual) are discarded. We also discard indicators for all countries except the United States and indicators that do not have a deep history (at least 15 years). So, with painstaking work, we filter out a bunch of indicators and get about 10 thousand indicators. We formulate a more specific task of predicting the S&P 500 index one or two quarters ahead, having 10 thousand economic indicators available with a quarterly period. I do everything in MatLab, although it is possible in R.

Step 2: Convert all data to stationary form by differentiation and normalization. There are many methods here. The main thing is that the original data can be restored from the converted data. Without stationarity, no model will work. The S&P 500 series before and after the conversion is shown below.

Step 3: Choose a model. Maybe a neural network. One can do multi-variable linear regression . You can do multi-variable polynomial regression. After testing linear and nonlinear models, we come to the conclusion that the data is so noisy that it makes no sense to enter a nonlinear model. y(x) graph, where y = S&P 500 and x = one of 10 thousand indicators, is a nearly circular cloud. Thus, we formulate the task even more specifically: predict the S&P 500 index one or two quarters ahead, having 10 thousand economic indicators with a quarterly period, using multi-variable linear regression.

Step 4: We select the most important economic indicators from 10 thousand (reduce the dimension of the problem). This is the most important and difficult step. Let's say we take the history of the S&P 500 as long as 30 years (120 quarters). To represent the S&P 500 as a linear combination of economic indicators of various kinds, it is enough to have 120 indicators to accurately describe the S&P 500 over these 30 years. Moreover, the indicators can be absolutely any in order to create such an accurate model of 120 indicators and 120 S&P 500 values. So you need to reduce the number of inputs below the number of described function values. For example, we are looking for the 10-20 most important input indicators. Such tasks of describing data with a small number of inputs selected from a huge number of candidate bases (dictionary) are called sparse coding.

There are many methods for selecting predictor inputs. I tried them all. Here are the main two:

  1. We classify all 10,000 data by their S&P 500 predictive power. Predictive power can be measured by correlation coefficient or mutual information.
  2. We go through all 10 thousand indicators along the bottom and choose the one that gave the linear model y_mod = a + b*x1 describing the S&P 500 with the smallest error. Then we select the second input again by sorting through the remaining 10 thousand -1 indicators so that it describes the remainder y - y_mod = c + d*x2 with the least error. Etc. This method is called stepwise regression or matching pursuit.

Here are the top 10 indicators with the highest correlation coefficient with the S&P 500:

Series ID Lag Corr Mut info
'PPICRM' 2 0.315 0.102
'CWUR0000SEHE' 2 0.283 0.122
'CES1021000001' one 0.263 0.095
'B115RC1Q027SBEA' 2 0.262 0.102
'CES1000000034' one 0.261 0.105
'A371RD3Q086SBEA' 2 0.260 0.085
'B115RC1Q027SBEA' one 0.256 0.102
'CUUR0000SAF111' one 0.252 0.117
'CUUR0000SEHE' 2 0.251 0.098
'USMINE' one 0.250 0.102

Here are the top 10 indicators with the most mutual information with the S&P 500:

Series ID Lag Corr Mut info
'CPILEGSL' 3 0.061 0.136
'B701RC1Q027SBEA' 3 0.038 0.136
'CUSR0000SAS' 3 0.043 0.134
'GDPPOT' 3 0.003 0.134
'NGDPPOT' 5 0.102 0.134
'OTHSEC' 4 0.168 0.133
'LNU01300060' 3 0.046 0.132
'LRAC25TTUSM156N' 3 0.046 0.132
'LRAC25TTUSQ156N' 3 0.046 0.131
'CUSR0000SAS' one 0.130 0.131

Lag is the delay of the input series with respect to the simulated S&P 500 series. As can be seen from these tables, different methods of choosing the most important inputs result in different sets of inputs. Since my ultimate goal is to minimize model error, I chose the second input selection method, i.e. enumeration of all inputs and selection of the input that gave the least error.

Step 5: Choose a method for calculating the error and coefficients of the model. The simplest method is the COEX method, which is why linear regression using this method is so popular. The problem with the RMS method is that it is sensitive to outliers, i.e. these outliers significantly affect the coefficients of the model. To reduce this sensitivity, the sum of the absolute values of the errors can be used instead of the sum of the squared errors, which leads to the method of least moduli (MLM) or robust regression. This method does not have an analytical solution for the model coefficients unlike linear regression. Usually modules are replaced by smooth/differentiable approximating functions and the solution is carried out by numerical methods and takes a long time. I tried both methods (lean regression and MHM) and did not notice much advantage of MHM. Instead of MHM, I took a detour. In the second step of obtaining stationary data by differentiating them, I added a non-linear normalization operation. That is, the original series x[1], x[2], ... x[i-1], x[i] ... is first converted into a difference series x[2]-x[1] ... x [i]-x[i-1] ... and then each difference is normalized by replacing it with sign(x[i]-x[i-1])*abs(x[i]-x[i-1] )^u, where 0 < u < 1. For u=1, we get the classical COSE method with its sensitivity to outliers. At u=0, all values of the input series are replaced by binary values +/-1 with almost no outliers. For u=0.5, we get something close to MNM. The optimal value of u lies somewhere between 0.5 and 1.

It should be noted that one of the popular methods for converting data to a stationary form is to replace the values of the series with the difference in the logarithms of these values, i.e. log(x[i]) - log(x[i-1]) or log(x[i]/x[i-1]). The choice of such a transformation is dangerous in my case, since the dictionary of 10 thousand entries has many rows with zero and negative values. The logarithm also has the benefit of reducing the sensitivity of the RMS method to outliers. Essentially, my transformation function sign(x)*|x|^u has the same purpose as log(x), but without the problems associated with zero and negative values.

Step 6: Calculate the model prediction by substituting the fresh input data and calculating the model output using the same model coefficients that were found by linear regression in the previous history segments. Here it is important to keep in mind that the quarterly values of economic indicators and the S&P 500 come almost simultaneously (with an accuracy of 3 months). Therefore, to predict the S&P 500 for the next quarter, the model must be built between the current quarterly value of the S&P 500 and entries delayed by at least 1 quarter (Lag>=1). To predict the S&P 500 one quarter ahead, the pattern must be built between the current quarterly value of the S&P 500 and entries delayed by at least 2 quarters (Lag>=2). Etc. The accuracy of predictions decreases significantly with increasing delay greater than 2.

Step 7: Check the accuracy of the predictions on the previous history. The original technique described above (putting each input into the previous history, choosing the input that gives the smallest MSD, and calculating the prediction from the fresh value of that input) produced a prediction MSD that was even worse than the random or null predictions. I asked myself this question: why should an entrance that fits well into the past have a good predictable ability for the future? It makes sense to select model inputs based on their previous prediction error, rather than based on the smallest regression error on the known data.

In the end, my model can be described step by step like this:

  1. Uploading economic data from stlouisfed (about 10 thousand indicators).
  2. Preobrazeum data to a stationary form and normalize.
  3. We choose a linear model of the S&P 500 index, analytically solved by the RMS method (linear regression).
  4. We choose the length of the history (1960 - Q2 2015) and divide it into a training segment (1960 - Q4 1999) and a test segment (Q1 2000 - Q2 2015).
  5. We start predictions from 1960 + N + 1, where N*4 is the initial number of known quarterly S&P 500 values.
  6. On the first N data, a linear model is built y_mod = a + b*x for each economic indicator, where y_mod is the S&P 500 model and x is one of the economic indicators.
  7. We predict N + 1 bar with each model.
  8. We calculate the prediction errors of N + 1 bars by each model. We remember these mistakes.
  9. We increase the number of known S&P 500 values by 1, i.e. N + 1, and repeat steps 6-9 until we reach the end of the training segment (Q4 1999). At this step, we have stored prediction errors from 1960 + N +1 to Q4 1999 for each economic indicator.
  10. We start testing the model in the second period of history (Q1 2000 - Q2 2015).
  11. For each of the 10 thousand inputs, we calculate the standard error of predictions for 1960 - Q4 1999.
  12. Out of 10 thousand inputs, we select the one that had the lowest RMS prediction for 1960 - Q4 1999.
  13. We build a linear model y_mod = a + b*x for each economic indicator for 1960 - Q4 1999.
  14. We predict Q1 2000 by each model.
  15. The prediction of the selected input with the lowest RMS of predictions for the previous time interval (1960 - Q4 1999) is chosen as our main prediction of Q1 2000.
  16. We calculate the prediction errors of all inputs for Q1 2000 and add them to the RMS of the same inputs for the previous time period (1960 - Q4 1999).
  17. Move on to Q2 2000 and repeat steps 12-17 until we reach the end of the test area (Q2 2015) with the unknown value of the S&P 500, the prediction of which is our main goal.
  18. We accumulate prediction errors for Q1 2000 - Q4 2014 made by inputs with the lowest standard deviation of predictions in the previous segments. This error (err2) is our out-of-sample prediction error model.

In short, the choice of a predictor depends on their RMS of previous S&P 500 predictions. There is no looking ahead. The predictor can change over time, but at the end of the test segment, it basically stops changing. My model selected PPICRM with a 2-quarter lag as the first input to predict Q2 2015. Linear regression of the S&P 500 with the selected PPICRM(2) input for 1960 - Q4 2014 is shown below. Black circles - linear regression. Multi-colored circles - historical data for 1960 - Q4 2014. The color of the circle indicates the time.


Stationary S&P 500 predictions (red line):

S&P 500 predictions in raw form (red line):

The graph shows that the model predicts the growth of the S&P 500 in the second quarter of 2015. Adding a second input increases the prediction error:

1 err1=0.900298 err2=0.938355 PPICRM (2)

2 err1=0.881910 err2=0.978233 PERMIT1 (4)

where err1 is the regression error. It is obvious that it decreases from the addition of a second input. err2 is the root-mean-square prediction error divided by the random prediction error. That is, err2>=1 means that my model's prediction is no better than random predictions. err2<1 means my model's prediction is better than random predictions.

PPICRM = Producer Price Index: Crude Materials for Further Processing

PERMIT1 = New Private Housing Units Authorized by Building Permits - In Structures with 1 Unit

The model described above can be rephrased in this way. We gather 10 thousand economists and ask them to predict the market for the quarter ahead. Each economist comes up with his own prediction. But instead of picking a prediction based on the number of textbooks they've written or the number of Nobel Prizes they've won in the past, we wait a few years collecting their predictions. After a significant number of predictions, we see which economist is more accurate and begin to believe his predictions until some other economist surpasses him in accuracy.

Federal Reserve Economic Data - FRED - St. Louis Fed
Federal Reserve Economic Data - FRED - St. Louis Fed
  • fred.stlouisfed.org
Download, graph, and track 240,000 economic time series from 77 sources.
 
gpwr:
To be continued ...
Is it not embarrassing that the US government in general and the Open Market Committee in particular have been repeatedly suspected of falsifying and manipulating US macroeconomic labour market and GDP statistics in order to influence the financial markets?
 
Demi:
Is it confusing that the US government in general and the Open Market Committee in particular have repeatedly been suspected of falsifying and manipulating US labour market and GDP macroeconomic statistics in order to influence the financial markets?
Yes, it is embarrassing. It is also disconcerting that published data is adjusted many times after its release. But on the other hand, traders react to the data given to them by the US government, moving the market in one direction or the other, whether that data is falsified or incomplete or premature. So the technique of predicting the market based on this data should in principle work.
 
gpwr:
Yes, it is confusing. Also confusing is the fact that published data is adjusted many times after it is released. But on the other hand, traders react to the data given to them by the US government, moving the market in one direction or the other, whether that data is falsified or incomplete or premature. So the technique of predicting the market based on this data should in principle work.

Are you only interested in the S&P or is it just taken as an example?

It's just that the S&P has a peculiar movement pattern, not unlike the currency ratio.

 
Urain:

Are you only interested in the S&P or is it just taken as an example?

It's just that the S&P has a peculiar pattern of movement, not unlike the relationship between currencies.

It is taken as an example because of the ease of finding (publicly available) input data. Anything can be modelled this way: the Russian economy, exchange rates, etc. Market prices are most difficult to predict because there is a lot of noise in them. Predicting physical processes is much easier.
 
gpwr:
Yes, it is confusing. Also confusing is the fact that published data is adjusted many times after it is released. But on the other hand, traders react to the data the US government gives them, moving the market in one direction or the other, whether that data is falsified or incomplete or premature. So the technique of predicting the market based on this data should in principle work.

OK, let's have a look. I've been doing that as well.

A persistent hint-wanted-forward test

 
gpwr:

So, the task of predicting the S&P 500 index based on available economic data.

Quite an interesting topic. I tried to make indicators based on the data file: employment, new home construction, new home sales etc. And you know, you can see with the naked eye that some data has some correlation with the stock market. But, there seems to be no correlation with the currency market. I have used some basic US statistics.

Don't you think that you have chosen too many types of data? In my opinion, you need to exclude the unimportant from worthwhile data, the ones that affect the market.

However, I'm not familiar with neural analysis. I've started to read about it, but I haven't found any clear explanation of its functioning.

 
The regression algorithm will help predict any indicator from any data, even if there is no obvious relationship between them
 
forexman77:

It's quite an interesting topic. I tried to make indicators based on the data file: employment, construction of new houses, sales of new houses etc. And you know, you can see with the naked eye that some data has some correlation with the stock market. But, there seems to be no correlation with the currency market. I have used basic US statistics.

Don't you think that you have chosen too many types of data? In my opinion, you need to exclude the unimportant from worthwhile data, those that affect the market.


Large or small amount of input data is all relative.

The other is more important.

All inputs fall into two categories:

  • those with an impact on the target variable
  • those that have no influence or little influence.

I deliberately use the word influence rather than correlation. Correlation is an empty tool because correlation always has some value and no value is NA, which is fundamental in determining the impact of the raw data on the target variable.

Variables that have no (having a low impact - note that this is a qualitative characteristic) are noise in determining the effect on the target variable. The pitfall here is that starting with some amount not determined algorithmically, this noise "clogs up" the important variables and then the "important" variables cannot be algorithmically extracted from this aggregate amount.

Hence, one has to manually look at the whole list of input variables and decide intuitively, or based on some other consideration, that "this input variable is likely to affect and this one is likely not."

I know of several dozen algorithms for determining the importance of variables, which I have tried out on a set from my paper and book (up to 100 input variables). The result is exactly as described. I manually selected some list, and then filtered it with the algorithm and got the list. And the value of such a list is fundamental: models using such a set of "influencing" input data (using 3 different types of models) do NOT have the property of overlearning, which is the main problem. Overfitting is the main consequence of using "noisy" input data.

PS.

Stationarity plays no role in my models, which are randomForest, ada, SVM.

 
gpwr:

...

.... No model will work without stationarity.

...

The requirement for stationarity is very rigid and completely unjustified.

.

And "non-stationary" models work just fine ;)

 
transcendreamer:
A regression algorithm can predict any indicator based on any data, even if there is no explicit relationship between them
This can be said of any model, not only regression but also neural models, ARMA and others. If there is no relationship between inputs and outputs, any model will generate a prediction, only inaccurately.
Reason: