CS365A, Project Report: Predictive Model for Crude Oil Prices

M ENTOR : P ROF. A RNAB B HATTACHARYA , IIT K ANPUR CS365A, Project Report: Predictive Model for Crude Oil Prices Safal Pandita (15111040), Eeshit D...
7 downloads 0 Views 1MB Size
M ENTOR : P ROF. A RNAB B HATTACHARYA , IIT K ANPUR

CS365A, Project Report: Predictive Model for Crude Oil Prices

Safal Pandita (15111040), Eeshit Dhaval Vaishnav(11784) Group 5

C ONTENTS 1 Introduction and Motivation 1.1 Factors affecting Oil Prices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1.1 Organization of the Petroleum Exporting Countries (OPEC) Supply 1.1.2 Non-OPEC Supply . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1.3 Stock Market . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1.4 The US Dollar value . . . . . . . . . . . . . . . . . . . . . . . . . . . .

2 . . . . .

. . . . .

. . . . .

2 2 2 3 3

2 The Objective

3

3 The Datasets

3

4 The Methods

4

5 The Results 5.1 Further evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

6 Conclusions

7 10

14

1

1 I NTRODUCTION AND M OTIVATION Oil plays a vital role in the world economy as an essential raw material for many manufacturing and transportation processes. Almost two-thirds of the global energy demands are met by oil. It is a highly volatile and actively traded commodity in global markets. In fact, it is the most heavily and actively traded commodity, which accounts for almost 10% of the trade volume on global markets. Predicting oil prices is of great interest to economists, policy makers and commodity traders. Qualitative methods are used to estimate the impact of infrequent events like wars and natural calamities on oil prices (and they are especially relevant in today’s scenario).

1.1 FACTORS AFFECTING O IL P RICES Heuristically, the price of oil is determined by the supply and the demand (similar to most economic quantities). An increase in demand results in an increase in the price of oil, and a decrease in demand leads to a decrease in the price of oil. Similarly, a cutback in oil supply results in an rise in oil prices. Also, supply and demand themselves are governed by various factors like geopolitcs, economy, government policy and weather. The following are a few of the several factors which affect the supply-demand balance: 1.1.1 O RGANIZATION OF THE P ETROLEUM E XPORTING C OUNTRIES (OPEC) S UPPLY OPEC is an organization of 12 oil exporting nations, namely Algeria, Angola, Ecuador, Iran, Iraq, Kuwait, Libya, Nigeria, Qatar, Saudi Arabia, United Arab Emirates, and Venezuela. Its primary aim is to coordinate and unify the selling prices of oil of its member countries 1 . 40% of the world’s crude oil is supplied by OPEC members and thus their actions significantly affect oil prices. This is very evident in the scenario today. Not limiting the production from OPEC producers can result in heavy deflation of oil prices (exactly what is happening in the world today. 2 . It is important to also understand that oil prices not only depend on the current demand and supply, but also on the projected future supply and demand. This is understood intuitively by the fact that oil is a heavily traded commodity and there a lot of oil prospectors who use the predicted values of oil prices to make bets.OPEC memebers also adjust their production of oil based on current demand and also based on future demand. 1.1.2 N ON -OPEC S UPPLY Another factor offcourse is non-OPEC supply which accounts for about 60% of the world’s crude oil. OPEC and non-OPEC coutries try to sort-of balance each other’s supplies. This mechanism sometimes fails which can also contribute to a rise or fall in oil prices [?]. 1 "OPEC : Home.".

http://www.opec.org/

2 "Energy and Financial Markets: What Drives Crude Oil Prices".

http://www.eia.gov/finance/markets

2

1.1.3 S TOCK M ARKET For a heavily traded commodity like oil, the stock market can be accurately used as an indicator of economic variables like oil prices. Upon the improvement of economic conditions, one would expect the demands to go up due to the rise in expendable income (which would offcourse result in an increase in oil prices). We use the Standard and Poor’s 500 (S&P 500) index as the benchmark for the stock market in our project. ( S&P 500 is the most widely used indicator of the US economy which is basically a weighted index of the market caps of 500 companies.) 1.1.4 T HE US D OLLAR VALUE Now, Oil, like most globally traded commodities, is traded in US Dollars. So any change in the value of the dollar relative to other currencies will cause the oil prices to shift. For our project, to incorporate the information from this factor, we use the US Dollar Index which is a measure of the value of the US dollar relative to a whole basket of foreign currencies. There are several studies that support the idea that there is a negative correlation between the US Dollar Index and the oil price 3 . The reasoning is that the depreciation in the dollar exchange rate causes oil to be cheaper in countries outside the US, which leads to an increase in demand, which in turn causes oil prices to rise. This reasoning must be taken with a pinch of salt and there are many caveats and disclaimers. We make no such assumption and train our model without any underlying hypothesis about the correlation between these two parameters.

2 T HE O BJECTIVE The objective is to carry out an exploratory and comparative analysis of the quantitative models for predicting oil prices given macro-economic and oil data for the past 30 years. For this purpose, we first select the datasets that acccount for the factors mentioned above. Then, we use machine learning techniques to make predictions for future oil prices using the regression models we train on our datasets.

3 T HE D ATASETS We used datasets that encompassed the factors indicated above. OPEC and the US Department of Energy (DOE) provide daily prices for Brent Crude, WTI Crude and other world crude oil. Quandl is another resource we used extensively for our analsyis since it provided formatted datasets that were easy to load into our algorithms. We use monthly Oil prices and values of factors 4 . The crude oil benchmark we used was the West Texas Intermediate (WTI).( For the uninitated, crude oil benchmarks are basically just reference prices for buyers and sellers. Also, we will use a term calld spot price which is 3 Zhang, Yue-Jun, et al. "Spillover effect of US dollar exchange rate on oil prices." Journal of Policy Modeling 30.6

(2008): 973-991. 4 Quandl - Find, Use and Share Numerical Data.

https://www.quandl.com

3

just the price at which a particular commodity can be bought or sold at a specified time and place. ) For completeness, we list the macroeconomic factors and spot prices we used for predicting oil prices. These are time series datasets obtained primarily from Quandl. Importantly, we adjusted every single one of the values below for inflation before using them in our feature vectors. The inflation adjustment was done using the standard Consumer Price Index (CPI) normalization. • S&P 500 Index • New York Stock Exchange Index (NYSE) • US Dollar Index • Gold Value We have obtained monthly values for all of these quantities starting from 1983. To justify our selection of factors for our model, we show the following heatmap. The heatmap indicates a strong correlation between gold values and oil values and also indicates positive vales of correlation between NYSE and SnP 500 prices. As discussed above, we see a negative correlation between the US Dollar Index and the Oil prices. This framework of building correlation matrices while choosing factors for the regression models we will present below is particularly useful for trying out and prospecting for predictive factors.

4 T HE M ETHODS As mentioned earlier, we first construct our feature vectors for each month using the datasets above. Then we tried a large number of regression models. We will list out the different regression methods we used for building our predictive model. 5 • LinearRegression: LinearRegression fits a linear model with coefficients w = (w 1 , ..., w p ) to minimize the residual sum of squares between the observed responses in the dataset, and the responses predicted by the linear approximation. • ARDRegression: Fit the weights of a regression model, using an Automatic Relevance Determination Regression(ARD) prior. The weights of the regression model are assumed to be in Gaussian distributions. Also estimate the parameters lambda (precisions of the distributions of the weights) and alpha (precision of the distribution of the noise). The estimation is done by an iterative procedures (Evidence Maximization). 6 • Bayesian Ridge: Bayesian regression techniques can be used to include regularization parameters in the estimation procedure: the regularization parameter is not set in a hard 5 The descriptions are taken verbatim from the explanations on sklearn as we are using the libraries from Scikit-

learn (BSD License) , www.scikit-learn.org 6 scikit-learn.org/stable/modules/generated/sklearn.linearmodel.ARDRegression.html

4

Figure 3.1: Correlation matrix for oil price prediction factors

5

sense but tuned to the data at hand. BayesianRidge estimates a probabilistic model of the regression problem. The prior for the parameter w is given by a spherical Gaussian. 7 • LARS: Least-angle regression (LARS) is a regression algorithm for high-dimensional data. If two variables are almost equally correlated with the response, then their coefficients should increase at approximately the same rate. The algorithm thus behaves as intuition would expect, and also is more stable. 8 • LassoLARS: LassoLars is a lasso model implemented using the LARS algorithm, and unlike the implementation based on coordinatedescent, this yields the exact solution, which is piecewise linear as a function of the norm of its coefficients. 9 • KNeighbourRegressor: Regression based on k-nearest neighbors. The target is predicted by local interpolation of the targets associated of the nearest neighbors in the training set. 10 • SVR, LinearSVR and nuSVR: They rely on defining a loss function that ignores errors, which are situated within a certain distance of the true value. This type of function is often called | epsilon intensive loss function. Scalable Linear Support Vector Machine for regression implemented using liblinear is LinearSVR. Support Vector Machine for regression implemented using libsvm using a parameter to control the number of support vectors is nuSVR. 11 • ExtraTreeRegressor: This class implements a meta estimator that fits a number of randomized decision trees (a.k.a. extra-trees) on various sub-samples of the dataset and use averaging to improve the predictive accuracy and control over-fitting. 12 • GradientBoostingRegressor: Gradient Boosting for regression. GB builds an additive model in a forward stage-wise fashion; it allows for the optimization of arbitrary differentiable loss functions. In each stage a regression tree is fit on the negative gradient of the given loss function. 13 • KernelRidge:Kernel ridge regression (KRR) combines ridge regression (linear least squares with l2-norm regularization) with the kernel trick. It thus learns a linear function in the space induced by the respective kernel and the data. For non-linear kernels, this corresponds to a non-linear function in the original space. 14

7 scikit-learn.org/stable/modules/generated/sklearn.linearmodel.BayesianRidge.html 8 scikit-learn.org/stable/modules/generated/sklearn.linearmodel.LARS.html 9 scikit-learn.org/stable/modules/generated/sklearn.linearmodel.LassoLARS.html 10 scikit-learn.org/stable/modules/generated/sklearn.linearmodel.KNeighbourRegressor.html 11 http://scikit-learn.org/stable/modules/generated/sklearn.svm.SVR.html 12 http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.ExtraTreesRegressor.html 13 http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingRegressor.html 14 http://scikit-learn.org/stable/modules/generated/sklearn.kernelridge.KernelRidge.html

6

Figure 5.1: Mean Relative Error (in %) vs Number of months. Different lines represent different regression algorithms

5 T HE R ESULTS We start by making an input vector made up of all the 5 parameters that were correlated to oil on 12 different regression algorithms. We use these models to predict upto 5 months into the future and measure their performance using 3 different error metrics namely Mean Relative Error, Mean Absolute error and RMS error. The results are as follows:

M EAN R ELATIVE E RROR M RE =

1 n

n P i =1

|P t∗ −P t | Pt

Mean Relative Error is the mean of the difference between predicted values and original values divided by the original value. We use it as a percentage and multiply by 100. This graph shows the mean relative error (%) when a particular regression algorithm is used to predict upto 5 months into the future. The input vector for this graph takes into account all our initial parameters i.e. Oil Prices, Gold Prices, NYSE, SP500 and USD Index to predict future oil prices.

7

Figure 5.2: Mean Absolute Error vs Number of months. Different lines represent different regression algorithms

M EAN A BSOLUTE E RROR M AE =

1 n

n P i =1

|P t∗ − P t |

Mean Absolute Error is the mean of difference between predicted values and original values. This graph shows the mean absolute error when a particular regression algorithm is used to predict upto 5 months into the future. The input vector for this graph takes into account all our initial parameters i.e. Oil Prices, Gold Prices, NYSE, SP500 and USD Index to predict future oil prices.

RMS E RROR s

RM SE =

1 n

n P

i =1

(P t∗ − P t )2

RMS is the square root of the mean of the square of differences between predicted values and orginal values.

8

Figure 5.3: Root Mean Square Error vs Number of months. Different lines represent different regression algorithms

This graph shows the root mean square error when a particular regression algorithm is used to predict upto 5 months into the future. The input vector for this graph takes into account all our initial parameters i.e. Oil Prices, Gold Prices, NYSE, SP500 and USD Index to predict future oil prices.

9

Figure 5.4: Comparison of all predictive models with one month into the future and all parameters in the input vector

C OMPARISON OF E RROR M ETRICS

From the above metrics we decided to use Linear Regression and Bayesian Ridge for further evaluation.

5.1 F URTHER EVALUATION After finalising the two regression algorithms we started tweaking our input vector which initially was made of all the 5 factors we were using. The five parameters we had could be permuted into 32 different ways. We tested and selected the 5 which gave the most consistent results with minimum errors. So now we had 5 different models for both the regression algorithms that we chose above. This gave us 10 models to work with. • Param: Oil • Param: Oil, Gold • Param: Oil, SP500, Gold • Param: Oil, NYSE, Gold • Param: Oil, NYSE, USD, Gold

10

Figure 5.5: Mean Relative Error (in %) vs Number of months. Different lines represent different regression algorithms We used these 10 models to again predict upto 5 months into the future and measured their performance using 3 different error metrics namely Mean Relative Error, Mean Absolute error and RMS error. The input vector for each of these models was made of the factors used mentioned in their name.

The results corresponding to these 10 models are:

M EAN R ELATIVE E RROR M RE =

1 n

n P i =1

|P t∗ −P t | Pt

Mean Relative Error is the mean of the difference between predicted values and original values divided by the original value. We use it as a percentage and multiply by 100. This graph shows the mean relative error (%) for the 10 models that we selected to predict oil prices upto 5 months into the future.

11

Figure 5.6: Mean Absolute Error vs Number of months. Different lines represent different regression algorithms

M EAN A BSOLUTE E RROR M AE =

1 n

n P i =1

|P t∗ − P t |

Mean Absolute Error is the mean of difference between predicted values and original values. This graph shows the mean absolute error for the 10 models that we selected to predict oil prices upto 5 months into the future.

RMS E RROR s

RM SE =

1 n

n P

i =1

(P t∗ − P t )2

RMS is the square root of the mean of the square of differences between predicted values and orginal values.

12

Figure 5.7: Root Mean Square Error vs Number of months. Different lines represent different regression algorithms

This graph shows the root mean square error for the 10 models that we selected to predict oil prices upto 5 months into the future.

13

C OMPARISON OF E RROR M ETRICS

Figure 5.8: Comparison of our final predictive models with three month into the future

We can see that Bayesian Regression with the parameters: Oil, NYSE, Gold give slightly better results than all other models. We will use this model to predict oil prices in the future.

6 C ONCLUSIONS We eventually conclude that the regression algorithms that work the best using our evaluation metrics of Mean Relative Error, Mean Absolute Error and RMS Error are Linear Regression and Bayesian Regression on an inflation adjusted monthly data set of the past 30 years. Using these two algorithms we could test for various permutations of the input vector and found out that the most useful parameters for crude oil prediction are Oil, NYSE and Gold. We have also used this model to predict values of Oil prices 3 months into the future.

Figure 5.9: Predicted prices of oil

14

Suggest Documents