Pricing Single Malt Whisky

DEGREE PROJECT IN TECHNOLOGY, FIRST CYCLE, 15 CREDITS STOCKHOLM, SWEDEN 2016 Pricing Single Malt Whisky A Regression Analysis SANNE BJARTMAR HYLTA EM...
66 downloads 1 Views 2MB Size
DEGREE PROJECT IN TECHNOLOGY, FIRST CYCLE, 15 CREDITS STOCKHOLM, SWEDEN 2016

Pricing Single Malt Whisky A Regression Analysis SANNE BJARTMAR HYLTA EMMA LUNDQUIST

KTH ROYAL INSTITUTE OF TECHNOLOGY SCHOOL OF ENGINEERING SCIENCES

Pricing Single Malt Whisky A Regression Analysis

SANNE

BJARTMAR HYLTA EMMA LUNDQUIST

Degree Project in Applied Mathematics and Industrial Economics (15 credits) Degree Progr. in Industrial Engineering and Management (300 credits) Royal Institute of Technology year 2016 Supervisors at KTH: Thomas Önskog, Jonatan Freilich Examiner: Henrik Hult

TRITA-MAT-K 2016:05 ISRN-KTH/MAT/K--16/05--SE

Royal Institute of Technology SCI School of Engineering Sciences KTH SCI SE-100 44 Stockholm, Sweden URL: www.kth.se/sci

Abstract This thesis examines the factors that affect the price of whisky. Multiple regression analysis is used to model the relationship between the identified covariates that are believed to impact the price of whisky. The optimal marketing strategy for whisky producers in the regions Islay and Campbeltown are discussed. This analysis is based on the Marketing Mix. Furthermore, a Porter’s five forces analysis, focusing on the regions Campeltown and Islay, is examined. Finally the findings are summarised in a marketing strategy recommendation for producers in the regions Campbeltown and Islay. The result from the regression analysis shows that the covariates alcohol content and regions are affecting price the most. The small regions Islay and Campbeltown, with few distilleries, have a strong positive impact on price while whisky from unspecified regions in Scotland have a negative impact on price. The alcohol content has a positive, non-linear, impact on price. The thesis concludes that the positive relationship between alcohol content and price not is due to the the alcohol taxes in Sweden, but that customers are ready to pay more for a whisky with higher alcohol content. In addition, it concludes that small regions with a few distilleries result in a higher price on whisky. The origin and tradition of whisky have a significant impact on price and should thus be emphasised in the marketing strategy for these companies.

1

Sammanfattning Denna kandidatuppsats unders¨oker de faktorer som p˚ averkar priset p˚ a whisky. Multipel regressionsanalys anv¨ ands f¨or att modellera f¨orh˚ allandet mellan de identifierade variablerna som tros p˚ averka priset p˚ a whisky. Vidare diskuteras den optimala marknadsf¨oringsstrategi f¨or whiskyproducenter i regionerna Islay och Campbeltown. Analysen baseras p˚ a en Marknadsmix-analys f¨or whisky i Skottland. Detta f¨ oljs av Porters femkraftsmodell med fokus p˚ a regionerna Islay och Campeltown. Slutligen sammanfattas resultaten i en rekommendation av marknadsf¨ oringsstrategi f¨ or producenter i regionerna Islay och Campbeltown. Resultatet fr˚ an regressionsanalysen visar att kovariaterna alkoholhalt och regioner har st¨ orst p˚ averkan p˚ a priset. De sm˚ a regionerna Islay och Campbeltown, med f˚ a destillerier, har en stark positiv inverkan p˚ a priset. Whisky fr˚ an ospecificerade regioner i Skottland har d¨ aremot en negativ inverkan. Alkoholhalten har en positiv, icke-linj¨ar inverkan p˚ a priset. I kandidatuppsatsen dras slutsatsen att det positiva sambandet mellan alkohol och pris ej kan f¨ orklaras av Sveriges alkoholskatt, utan att kunder ¨ar redo att betala mer f¨ or en whisky med h¨ ogre alkoholhalt. Vidare konstateras att sm˚ a regioner med f˚ a destillerier resulterar i ett h¨ ogre pris p˚ a whisky. Whiskyns ursprung och tradition har en stor inverkan p˚ a pris och b¨ or d¨ arf¨or betonas i marknadsf¨oringen.

2

Contents 1 Introduction 1.1 Background . . . . . . . . . . . . . . . 1.1.1 Thesis Background . . . . . . . 1.1.2 Whisky in Scotland . . . . . . 1.1.3 Single malt whisky production 1.2 Problem definition . . . . . . . . . . . 1.3 Purpose and aim . . . . . . . . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

6 6 6 6 7 8 8

2 Mathematical Background 2.1 Multiple regression analysis . . . . . . . . . . . . 2.1.1 Description . . . . . . . . . . . . . . . . . 2.1.2 Ordinary Least Squares . . . . . . . . . . 2.2 Assumptions in Ordinary Least Squares method 2.2.1 Homoscedasticity and no multicollinearity 2.2.2 Normally distributed residuals . . . . . . 2.2.3 Strict exogeneity . . . . . . . . . . . . . . 2.3 Errors . . . . . . . . . . . . . . . . . . . . . . . . 2.3.1 Multicollinearity . . . . . . . . . . . . . . 2.3.2 Heteroscedasticity . . . . . . . . . . . . . 2.3.3 Endogeneity . . . . . . . . . . . . . . . . . 2.4 Model valuation . . . . . . . . . . . . . . . . . . . 2.4.1 Hypothesis testing . . . . . . . . . . . . . 2.4.2 t-test and hypothesis testing . . . . . . . 2.4.3 R2 . . . . . . . . . . . . . . . . . . . . . . 2.4.4 F-test . . . . . . . . . . . . . . . . . . . . 2.4.5 BIC - Bayesian information criterion . . . 2.4.6 AIC - Akaike Information criterion . . . . 2.4.7 Combining AIC and BIC . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

9 9 9 10 10 11 11 11 11 11 12 13 14 14 14 15 16 16 16 17

3 Method 3.1 Data Collection . . . . . . 3.2 Variables . . . . . . . . . 3.2.1 Response Variable 3.2.2 Covariates . . . . . 3.3 Initial model . . . . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

18 18 18 18 18 19

4 Results 4.1 Initial model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Initial model validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

21 21 22

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

3

. . . . .

. . . . . .

. . . . .

. . . . . .

. . . . .

. . . . . .

. . . . .

. . . . . .

. . . . .

. . . . . .

. . . . .

. . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

22 24 25 25 25 26 27 27 29 30 30

5 Discussion and conclusion 5.1 Analysis of covariates in the Final model . . . . . 5.1.1 Alcohol content . . . . . . . . . . . . . . . 5.1.2 Islay . . . . . . . . . . . . . . . . . . . . . 5.1.3 Cambpeltown . . . . . . . . . . . . . . . . 5.1.4 Other . . . . . . . . . . . . . . . . . . . . 5.2 Discussion and conclusion of mathematical model

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

31 31 31 31 31 31 32

4.3 4.4 4.5

4.2.1 Residual diagnostics . 4.2.2 F-statistic and p-value 4.2.3 R2 and R2 adjusted . 4.2.4 VIF-test . . . . . . . . Reducing the model . . . . . Final model . . . . . . . . . . Final model validation . . . . 4.5.1 Residual diagnostics . 4.5.2 F-statistic and p-value 4.5.3 R2 and R2 adjusted . 4.5.4 VIF-test . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

6 Whisky in Scotland from a marketing perspective 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2 Marketing Mix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2.1 Product & Consumer wants and needs . . . . . . . . . . . . 6.2.2 Price & Cost . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2.3 Promotion & Communication . . . . . . . . . . . . . . . . . 6.2.4 Distribution & Convenience . . . . . . . . . . . . . . . . . . 6.3 Porter’s Five Forces Analysis . . . . . . . . . . . . . . . . . . . . . 6.3.1 Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.3.2 Analysis of Whisky from Islay and Campbeltown . . . . . . 6.4 Recommendation of marketing strategy for producers in Islay and beltown . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 References

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Camp. . . . .

34 34 34 34 35 35 36 36 36 37 39 40

4

List of Figures 1 2 3 4 5 6 7 8

Map over Scotland and its whisky regions . . . . . . . . . . . Scale Location plot for the Initial Model . . . . . . . . . . . . Residuals versus fitted values for the Initial Model . . . . . . Normal QQ-plot for the Standardized residuals, Initial Model Scale Location plot for the Final Model . . . . . . . . . . . . Residuals versus fitted values for the Final Model . . . . . . . Normal QQ-plot for the Standardized residuals, Final Model Porter’s Five Forces [14] . . . . . . . . . . . . . . . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

7 22 23 24 27 28 29 37

Table of the response variable and covariates in the initial model Regression results for the initial model . . . . . . . . . . . . . . . Continued, Regression results for the initial model . . . . . . . . Table of VIF-test values for the covariates in the initial model . . Covariate data and statistics . . . . . . . . . . . . . . . . . . . . Regression results for the final model . . . . . . . . . . . . . . . . Continued, Regression results for the final model . . . . . . . . . Table of VIF-test values for the covariates in the final model . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

19 21 21 25 25 26 26 30

List of Tables 1 2 3 4 5 6 7 8

5

1

Introduction

1.1 1.1.1

Background Thesis Background

Malt whisky has historically been an important spirit with its many classes and types, especially in Scotland. Since the variation in grain, processing, maturation, region of origin and alcohol content is vast, the price of whisky differs significantly. Generally the price of a product reflects its quality and content, but when there are additional factors affecting price, the price model becomes more complex. This is the case for whisky, where it is difficult to derive a whisky bottle’s certain price. As whisky is considered to be such a traditional spirit, especially in Scotland, the brands and the distilleries’ regions are important for certain consumers. Thus, a whisky’s origin has a large impact on price. 1.1.2

Whisky in Scotland

Whisky is Scotland’s national drink and is made of fermented grain mash. The fermentation together with the distillation and the aging in wooden barrels are the most significant characteristics for different classes of whisky. Various grains such as barley, corn, wheat, buckwheat and rye can be used and they all have a unique, characteristic taste. For single malt whisky, the grain is malted barley, but there are also a number of different barley varieties, resulting in different tastes. [8] In this thesis we will only analyse single malt whisky from Scotland, i.e. malt whisky from one distillery only. Since the thesis aims to analyse the orgin’s impact, blended whiskys from several distilleries can not be included. Scotland is divided into following six whisky regions: 1. Lowlands: The most southern region with a light and neutral whisky. 2. Highlands: The most northern and biggest region where the whisky is elegant and tasty with a little sweetness. 3. Speyside: Located in the northeastern corner of Scotland, sometimes part of the Highlands, with a large number of distilleries. The taste of the whisky is sweet, fruity and complex. 4. Islands: Covering the islands in the north west excluding the island Islay. Sometimes considered to be part of the Highlands. The whisky is extremely varied with few similarities, but generally smoky with peaty undertones and marked salinity.

6

5. Islay: Islay is one of the southernmost islands and has nine active distilleries. The whisky is powerful with a smoky, peaty character. 6. Campbeltown: The area around Campbeltown is a historical whisky region with only three distilleries remaining. The characteristics of the whisky include a defined dryness with a pungency, smoke and a solid salinity. [8].

Figure 1: Map over Scotland and its whisky regions

1.1.3

Single malt whisky production

The process of making single malt whisky is long and complex. It includes the following processes: 1. Malting: The barley is soaked in water to undergo germination, to convert the starch into soluble sugars. The barley is then dried in a kiln, traditionally with peat used to power it which influences the taste. 2. Mashing: The malted grain is crushed and mixed with hot water into a mash tun. The sugar in the malt dissolves and is drawn off. This liquid is called wort. 7

3. Fermentation: The fermentation process starts by adding yeast to the wort. The sugars are turned into alcohol and at this stage the liquid is called wash. 4. Distillation: The wash is traditionally distilled twice in Scotland. The shape and material of the stills influence the taste and usually stills are made of copper to remove sulfur-based compounds from the alcohol. First, the wash distillation produces a liquid called low wines, with a low level of alcohol. It is then re-distilled in a spirit still. 5. Maturation: The whisky is matured in oak casks, giving it its characteristic taste. The age of a whisky is the time between distillation and bottling. A whisky bottled for many years has still the same age but may get a rarity value. The type of cask has a great impact on the taste since the whisky undergoes a number of processes. [8].

1.2

Problem definition

The research question in this thesis is: ”What factors impact the price of a bottle of whisky and which ones have the highest significance?”. There are many possible factors, such as origin, storage time, processing and alcohol percentage that are believed to influence the price. From a business point of view, to maximize profit the pricing is extremely important for whisky producers. Given the price drivers of single malt whisky the additional research question is: ”What marketing strategy should whisky producers use?”.

1.3

Purpose and aim

This thesis aims to identify the most important factors affecting price for single malt whisky in Scotland by creating a pricing model. It is aimed at producers, to give them a clear and scientific pricing model on which parameters determine the price of single malt whisky in Scotland. This could be a tool for producers to improve their pricing strategy and when introducing new products. For example there might be differences in the optimal strategy for different areas in Scotland, depending on the traditions around them and how well known they are. In addition the pricing model can be used by consumers, who have a great interest in whisky as the model can provide them with additional information when comparing different bottles of whiskey.

8

2

Mathematical Background

2.1

Multiple regression analysis

2.1.1

Description

Regression analysis is a common technique in mathematical statistics that examines and models the relationship between certain independent variables, called covariates, and how they affect the dependent variable, called the response variable. The purpose is to find the function that best fits the observed data. Frequently used models in regression analysis is simple linear regression, multiple linear regression, polynomial regression, logistic regression and nonlinear regression. Regression analysis can be applied to many different fields including engineering, economics, management, social sciences and biotechnology. [1]. In this paper we examine a relationship between several variables, and thus the multiple linear regression will be used. The response variable, yi , in multiple regression analysis is defined as a set of observations that depend on the covariates, xij which are regarded as deterministic. βj are estimated to get an estimation of the line. The residuals, i , are random variables that are independent between observations. The definition of the linear regression model yields [3]:

yi =

k X

xij βj + i ,

i = 1, 2, .., n

(1)

j=0

This can also be expressed in matrix form: Y = Xβ +  where E[]=0 and

   and Y =   

y1 y2 . . yn



E[t ]=



    , X =     

(2)

Iσ 2 ,

1 x11 1 x21 . . . . 1 xn1

. . . . .

. x1k . x2k . . . . . xnk





    , β =     

β0 β1 . . βk





    ,  =     

0 1 . . n

     

• The β are the unknown coefficients corresponding to the covariates X. The regression will estimate these values based on the data. The estimates of β explains what effect a certain covariate has on the response variable. 9

• E[]=0 means that covariates are not biased and affected by the dependent variable and E[t ]= Iσ 2 states that the variance for the different covariates is the same. •  = (0 , ..., n )t is the error term containing the residuals for the covariates In our work and analysis a structural interpretation will be used, and not predictions. The covariates should hence influence the dependent variable, but never the other way around. This assumption make hypothesis testing possible. 2.1.2

Ordinary Least Squares

The Ordinary Least Squares (OLS) is a method used to estimate the values of β, which is required when performing a regression analysis. The method calculates the optimal estimate of the parameter β, provided that all the assumptions for the regression and for the Ordinary Least Squares method are met. [3]. The assumption is further discussed in the following section. The optimal estimates of the βi and the residuals, i are denoted as βˆi and ˆi . The Least Squares method minimizes the sum of squared residuals and therefore minimizes the following expression: [4]: n X i=1

ˆ2i

=

n X

ˆ T (Y − X β) ˆ (yi − yˆi )2 = (Y − X β)

(3)

i=1

ˆ The estimate βˆ is obtained by solving the following normal equation for β: X T ˆ = 0

(4)

Hence, the least square estimate βˆ of β yields: βˆ = ( X T X)−1 X T Y

(5)

Furthermore the covariance matrix for βˆ can be derived as: ˆ = (X T X)−1 σ 2 Cov(β)

(6)

An unbiased estimation of σ 2 yields: 1 |ˆ |2 n−k−1 where n defines the number of observations and the k number of covariates. σ2 =

2.2

(7)

Assumptions in Ordinary Least Squares method

The linear regression model is based on a number of basic assumptions that have to be met. These assumption holds, and plays an important role, for the OLS method. [3]. The major assumptions are stated below. These are also the assumptions that will be investigated in our models in this thesis: 10

2.2.1

Homoscedasticity and no multicollinearity

All random variables have the same finite variance and they are not correlated with each other:

2.2.2

E[i j ] = 0, i 6= j.

(8)

E[i j ] = σ 2 , i = j

(9)

Normally distributed residuals

It is assumed that the residuals have normal distribution conditional on the regressors: |X ∼ N (0, σ 2 In ) 2.2.3

(10)

Strict exogeneity

Strict exogeneity: The covariates are not correlated to the residuals. Hence the residueals in the regression should have conditional mean zero: E[|X] = 0

2.3

(11)

Errors

Three common problems that violates the assumptions in OLS are presented below. 2.3.1

Multicollinearity

Multicollinearity occurs when two or more covariates are moderately or highly correlated. This means that they can be written as linear combinations of each other and the effect is increased variance. There are several indicators and tests to find multicollinearity. Multicollinearity can be detected using the Variance Inflation Factor (VIF). This provides an index that measures how much the variance of an estimated regression coefficient is increased because of collinearity. [5] Consider the following linear model with k independent variables: Y = β0 + β1X1 + ... + βkXk + 

We can calculate k different VIFs (one for each Xi ) in three steps [5]: 11

(12)

1. Run an ordinary least square regression that has Xi as a function of all the other explanatory variables in the first equation. For i = 1 this yields: X1 = α2 X2 + ... + αk Xk + c0 + 

(13)

Where c0 is a constant and e is the error term. 2. Calculate the VIF factor for βˆi with the following formula: V IF =

1 1 − Ri2

(14)

Where Ri2 is the coefficient of determination of the regression equation in step one, with Xi on the left hand side of the equation and all other predictor variables on the right hand side. 3. Analyse the magnitude of multicollinearity by looking at V IF (βˆi ) or the tolerance, T = V 1IF . A tolerance of less than 0.20 or 0.10 and/or a VIF of 5 or 10 and above indicates a multicollinearity problem. To get rid of multicollinearity simply choose one of the variables to omit the other [4]. 2.3.2

Heteroscedasticity

In linear regression the error terms i are assumed to be homoscedastic, which means that they all have the same variance σ. Heteroscedasticity occurs when this assumption does not hold, i.e. the error terms do not have constant variance. If a heteroscedastic equation is misspecified as homoscedastic the standard deviations of the residuals vary and cause inconsistent standard deviation of the β-values. The t-test explained later in this chapter are then no longer valid. The model specification is: [4]

yi =

k X

xij βj + i ,

i = 1, 2, .., n

(15)

j=0

where E[i ] = 0 och E[2i ] = σi2 , ei are independent between observations Heteroscedasticity can be caused by several factors, for example poor data collecting techniques making σ 2 fairly large or skewness in the data distribution. [4] Heteroscedasticity can be detected graphically by study of a residual plot, where the residuals are plotted against the covariates. Large variation in the size of deviations from the x-asis or unevenly distributed point indicates heteroscedasticity. [3] 12

Heteroscedasticity can also be detected through a quantile-quantile plot (QQ-plot). This compares two probability distributions by plotting their quantiles against each other. In cases where the residuals would be tested against a normal distribution the residuals are plotted against the theoretical normal distribution. A straight line of 45 degrees in the graph indicates that the residuals are approximately normal distributed. Points that deviates much from the straight line indicates that they are not normally distributed and consequently there might be a problem with heteroscedasticity. [6] The White’s Consistent Variance Estimator can be implemented to reduce heteroscedasticity. White’s Consistent Variance Estimator is a covariance matrix that can be used instead n of the usual covariance matrix. The matrix scales down with n−k−1 and is defined as: [4] ˆ = (X T X)−1 X T D(ˆ Cov(β) 2 )X(X T X)−1 Where D(ˆ 2 ) is a nxn diagonal matrix:    D(ˆ  )=   2

ˆ1 2 0 .. . 0

2.3.3

0 .. . .. . ···

··· 0 .. . 0

n n−k−1

 0 ..  .    0  ˆn 2

(16)

(17)

Endogeneity

Endogeneity is violation against exogeneity, which is assumed in OLS. This occurs when the expected value of ei depends on one or more of the covariates and the assumption that E[i ] = 0 is violated. Endogeneity can arise as a result of measurement errors and autoregression with autocorrelated residuals. [4] To detect endogeneity one can analyze the covariates to see if there are any covariates that has been omitted but should be in the model or if the residual is correlated with one or several of the covariates. [4] If the models covariates are endogenous another method such as the one of instrumental variables (IV) and Two Stage Least Squares (2SLS) is recommended as this will bring consistent estimations. [4]

13

2.4 2.4.1

Model valuation Hypothesis testing

Hypothesis testing is a statistical method to examine if a hypothesis can be rejected at a ceratin significance level. When using hypothesis testing further assumptions arises, namely that the residuals should be normally distributed., i ∼ N (0, σ 2 ). [1] The algorithm for the hypothesis is divided in 4 steps [7]: 1. Determine a significance value α based on preferences. Formulate the null hypothesis H0 , and an alternative hypothesis Hα . 2. Identify the test statistic, T, that can be used to determine whether H0 should be rejected or not. 3. Compute the p-value. The p-value gives the probability that a test statistic is at least as significant as the one observed when H0 is true. The smaller the p-value, the stronger the evidence against H0 . 4. If the p-value ≤ α the null hypothesis is rejected and the alternative hypothesis is valid. 2.4.2

t-test and hypothesis testing

The t-test is a common test to test and identify the parameters significance. When performing the t-test, only one linear constraint can be tested at a time. Examples of parameters that can be used during the test and in the linear constrains are one of the β-values and the correlation coefficient r. When performing the t-test for β value , βi , the linear conditions is reformulated and set equal to zero. This forms the null hypothesis [1]: H0 : βi = 0 Hα : βi 6= 0

(18)

The degrees of freedom are set equal to the number of observations subtracted with the number of parameters estimated in the regression. The null hypothesis is that the linear condition is true, and that β is zero which means that the covariate xi is not the response variable. The alternative hypothesis Hα is that the covariate xi is a influencing factor on the response variable and therefore that βi is not zero. The t-value is obtained by dividing the β-value with the standard error for the same β-value [9].

14

ti =

βi S.E(βi )

(19)

A |t| close to zero indicates that the tested covariates is not explanatory of the response variable, and that these might can be excluded from the model. To determine if the parameter can be excluded from the model, calculate the p-value: p − value = P (T ≥ t),

T ∈ t(n − 2)

(20)

Determine the p-value with the chosen significant level α. If p ≤ α the null hypothesis can be rejected and the alternative hypothesis is valid. [9] 2.4.3

R2

R2 is a measure of goodness of fit and is defined as the square of the sample correlation coefficient between y and the least squares estimation ˆy . R2 measures the proportions of variance in the response variable that can be explained by the covariates. In the linear case which is used in this thesis, the least squares estimation is used and the total variance in the responding variable can be divided into two parts: the explained variance and the unexplained variance. The explained variance is defined as the sum of the squared deviations for the estimated value from its mean value. The unexplained variance is defined as the sum of the squared residuals. R2 is defined as the fraction of the explained variance and the unexplained variance [4]: R2 =

ˆ V ar(xβ) V ar(ˆ ) =1− V ar(y) V ar(y)

(21)

R2 is always increasing when adding more covariates, even though this not nessisarely means that the goodness of fit is better. A high R2 can be an indication of over-fitting which means having too many covariates in the model. The adjusted R2 is a modified version of R2 that has been adjusted for the number of predictors in the model. The adjusted R2 increases only if the new term improves the model more than is expected by chance. The adjusted R2 will hence peak when the optimal amount of covariates appears. The adjusted R2 , R¯2 , is defined as: [10] n−1 R¯2 = 1 − (1 − R2 ) n−p

(22)

where p is the total number of explanatory variables in the model (not including constant term), and n is the sample size.

15

2.4.4

F-test

F-test is a statistical test in which the test statistic has a F -distribution under the null hypothesis. F s assigned probability distribution is used to calculate P r(X > F ), where X is a random variable. The F-test is used to test the hypothesis that a number of r of β-values are equal to zero, β1 = β2 = .. = βr = 0. It can be used for testing simple and multiple regression models significance. The test variable is defined as: n−k−1 F = r



 |e∗ |2 −1 |ˆ e|2

(23)

where e∗ is the residual from the restricted regression and eˆ is the residual from the unrestricted regression. The null hypothesis is rejected if the F-statistic is large. A rejected null hypothesis implies that the covariates corresponding to the β-values are significant. The F-test is significant under the assumption that the residuals are normal distributed. The test is asymptotic significant if the residuals are not normal distributed. [4] 2.4.5

BIC - Bayesian information criterion

The Bayesian information criterion can be used to compare different models. BIC is a function that minimize the squares of the residuals and assumes normal distributed residuals. The BIC-value can hence be used to determine which models has a better approximation of the true empirical data. A lower BIC-value indicates a more significant model.The formula for calculating BIC yields: [11]  2 |ˆ e| K ∗ ln(N ) BIC = ln + (24) N N Where N is the number of observations, K is number of covariates + the intercept and eˆ is the residuals from the unrestricted regression. A common approach is to analyse the BIC while removing covariates to find the model that minimizes the BIC. 2.4.6

AIC - Akaike Information criterion

The Akaike Information criterion is a measure of the relative quality of a model for a given set of data. Given a set of models for the data, AIC estimates the quality of each model relative to the other models. It also denotes the relative quantity of information lost when a given model, with estimated parameters, is compared to the true process that generated the data. An AIC-test can be used to test if one or more covariates would enter the equation. Similar to the BIC, one should choose the model that minimizes: [4]

16

AIC = N ∗ ln(|ˆ e|2 ) + 2K

(25)

Where N is the number of observations, K is number of covariates + the intercept and eˆ is the residuals from the unrestricted regression. 2.4.7

Combining AIC and BIC

In this thesis we will analyse the model from both an AIC and BIC perspective when determining our final model. We take this approach as BIC often gives a model with too few covariates and the AIC often choose a model with too many covariates. [11]

17

3

Method

3.1

Data Collection

To find a pricing model for whisky bottles, prices and information of a large amount of bottles was needed. The data was provided in the assortment file available at Systembolaget’s website [2]. Since Systembolaget is a non-profit organization, all bottles should be equivalently priced which is a requirement for the analysis of the price drivers. The data includes all whisky products available at Systembolaget, almost 2000 bottles of malt whisky. We limited the data to contain only 700 ml bottles from Scotland, which resulted in approximately 1000 bottles. These bottles form the basis of our data collection, with each bottle’s name, producer, price, volume, storage time and region. In most cases the time of storage was included in the name, and an excel command was used to obtain this number. In other cases, where the name did not contain the storage time, it was calculated by subtracting the bottle’s vintage with the year of sell start. The accuracy of the method was confirmed by comparing bottles with calculated year with bottles with given storage time and it had a high grade of precision. Furthermore, we manually looked up the whisky region for each distillery. The whisky regions used were; Lowlands, Highlands, Speyside, Islands, Islay and Campbeltown [8]. In some cases the regions were not specified and therefore a new region variable Other was introduced. Outliers were removed in two ways. First all prices below 400 SEK or above 6500 SEK were eliminated, corresponding to the 5% quantiles and thus the remaining data contained 90% of all 700 ml bottles in Scotland. Secondly, all bottles without complete information were removed, i.e. if any covariate was missing.

3.2 3.2.1

Variables Response Variable

The response variable for the regression analysis is the price of a 700 ml bottle of whisky, y. To eliminate heteroscedasticity the prices were logtransformed. Since the aim is to find what impact regions, alcohol and storage time have on the price the choice of response variable was obvious. 3.2.2

Covariates

The covariates in the regression are all factors believed to have an impact on the whisky price, including both qualitative and quantitative variables. To analyse qualitative factors the use of dummy variables is central. The dummy variable takes the value of one if the 18

statement is true and zero otherwise. Storage time [years] The storage time has a great impact on the taste of whisky and is therefore believed to affect the price as well. Whisky without known storage time have been excluded in this regression. Alcohol content [%] The alcohol content ranged from 40.00% to 65.10% in the set of whisky used in the regression and it is of importance to examine whether this has an effect on the price or not. The alcohol covariate was set to be the difference from 40.00%, to create positive numbers ranging from 0.00 to 25.10. Since 40.00% is the minimum level of alcohol this should be used as reference, as it is only of interest to see how much a change in alcohol content from this level affects the price. The new alcohol content covariate gives a clear result of the impact of one unit’s increase. Region [dummy variables] The covariate Region was cathegorized into following six regions: Lowlands, Highlands, Speyside, Islands, Islay and Campbeltown. For each whisky region in Scotland a dummy variable was created, taking the value of one if a whisky was produced in the region and zero otherwise. The benchmark used was Highlands since it is the biggest region containing many data points.

3.3

Initial model

The initial model included the response variable, all covariates described in section 5.2.2 and their corresponding coefficients. Table 1: Table of the response variable and covariates in the initial model Variable Description Unit Response Variable y Price SEK Covariates x1,i Alcohol Content % above 40.00% x2,i Storage Time Years x3,i Storage Time2 Years2 x4,i Lowlands Dummy; 0 or 1 x5,i Speyside Dummy; 0 or 1 x6,i Islands Dummy; 0 or 1 x7,i Islay Dummy; 0 or 1 x8,i Campbeltown Dummy; 0 or 1 19

The initial model was thus: log(yi ) = β0 +x1,i β1 +x2,i β2 +x3,i β3 +x4,i β4 +x5,i β5 +x6,i β6 +x7,i β7 +x8,i β8 +i ,

20

i = 1, 2, .., n (26)

4 4.1

Results Initial model

All covariates were included in the initial model except the benchmark ”Highlands”. Highland was assigned benchmark as this region had the highest quantity and most consistent data. This is a standard procedure when conducting a regression with dummy variables, in order to reduce the risk for linear dependency and multicollinearity. Table 2: Regression results for the initial model Model R2 Adjusted R2 F-statistics p-value −1 Initial Model 1.8663*10 1.7688*10−1 8.884 < 7.283 ∗ 10−13

Table 3: Continued, Regression results for the initial model Coefficients Intercept Alcohol Content Storage Time Storage Time Squared Lowlands Speyside Islands Islay Campbeltown Other

Estimated β 6.873 1.944*10−2 -7.124*10−4 3.877*10−5 1.055*10−1 2.677*10−2 −6.489 ∗ 10−2 1.338 ∗ 10−1 2.617*10−1 −2.812 ∗ 10−1

Standard Error 4.937*10−2 2.926*10−3 3.873*10−3 9.755*10−5 8.593*10−2 3.906*10−2 6.797 ∗ 10−2 5.466 ∗ 10−2 1.026*10−1 7.889 ∗ 10−2

21

t-value 139.211 6.645 -1.840*10−1 3.970*10−1 1.228 6.850*10−1 −9.550 ∗ 10−1 2.448 2.550 −3.565

p-value

Suggest Documents