Regression is one of the most useful statistical methods for data analysis. Multicollinearity is a problem that, pose a challenge to regression analysis by increasing the standard error of the estimators, making the model to be less predictive and difficult for interpretation. Penalized regression which is a variable selection techniquehave been developed specifically to eliminate the problem of multicollinearity and also reduce the flaws inherent in the prediction accuracy of the ordinary least squares (OLS) regression technique. In this thesis, the focus is on the numerical study of these three penalized methods, namely: least absolute shrinkage selection operator (LASSO), elastic net and the newly introduced correlation adjusted elastic net (CAEN). A diabetes dataset which was shown to possess the qualities of multicollinearitywas obtained from previous literature to compare these well-known techniques. 10-fold cross validation (CV) within glmnet package was used to entirely search for the optimal λ.The whole path of results (in λ ) for the LASSO, Elastic Net and CAEN models were calculated using the path wise Cyclic Coordinate Descent (CCD) algorithms– in glmnet package in R,a computationally effective technique for finding out these convex optimization solutions. A regularized profile plot of the coefficient paths for the three methods, were also shown. Predictive accuracy was also assessed using the mean squared error (MSE) and the penalized regression models were able to produce feasible and efficient models capable of capturing the linearity in the data than the ordinary least squares model.It was observed that correlation adjusted elastic net generates a less complex model with a minimum mean square error (MSE).
1.1 Background of the study
In Multiple linear regression analysis, when a large number of predictor variables are introduced in a model to reduce possible modeling biases or there is serious concern of multicollinearity among the predictor variables, variable selection is an important issue. Regression is one of the most useful statistical methods for data analysis. However, there are many practical problems and computational issues, such as multicollinearity and high dimensionality that pose a challenge to regression analysis.To deal with these challenges, variable selection and shrinkage estimation are becoming important and popular. The traditional approach of automatic selection (such as forward selection, backward elimination and stepwise selection) and best subset selection are often computationally expensive and may not necessarily produce the best model. The method of penalized least squares (PLS), which is equivalent to penalized maximum likelihood, helps to deal with the issue of multicollinearity by putting constraints on the values of the estimated parameters. A wonderful consequence is that the entries of the variance-covariance matrices are reduced significantly.
Suppose multicollinearity is detected and the predictor variables that cause multicollinearity are identified. As discussed by (Ryan 2009) multicollinearity may not be a problem if the goal is to use the linear regression model for prediction. However multicollinearity is a problem if we use the linear regression model for description or control.Multicollinearity implies that predictor variables form some groups. Within each group, predictor variables are highly correlated. One solution to multicollinearity is to remove one or more of the predictor variables within the same group, but deciding which ones to eliminate
tends to be a difficulttechnical task. A major consequence of multicollinearity is that the parameter estimators and their variances tend to be large. Therefore the inference on the response is highly variable.
To deal with the challenges mentioned above, penalized regression approaches, also called shrinkage or regularization methods, have been developed. Although shrinking some of the regression coefficients toward zero may result in biased estimates, these regression coefficient estimates will have smaller variance. This can result in enhanced prediction accuracy because of a smaller mean squared error (Hastie et al., 2009). Regression coefficients are shrunk by imposing a penalty on their size, which is done by adding a penalty function to the least-squares model. Moreover, some of these procedures e.g. the Least Absolute Shrinkage Selection Operator (LASSO) enable variable selection such that only the important predictor variables stay in the model(Szymczak, et al. 2009).
1.2 Statement of the Problem
When perfect multicollinearity or near-perfect multicollinearity exists in a model, parameter estimates of the multiple linear regression models are not unique. In practice, perfect collinearity occurs rarely, what we often have is nearly-perfect collinearity. However quite often we face the issue of multicollinearity when there are strong linear relationships among two or more predictor variables. This happens when two or more predictor variables contribute more or less to a same characteristic of the subjects.In recent years, alternative methods have been introduced to deal with multicollinearity. In particular, methods of penalization become popular and useful. This is also known as simultaneous shrinkage and variable selection. The purpose of this study is to assess thestatistical performances of LASSO, Elastic Net and the newly introduced Correlation Adjusted Elastic-Net (CAEN) regression methods.
1.3 Research Motivation
The motivation for using penalized regression is that in the presence of nearly-perfect multicollinearity, the ordinary least squares estimates are not unique. However, with penalized least squares, these estimates become unique especially when appropriate tuning parameters are chosen. Similarly, without penalization, the ordinary least squares estimators are subject to high variability when multicollinearity exists. With penalization, the variances of the estimators are controlled. Most of the comparisons done by other researchers werebetween LASSO and elastic net.This research attempts to compare LASSO, elastic net and the newly introducedcorrelation adjusted elastic net.And also assess the advantages of using these methods over the classical least squarestechnique. This research attempts to accentuate some of these differences by using numerical results.
1.4 Aim and objectives of the study
The main aim of this research is to assess the performance and advantages of using LASSO, Elastic Net and CAEN methods over the classical regression methods. We hope to achieve this aim through the following objectives:
i. Application ofpenalized regression methods of eliminating multicollinearity.
ii. Identifying the variables that possess the characteristics of multicollinearity using the Variance Inflation Factor, and
iii. Identifying the number of variables selected by each of the penalized regression methodand the classical least squares method.
1.5 Significance of the study
The significance of this study is geared toward detecting variables with the qualities of multicollinearity in a regression model. Also to show why penalized methods are preferred,
over classical least squares technique when faced with the problem of multicollinearity. In achieving this, we explored and compared three penalized methods used in eliminating multicollinearity. This work is also aimed at providing assistance to researchers to ease their decision making as to which technique to be used when encountered with the problem of multicollinearity.
1.6 Scope and limitations of the study
This research is circumscribed by the use of Leave One-Out Cross Validation (LOOCV) criterion to determine the number of variables selected by each of these methods under study, also by the used of mean square error, to assess the predictive accuracy of the methods. The research also gives an overview of each of the procedures in an attempt to highlight the similarities as well as the differences existing among these three penalized methods with respect to variable selection.
Multicollinearity is another important issue in multiple regression. Collinearity means a linear relationship exists between two or more predictor variables, while multicollinearity refers to a situation in which two or more predictor variables are highly linearly correlated. The most extreme case is perfect collinearity (or multicollinearity) where the linear correlation between two predictor variables is either -1 or 1. This happens, for example, when two predictor variables and satisfy for two real numbers a and b.
In the presence of perfect multicollinearity, parameter estimates of the population multiple linear regression model are not unique. In practice, perfect collinearity occurs rarely. However quite
often we face the issue of multicollinearity when there are strong linear relationships among two or more predictor variables. This happens when two or more predictor variables contribute more or less to the same characteristic of the subjects. For a matrix A, let be its transpose and be its inverse matrix, if it exists. When predictor variables are highly linearly correlated, the most significant consequence is that entries of tend to be large, so the predictor variables contribute overlapping and redundant information. Other consequences of multicollinearity are that some predictor variables may not be statistically significant but the model may overall be significant, and that the usual interpretation of coefficient estimates fails in the presence of multicollinearity.Furthermore there is high variability of parameter estimators, because the estimated variance-covariance matrix has large diagonal entries. Several methods for detecting multicollinearity exist. These include checking for significant change in the parameter estimate when its corresponding predictor variable is added to or removed from the model, checking for insignificance of individual estimators while the model is overall significant, calculating the Variance Inflation Factor (VIF) and carrying out formal multicollinearity tests. There are several remedies for dealing with multicollinearity. One method is to select a collection of predictor variables that are minimally correlated with each other. This avoids over fitting the regression model and can be normally done with statistical software. However information from other predictor variables is often lost. Furthermore, there is no clear way of selecting a collection of predictor variables that forms the best subset.
Since omitting predictor variables may result in potential loss of information, another method is to include interaction terms into the model to account for high linear correlation among the predictor variables. There are several problems with this approach. One of such is that the form
of interaction is not unique and must be carefully determined. Another problem is that ,the model is much more complex and has too many terms which reduce the degrees of freedom of the inference of the response, and hence reduces the power for predicting and estimating the response.In recent years, alternative methods have been introduced to deal with multicollinearity. In particular, some methods of penalization become popular and useful. This is also known as simultaneous shrinkage and variable selection.