ABSTRACT
A penalized regression techniques which is a variable selectionhas been developed specifically to eliminate the problem of multicollinearity and also reduce the flaws inherent in the prediction accuracy of the classical ordinary least squares (OLS) regression technique. In this dissertation, we focus on the numerical study of four penalized regression methods. A diabetes dataset was used to compare four of these well-known techniques, namely: Least Absolute Shrinkage Selection Operator (LASSO), Smoothly Clipped Absolute Deviation(SCAD) and Correlation Adjusted Elastic Net (CAEN) and Elastic Net (EN). The whole paths of results (in λ) for the LASSO, SCAD and CAEN models were calculated using the path wise Cyclic Coordinate Descent (CCD) algorithms– in glmnetin R. We used 10-fold cross validation (CV) within glmnetto entirely search for the optimal λ. Regularized profile plots of the coefficient paths for the three methods were also shown. Predictive accuracy was also assessed using the mean squared error (MSE) and the penalized regression models were able to produce feasible and efficient models capable of capturing the linearity in the data than the ordinary least squares model.Since there are lots of variables in many survival data analysis problems, SCAD can also be applied to survival data.After thorough analysis it was observed that SCAD generates a less complex model with a minimum mean square error (MSE) than the three penalized regression compared namely: Least Absolute Shrinkage Selection Operator (LASSO), Elastic Net (EN) and Correlation Adjusted Elastic Net (CAEN).
TABLE OF CONTENTS
Title Page …………………………………………………………………………………………………………………….2
Declaration …………………………………………………………………………………………………………………..3
Certification …………………………………………………………………………………………………………………4
Dedication ……………………………………………………………………………………………………………………5
Acknowledgements ……………………………………………………………………………………………………….6
Abstract ……………………………………………………………………………………………………………………….7
Table of Contents ………………………………………………………………………………………………………….8
List of Tables……………………………………………………………………………………………………………… 10
List of Figures ……………………………………………………………………………………………………………. 11
CHAPTER ONE…………………………………………………………………………………………………………. 12
INTRODUCTION ………………………………………………………………………………………………………. 12
1.1 Background of the Study ……………………………………………………………………………………… 12
1.2 Research Motivation …………………………………………………………………………………………… 13
1.3 Statement of the Problem …………………………………………………………………………………….. 13
1.4 Aim and Objectives of the Study …………………………………………………………………………… 14
1.5 Significance of the Study …………………………………………………………………………………….. 14
1.6 Scope and Limitations of the Study ……………………………………………………………………….. 14
CHAPTER TWO ………………………………………………………………………………………………………… 15
LITERATURE REVIEW …………………………………………………………………………………………….. 15
2.1 Introduction ………………………………………………………………………………………………………. 15
2.2 Classical Regression Methods ………………………………………………………………………………. 15
2.3 Penalized Regression ………………………………………………………………………………………….. 18
2.3.1 LASSO Regression ………………………………………………………………………………………. 19
2.3.2 Elastic Net Regression…………………………………………………………………………………… 22
2.3.3 Correlation Adjusted Elastic Net (CAEN) Regression ………………………………………… 23
2.3.4 Smoothly Clipped Absolute Deviation (SCAD) Regression …………………………………. 23
2.4 Application of Penalized Regression ……………………………………………………………………… 24
CHAPTER THREE …………………………………………………………………………………………………….. 26
9
METHODOLOGY ……………………………………………………………………………………………………… 26
3.1 Penalized Regression Techniques ………………………………………………………………………….. 26
3.1.1 LASSO Regression Approach ………………………………………………………………………… 28
3.1.2 Elastic Net Regression Approach ……………………………………………………………………. 30
3.1.3 Correlation Adjusted Elastic Net Approach ………………………………………………………. 31
3.1.4 SCAD Regression Approach ………………………………………………………………………….. 32
3.2 Ordinary Least Squares ……………………………………………………………………………………….. 33
3.3 Assumptions of Multiple Linear Regression ……………………………………………………………. 33
3.4. Variance Inflation Factor ……………………………………………………………………………………. 34
3.5 Mean Square Error ……………………………………………………………………………………………… 35
3.6 Choice of turning Parameters ……………………………………………………………………………….. 36
3.7 Source of data ……………………………………………………………………………………………………. 36
CHAPTER FOUR ………………………………………………………………………………………………………. 37
RESULTS AND DISCUSSION ……………………………………………………………………………………. 37
4.1 Introduction ………………………………………………………………………………………………………. 37
4.2 Determining the Ordinary least squares regression. ………………………………………………….. 37
4.3 Determining the Correlation among independent variables ………………………………………… 39
4.4: Results Based LASSO regression …………………………………………………………………………. 44
4.5 Results Based Elastic net regression ………………………………………………………………………. 44
4.6 Results Based Correlation adjusted elastic net regression ………………………………………….. 48
4.7: Smoothly Clipped Absolute Deviation regression ………………………………………………….. 51
CHAPTER FIVE: ……………………………………………………………………………………………………….. 55
SUMMARY, CONCLUSION AND RECOMMENDATION …………………………………………….. 55
5.1 Summary ………………………………………………………………………………………………………… 55
5.2 Conclusion ………………………………………………………………………………………………………. 55
5.3 Recommendation ……………………………………………………………………………………………… 55
5.4 Suggestion for further study ……………………………………………………………………………….. 56
5.5 Contribution to knowledge ………………………………………………………………………………… 56
REFERENCES …………………………………………………………………………………………………………. 57
APPENDIX A ……………………………………………………………………………………………………………. 61
10
CHAPTER ONE
INTRODUCTION
1.1 Background of the Study
In order to reduce possible biasness,large number of predictor variables was introduced in a model and that lead to a serious concern of multicollinearity among the predictor variables in multiple linear regressions, variable selection is an important issue.(Mathew and Yahaya, 2015) Multicollinearity and high dimensionality are two problems and computational issue that bring challenges to regression analysis. To deal with these challenges, variables selection and shrinkage estimation are becoming important and useful. The traditional approach of automatic selection (such as forward selection, backward elimination and stepwise selection) and best subset selection are computationally expensive and may not necessarily produce the best model. Multicollinearity problem is being dealt with by Penalized least square (PLS) method by putting some constraints on the values of the parameters estimated. The aftermath is that the entries of the variancecovariance matrix are significantly reduced.When multicollinearity exist that predictor’s variables that are highly correlated form some groups. One of the waycollinearity problem can be dealt with is to remove one or more of the predictor variables within the same group, by making decision which among the group variables is to be eliminated tend to be difficult and complicated. Theaftermath of multicollinearityis th at the parameter estimator and their variance or standard error tends to be large and prediction may be very inaccurate.
In a situation where there exist correlated data or data where the number of predictors is much larger than the sample size, penalized regression methods have beenintroduced to deal with this
13
challenge, because they produce more stable results, penalized regression methods do not clearly
select the variables; instead they minimize the Regression Sum of Square by using a penalty on
the size of the regression coefficients. This penalty causes the regression coefficients to shrink
toward zeroand thismay result in biased estimates through these regression coefficient estimates
will have smaller variance. This can improve the prediction accuracy because of the smaller mean
squared error (Hastie et al., 2009). This is why penalized regression methods are also known as
shrinkage or regularization methods. Some regression coefficients are set to zero exactly if the
shrinkage is large enough,thus, penalized regression methods perform variable selection and
coefficient estimation simultaneously. The Least Absolute Shrinkage Selection Operator
(LASSO) enables selection such that only the important variable stays in the model
(Szymeezak,et al., 2009).
1.2 Research Motivation
The motivation for using penalized regression is that the ordinary least square estimation methods
is not unique and are subjected to high variability due to the presence of multicollinearity.
However, with penalization it becomes unique when appropriate turning parameters are chosen
and the variances of the estimators are controlled. Most of the comparisons done by Mathew and
Yahaya (2015) were between Least Absolute Shrinkage Selector Operator (LASSO), Elastic Net
(EN) and Correlation Adjusted Elastic Net (CAEN). This research attempt to compare LASSO,
EN, CAEN and Smoothly Clipped Absolute Deviation (SCAD) regression.
1.3 Statement of the Problem
When multicollinearityexist in amodel, Parameter estimates
of the multiple linear regression
models are not unique. Most often we face the issue of multicollinearity when there are strong
linear relationships between two or more predictors. In recent years, alternative methodsknown as
14
shrinkage and variable selection have been introduced to deal with multicollinearity in particular, penalized regression methods. This studydealwithmulticollinearity by considering different penalized regression methods.
1.4Aim and Objectives of the Study
The aim of this study is to compare the performance of penalized regression techniques withclassical regression methods in minimizing the effect of multicollinearity. We intend to achieve this aim through the following objectives:
i. Determine variables that possess multicollinearity using Variance Inflation Factor;
ii. Apply penalized regression techniques such as LASSO, CAEN, EN, and SCAD regression to eliminate multicollinearity; and
iii. Assess the adequacy of the fitted penalized regression models and the classical least squares.
1.5Significance of the Study.
This study is expected at the end to show the importance of variable selection through Penalized regression as a prior step in removing unimportant factors or variables before model building, also, providing assistance to researchers to ease their decision making as to which technique to be used when encountered with the problem of multicollinearity.
1.6Scope and Limitations of the Study
This study revolves around the use of Generalized Cross-Validation (GCV) as a good approximation of the leave-one-out cross-validation (LOOCV)to determine the number of variables selected by each of the methods (LASSO, CAEN, EN and SCAD) under study and also by the use of Mean Square Error and linear fits to determine the predictive accuracy of the
15
methods. The research gives an insight of each of the procedure in an attempt to highlight the similarities and the differences existing between three penalized methods.
Do you need help? Talk to us right now: (+234) 08060082010, 08107932631 (Call/WhatsApp). Email: [email protected].
IF YOU CAN'T FIND YOUR TOPIC, CLICK HERE TO HIRE A WRITER»