## ABSTRACT

In this thesis, we present an alternative to ordinary least squares (OLS) regression based on analytical solution in the Statgraphics software is considered, and this alternative is no other than quantile regression (QR) model. We also present goodness of fit statistic called Quantile regression coefficient of determination as well as heteroskedasticity test statistics for the parameters. The procedure is well presented, illustrated and validated by a numerical example based on publicly available dataset on fuel consumption in miles per gallon in highway driving. Theresults obtained from the analysis in this thesissuggest that sometimes OLS estimates can even be misleading what the true relationship between response variable and covariate as the effects can be very different for different subsections of the sample. Quantile Regression therefore gives a better and more complete view of the relationship among random variables.

## TABLE OF CONTENTS

DECLARATION ……………………………………………………………………………………………………………………. i

CERTIFACATION ………………………………………………………………………………………………………………… ii

DEDICATION ……………………………………………………………………………………………………………………… iii

ACKNOWLEDMENT ………………………………………………………………………………………………………….. iv

ABSTRACT ………………………………………………………………………………………………………………………….. v

TABLE OF CONTENTS ……………………………………………………………………………………………………….. vi

List of Tables ……………………………………………………………………………………………………………………… viii

List of Figures ………………………………………………………………………………………………………………………. ix

Appendix ………………………………………………………………………………………………………………………………. x

CHAPTER ONE ……………………………………………………………………………………………………………………. 1

INTRODUCTION ……………………………………………………………………………………………………………… 1

1.0 Background of the Study…………………………………………………………………………………………… 1

1.1 Limitations for the Conditional Mean …………………………………………………………………………. 2

1.2 Motivation of the Study ……………………………………………………………………………………………. 4

1.3 Statement of the Problem ………………………………………………………………………………………….. 5

1.4 Scope of the study ……………………………………………………………………………………………………. 5

1.5 Significance of the study …………………………………………………………………………………………… 6

1.6 Aim and Objective(s) of the Study …………………………………………………………………………….. 6

1.7 Limitation of the Study …………………………………………………………………………………………….. 7

1.8 Statement of Hypotheses …………………………………………………………………………………………… 7

CHAPTER TWO …………………………………………………………………………………………………………………… 8

LITERATURE REVIEW ……………………………………………………………………………………………………. 8

2.0 Introduction ………………………………………………………………………………………………………………… 8

2.1 Multiple Linear Regression ……………………………………………………………………………………….. 8

2.2 Quantile Regression ……………………………………………………………………………………………….. 10

CHAPTER THREE ……………………………………………………………………………………………………………… 16

METHODOLOGY …………………………………………………………………………………………………………… 16

3.0 Introduction ……………………………………………………………………………………………………………… 16

3.1 Data Collection ……………………………………………………………………………………………………… 16

3.2 Classical Linear Regression …………………………………………………………………………………….. 17

3.3 Estimation of the Parameters in Linear Regression Models …………………………………………. 18

3.4 Estimating

2 ……………………………………………………………………………………………………….. 22

vii

3.5 Properties of the Estimators …………………………………………………………………………………….. 23

3.6 Coefficient of Multiple Determination

2 R ……………………………………………………………….. 23

3.7 Stepwise Selection …………………………………………………………………………………………………. 25

3.8 Quantile Regression ……………………………………………………………………………………………….. 25

3.8.1 Computation of Quantile Regression …………………………………………………………………. 26

3.8.2 Least Absolute Deviation Regression Goodness of Fit …………………………………………. 30

3.8.3 Quantile Regression Goodness of Fit …………………………………………………………………. 31

CHAPTER FOUR ………………………………………………………………………………………………………………… 33

RESULTS AND DISCUSSION …………………………………………………………………………………………. 33

4.0 Introduction ……………………………………………………………………………………………………………… 33

4.1 Numerical Illustration and Discussion of the result …………………………………………………….. 33

4.2 Ordinary least squares regression …………………………………………………………………………….. 34

4.3 Heteroskedasticity Test: White ………………………………………………………………………………… 35

4.4 Comparison of OLS and QR as the number of variable increases …………………………………. 36

CHAPTER FIVE …………………………………………………………………………………………………………………. 38

SUMMARY CONCLUSIONS AND RECOMMENDATIONS ……………………………………………… 38

5.0 Introduction ……………………………………………………………………………………………………………… 38

5.1 Summary ………………………………………………………………………………………………………………. 38

5.2 Conclusion ……………………………………………………………………………………………………………. 39

5.3 Recommendations ………………………………………………………………………………………………….. 39

5.4 Contribution to knowledge………………………………………………………………………………………. 39

5.5 Further research …………………………………………………………………………………………………….. 40

REFERENCES………………………………………………………………………………..41

APPENDIX A ……………………………………………………………………………………………………………………… 44

viii

## CHAPTER ONE

INTRODUCTION

1.0 Background of the Study

In most regression problems, interest lies in studying the relationship between two or more variables. Where it is an important aspect in the philosophy of science to study the concept of relationship between varying qualities or events. The purpose of regression analysis is to expose the relationship between a response variable and predictor variables. In real applications, the response variable cannot be predicted exactly from the predictor variables. Instead, the response for a fixed value of each predictor variable is a random variable. For this reason, we often summarize the behavior of the response for fixed values of the predictors using measures of central tendency. Typical measures of central tendency are the average value (mean), the middle value (median), or the most likely value (mode). Traditional regression analysis is focused on the mean; that is, we summarize the relationship between the response variable and predictor variables by describing the mean of the response for each fixed value of the predictors, using a function we refer to as the conditional mean of the response. The idea of modeling and fitting the conditional-mean function is at the core of a broad family of regression-modeling approaches, including the familiar simple linear-regression model, multiple regression, models with “heteroscedastic” errors using weighted least squares, and nonlinear regression models.

Conditional-mean models have certain attractive properties. Under ideal conditions, they are capable of providing a complete and parsimonious description of the relationship between the covariates and the response distribution. In addition, using conditional-mean models leads to

2

estimators (least squares and maximum likelihood) that possess attractive statistical

properties that are easy to calculate, and are straightforward to interpret. Such models have

been generalized in various ways to allow for “heteroscedastic” errors so that given the

predictors, modeling of the conditional mean and conditional scale of the response can be

carried out simultaneously. Conditional-mean modeling has been applied widely in the social

sciences, particularly in the past half century, and regression modeling of the relationship

between a continuous response and covariates via least squares and its generalization is now

seen as an essential tool. More recently, models for binary response data, such as Logistic,

Probit and Poisson regression models for count data have become increasingly popular in

social science research. These approaches fit naturally within the conditional mean modeling

framework. While quantitative social-science researchers have applied advanced methods to

relax some basic modeling assumptions under the conditional-mean framework, this

framework itself is seldom questioned.

The conditional-mean framework has inherent limitations.

1.1 Limitations for the Conditional Mean

The mean of the errors is zero, i.e. 0 i E .

The error is assumed to have equal variance that is, var 2 i for all value of X x

The distributions of errors for different value of regressors are independent, denoting

the errors in different value by i j and , that is, Cov 0 i j. i j

The probability distribution of errors for all values of x is normally distributed by

2 ~ N 0, i .

An alternative to conditional-mean modeling believed to have roots that can be traced to the

mid-18th century. This approach can be referred to as conditional median modeling, or

simply median regression. It addresses some of the issues mentioned above regarding the

choice of a measure of central tendency. The method replaces least-squares estimation with

3

least-absolute distance estimation. While the least-squares method is simple to implement without high-powered computing capabilities, least-absolute-distance estimation demands significantly greater computing power. It was not until the late 1970s, when computing technology was combined with algorithmic developments such as linear programming, that median-regression modeling via least-absolute-distance estimation became practical. The median-regression model can be used to achieve the same goal as conditional-mean-regression modeling in order to represent the relationship between the central location of the response and a set of covariates. However, when the distribution is highly skewed, the mean can be challenging to interpret while the median remains highly informative. As a consequence, Conditional-median modeling has the potential to be more useful. The median is a special quantile, which describes the central location of a distribution. Conditional-median regression is a special case of quantile regression in which the conditional 0.5 quantile is modeled as a function of covariates. More generally, other quantiles can be used to describe non-central positions of a distribution. The quantilenotion generalizes specific terms like quartile, quintile, decile, and percentile. The pthquantile denotes that value of the response below which the proportion of the population is p. Thus, quantiles can specify any position of a distribution. For example, 2.5% of the population lies below the 0.25th quantile. Koenker and Bassett (1978) introduced quantile regression, which models conditional quantiles as functions of predictors. The quantile-regression model is a natural extension of the linear-regression model. While the linear-regression model specifies the change in the conditional mean of the dependent variable associated with a change in the covariates, the quantile regression model specifies changes in the conditional quantile. Since any quantile can be used, it is possible to model any predetermined position of the distribution.

4

Quantile-regression models can be fitted by minimizing a generalized measure of distance

using algorithms based on linear programming. As a result, quantile regression is now a

practical tool for researchers. Software packages familiar to statistical scientists offer readily

accessed commands for fitting quantileregression models.

This research work aims to investigate the robustness of quantile regression as an alternative

to least squares regression, especially when the number of regressors increases in the model

to a broad audience of social scientists who are interested in modeling both the location and

shape of the distribution they wish to study. It is imperative for researchers to utilize the

comparison in two parts.

1.2 Motivation of the Study

Standard linear regression techniques summarize the average relationship between a set of

regressors and the outcome variable based on the conditional mean function Ey x . This

provides only a partial view of the relationship, as we might be interested in describing the

relationship at different points in the conditional distribution of y. Quantile regression

provides that capability. Analogous to the conditional mean function of linear regression, we

may consider the relationship between the regressors and outcome using the conditional

median functionQ y x q , where the median is the 50th percentile, or quantile q, of the

empirical distribution. The quantile q (0; 1) is that y which splits the data into proportions q

below and1 q above: : 1 F y q and y F q q q

for the median, q = 0.5.If i

is the model

prediction error, OLS minimizes

2

i

. Median regression, also known as least-absolutedeviations

(LAD) regression, minimizes i

. Quantile regression minimizes a sum that

gives asymmetric penalties i

1 q for overprediction and i

q for underprediction.

Although its computation requires linear programming methods, the quantile regression

estimator is asymptotically normally distributed. Median regression is more robust to outliers

5

than least squares regression, and is semi parametric as it avoids assumptions about the

parametric distribution of the error process. Just as regression models conditional moments,

such as predictions of the conditional mean function, we may use quantile regression to

model conditional quantiles of the joint distribution of y and x.

1.3 Statement of the Problem

Regression analysis is robust in application for various kind of research, especially when

provisions are made to control for problems dealing with heteroskedasticity, due to the

violation of OLS assumption 2

2

2

1

2 2 i.e . What is the implication of

heteroskedasticity?The ordinary least squares (OLS) estimators and regression predictions

based on them remain unbiased and consistent.The OLS estimators are no longer the BLUE

because they are no longer efficient. As a result, regression predictions will be inefficient as

well.Because of the inconsistency of the covariance matrix of the estimated regression

coefficients, the tests of hypotheses, that is, t-tests or F-tests, are no longer valid.

The purpose of this study is to introduce a good model that controls the problem of

heteroskedasticity, and the newly model is considered as quantile regression model as a

robust alternative to ordinary least squares regression when the said assumptions fail to hold.

1.4 Scope of the study

The study, seeks to investigate the behavior or nature of the two regression processes as the

number of predictor (explanatory) variables increases. Keeping in mind that quantile

regressiondoes not take into account the failure or otherwise of the existing assumptions.

Discussion of the two techniques, various significant tests of the importance of independent

variables, the reliability of the models and results interpretation in the two techniques will be

considered. The study also involves empirical analysis using the two techniques and a

comparison of the results will be made to establish the discrepancies in using the two

techniques, if any.

6

1.5 Significance of the study

The study has been found worthy of research due to the fact that QR has been found to give a more holistic view of the effect of the explanatory variables on the response variable at different quantiles. The main significance of the study is that it strongly propose the use of QR as against the OLS regression that was used in the past for examiningthe heteroscedasticity, thereby recommending the use of QR to researchers for the study of heteroscedasticity. Most research involves estimation of the relationship between independent variables and dependent variable on the average. The relationship between dependent and independent variables can be estimated and examined at each quantile of the dependent variable. Inthis research it is our purpose to revisit the least absolute deviation estimation in regression analysis, consider some of its theoretical properties, and consider its implementation from a computational mathematical programming point of view. We also consider goodness of fit statistics as well as approximate distributions of the associate test statistics for the parameters. Furthermore, we suggest a new goodness of fit statistic, called the Quantilecoefficient of determination, which is adapted to the metrics used in LAD estimation. Finally some examples are provided to illustrate the behavior of the procedures in data that include outliers growing recognition of the need for a more flexible, more complete analysis is a driving force in the use of QR in the literature.

1.6 Aim and Objective(s) of the Study

The main aim of the study is to investigatequantile regression as an alternative to least squares regression, especially when the number of regressors increases.

1. To examine the quantile regression andleast squares regression.

2. To compare the models in term of goodness of fit statistic.

3. To recommend a suitable model for regression analysis.

7

1.7 Limitation of the Study

The study would have focused on conditional, unconditional, mean and quantilesbut for time

constraint, the study will focused only on conditional mean and conditional quantiles.

1.8 Statement of Hypotheses

t – test for significance of one coefficient

The t – test is used to determine whether the relationship between y andxj is significant

H0: j = 0 (i.e. the coefficient j is not significantly different from zero).

H1: βj ≠ 0 (i.e. the coefficient is significantly different from zero).

F – Test for overall significant of all coefficients

Testing whether the relationship between y and all xvariables is significant

: at least one of is different from zero. 1 js H

: 0 0 1 2 p H

8

Do you need help? Talk to us right now: (+234) 08060082010, 08107932631 (Call/WhatsApp). Email: [email protected]

**IF YOU CAN'T FIND YOUR TOPIC, CLICK HERE TO HIRE A WRITER»**