## ABSTRACT

Binary Logistics Regression and Naïve Bayesian classifier are two of the common classification modelling techniques that allow one to predict the category that a new observation belongs, on the basis of a training set of data containing observations (or instances) whose category membership is known. We studied the classification performances of the two linear classification under different feature (variable) selection criteria and dataset size conditions on a medical domain area were studied based on the datasets (breast cancer and heart diseases) obtained from the University of California, Irvine, online respiratory. The result indicated that logistics Regression for classification on relatively large datasets without the application of PCA (for variable selection) has the great accuracy (91.4%), while Naïve Bayesian classifier with PCA (for variable/ feature selection) tops the smaller dataset classification with an accuracy of 90.2%. These two accuracies are close enough and high enough, which is an indication of high relevance of their selections in solving classification problems on datasets from this kind of domain.

## TABLE OF CONTENTS

TITLE PAGE.. ……………………………………………………………………………………………………. i

DECLARATION.. ……………………………………………………………………………………………… ii

CERTIFICATION …………………………………………………………………………………………….. iii

DEDICATION ………………………………………………………………………………………………….. iv

ACKNOWLEDGEMENT ………………………………………………………………………………….. v

ABSTRACT……………………………………………………………………………………………………… vi

TABLE OF CONTENTS……………………………………………………………………………………. vi

LIST OF TABLES ……………………………………………………………………………………………… x

LIST OF FIGURES …………………………………………………………………………………………… xi CHAPTER ONE: INTRODUCTION ………………………………………………………………… 1

1.1 Background to the Study ………………………………………………………………………… 1

1.2 Statement of the Problem ……………………………………………………………………….. 3

1.3 Aim and Objectives of the Study …………………………………………………………….. 3

1.4 Significance of the Study ……………………………………………………………………….. 4

1.5 Motivation ……………………………………………………………………………………………. 4

1.6 Scope and Limitation of the Study …………………………………………………………… 4

1.7 Definition of Terms ……………………………………………………………………………….. 5

CHAPTER TWO: LITERATURE REVIEW …………………………………………………….. 8

CHAPTER THREE: RESEARCH METHODOLOGY AND MATERIALS …….. 14

3.1 Introduction ………………………………………………………………………………………… 14

3.2 Source of Data …………………………………………………………………………………….. 14

3.3 Method of data analysis………………………………………………………………………… 14

3.4 Principal component analysis (PCA) ……………………………………………………… 14

3.5 Logistics Regression ……………………………………………………………………………. 17

3.5.1 Binary Logit model from the Logistic Function ………………………………… 17

viii

3.6 Naïve Bayesian Classifier …………………………………………………………………….. 20

3.6.1 Formulation of the model……………………………………………………………….. 20

3.6.2 Learning the model: ………………………………………………………………………. 21

3.7 Classification of New Data …………………………………………………………………… 22

3.8 Model‟s performance evaluation……………………………………………………………. 22

3.8.1 Confusion Matrix ………………………………………………………………………….. 22

3.8.2 Comparing multiple models……………………………………………………………. 23

CHAPTER FOUR: ANALYSIS AND DISCUSSION ……………………………………….. 25

4.1 Introduction ………………………………………………………………………………………… 25

4.2 Model Building and evaluation for the Breast Cancer Dataset …………………… 25

4.2.1 Building and evaluating NB on the larger dataset, with no PCA …………. 26

4.2.2 Building and evaluating NB on the smaller dataset, with no PCA ……….. 26

4.2.3 Building and evaluating LR on the larger dataset, with no PCA ………….. 27

4.2.4 Building and evaluating LR on the smaller dataset, with no PCA ……….. 28

4.2.5 Building and evaluating NB on the larger dataset, with PCA ……………… 29

4.2.6 Building and evaluating NB on the smaller dataset, with PCA ……………. 30

4.2.7 Building and evaluating LR on the larger dataset, with PCA ………………. 31

4.2.8 Building and evaluating LR on the smaller dataset, with PCA ……………. 31

4.3 Model Building and evaluation for the Heart Disease Dataset …………………… 32

4.3.1 Building and evaluating NB on the larger dataset, with no PCA …………. 33

4.3.2 Building and evaluating NB on the smaller dataset, with no PCA ……….. 34

4.3.3 Building and evaluating LR on the larger dataset, with no PCA ………….. 34

4.3.4 Building and evaluating LR on the smaller dataset, with no PCA ……….. 35

4.3.5 Building and evaluating NB on the larger dataset, with PCA ……………… 36

4.3.6 Building and evaluating NB on the smaller dataset, with PCA ……………. 37

4.3.7 Building and evaluating LR on the larger dataset, with PCA ………………. 38

4.3.8 Building and evaluating LR on the smaller dataset, with PCA ……………. 38

ix

4.4 Summary of Results …………………………………………………………………………….. 39

CHAPTER FIVE: SUMMARY, CONCLUSION AND RECOMMENDATION .. 42

5.1 Summary ……………………………………………………………………………………………. 42

5.2 Conclusion………………………………………………………………………………………….. 43

5.3 Recommendation…………………………………………………………………………………. 43

5.4 Recommendation and Suggestion for future research……………………………….. 44

5.5 Contribution to knowledge……………………………………………………………………. 44

REFERENCES ………………………………………………………………………………………………… 44

APENDIX I……………………………………………………………………………………………………… 46

APENDIX II …………………………………………………………………………………………………….

## CHAPTER ONE

INTRODUCTION

1.1 Background to the Study

We start by considering the following problem: suppose you are a medical laboratory technologist, who has access to a patient‟s health records, who was admitted for heart disease diagnosis, the natural question that comes to mind is, does he or she has a heart disease or not? Or this one: suppose you are a bank, and given a person who wants to take out a loan, will she default on the loan? Or this: how can your email server tell which emails are spam and which ones are actual mail? Intuitively, all of the above situations can be resolved by examining empirical data and taking out the factors that are important in each. For example, in the heart disease case, one might want to look through hospitalization records of patients who have had heart diseases and see if your patient resembles them in age, blood pressure, body temperature, diet and exercise habits, family history and other clinical measurements. The above situations are examples of classification problems. Classification is a statistical method used to build predicative models to separate and classify new data points. In Machine Learning and Statistics, classification is the problem of identifying to which of a set of categories a new observation belongs, on the basis of a training set of data containing observations (or instances) whose category membership is known. Feature (variable) selection is the process of identifying and removing as many irrelevant and redundant features as possible from dataset features (Yu and Liu, 2004). This reduces the dimensionality of the data and enables data mining algorithms to operate effectively. The fact that many features depend on one another often unduly influences the accuracy of models. Classification models are affected by the choice of features (variables). Purkayasthaet al (2014) stated that: selecting the relevant features for classification is significant for a variety of reasons like simplification of performance, computational efficiency, and feature interpretability.

2

A model which performs classification is known as a classifier. A classifier is a function which maps an input variable X to a class C. Classifiers are broadly divided into linear and non-linear classifiers: with the linear classifiers, models which are based on the linear combination of variables‟ values are built. Linear classifiers work well for practical problems such as medical diagnosis, document classification, and more generally for problems with many variables (features), reaching accuracy levels comparable to non-linear classifiers while taking less time to learn from the training dataset. There are numerous classifiers today and the choice of which to use depends on a number of factors; for example, the simplicity, accuracy and applicability to the domain and structure of dataset under consideration, Kwon and Mun Sim (2013). In this research work, we focused on Discriminative-Generative pair of binary linear classifiers as typified by Binary Logistics Regression (LR) and Naïve Bayes Classifier (NBC). Bayesian Classification represents a supervised learning method for classification. Naïve Bayes classifier is a kind of probabilistic classifier that is based on applying Bayes’ theorem. It assumes that, all features are independent. Naïve Bayes classifier has the following three advantages. First, in some probability models, Naïve Bayes classifier can be effectively trained in supervised learning environment. Second, the amount of training data used to estimate the necessary parameters for classification need not necessarily be large. Third, despite a simple design, Naïve Bayes classifier operates well in various complicated situations (Yoo and Yang, 2015). Given a class C and Variable vector X, we use the training data to determine the probabilities P(X/C) and P(C) for all values of X and C. New X examples can then be classified using these estimated probability distributions with Bayes Rule. This type of classifier is called a generative classifier, because we can view the distribution P(X/C) as describing how to generate random instances of X conditioned on the target class C.

3

Logistic Regression is a model that uses training data to directly estimate the probability of an instance of some set of features or a variable vector X belonging to a class C, P(C/X), in contrast to Naive Bayes. In this sense, Logistic Regression is often referred to as a discriminative classifier because we can view the distribution P(C/X) as directly estimating the probabilistic value of the target C for any given instance of X. The success of classifiers depends on the nature of the relationship between feature selection and Classification Accuracy. Researchers such as Kwon and Sim (2013), have tried to understand the nature of this relationship using some selected classifier models (algorithm). However, their explanation is too general and therefore not very informative. Ultimately, we would like to understand the performances of Naïve Bayesian Classifier and Logistic Regression when used as binary classifiers on the same domain area (In our case: for health problem diagnosis. Presence or absence of an ailment) under certain interactions of dataset sizes and variable selection methods.

1.2 Statement of the Problem

As the need to analyze big data sets grows exponentially, the role that classification algorithms play in data mining techniques also increases. As Kwon and Mun Sim (2013), noted that it is still a complex issue to determine which algorithm is strong or weak in relation to which data set, where in they experimentally examined how dataset characteristics affect a models performance. The key problem when dealing with classification problem is not whether a model is superior to others, but under which conditions a particular method can significantly outperform others on a given application problem. Naïve Bayesian Classifier and Logistics Regression have been reported to do well with a variety of datasets. This research proposes to find an optimal choice between these two classifiers.

1.3 Aim and Objectives of the Study

The aim of this research work is to study the classification performance of Naïve Bayesian Classifier and Logistic Regression under different feature (variable) selection criteria and

4

dataset size conditions on a domain area. The aim shall be achieved through the following objectives by:

i. building a Naïve Bayesian classifier model for each of the pre-determined conditions;

ii. building a Logistic Regression classifier model for each of the pre-determined conditions;

iii. testing the models in objectives (i) and (ii) on some datasets in order to measure their respective classification accuracies;

iv. performing a test of independence in the interaction of feature selection criteria, dataset size and choice of classifier model (algorithm).

1.4 Significance of the Study

This study helps to understand the optimal performances of Logistics Regression and Naïve Bayesian Classifier which are both linear statistical classification models that are fast becoming the choice of many researchers. Particularly, the results help to make optimal decisions on choice of model, and consequently improve the performances of classification algorithms.

1.5 Motivation

Our motivation stems from the resolution of Kwon and Sim (2013) that noted the complexity of having to determine which classifier model is strong or weak in relation to data sets from a specified domain of study, and concludes that the conditions under which a particular method significantly outperform the others on a giving application problem is the key to dealing with classification. In view of that, we were motivated to study the effect of feature selection and dataset size on the accuracy NBC and LR as limited to data sets from a medical domain.

1.6 Scope and Limitation of the Study

This research is limited to empirical data on medical records collected for breast cancer disease and heart disease which are suitable for classification. Data from non-medical domain areas are not considered.

5

1.7 Definition of Terms

a) Classification

Classification is a statistical method used to build predicative models to separate and classify new data points.

b) Classifier

The predicative model built to separate and classify new data points is known as a classifier.

c) Naïve Bayes Naive Bayes is a classifier which is based on applying Bayes‟ theorem with the basic assumption of independence between every pair of features.

d) Logistics Regression (LR) Logistic regression or logit model is a regression model whose dependent variable is categorical and takes only two values, such as pass or fail, win or lose, alive or dead, presence of disease or absence of disease. Multinomial Logistics regression has cases (in the form of dependent variables) with more than two categories.

e) Training set

A training set is a set of data used to discover potentially predictive relationships.

f) Testing set

A testing set is a set of data used to assess the strength and utility of a predictive relationship.

g) Confusion matrix

6

A confusion matrix is a table that is often used in the description of the performance of a

model (classification models) on a set of data meant for the purpose of testing (usually

called the test data set) for which the true values are known.

Predicted (no disease) Predicted (disease)

Actual (no disease) TN FP

Actual (disease) FN TP

Table 1. 1: Sample of a Confusion Matrix

h) Accuracy

Accuracy is the percentage of correct predictions made. In other words, the accuracy is

the proportion of true results (that is, both true positives and true negatives) among the

total number of cases examined.

i) Precision

TP/TP FP

Precision gives information on the proportion of patients diagnosed as having a disease

by the classifiers had it in the real case. It can be defined as the proportion of True

Positive in the set of subjects diagnosed as positive to the condition been tested upon.

j) Sensitivity (Recall)

TP/TP FN

Sensitivity computes the proportion of patients that actually had the disease who were

diagnosed as having it. One should be careful not to mix up the meaning of sensitivity for

precision. Sensitivity gives the proportion of True positive in the set of subjects having

the condition in reality, like the proportion of patients who had actually had breast cancer

and were diagnosed having it by the classifier model.

7

k) Specificity (True negative rate)

Specificity (SP) is calculated by dividing the number of correct negative predictionsby the

total number of negative subjects (patients). Specificity may appear in other texts as true

negative rate (TNR) or simply, Specificity, both terms mean the same thing. Specificity of

1.0 is considered the best, whereas 0.0 is considered the worst.

Specificity is calculated by dividing the number of correct negative predictions (TN) by

the total number of negatives (N).

Specificity, SP TN /TN FP

SP TN / N

Remember, TNR SR

l) Principal Component Analysis

Principle Component Analysis (PCA) is a statistical technique used to examine the

relation that exists among a set of variables in order to identify the structural pattern of

those variables. PCA, also called factor analysis, is a non-parametric analysis and answers

uniquely and independently of hypothesis about the data distribution.

m) Multicollinearity

In statistics, multicollinearity (sometimes called collinearity) is a phenomenon whereby

two or more independent predictor variables in a multivariate regression model have a

high correlation, which is an indication that one may be linearly predicted from the others

with an acceptable degree of accuracy.

8

2

**IF YOU CAN'T FIND YOUR TOPIC, CLICK HERE TO HIRE A WRITER»**