ABSTRACT
Dimensionality reduction provides a compact representation of an original high-dimensional data,
which means the reduced data is free from any further processing and only the vital information is
retained. For this reason, it is an invaluable preprocessing step before the application of many
machine learning algorithms that perform poorly on high-dimensional data. In this thesis, the
perceptron classification algorithm – an eager learner – is applied to three two-class datasets
(Student, Weather and Ionosphere datasets). The k-Nearest Neighbors classification algorithm – a
lazy learner – is also applied to the same two-class datasets. Each dataset is then reduced using
fifteen different dimensionality reduction techniques. The perceptron and k-nearest neighbor
classification algorithms are applied to each reduced set and the performance (evaluated using
confusion matrix) of the dimensionality reduction techniques is compared on preserving the
classification of a dataset by the k-nearest neighbors and perceptron classification algorithms. This
investigation revealed that the dimensionality reduction techniques implemented in this thesis
seem to perform much better at preserving K-Nearest Neighbor classification than they do at
preserving the classification of the original datasets using the perceptron. In general, the
dimensionality reduction techniques prove to be very efficient in preserving the classification of
both the lazy and eager learners used for this investigation.
Keywords: Classification, confusion matrix, dimensionality reduction, eager learner, k-nearest
neighbors, lazy learner, and the perceptron.
TABLE OF CONTENTS
COVER PAGE……………………………………………………………………. Error! Bookmark not defined.
Certification ……………………………………………………………………………………………………………………. i
Dedication …………………………………………………………………………………………………………………….. iii
Acknowledgement …………………………………………………………………………………………………………. iv
Abstract …………………………………………………………………………………………………………………………. v
TABLE OF CONTENTS ………………………………………………………………………………………………… vi
LIST OF TABLES ……………………………………………………………………………………………………….. viii
LIST OF FIGURES ……………………………………………………………………………………………………….. ix
CHAPTER ONE: INTRODUCTION ………………………………………………………………………………… 1
1.1 Background of the Study ………………………………………………………………………………………… 1
1.2 Problem Statement …………………………………………………………………………………………………. 3
1.3 Aim and Objectives………………………………………………………………………………………………… 4
1.4 Scope and Limitations…………………………………………………………………………………………….. 5
1.5 Thesis Structure …………………………………………………………………………………………………….. 6
CHAPTER TWO: LITERATURE REVIEW ……………………………………………………………………… 7
2.1 Dimensionality Reduction ………………………………………………………………………………………. 7
2.2 Machine Learning ………………………………………………………………………………………………….. 8
2.3 Dimensionality Reduction and Machine Learning ……………………………………………………. 10
CHAPTER THREE: METHODOLOGY …………………………………………………………………………. 12
3.1 Dimensionality Reduction Techniques ……………………………………………………………………. 12
3.1.1 The New Random Approach …………………………………………………………………………… 12
3.1.2 Modified New Random Approach ……………………………………………………………………. 12
3.1.3 Singular Value Decomposition (SVD) ……………………………………………………………… 13
3.1.4 Principal Component Analysis (PCA) ………………………………………………………………. 14
3.1.5 The Variance Approach ………………………………………………………………………………….. 14
3.1.6 The Combined Approach ………………………………………………………………………………… 15
3.1.7 The Direct Approach ………………………………………………………………………………………. 16
3.1.8 Random projection (RP) …………………………………………………………………………………. 17
3.1.9 Bottom-Up Approach……………………………………………………………………………………… 17
3.1.10 Top-Down Approach ……………………………………………………………………………………. 19
3.1.11 The New Top-Down Approach ……………………………………………………………………… 19
CHAPTER ONE
INTRODUCTION
1.1 Background of the Study
Data volumes and variety are increasing at an alarming rate making very tedious any attempt to
glean useful information from these large data sets. Extracting or mining useful information and
hidden patterns from the data is becoming more and more important but can be very challenging
at the same time [1]. A lot of research done in domains like Biology, Astronomy, Engineering,
Consumer Transactions and Agriculture, deal with extensive sets of observations daily.
Traditional statistical techniques encounter some challenges in analyzing these datasets due to their
large sizes. The biggest challenge is the number of variables (dimensions) associated with each
observation. However, not all dimensions are required to understand the phenomenon under
investigation in high-dimensional datasets; this means that reducing the dimension of the dataset
can improve accuracy and efficiency of the analysis [2]. In other words, it is of great help if we
can map a set of points, say n, in d-dimensional space into a p-dimensional space -where p << dso
that the inherent properties of that set of points, such as their inter-point distances, their labels,
etc., does not suffer great distortion. This process is known as Dimensionality reduction [3].
A lot of methods exist for reducing the dimensionality of data. There are two categories of these
methods; in the first category, each attribute in the reduced dataset is a linear combination of the
attributes of the original dataset. In the second category, the set of attributes in the reduced dataset
is a subset of the set of attributes in the original dataset [4]. Techniques belonging to the first
category include Random Projection (RP), Singular Value Decomposition (SVD), Principal
Component Analysis (PCA), and so on; while techniques in the second category include but are
not limited to the Combined Approach (CA), Direct Approach (DA), Variance Approach (Var),
2
New Top-Down Approach (NTDn), New Bottom-Up Approach (NBUp), New Top-Down
Approach (modified version) and New Bottom-Up Approach (modified version) [5].
Dimensionality reduction provides a compact representation of an original high-dimensional data,
which means the reduced data is free from any further processing and only the vital information is
retained, so it can be used with many machine learning algorithms that perform poorly on highdimensional
data [6]. Calculation of inter-point distances is essential for many machine learning
tasks and when the dimensionality increases, it has been proved that “the distance of a sample
point to the nearest point becomes very similar to the distance of the sample point to the most
distant point”, thereby deteriorating the performance of machine learning algorithms [7].
Therefore, dimensionality reduction is an invaluable preprocessing step before the application of
many machine learning algorithms.
Machine learning is a scientific field in which computer systems can automatically and
intelligently learn their computation and improve on it through experience [8], [9]. Machine
learning algorithms are of two main types: supervised learning algorithms and unsupervised
learning algorithms. These algorithms have been used in solving a lot of complex real-world
problems [10], [11]. In unsupervised learning, the set of observations are categorized into groups
(clusters) basing the categorization on the similarity between them. This categorization is
otherwise known as clustering [8]. Many clustering algorithms exist, among which k-means
clustering is the most famous for a large number of observations [12].
Unlike clustering, classification is a supervised learning method in which the corresponding label
for any valid input is predicted based on a number of training examples referred to as “training
set,” [8], [12]. Classification algorithms can further be categorized into eager and lazy learners,
and this investigation considers one from each category. Eager learning algorithms attempt to
3
construct a general rule or create a generalization during the training phase which can further be
used in classifying unseen instances [13]. Example of eager learners includes decision trees,
support vector machine, and the perceptron.
The perceptron, an eager learner, is one of the earliest and simplest of all classification algorithms
invented by Rosenblatt [14], basically used for classifying each point of a data set into either a
positive or a negative label (1 or -1, good or bad, hot or cold, man or woman, etc.) [15]. It is
interesting to know that in its basic form, it is still as valid as when it was first published [16].
On the other hand, a lazy learner delays any generalization or model construction until it is
presented with an unseen instance to be classified [17]. This idea of not conducting any processing
until a lazy learner is presented with an unseen instance makes the learner to require a lot of space
in memory for storing the whole of the training instances and processing them each time it is
presented with a new unseen instance. Example of a lazy learner is the k-nearest neighbor classifier
[18]. In this algorithm, the result/label of any given instance is predicted based on the label most
common to its k nearest neighbors, k, in this case, is a user-defined positive integer, normally with
a small value [15].
1.2 Problem Statement
In data classification, the corresponding label (class) for any valid input is predicted based on a
number of training examples referred to as “training set”. This is achieved using a classifier model
learning algorithm is applied to a training set made up of past examples having the same set of
attributes with the unseen example [8], [12]. However, before starting the training, the label of
each example in the “training set” is known [19].
4
To build a classifier model, an eager learner attempts to construct a general rule in the training
phase which will subsequently be used in classifying unseen instances while a lazy learner delays
the process until it is presented with an unseen instance [13]. The main disadvantage in eager
learning is the long time which the learner takes in constructing the classification model but after
the model is constructed, an eager learner is very fast in classifying unseen instances, while for a
lazy learner, the disadvantage is the amount of space it consumes in memory and the time it takes
during the classification [17]. This makes dimensionality reduction a very crucial preprocessing
step because it facilitates classification, and compression of high-dimensional data and thus
conserves memory and provides a compact representation of an original high-dimensional data
[5].
Researches have been conducted on how dimensionality reduction techniques affect the
performance of classifiers [20]–[22]. However, very little attention is given to the extent to which
these reduction techniques facilitate and preserve classification. Therefore, this thesis attempts to
advance the research by investigating the extent to which dimensionality reduction preserves the
classification of weather dataset, student dataset and the ionosphere dataset obtained from “UCI
machine learning repository”, in order to fill the gap in literature and provide steps for further
research in the area of machine learning.
1.3 Aim and Objectives
The aim of this research is to investigate the extent to which dimensionality reduction techniques
preserve classification.
5
The objectives of the research are as follows:
1. Implementation of fifteen dimensionality reduction techniques and using these techniques to
reduce the weather and student datasets, as well as the ionosphere dataset obtained from the UCI
machine learning repository [23].
2. Implementation of the perceptron classification algorithm and using it to classify the data points
of a two-class dataset. It shall also be applied to the datasets reduced from this two-class dataset
using the techniques above, and comparisons will be made to determine the extent to which the
reduction methods preserve the classification of the original dataset using the perceptron.
3. Implementation of the k-Nearest Neighbors classification algorithm and comparing the
performance of the dimensionality reduction techniques on preserving the classification of a
dataset by the k-nearest neighbors and perceptron classification algorithms.
4. Using confusion matrices to show the extent to which each dimensionality reduction method
preserves classification of the original datasets and make comparisons with each other.
1.4 Scope and Limitations
This project is limited to showing the extent to which each of the dimensionality reduction methods
implemented in this thesis preserves the classification of the original datasets by the perceptron
and k-Nearest Neighbors classification algorithms. Accuracy will be used as the performance
measure for showing the extent of the classification preservation, and this shall be obtained using
confusion matrices.
6
1.5 Thesis Structure
This thesis consists of five chapters. Chapter 1 introduces dimensionality reduction and discusses
its importance and applications to machine learning tasks. After presenting the problem to be
addressed, the aim of the research is stated and the objectives are outlined.
Chapter 2 presents a review of literature related to dimensionality reduction and machine learning
in general. Existing literature on single layer neural network is reviewed.
Chapter 3 describes the methodology used in this thesis. It discusses the methods in detail and
explains how they are applied in achieving the objectives of the thesis. The results obtained from
the methodology is presented and discussed fully in Chapter 4.
Chapter 5, which is the final chapter, provides a summary of the work and the results obtained in
this thesis, concludes the research and also gives a possible recommendation for further research.
IF YOU CAN'T FIND YOUR TOPIC, CLICK HERE TO HIRE A WRITER»