ABSTRACT
In this project, we shall implement the hierarchical clustering algorithm and apply it to various data sets such as the weather data set, the student data set, and the patient data set. We shall then reduce these datasets using the following dimensionality reduction approaches: Random Projections (RP), Principal Component Analysis (PCA), Variance (Var), the New Random Approach (NRA), the Combined Approach (CA) and the Direct Approach (DA).
The rand index and ARI will be implemented to measure the extent to which a given dimensionality reduction method preserves the hierarchical clustering of a data set. Finally, the six reduction methods will be compared by runtime, inter-point distance preservation, variance preservation and hierarchical clustering preservation of the original data set.
TABLE OF CONTENTS
DECLARATION ……………………………………………………………………………………………………. i
ABSTRACT ………………………………………………………………………………………………………….. ii
ACKNOWLEDGEMENT …………………………………………………………………………………….. iii
DEDICATION ……………………………………………………………………………………………………… iv
LIST OF FIGURES ………………………………………………………………………………………………. vi
LIST OF TABLES ……………………………………………………………………………………………….. vii
1 INTRODUCTION ………………………………………………………………………………………….. 1
2 HIERARCHICAL CLUSTERING ……………………………………………………………………. 2
1.1 SNIPPET OF CLUSTERED DATA ……………………………………………………………….. 2
3 DIMENSIONALITY REDUCTION TECHNIQUES ………………………………………….. 4
3.1.1 RANDOM PROJECTIONS (RP) …………………………………………………………. 4
3.1.2 PRINCIPAL COMPONENT ANALYSIS (PCA) …………………………………… 4
3.1.3 NEW RANDOM APPROCAH…………………………………………………………….. 5
3.1.4 VARIANCE ………………………………………………………………………………………. 6
3.1.5 COMBINED APPROACH ………………………………………………………………….. 6
3.1.6 DIRECT APPROACH ………………………………………………………………………… 7
4 IMPLEMENTATION ……………………………………………………………………………………… 9
4.1.1 RANDOM PROJECTION (RP) …………………………………………………………. 10
4.1.2 PRINCIPAL COMPONENT ANALYSIS (PCA) …………………………………. 11
4.1.3 NEW RANDOM APPROACH…………………………………………………………… 12
4.1.4 VARIANCE …………………………………………………………………………………….. 13
4.1.5 DIRECT APPROACH ………………………………………………………………………. 14
4.1.6 COMBINED APPROACH ………………………………………………………………… 15
5 RAND INDEX ……………………………………………………………………………………………… 16
6 CONCLUSION …………………………………………………………………………………………….. 17
7 REFERENCES ……………………………………………………………………………………………… 18
8 Appendix A MATLAB CODES USED FOR IMPLEMENTATION …………………… 19
CHAPTER ONE
INTRODUCTION
Given a data set containing n points in high dimensional space, it is often helpful if it can be projected onto a lower dimensional space without suffering great distortion. This process is called dimensionality reduction. Essentially, dimensionality reduction reduces the number of variables to be considered in a way that the relevant data is retained while reducing the amount of the data.
Dimensionality reduction helps to reduce the runtime of algorithms whose runtime depends on the dimensions of the working space. It also broadens the scope for the choice of method for data processing. It provides complexity control which avoids overfitting of the training data.
Dimensionality can be applied in several domains which include text data, image data, nearest neighbor search and in the domain of clustering and classification. Clustering is the assignment of a set of observations into subsets (called clusters) so that observations in the same cluster are similar in some sense. Clustering is a method of unsupervised learning. Classification, on the other hand, is a method of supervised learning. The task of the supervised learner is to predict the value of the function for any valid input after having seen a number of training examples (i.e. pair of input and target output). As mentioned above, this project focuses on the categorization of data using hierarchical clustering.
IF YOU CAN'T FIND YOUR TOPIC, CLICK HERE TO HIRE A WRITER»