This research developed a distributed big data fusion architecture for machine to machine communication using ensemble learning. This is implemented to mitigate the challenges that characterize centralized big data fusion architecture commonly adopted through the use of Hadoop MapReduce platform. These challenges include bandwidth consumption, latency, and high computational cost. Fog computing technique approach was adopted through the implementation of ensemble learning; feature engineering was implemented to extract information (pixel values, number of layers (nlayers), number of cell (ncell), number of row (nrow), and coordinates) from the data, water bodies and vegetation index (NDWI and NDVI) were calculated. The extracted information was used as a training dataset for both centralized and distributed architecture using adaboost as bases of comparison between centralized and distributed architecture. Performance evaluation was based on Bandwidth consumption and Latency. Results were presented in the form of confusion matrix. The developed architecture achieved a 31.44 minutes and 1.9% improvement in latency and accuracy between the centralized and the distributed architecture respectively. The result also showed 5.8% and 4.81 minutes improvement in accuracy and latency were recorded in performance comparison of base learner and ensemble Adaboost.
TABLE OF CONTENTS
TABLE OF CONTENTS
LIST OF FIGURES
LIST OF TABLES
LIST OF ABBREVIATIONS XVII
CHAPTER ONE: INTRODUCTION
1.1 Background of Study 1
1.2 Statement of Problem 3
1.3 Aim and Objectives 4
1.4 Significance of Research 4
1.5 Scope of Study 4
CHAPTER TWO: LITERATURE REVIEW
2.1 Introduction 7
2.2 Review of Fundamental Concepts 7
2.2.1 Big data 7
2.2.2 Big data processing and management. 9
2.2.3 Big data fusion architecture 10
188.8.131.52 Centralized big data fusion architecture (CBDFA) 12
184.108.40.206 Distributed big data fusion architecture (DBDFA) 13
2.2.4 Hadoop Architecture and Design 14
220.127.116.11 Hadoop distributed file system 14
18.104.22.168 MapReduce 15
2.2.5 The U.S. Joint directors of laboratories data fusion model 16
2.2.6 Fog Computing 19
2.2.7 Fog Computing System Architecture 22
2.2.8 Machine-to-Machine Communication 22
2.2.9 Ensemble Learning System 24
22.214.171.124 Combining Classifiers. 25
126.96.36.199 Combining class labels 25
188.8.131.52 Voting based fusion techniques 26
184.108.40.206 Majority voting 26
2.10 Algorithms 27
2.10.1 Boosting 27
2.10.2 Stack Generalization 29
2.10.3 K- Means 31
2.10.4 Haar-like feature 32
2.11 OpenCV 33
2.12 Weka 33
2.13 Classification 34
2.13.1 NDVI and NDWI 35
2.13.2 Dataset 36
2.13.3 System specification 37
2.13.4 Storage Error! Bookmark not defined.
2.3 Review of similar works 37
CHAPTER THREE: MATERIALS AND METHODOLOGY
3.1 Introduction 44
3.2 Materials 44
3.2.1 Software 44
3.2.2 Hardware 45
3.3 Methodology 45
3.3.1 Development of Centralized Big data Fusion Architecture 46
220.127.116.11 Data Acquisition 47
18.104.22.168 Pre-Processing 47
22.214.171.124 Data reduction 48
3.3.2 Feature engineering 49
126.96.36.199 Load data. 50
188.8.131.52 Extract Pixel values. 51
184.108.40.206 Extract image coordinate and layers in X, Y table. 51
220.127.116.11 Crop and extract pixel values of Region of Interest (RoI) 52
18.104.22.168 Extract NDVI and NDWI 52
3.3.3 Implementation of U.S. Joint Director of Laboratory 53
22.214.171.124 Object Alignment 53
126.96.36.199 Object Refinement 53
188.8.131.52 Object Identification and Detection using Haar-like feature (Adaboost) Algorithm. 54
3.4 Development of Distributed Big Data Fusion Architecture using Ensemble Learning. 55
3.4.1 Data Description: 56
3.4.2 Data Pre-processing: 56
3.5 Classification 57
3.5.1 WEKA implementation of Ensemble Learning (Adaboost) 57
3.6 Performance Evaluation 59
3.6.1 Bandwidth 59
3.6.2 Latency 59
3.6.3 Confusion Matrix 60
CHAPTER FOUR: RESULTS AND DISCUSSION
4.1 Introduction Error! Bookmark not defined.
4.2 Centralized Data Fusion Architecture 61
4.2.1 Data pre-processing 61
184.108.40.206 Data Visualization 62
220.127.116.11 Principal component Analysis 62
18.104.22.168 Satellite image Bands Extraction 63
22.214.171.124 Implementation of NDVI and NDWI 64
4.2.2 Classification 65
4.3 Analysis of centralized data fusion implementation 66
4.3.1 Adaboost (Haar) 66
4.3.2 Confusion matrix 69
4.4 Results from Distributed Big Data Fusion Architecture Using Ensemble Learning. 70
4.4.1 Data pre-processing 70
126.96.36.199 Statistical parameter 71
188.8.131.52. Data Visualization using WEKA. 72
4.4.2 Implementation of Ensemble Learning Adaboost 73
CHAPTER FIVE: CONCLUSION SUMMARY AND RECOMMENDATION
5.1 Conclusion 77
5.2 Significant contributions 77
5.3 Recommendations for further work 77
5.4 Limitation 78
1.1 Background of Study
Data fusion is a recognized structure used to express the union of data from various sources in which its communication imparts the methods and instruments for the cooperation of data that originated from different sources (Akita 2002; Wald, 1998). This approach is centered on the synergy offered by the various sources. The overall idea is to increase training data for a given learning problem using data obtained from other associated problem so as to learn from multiple sources. The process is useful in a situation where the information contained in the single source is partially relevant to the problem of interest. Having this knowledge, it is advantageous to extract complementary information from scores of data source. Data fusion is also not profitable when different data source available for a given problem are incomplete and noisy. The fundamentals of data fusion are to focus on increasing the accuracy of classification thereby increasing the chances of getting valuable information (Wald, 1998). Defined data fusion as a formal framework for combining different data obtained from a different source with the sole aim of obtaining information of superior quality which necessarily be correlated with the production of more relevant information (Wald, 1998). The main reason that necessitates the development of data fusion is the increasing requirement for pertinent information, a robust procedure to manage data efficiently and reliability and performance.
Data fusion finds relevance in science, engineering, biometric, finance, governance, remote sensing and geographic data, wireless sensor network amongst others. The process of combining information from several other sources are categorized into lower level data fusion, intermediate data fusion, and high-level data fusion. The categorization, however, is dependent on the various
stage in which data is fused (Esteban et al., 2005). The Low-level fusion also known as raw data fusion is the combination of raw data from multiple origins into new raw data that should be explanatory, while intermediate level or feature, is the fusion of data of different origin (Esteban et al., 2005). The intermediate level of fusion requires the extraction of a different feature from the source of raw data to be aggregated into a unified feature. Hence this level can be referred to as fusion that is based on feature extraction which is object-dependent while the high level is a fusion approach that depends on decision thereby combining result from different algorithm to reach a final fused decision. In the field of research, several approaches to this method of fusion exist such as statistical method, majority voting, and ensemble method amongst others. Consequently, there is a certain trade-off associated with each level of fusion. The chief of which is choice of the most appropriate fusion level for the data available. Other identified cons are heightened in the table below (Esteban et al., 2005).
Table 1.1: Tradeoff for Different Levels of Data Fusion (Esteban et al., 2005)
Possibly very large
Several data fusion architectures have been developed and implemented. However, in the multisensor system, the purpose is to extract information from several available sources using an
appropriate method like Bayesian inference, ensemble learning, Dumpster-Shafer evidence theory, fuzzy logic and neural network (Alsheikh, 2014).
Big Data Fusion (BDF) learn and update data models as they process data. This holds for both static and real-time data. At each level of processing, the models are updated and discovered relations or patterns are applied almost immediately to the fusion task (Wheeler, 2014). Cloud computing requires the data to be directly preloaded in the data centre before a user runs its application in the cloud. This architecture simply is one that operates based on the parallelism of federated systems/server communicating and performing a given task (Abhishek & Benjamin, 2012; Calheiros, et al., 2011). Demand and needs for optimizing network resources economically and technical cost of managing data and IT infrastructure remotely necessitated cloud computing technique. BDF has been implemented in cloud computing architecture as reported in several kinds of literature (Abhishek & Benjamin, 2012; Calheiros et al., 2011; Calheiros, 2011). This process account for large data management, processing, storage, analytic and fusion. Major Service provider of cloud computing includes Microsoft, Facebook, Google, Amazon (Evans, 2011) managing millions of gigabytes of data generated daily by billions of network-enabled devices connected to the internet thereby birthing an era of big data, Internet of
Things (Sneps-Sneppe & Namiot, 2012) and machine to machine communication (M2M) (X.-W.
Chen & Lin, 2014; Evans, 2011).
1.2 Statement of Problem
Popular centralized big data fusion employs the use of Hadoop MapReduce architecture which is characterized by large latency for communication between network devices and communication bandwidth. These latency and bandwidth is not computationally cost effective. Therefore, it is
imperative to developed a distributed architecture that will fuse big data in manner that the processing and fusion is done closer to the point where it is generated thereby mitigating the challenges observed in centralized architecture.
1.3 Aim and Objectives
The aim of this research is develop a distributed big data Fusion Architecture for machine-to-machine (M2M) communication using Ensemble Learning.
The objectives of this research are as follows:
1) To develop a centralized big data fusion architecture.
2) To develop a distributed big data fusion Architecture leveraging on Fog Computing technique using ensemble learning.
3) To evaluate the performance of the developed architectures using Accuracy and latency as performance metrics by:
I. Comparison of the performance of Adaboost in centralized data fusion architecture and Adaboost in distributed data fusion architecture.
II. Comparison of performance of AdaBoost in distributed Architecture and Ensemble Adaboost on distributed Data fusion Architecture.
1.4 Significance of Research
The significance of this research is to ensure that data is fused distributed at the edge using fog computing techniques in a distributed architecture. To demonstrate the use of Weikato Environment for Knowledge Analysis (Weka) as a tool for data fusion through the use of ensemble methods.
1.5 Scope of the Study
The scope of this study is to use WEKA data mining suit to implement distributed data fusion architecture using one terabytes of spatial image data.
GET THE COMPLETE PROJECT»
Do you need help? Talk to us right now: (+234) 08060082010, 08107932631, 08157509410 (Call/WhatsApp). Email: firstname.lastname@example.org