ABSTRACT
This research investigated image recognition frameworks on datasets of visible-light and infrared (IR) imagery using deep convolutional neural networks (CNN). This is due to their recent success on a variety of problems including computer vision which often surpassed the state of the art methods. Three deep learning based object recognition approaches were investigated on a fused version of the images in order to exploit the synergistic integration of the information obtained from varying spectra of same data with a view to improving the overall classification accuracy. Firstly, a simple 3-layer experimental deep network was designed and used to train the datasets for performing recognition. A second experiment was conducted where a pre-trained 16-layer convolutional neural network (Imagenet-vgg-verydeep-16) was used to extract features from the datasets. These features are then used to train a logistic regression classifier for performing the recognition. Finally, an experiment was conducted where another pre-trained model (Imagenet-vgg-f) was fine-tuned to suit the dataset‘s classes and which is then retrained accordingly on the datasets using back propagation. This research adopted a simple and novel fusion strategy where the IR and visible images were fused by concatenating the IR image as an additional fourth layer to the visible image with a view to enhancing the performance of object recognition systems. Despite its simplicity and the limited size of the training data, the 3-layer network achieved a classification accuracies of 86.8590% on the fused multimodal image, 85.0610% on the visible images and 67.9878% on the Infrared images. These results represent a respective improvement by 4.76%, 3.16% and 13.99% when compared withthose of the CNN architectureofZhang. Also, an improvement by 4.06% on visible images was obtained over that of Zhang‘sGnostic Field + CNN model. It is also observed that while the pre-trained model performed well, fine-tuning improved the performance to 100% classification accuracy. Results obtained were compared with those ofZhangas a means of validation. This work was implemented using MATLAB programming language, MatConvnet library for CNN and Liblinear library for large linear classification.
TABLE OF CONTENTS
COVER PAGE …………………………………………………………………………………………………. i TITLE PAGE ………………………………………………………………………………………………….. ii DECLARATION…………………………………………………………………………………………….. iii CERTIFICATION ………………………………………………………………………………………….. iv DEDICATION…………………………………………………………………………………………………..v ACKNOWLEDGEMENT ……………………………………………………………………………….. vi ABSTRACT …………………………………………………………………………………………………… vii TABLE OF CONTENTSviii LIST OF APPENDICES xi LIST OF FIGURES ……………………………………………………………………………………….. xii LIST OF TABLES ………………………………………………………………………………………… xiii LIST OF ABBREVIATIONS ………………………………………………………………………… xiv CHAPTER 1: INTRODUCTION
1.1 BACKGROUND …………………………………………………………………………………….1
1.1.1 Traditional Recognition Approaches ………………………………………………1
1.1.2 Multimodal Images ………………………………………………………………………2
1.1.3 Fusion of IR and Visible Imagery …………………………………………………..4
1.2 STATEMENT OF RESEARCH PROBLEM5
1.3 MOTIVATION .6
1.4 AIM AND OBJECTIVES…………………………………………………………………………7
1.5 METHODOLOGY ………………………………………………………………………………….8
1.6 DISSERTATION ORGANIZATION ………………………………………………………..9
ix
CHAPTER 2: LITERATURE REVIEW
2.1 INTRODUCTION10
2.2 REVIEW OF FUNDAMENTAL CONCEPTS10
2.2.1 Machine Learning10
2.2.1.1Machine learning categories10 2.2.2 ComputerVision ……………………………………………………………………………….12 2.2.2.1Object recognition …………………………………………………………………………..12
2.2.3 Artificial Neural Networks …………………………………………………………….13
2.2.3.1 Structure of a biological neuron 14
2.2.3.2 Neuron network topology16
2.2.3.3 Activation function 16
2.2.3.4 Multi-layer Perceptron 19
2.2.4 Training Algorithm: The Back Propagation Algorithm20
2.2.5 Deep learning22
2.2.5.1 Deep neural networks (DNN)23 2.2.5.2 Convolutional neural networks24 2.2.5.3 Advantages of using deep CNN26
2.3 REVIEW OF SIMILAR WORKS 27
CHAPTER 3 METHODOLOGY
3.1 INTRODUCTION …………………………………………………………………………………..31
3.2 DATA COLLECTION ………………………….31 3.2.1 VAIS Dataset …………….31 3.2.2 RGB-NIR Scene Dataset 32
3.3 DESIGNING A SIMPLE EXPERIMENTAL NETWORK 33
x
3.4USING PRE-TRAINED NETWORK FOR CLASSIFICATION ……………………….34 3.4.1Extracting Features Using CNN Pre-trained on Imagenet VGGNE 35 3.4.2 VGGNet ………………………………………………………………………………………..36 3.4.3 Finetuning the Pretrained CNN with Backpropagation ………………………..37
3.3.4 Alexnet/Imagenet …………………………………………………………………………37
3.5 PERFORMANCE METRIC ……………………………………………………………………39 CHAPTER FOUR: RESEARCH METHODOLOGY
4.1 INTRODUCTION …………………………………………………………………………………40
4.2 PRELIMINARY RESULTS ……………………………………………………………………40
4.3 MODIFYING THE CNNs TO ACCEPT 4-LAYERIMAGES …………………….40
4.3.1 Results for Modified 3-layer CNN ………………………………………………………..41 4.3.2 Results for Modified Pre-trained CNN …………………………………………………..43
4.4 EXTENDING EXPERIMENTS ON RGB-NIR SCENE DATASET ……………44
4.4.1 Results for Training from Scratch Using RGB-NIR Scene Dataset44
4.4.2 Results for Fine-tuning Using RGB-NIR Scene Dataset ………………….45 4.5 VALIDATION ………………………………………………………………………………………46 CHAPTER FIVE: CONCLUSION AND RECOMMENDATION
5.1 SUMMARY ………………………………………………………………………………………….48
5.2 SIGNIFICANT CONTRIBUTIONS ………………………………………………………..48
5.3 CONCLUSION ……………………………………………………………………………………..49
5.4 LIMITATIONS ……………………………………………………………………………………..49
5.5 RECOMMENDATIONS FOR FURTHER WORK ……………………………………50
REFERENCES ……………………………………………………………………………………..51
APPENDICES …………………………………….56
xi
CHAPTER ONE
INTRODUCTION 1.1 BACKGROUND In the field of computer vision, the area of object recognition which deals with the task of identifying or finding objects in images or video sequence has over the past decades attracted tremendous yet inexhaustible research interests. This is due to its increasing usefulness in several daily life applications such as law enforcement, surveillance systems, information safety, computer security, secured financial transactions amongst others (Kong et al., 2007). Humans naturally recognize multitude of objects with almost no efforts in images. This is regardless of the images‘ varying viewpoints, sizes, scales or even when the images are rotated or translated. More interestingly is the fact that humans recognize objects even when they are obstructed partially from view. Performing these tasks by a computer still remains a challenging task in computer vision. Object recognition system has wide range of applications in areas related to surveillance, safety and security, access control, information security, identity fraud, etc. These include optical character recognition (OCR), android eyes-object recognition, image panoramas, image watermarking, global robot localization, face detection, manufacturing quality control, content-based image indexing, object counting and monitoring, visual positioning and tracking, automated vehicle parking systems, video stabilization. Hence, researchers over the decades have developed several techniques in order to solve this problem (Chang et al., 2008). 1.1.1 Traditional Recognition Approaches
Over the past decades, many approaches have been implemented to perform object recognition. In particular, the traditional recognition approaches usually use multitude of fixed features such as edges, scale-invariant feature transform (SIFT), histogram of oriented gradients (HoG) etc.
2
that are usually hand designed or engineered (i.e. fixed kernels) used to extract features from an image which are later fed to a simple trainable classifier (such as an SVM). Today, rather than relying on these hand engineered features, recent research in end-end feature learning makes the predictions using sequences of non-linear processing stages to learn representations of data with multiple level of abstraction so that the resulting intermediate representations are interpreted as being feature hierarchies with the whole system being jointly learned from data. This method is called deep learning, which has proven to surpass the traditional approaches by far. Deep learning has revolutionized the general fields of pattern recognition and machine learning and basically exploits the idea of extracting features automatically and hierarchically (LeCun et al., 2015; Schmidhuber, 2015), this is depicted in Figure 1.1.
Figure 1.1: A Block Diagram of Deep Learning Architecture
In particular a specialized type of network called convolutional neural network has proven to show a massive power as a high end-to-end learning architecture(Krizhevsky et al., 2012). It is actually considered amongst the most powerful and classic supervised methods widely used today in pattern recognition and machine learning especially in the aspect of object recognition. It is well accepted in the scientific community and have already been deployed in renowned and gigantic corporations such as Facebook and Google in order to solve image auto-tagging (Rosenberg, 2013) and face recognition problems (Taigman et al., 2014). 1.1.2 Multimodal Images
Multimodal images are important since visible images are sensitive to varying illumination conditions which makes the performance of systems built on these images to be degraded especially under poor illumination conditions (Chang et al., 2008). To ameliorate this problem,
3
the data is complemented by an infrared (IR) spectrum version of the image. This is possible since IR images have little control over lighting conditions as the images are formed from emission that is intrinsic to the objects. Also, despite the massive performance shown by these state-of-the-art techniques, several other factors affect the performance of the object recognition system especially when visible images are involved including the illumination changes, pose variations, occlusions etc. Fortunately, IR sensors unlike visible sensors measure an object‘s energy radiations which are less sensitive to the illumination changes and are in fact operable in the dark (Zhang et al., 2015). IR images fundamentally provide different view of the scene since illumination or visually reflective information does not generally affect IR measurements and that the structures, shapes and the intensities seen in IR images are formed by a different phenomenon. For example, infrared images also contain gradients and textures which are based on the complicated heat transfer that exist between the internal and the external heat sources and the physical object rather than that of visible illumination. Interestingly, it is possible to correlate features in both the visible and IR spectra since the structures and the composition of the objects captured can easily be recognized in both the visible and IR (Hariharan et al., 2006; Ma et al., 2015;Zhang et al., 2015). However, despite the robustness of the IR images to illumination changes, they also present new challenges which in general allow for the relaxation of many restrictions usually placed in the visible spectrum algorithm which primarily is due to the issues of incident illumination on an object. The IR images also do not provide data with high-resolution like the visible image. An example is face detection in low-light environments where instead of attempting to adjust the detection algorithm due to poor illumination, reliance is shifted on the IR imagery which is guaranteed to consistently perform in poor or nominal illumination (Chang et al., 2008).
4
1.1.3 Fusion of Infrared and Visible Imagery
In order to exploit the synergistic integration of the information obtained from varying spectrum of same data with a view to improving the overall classification accuracy, fusion approaches are proposed in literature with the expectation that a combination of these data from multiple sources would yield a more informative representation which guarantees better recognition accuracy than the original(Chang et al., 2008; Kong et al., 2007).
Generally, the Information obtained from sensors contains redundancy and may not contribute to actual knowledge of the scene to be analyzed. This information, obtained from different means could provide complementary details of the analyzed object. Therefore, the goal of information fusion approaches is to ultimately exploit these complimentary portions and therefore developing an algorithm capable of visualizing these information is of great importance (Singh et al., 2004). Several literatures reviewed techniques for image fusion for the purpose of visualization and the data fusion from visible and IR sensors has really led to an improved target recognition rates (Zhou & Hecht-Nielsen, 1993). The choice of an appropriate fusion scheme depends on both the application and data. In this research context, the visible cameras capture the reflectance of light of the object while the infrared cameras measure the thermal or heat energy emitted by the object. Now, since the surface of the object and its temperature has nothing in common, information contained in the visible as well as the IR images are hence complimentary and independent.
Pixel by pixel fusion does not generally preserve spatial information in the image unlike multiple-resolution level fusion which allows features with varying spatial extend to be fused at most salient resolutions. Hence, important features which appear at lower resolutions could be preserved by the fusion process. Basically, in multi-resolution fusion, a multiscale transform is
5
performed on each source image after which a composite multi-scale representation is constructed based on some specific fusion rules. The fused version of the image is obtained by taking inverse of the multiscale transform (Piella, 2003). Some multiscale techniques which are popular include the Fourier transform, Laplacian pyramid, and wavelets. High frequency components are relatively independent of the global illumination changes while low frequencies take the spatial relationships which exist among the pixels into account and are less sensitive to small changes and noise.
The fusion schemes or strategy studied in this work adopts a simple idea of concatenating the infrared image which is a single channel image as a fourth layer to the visible image which is a three channel RGB image because each of the sensing modality has its own benefits and drawbacks as neither of the modality alone can solve completely the problem of robust object recognition alone. This fusion process will therefore attempt to maximize the advantage of each of the modality by fusing intelligently their individual information and at the same time overcoming limitations of each modality when used alone(Kong et al., 2007; Ma et al., 2015). 1.2 STATEMENT OF RESEARCH PROBLEM
Images acquired using visible sensors are sensitive to some factors such as varying illumination conditions, scattering, absorption, etc. which degrade the performance of object recognition systems built on these images especially under poor illumination conditions (Chang et al., 2008). Interestingly, lighting conditions have little or no effect over IR sensors as the images are formed from emissions that are intrinsic to the objects and are in fact operable in the dark (Zhang et al., 2015). Hence, IR images fundamentally provide different view of the scene and the structures, shapes and the intensities seen in IR images are formed by a different phenomenon. According to literatures, (Hariharan et al., 2006; Ma et al., 2015;Zhang et al., 2015), it is possible to correlate
6
features in both the visible and IR spectra since the structures and the composition of the objects captured can easily be recognized in both the visible and IR. This research therefore exploited the capabilities of deep convolutional neural networks (CNN) architectures as an end-to-end learning architecture in order to recognize objects in multimodal images by synergistically integrating the information obtained in the visible and infrared spectrums of same data with a view to performing a critical role in perception systems and substituting the standard computer vision approaches. 1.3 MOTIVATION
Computer vision applications are getting more important by the day and the technology is far from being matured. For example, our roads, cities and general environment have very unpredictable dynamics where multiple actors such as animals, pedestrians, vehicles, street furniture etc. coexist together. Hence, providing autonomous vehicle having a robust perception system is needed in order to understand the environment correctly to be able to interpret what is happening around the surrounding and consequently act upon it. These applications usually get information from their environment by means of on-board sensors. Today, affordable sensors such as multiple layer laser-scanners or depth and appearance capturing cameras are available in the market and are being incorporated in mobile robots and vehicles. Consequently, new information sources are available to be used in the perception systems. As regards the development of the software, significant advances have been made in the field of machine learning, computer vision and mobile robotics research in recent years. The area of deep learning, specifically advances in convolutional neural networks (CNNs) have shown promising results. This is after the breakthrough research by Krizhevsky et al., (2012) on the ImageNet 2012 competition classification benchmark which brings the state-of the-art error from 26.1% to
7
15.3% (Krizhevsky et al., 2012). Since then, a lot of the winning performances in the competition are obtained by expansions and modification of the so called AlexNets. In fact, ImageNet 2012 benchmark error rate has been reduced to 6.5% by using an even deeper network (Szegedy et al., 2015). Today CNN based methods are surpassing state of the art solutions on a variety of problems such as document recognition, object classification and detection or semantic segmentation and understanding even beating the human capabilities(He et al., 2015). Therefore, CNNs are of special interest either being used for robust features extraction replacing the traditional hand-crafted ones, or as an end-to-end trainable system. This context is the motivation in carrying out this research where the capabilities of deep CNN architectures are employed to recognize objects in multimodal images with a view to performing a critical role in perception systems and substituting the standard computer vision approaches. 1.4 AIM AND OBJECTIVES The aim of this research is to perform object recognition in visible-light & infrared (IR) images using deep convolutional neural networks. In order to achieve this aim, the following objectives were employed:
i) To perform a construction of a representative dataset (adopted)
ii) To design and train a simple experimental deep convolutional neural network architecture and investigate its performance on visible, IR and their fused multimodal versions.
iii) To train a logistic regression classifier using features extracted from standard pre-trained 16-layer CNN called imagenet-vgg-verydeep-16 on visible and IR images.
8
iv) To fine-tune a standard pre-trained model called imagenet-vgg and retraining accordingly based on the dataset‘s classes using back propagation algorithm on visible, IR and their fused multimodal versions.
v) To compare the performance of the proposed frameworks with that of Zhang et al (2015) on the VAIS dataset as a means of validation using classification accuracy as the performance metric.
1.5 METHODOLOGY In order to achieve the objectives highlighted in section 1.3, the following research methodology was adopted:
i) Data collection of comprehensively labeled images of predefined categories of visible and infrared images. Two datasets were used for this research which are the VAIS datasets (Zhang et al., 2015) for sea vessel classification which consists of 1623 visible images, 1242 IR images and 1088 visible/IR paired images as well as the RGB-NIR Scene Datasets.
ii) Design of a simple experimental deep CNN architecture for investigating the different recognition frameworks proposed. It had a total of 10 layers having 3 convolutional layers, 2 maxpool layers, 2 dropout layers, 2 ReLU layers and a final loss layer.
iii) Training a network with the datasets using the designed architecture for performing recognition tests. The back propagation algorithm is used for training the designed network.
iv) Training a logistic regression classifier using features extracted from this dataset with the standard pre-trained 16-layer CNN (imagenet-vgg-verydeep-16).
v) Fine-tuning a standard pre-trained model (imagenet-vgg) and retraining accordingly based on the dataset‘s classes using back propagation algorithm.
9
vi) Performing steps iii) and v) on the fused combination of visible and IR imagery and comparing the results with those on the VAIS dataset using the % classification accuracy as the metric.
vii) Evaluating performance of the proposed frameworks with those obtained by Zhang et al (2015) on the VAIS dataset using the % classification accuracy as the metric
1.6 DISSERTATION ORGANIZATION The general introduction has been presented in chapter one. The rest of the chapters are structured as follows: Firstly, a detailed review of related literature and relevant fundamental concepts about machine learning and its categories, computer vision, object recognition, artificial neural networks and deep learning are carried out in chapter two. Secondly, chapter three describes an in-depth explanation on designing a simple 3- layer CNN. How a pre-trained model can be used to extract features and eventually used for classification, is also described. The chapter also described how to fine-tune the pre-trained CNN and retrained using backpropagation. The statistics of the dataset used as well as the models and frameworks used in the research are also described in chapter three. Thirdly, the analysis, performance and discussion of the result are shown in chapter Four. Finally, conclusion and recommendations for further work makes up the chapter five. The list of cited references and MATLAB codes are provided in the appendices at the end of this dissertation.
10
IF YOU CAN'T FIND YOUR TOPIC, CLICK HERE TO HIRE A WRITER»