ABSTRACT
A huge amount of data is generated on daily basis. The generated data can be both structured and unstructured data. The sources from which most of the unstructured data are found are the dailies, social networks (posts from Facebook, tweeter, etc.), event reporting (for example recounting an accident), etc.
One of the biggest challenges in Big Data analysis is the use of unstructured data. There is need to structure the corpus so as to permit analysis and one of the approaches for structuring unstructured data is the technique of annotation. Annotation could be fully automatic, semi-automatic, or fully manual (human).
The technique of annotation has been of important help in different domains and sectors (machine learning, education, health, commerce, etc.). For example, in machine learning especially for supervised learning where annotation is used in the training phase to label data.
In this research we studied and analysed different annotation tools and techniques. The studied tools were tested and their most important features that should be taken into consideration when choosing a tool were used for the comparison.
Keywords: unstructured data, annotation, annotation tools, annotation techniques, ontology, semantic web, semantic annotation
v
TABLE OF CONTENTS
Abstract ………………………………………………………………………………………………………. iii
Acknowledgement ………………………………………………………………………………………… v
Table of content …………………………………………………………………………………………… vi
List of figures……………………………………………………………………………………………….. x
Table of content …………………………………………………………………………………………… vi
Chapter One ……………………………………………………………………………………………….. 1
Introduction …………………………………………………………………………………………………. 1
1.1 Research and Background ……………………………………………………………………. 1
1.2 Problem statement ……………………………………………………………………………….. 3
1.3 Research aim and objective …………………………………………………………………… 3
1.3.1 Aim ………………………………………………………………………………………………. 3
1.3.2 Objective ……………………………………………………………………………………….. 3
1.4 Limitation of study ………………………………………………………………………………… 4
1.5 Research outline ………………………………………………………………………………….. 4
Chapter Two ……………………………………………………………………………………………….. 5
Literature review ………………………………………………………………………………………….. 5
2.1 What is annotation? ……………………………………………………………………………… 5
2.2 Types of annotations…………………………………………………………………………….. 5
2.2.1 Text annotation ………………………………………………………………………………. 5
2.2.2 Image annotation ……………………………………………………………………………. 6
2.2.3 Video annotation …………………………………………………………………………….. 6
2.3 Annotation techniques ………………………………………………………………………….. 6
2.3.1 Manual annotation ………………………………………………………………………….. 6
2.3.2 Automatic annotation ………………………………………………………………………. 7
2.3.3 Semi-automatic annotation ………………………………………………………………. 7
vii
2.3.4 Comparison of annotation techniques ……………………………………………….. 8
2.4 Advantages of annotation ……………………………………………………………………… 9
2.5 Metadata …………………………………………………………………………………………….. 9
2. 6 Semantic Web ……………………………………………………………………………………. 9
2.7 Semantic annotation …………………………………………………………………………… 10
2.8 Ontology …………………………………………………………………………………………… 10
2.9 Related work ……………………………………………………………………………………… 11
Chapter Three ……………………………………………………………………………………………. 17
Discussion of annotation Tools …………………………………………………………………….. 17
3.1. Machine learning and annotation …………………………………………………………. 17
3.2. Categories of annotation tools …………………………………………………………….. 17
3.3 Textual data ………………………………………………………………………………………. 18
3.4 Annotating text …………………………………………………………………………………… 18
3.5 Text annotation tools…………………………………………………………………………… 18
3.5.1 Tagtog annotation tool …………………………………………………………………… 19
3.5.1.1 Functionalities of Tagtog annotation tool ……………………………………. 20
3.5.1.1.1 Manual annotation in Tagtog annotation tool ………………………… 21
3.5.1.1.2 Semi- automatic annotation in Tagtog annotation tool ……………. 24
3.5.1.1.3 Automatic annotation in Tagtog annotation tool …………………….. 24
3.5.2 Annotea annotation tool …………………………………………………………………. 24
3.5.2.1 Functionality of Annotea ………………………………………………………….. 25
3.5.3 Diigo annotation tool ……………………………………………………………………… 29
3.5.3.1 Functionality of Diigo annotation tool …………………………………………. 29
3.5.3.2 Diigo features ………………………………………………………………………… 29
3.5.4 GATE annotation tool ……………………………………………………………………. 31
3.5.4.1 Functionality of GATE annotation tool ……………………………………….. 32
3.5.4.1.1 Manual annotation using GATE ………………………………………….. 32
3.5.4.1.2 Semi-automatic with GATE ………………………………………………… 34
3.5.4.1.3 Automatic annotation with GATE ………………………………………… 34
3.5.5 Doccano annotation tool ………………………………………………………………… 35
3.5.5.1 Doccano features …………………………………………………………………… 36
3.5.5.2 Functionalities of Doccano ……………………………………………………….. 36
3.5.5.3 Functionalities of Doccano ……………………………………………………….. 38
viii
3.5.5.3.1 Manual annotation using Doccano ………………………………………. 38
3.5.5.3.2 Semi-automatic annotation using Doccano …………………………… 39
3.5.5.3.3 Automatic annotation using Doccano…………………………………… 39
3.5.6 Prodigy annotation tool ………………………………………………………………….. 39
3.5.6.1 Functionalities of prodigy annotation tool……………………………………. 40
3.5.6.1.1 Entity recognition with Prodigy ……………………………………………. 41
3.5.6.1.2 Text classification……………………………………………………………… 41
3.5.6.1.3 Computer vision ……………………………………………………………….. 42
Chapter Four ……………………………………………………………………………………………… 43
Comparative analysis ………………………………………………………………………………….. 43
4.1 Features used for comparison ……………………………………………………………… 43
4.2 Comparative table of annotation Tools ………………………………………………….. 47
4.3 Advantages of the annotation tools ……………………………………………………….. 50
4.3.1 Advantages of Tagtog annotation tool ……………………………………………… 50
4.3.2 Advantages of Annotea …………………………………………………………………. 50
4.3.3 Advantages of Diigo………………………………………………………………………. 51
4.3.4 Advantages of Gate ………………………………………………………………………. 51
4.3.5 Advantages Doccano …………………………………………………………………….. 52
4.3.6 Advantages of Prodigy …………………………………………………………………… 52
Chapter Five ……………………………………………………………………………………………… 53
Conclusion and Recommendations……………………………………………………………….. 53
5.1 Conclusion ………………………………………………………………………………………… 53
5.2 Recommendations ……………………………………………………………………………… 53
References ………………………………………………………………………………………………… 54
CHAPTER ONE
Introduction
1.1 Research and Background
In recent years, big data has become a salient research topic in the IT industry. A huge volume of data is generated on daily basis, owing to the fact that numerous numbers of devices are connected to the internet. The huge volume of data come from different sources, such as the World Wide Web, Ecommerce, and social media platforms like Facebook, twitter etc. Big data can be of two types, structured and unstructured data; structured data are the usual data that can be stored in a relational database, the query of these data is easy and analysis on them can be seamlessly done (Kiefer, 2016). The unstructured data, are usually data from email, pictures, document, video files, audio and other sources. It is difficult to process these unstructured data with relational databases, hence there is the challenge of managing such data. In this research the technique of annotation will be studied; various annotation tools and techniques have been proposed, that help in the analysis of big data.
The technique of annotation has played a considerable role in different sectors (education, health, commerce etc.) for example in machine learning, annotation tool can be used to train data sets.
Annotation has different definition, depending on the context of its use. Annotation is a way of interpreting document (Okunoye, Oladejo, & Odumuyiwa, 2010). There are different annotation forms such as labelling an object, making a comment, tagging images, audio and videos etc. Annotating a document makes the document to be more detailed, informative and also makes the document to be easily queried, hence it adds value to a document.
2
Annotation may be of two types, that is implicit and explicit annotation (Okunoye et al.,
2010). Implicit annotation is one which is assumed to be only understood by the maker. Unlike implicit annotation, explicit annotation is one that the meaning of the annotation is assumed to be known by a group, team or users of the same field of study (Okunoye et al., 2010).
Annotation as object is defined as an intentional and topical value-adding note linked to an extant information object (Bodain & Robert, 2007). Annotation is also defined as “any object (annotation) that is associated with another object (document) by some relationship” (Brusilovsky, 2005). The definition of annotation by (Brusilovsky, 2005) does not only consider annotation as object but also as an action involving anchoring the object with the concerned document. Annotation as action is defined as an act of interpreting a document (Robert, 2007). It is a process of creating annotation as object and anchoring it to the document object (i.e. information source being annotated).
Annotation is also defined as a way of attaching extra information (metadata) to a database record to provide better understanding and connective to the related information. Annotation can be manual, semi-automatic or automatic.
• Automatic annotation: makes use of computerized automated tools to annotate a document.
• Semi-automatic: makes use of computerized automated tools to annotate a document but also requires human intervention.
• Manual annotation: the annotation is totally done by human annotator.
3
1.2 Problem statement
There is some useful information that can be got from unstructured data when they are processed and analysed. Many annotation tools and techniques can be used to structure unstructured data, so because of numerous numbers of the available annotation tools and techniques out there it may be difficult to choose a particular tool which will be suitable for a pertaining data or available operating environment. There is need to study and analyse different annotation tools and techniques, to know the usefulness, usability, strength and the type of data they are most suitable for.
This thesis proposes a comparative study of different annotation tools and techniques, to facilitate in decision making for the best tool to employ when faced with a problem solvable by annotation.
1.3 Research aim and objective
1.3.1 Aim
A comprehensive review of existing proposals in the field of annotation (theses, articles, software). The result should be in form of a comparative table of existing techniques and tools. The tools discovered should be tested.
1.3.2 Objective
• Study and analyse different existing annotation tools and techniques.
• Test the discovered tools and techniques.
• Compare the studied annotation tools and techniques.
• Make a comprehensive conclusion and suggestion on the best tools suitable for different context of use.
4
1.4 Limitation of study
This research study is limited to comparing a few existing annotation tools and techniques that can be used in processing unstructured data.
1.5 Research outline
The entire research is divided into five chapters. Each chapter highlighted different topics and subtopics as follows:
Chapter 2 discusses the basic concepts and literature review related to annotation tools and techniques. Chapter 3 studies, analyses and compares different annotation tools. Chapter 4 presents the comparison results and discussion. Lastly, Chapter 5 contains summary, conclusion, recommendations, and future work.
IF YOU CAN'T FIND YOUR TOPIC, CLICK HERE TO HIRE A WRITER»