Name: An Improved Job Scheduling Algorithm In Grid Computing Environment Using Fault Tolerance Mechanisms
SKU: 35999
Availability: InStock

Download this complete Project material titled; An Improved Job Scheduling Algorithm In Grid Computing Environment Using Fault Tolerance Mechanisms with abstract, chapters 1-5, references, and questionnaire. Preview Abstract or chapter one below

Format: PDF and MS Word (DOC)
pages = 65

₦ 5,000

Description

ABSTRACT

Grid scheduler, schedules user jobs on the best available resource in terms of resource characteristics by optimizing time, and resource failure in grid is no longer an exception but a regular event. Andresources are increasinglybeing used by the scientific community to solve computationallyintensive problems which typically run for days oreven months.It is therefore absolutely essential that long-running applications are able to tolerate failures and avoid computation of the task from scratch when resource failure occurred, to satisfy the user‟s QoS requirement.An Improved Job Scheduling Algorithm in Grid Computing Environment Using Fault Tolerance Mechanisms is proposed. The technique employed here, isthe use of resource failure rate, as well as checkpoint-based roll back recovery strategy. Checkpointing aims at reducing the amount of work that is lost upon failure of the system by intermediately saving the state of the system. A comparison of our proposed approaches with Moallem‟s ACO, the result shows that the proposed algorithm achieved up to 13% reduction in makespan, 12% maximization in terms of throughput and 12% maximization in ATA when the Gridlets are varied and the resources are kept constant. Also when the Resources are varied and Gridlets are kept constant, the proposed algorithm achieved 18% reduction in makespan, 18% maximization in terms of throughput and up to 14% maximization in ATA.

DECLARATION ………………………………………………………………………………………………………………………… ii
CERTIFICATION …………………………………………………………………………………………………………………………… iii
DEDICATION ………………………………………………………………………………………………………………………………. iv
ACKNOWLEDGEMENT ………………………………………………………………………………………………………………….. v
LIST OF FIGURES ……………………………………………………………………………………………………………………. ix
LIST OF TABLES ………………………………………………………………………………………………………………………. x
CHAPTER ONE: INTRODUCTION …………………………………………………………………………………………….. 1
1.1 Background of the Study………………………………………………………………………………………………… 1
1.1.1 General Issues in Grid System: Principles …………………………………………………………………. 3
1.2 Problem Statement ………………………………………………………………………………………………………… 4
1.3 Research Question ………………………………………………………………………………………………………… 5
1.4 Research Motivation ……………………………………………………………………………………………………… 6
1.5 Research Aim and Objectives …………………………………………………………………………………………. 6
1.6 Research Methodology ………………………………………………………………………………………………….. 7
1.7 Scope and Limitation …………………………………………………………………………………………………….. 7
1.8 Contribution to Knowledge …………………………………………………………………………………………….. 8
1.9 Dissertation Organization ………………………………………………………………………………………………. 8
1.10 Definition of Terms ……………………………………………………………………………………………………….. 9
CHAPTER TWO: LITERATURE REVIEW ………………………………………………………………………………… 13
2.1 Introduction ………………………………………………………………………………………………………………… 13
2.2 The Grid System ………………………………………………………………………………………………………… 13
2.2.1 Types of grid system …………………………………………………………………………………………….. 13
2.2.2 Application of grid computing ……………………………………………………………………………….. 15
2.2.3 Challenges to grid computing ………………………………………………………………………………… 16
2.3 Resource Management System ……………………………………………………………………………………… 17
2.4 Grid Resource Scheduling ……………………………………………………………………………………………. 18
2.5 Artificial Intelligence (AI) ……………………………………………………………………………………………. 20
2.5.1 Using swarm intelligence techniques ………………………………………………………………………. 20
2.6 Fault Tolerance …………………………………………………………………………………………………………… 22
2.6.1 Checkpointing ……………………………………………………………………………………………………… 25
viii
2.7 Checkpoint Interval ……………………………………………………………………………………………………… 31
2.8 Architecture of Fault Tolerance in Grid Computing …………………………………………………………. 31
2.9 Review of Some Proposed ACO Scheduling Algorithms ………………………………………………….. 33
2.10 Current Fault Tolerant Scheduling Algorithms………………………………………………………………… 44
2.11 Implementation Tools ………………………………………………………………………………………………….. 51
2.11.1 Java programming language ………………………………………………………………………………….. 51
2.11.2 Gridsim toolkit …………………………………………………………………………………………………….. 52
2.11.3 Reason for gridsim toolkit ……………………………………………………………………………………… 52
CHAPTER THREE: MATERIALS AND METHODS …………………………………………………………………… 54
3.1 Introduction ………………………………………………………………………………………………………………… 54
3.2 The Proposed Scheduling ACO Algorithm……………………………………………………………………… 54
3.3 The Proposed Scheduling Strategy in Grid Environment ………………………………………………….. 55
3.4 Decision on Next Resource to Visit ……………………………………………………………………………….. 57
3.5 Decision Making …………………………………………………………………………………………………………. 59
3.6 An Illustration of the Model ………………………………………………………………………………………….. 60
3.7 Architecture of the Proposed Fault Tolerant Technique ……………………………………………………. 60
3.8 Checkpointing Tool Implementation ……………………………………………………………………………… 64
3.9 An Illustration of the Rollback Recovery Analysis ………………………………………………………….. 65
3.10 Feature of Gridsim Resource Failure Modified ……………………………………………………………….. 68
3.11 System Model …………………………………………………………………………………………………………….. 69
3.12 Application Model ………………………………………………………………………………………………………. 70
3.13 Performance Evaluation Criteria ……………………………………………………………………………………. 70
CHAPTER FOUR: ANALYSIS OF RESULTS ……………………………………………………………………………. 74
4.1 Introduction ………………………………………………………………………………………………………………… 74
4.2 Results ……………………………………………………………………………………………………………………….. 74
CHAPTER FIVE: SUMMARY, CONCLUSION AND RECOMMENDATION ………………………………. 83
5.1 Summary ……………………………………………………………………………………………………………………. 83
5.2 Conclusion …………………………………………………………………………………………………………………. 83
5.3 Recommendation ………………………………………………………………………………………………………… 84
REFERENCES …………………………………………………………………………………………………………………………. 85

CHAPTER ONE

INTRODUCTION
1.1 Background of the Study Computational approaches to problem solving have proven their worth in almost every field of human endeavor. Scientists in fields such as health, meteorology, astrophysics and many more are in need of huge processing power to perform complex calculations in a reasonable amount of time. It might take decade to run a set of modeling experiments on a standard personal computer. Buying a supercomputer costs millions of dollars and thousands more each year to maintain it. That’s not to mention the hefty electric bill to keep the massive system running. The standard Personal Computers today have great processing power. The standard tasks for an average user‟s computer vary very little, usually including word processing, Internet browsing, spreadsheets and presentations. Owning to the fact that high performance computing resources are expensive and hard to access, option was to use confederated resources that could comprise computation, storage and network resources from multiple geographically distributed institutions (Foster et al., 2008). As most systems are idle for significant periods of time, it should be possible to harness their idleness or unused resources and apply them towards projects in need of such resources. The Grid paradigm now emerged, led by Foster, Carl Kesselman, and Steve Tuecke (Foster et al, 2008; Rhodes, 2006) called “fathers of Grid” (Haque et al., 2012). They got together to develop a toolkit to handle computation management, data movement, storage management and other infrastructure that could handle large grids without restricting themselves to specific hardware and requirement(Barboni, 2011).
2
Grid emerges from solving computational problems which otherwise cannot be solved by single personal computer. Such computational problems are financial modeling, weather modeling, data visualization etc. This extremely high computing power is achieved by optimal utilization of distributed heterogeneous resources which are lying idle. This has enable scientists to broaden their simulations and experiments to take into account more parameters (like large values) than ever before. Imagine millions of computers owned by individuals and institutions from various countries across the world connected to form a single, huge, super-computer so as to utilize the resources as depicted in Figure 1.1
Figure 1.1: Grid Computing Environment
(http://eu-datagrid.web.cern.ch/eu-datagrid/images/images/grid-small-prov.jpg) The term Grid is analogous with “electrical power grids”, that provide consistent, pervasive, reliable, transparent access to utility power irrespective of location source(Rhodes, 2006; Pritpal and Gurinderpal, 2013; Foster and Kesselman, 2002). Grid computing is concerned with coordinated resource sharing and problem solving in dynamic, multi-institutional virtual organizations. The key concept is the ability to negotiate resource-sharing arrangements among a set of participating parties (providers and consumers) and then to use the resulting resource pool for some purpose (Foster, 2002).
3
Grid enables the sharing, selection, and aggregation of a wide variety of geographically distributed resources including supercomputers, storage systems, data sources, and specialized devices owned by different organizations for solving large-scale resource intensive problems in science, engineering, and commerce(Buyya, 2002).
1.1.1 General Issues in Grid System: Principles
Four main aspects characterize a grid(Rhodes, 2006; Buyya, 2002): 1. Multiple Administrative Domains and Autonomy:Grid resources are geographically distributed across multiple administrative domains, most often in different time zones, owned by different organizations. It is obliged to honor the autonomy of resource owners together with their local resource management and usage policies. 2. Heterogeneity: A Grid involves a collection of resources, heterogeneous in nature and will comprise enormous range of technologies. 3. Scalability. A Grid might grow from a small number of integrated resources to millions. This led to problem of potential performance degradation as the size of Grids increases. Subsequently,applications requiring numerous geographically located resources must be designed to be latency and bandwidth tolerant. 4. Dynamicity or adaptability: With so many resources in a Grid, the probability of some resource failing is high. Therefore resource failure should be considered as a rule rather than the exception. Applications or Resource managers must adapt their behavior dynamically and use the available resources and services efficiently and effectively.
4
Due to dynamic natureof grid, more failures are likely to occur in grid environments, thereby affecting the time needed to execute job applications and therefore degrading the performance of the system. Grid compute intensive applications often require much longer execution time in order to solve a single problem. The huge computing potential of grids, usually, remains unexploited due to their susceptibility to failures like, process failures, machine crashes, and network failures etc (Garg and Kumar, 2011). The failure of a resource running a user job has a huge effect on the Grid performance. Hence, in order to ensure the high system availability, the job site failure handling is inevitable.In grid computational system, incorporating fault tolerant algorithms in grid scheduling is advocated. „Swarm Intelligence for Distributed Job Scheduling on the grid algorithm‟ proposed byMoallem (2009) will be enhancedby incorporating Fault tolerant technique.
1.2 Problem Statement
Grid environment are dynamic in nature; where resources are geographically distributed, heterogeneous in nature, owned by different individual or organizations with their own policies, having different access models, and have dynamically varying loads and availability.In Grid scheduling system,the availability and performance of grid resources are unpredictable as requests from within an administrative domain may gain more priority over requests from outsideand alsonew resources may join or become unavailable, this results in performance fluctuation, which makes it difficult to implement(Caminero et al.,2007) the scheduling methods of parallel computing in Grid environment. These Resources are not dedicated to grid application due to site autonomy and competition by applications for resources and also the capabilities of the computing resources varies, and jobs often arrive dynamically. As the number of components in a distributed system
5
increases, the component failure rate also increases (Townend and Xu,2003).Most errors likely to occur in grid environment may be intensified by the fact that many Grid applications will perform long tasks that may require several days of computation, if not more. In addition the cost and difficulty of containing and recovering from faults in Grid applications is higher than that of normal applications (Gupta, 2011). While some Grid Scheduler in computational grid, assume resources are always available and 100 reliable and there is no failure of resourceswhile processing group of tasks.Failure and unavailability of resources are inevitable on real Grid environments and certainly will affect users Quality of Service (QoS) requirement. Working with such assumptions will lead to the simulated model outperforming the real grid system. Most existing fault management mechanisms reacts when failure has occurred and application dependent. For example if a job execution machine fails during execution; jobs would be submitted on another machine from start. And this is too expensive to afford for computational intensive jobs that require huge computational resources (Arora et al, 2013).To achieve the promising potentials of computational grids, an effective and efficient scheduling system is fundamentally important.Research‟sare been carried out on incorporating fault tolerance in computational grid environment where some researchers have implemented fault tolerance in most scheduling algorithms. This research incorporated fault tolerance in Ant Colony Optimization (ACO) scheduling algorithms carried out in Moallem (2009).
1.3 Research Question
This research work attempts to answer these questions:
1. What scheduling mechanism and programming tools will be used for efficiency?
6
2. What fault technique will be employed to solve the problem discussed in the problem statement?
3. How efficient is the technique used?
4. How willthe proposed technique be validated to prove its efficiency?
1.4 Research Motivation
Most of the scheduling algorithms assumed resources are fully reliable and there is no failure of resource while processing group of tasks. Grid environment involves multiple resources which are heterogeneous in nature and geographically distributed. These Resources are computing system, specialized devices, distributed applications etc. the probability of some resources failing is high. If the grid resource failed during execution of the job, the job is rescheduled on another resource which starts executing the job from scratch. Some of these resources satisfy the user resource criterion (like processors speed, computational cost,storage capacity etc) but they have a tendency toward faults. In such a scenario, the scheduler goes ahead to select the same resource for the simple reason that the grid resource assured to meet user‟s requirements of the jobs. This leads to more time consumed for the user job than expected. This is a setback to the Grid providers and users.Researchers are making effort to see that fault tolerant scheduling algorithms are incorporated into scheduling algorithms in grid environment.
1.5 Research Aim and Objectives
The aim of this dissertation is to Improve Job Scheduling Algorithm in Grid Computing Environment Using Fault Tolerance Mechanims. Therefore, the research objectives include proposal of a fault tolerant scheduling algorithm that will:
7
1. Enable job execution in spite of resource failure in the context of grid computing environment.
2. Improve user Quality of Service (QoS) requirements (e.g. deadline to complete its execution).
3. Reduce the selection probability of resources with more fault occurrence history.
1.6 Research Methodology
The following are the methods that were adopted for this research work:
1. Incorporatethe fault tolerant mechanismintoscheduling algorithms proposed in Moallem (2009).
2. Implementation and simulation of the proposed scheduling algorithms using GridSim Toolkit-5.2.
3. Evaluate the performance of the proposed scheduling algorithm with Moallem (2009).
1.7 Scope and Limitation
The scope of this research work is to incorporate fault tolerance mechanism; checkpoint technique into scheduling algorithm proposed by Moallem (2009). Checkpoint strategy saves the process state at a particular interval of time of each job to be executed on the resource. Checkpoint strategy provides an efficient way to guarantee the completion of jobs according to the QoS required by the user. Limitation of the research work, assumed that the storage device is dedicated and reliable. And also that resource does not fail when and during checkpointing time and also when recovery is being done.
8
1.8 Contribution to Knowledge
The following are the contribution made:
1. The need for a job scheduling strategy with fault tolerant mechanism for a computational grid environment is advocated. Architecture for in-cooperating fault tolerant based job scheduling model inMoallem (2009) dissertation is proposed. This will enable the jobs to start from the last checkpoint after resources failure not from scratch as in Moallem‟s.
2. The proposed scheduling strategy uses job checkpointing recovery mechanism to checkpoint the application state at regular intervals so that jobs can restart from the last checkpoints upon resource failure, this enables the grid to complete the jobs within short period of time.
3. The proposed scheduling algorithm consider not just the load information of the resources as in Moallem‟s but also the fault rate of the resource when selecting an optimum resource for executing the job.
4. The proposed fault tolerance scheduling is implemented using GridSim Toolkit simulator, a java-based discrete-event Grid simulation toolkit. The GridSim toolkit is one of the most widely used grid simulation tools.
5. Performance evaluations were carried out by comparing the proposed scheduling algorithm with Moallem (2009) which is without fault tolerant mechanism. The proposed algorithm improvesthe user job execution time and system throughput.
1.9 Dissertation Organization
The organization of the rest of the dissertation and a brief outline of the chapters in this thesis is as follows. In chapter 2, a review of related scheduling algorithms and fault
9
tolerance mechanisms in grid environment is carried out. In chapter 3,the gridsim simulator and the methodology used in the proposed swarm intelligence job scheduling with fault tolerant, an enhancement of Moallem (2009) with details of scheduling model use and the architecture of the scheduling algorithm is discussed. In Chapter 4, involves the design and implementation of a prototype system and performance evaluation of the proposed system with Moallem (2009). Chapter 5 concludes the thesis by summarizing the research thesis and outlining future work.
1.10 Definition of Terms
ACO: Ant Colony Optimization; simulate the behavior of ant to find desirable solutions to difficult combinatorial optimization problems. Checkpoint latency: The duration of time required to save the checkpoint. Checkpoint overhead: The increase in the execution time of an applications caused by the check pointing. Computing resources: resources which allow users to execute the required application. Error: is the part of the system’s state that may lead to a failure. Failure: is an event that occurs when the delivered service deviates from its correct behavior. A service is said to have failed if it could not comply with the promised functional specifications. Fault: is the adjudged or hypothesized cause of an error. A fault is an unwanted but possible constellation of system states.
10
Gridlet: is a package that contains all the information related to the job and its execution management details such as job length expressed in MIPS, disk I/O operations, the size of input and output length, and the job originator. Grid Information System (GIS): contains information about the available grid resources with their computing capacity and cost at which they offer services to grid users. Grid Resource Broker (GRB): maps jobs received from users to the grid resources. Grid Scheduler: schedule jobs to grid resource. Grid scheduling: process of making scheduling decisions involving resources over multiple administrative domains. Global scheduling is the problem of deciding where to execute the job. Global scheduling is not responsible for the final mapping of the job’s processes onto CPUs, this done by the machine’s local scheduler, i.e., operating system. Heterogeneity: Resources are said to be heterogeneous when they do not have all the exact same characteristics. For instance, the set of processors in a grid is heterogeneous if they do not all have the same processing capabilities or the same amount of memory. Heuristics: refers to experience-based techniques for problem solving, learning, and discovery. Job is the user’s program or application which is executed in the Grid. The job is also subject of the scheduling process. Job Migration: Sometimes it happens that due to some reason a job cannot be completely executed on a particular machine. At the time of failure of any task, task can be migrated to another machine.
11
Local scheduling is used to assign the process (es) of the job to the CPU(s) of the single machine. Makespan: is the time at which the application execution is completed.It is measured from the time the first job is sent to the grid, until the last job comes out of the grid. Meta-heuristic is a computational method that optimizes a problem by iteratively trying to improve a candidate solution. Proactive faulttolerant: The failure consideration for the grid is made before the scheduling of a job and is dispatched with hopes that the job does not fail. Process: referred as a task, is an instance of a program in execution. Post-active: A mechanism handles the job failures after it has occurred. Quality of Service (QoS): is the ability to provide different priority to different jobs and users, or to guarantee a certain level of performance to a job. Re-execution: A task whose execution is not reliable enough (for instance, because its execution speed is below a threshold) is executed twice (or more) to enhance its reliability. Resource: any hardware or software component such as a cluster, a supercomputer or a storage repository. Resources perform an operation, such as memory for data storage, processor for data processing, bandwidth for data transfer and so on. Resource Characteristics: object that stores the properties of a Grid resource: architecture, OS, list of Machines, allocation policy: time- or space-shared, time zone. Resource Scheduling: the resource scheduling is the process of mapping resources based on the requirements, characteristics which are specified in a query. Resource selection: is the problem of deciding which, among all available resources, should be used to solve a given problem.
12
Schedulers: are responsible for scheduling, and it manages various tasks – resource discovery and resource selection – required for effective application execution. Scheduling: is the problem of deciding at what time (when), and on which resource (where), to execute each of the (atomic) tasks that must be executed on the given platform. Site: Sites are autonomous unit composed of one or multiple resources. It is also termed as node. System-centric: is a traditional scheduling approach which is commonly applied in single administrative domains by attempts to optimize system-wide measures of performance. They aim to enhance the system throughput and utilization, and to thereby complete execution at the earliest possible time. Task: is the smallest unit that can be scheduled by the scheduler and allotted to a resource. Task has parameters such as priority, memory requirement, deadline, etc. Task scheduling: Scheduling refers to assigning of tasks to a group of resources that are distributed in multiple domains. Task Resubmission: A job may fail now whenever a failed task is detected, In this case at runtime the task is resubmitted either to the same or to a different resource for execution. Throughput:number of processes completed per unit time Turnaround Time: mean time from submission to completion of process
User-centric: concentrate on users‟ requirements by delivering maximum utility to the users of the system based on their QoS requirements. For example, a guarantee of certain QoS based on the attributes that the user finds important, such as the deadline by which the job has to be completed.

DOWNLOAD COMPLETE WORK

DISCLAIMER:

For Reference Only: Materials are for research, citation, and idea generation purposes and not for submission as your original final year project work.
Avoid Plagiarism: Do not copy or submit this content as your own project. Doing so may result in academic consequences.
Use as a Framework: This complete project research material should guide the development of your own final year project work.
Academic Access: This platform is designed to reduce the stress of visiting school libraries by providing easy access to research materials.
Institutional Support: Tertiary institutions encourage the review of previous academic works such as journals and theses.
Open Education: The site is maintained through paid subscriptions to continue offering open access educational resources.

An Improved Job Scheduling Algorithm In Grid Computing Environment Using Fault Tolerance Mechanisms

ABSTRACT

TABLE OF CONTENTS

CHAPTER ONE

Related Topics

A Naive Finite Difference Approximations For Singularly Perturbed Parabolic Reaction-Diffusion Problems

A Mathematical Model Of The Transmission Dynamics Of Typhoid Fever And Its Control

A Scheduling Algorithm For Minimizing The Penalty Ratio Variance

A Hybrid Algorithm For Approximating A Common Element Of Solutions Of A Variational Inequality Problem And A Convex Feasibility Problem