ABSTRACT: GRID COMPUTING: Grid computing is a collection

ABSTRACT: – The basic idea of
grid computing is to create large and powerful virtual computers which  is a collection of heterogeneous, distributed
environment. Grid computing is becoming a mainstream technology for large scale
distributed resource sharing and system integration. Grid applications often
involve large amount of data and/or computing resources that require secure
resource sharing across organizational boundaries. Today, highly secure or
virtual grid is very demanding in which you can share any resource from any
cluster even with existence of fault in system. This paper gives a method to
improve the resource utilization with maximum efficiency and throughput even in
occurrence of fault in system. It also increases the throughput of system by
simultaneous work of log entry by check pointing approach and execution of job
or by reducing the time.

Literature Survey:

We Will Write a Custom Essay Specifically
For You For Only $13.90/page!

order now

The scientific communities
were starting to look seriously at Grid computing as a solution to resource
federation problems. For example, high energy physicists designing the Large
Hadron Collider (LHC) realized that they needed to federate computing systems
at hundreds of sites if they were to analyze the many petabytes of data to be
produced by LHC experiments. The Grid: Blueprint for a New Computing
Infrastructure also had a catalyzing effect. Grid era in certain possibilities
that transcend simply bigger, faster, and better. Grid computing was started
long back ago, at a time when the application portability remained a major
challenge, that is when many of the processor architectures competed for the


computing is a collection of computers from
multiple locations to reach a common goal. The grid can be thought of as a distributed system with
non-interactive workloads that involve a large number of files.Grid computing
is a special type of parallel computing  that depends on complete
computers connected to a network (private, public or the internet) by  network
 producing  hardware, compared to
the lower efficiency of designing and constructing a small number of custom

In grid computing, the computers on the same network can work on
a task together, thus function as a supercomputer.Typically, a grid works on
various tasks within a network, but it is also capable of working on
specialized applications. It is designed to solve problems that are too big for
a supercomputer while maintaining the ability 
to process smaller problems.

A grid is connected by parallel nodes that form a computer
cluster, which runs on an operating system. The cluster varies in size from a
small work station to several networks. The technology is applied to a range of
applications, like mathematical, scientific or educational tasks through
several computing resources.


Protocols and services at five different layers as
identified in the Grid protocol architecture are provided by Grids.In general,
the higher layers are focussed on the user whereas the lower layers are more
focussed on computers and networks.

 At the Fabric
layer, access to different resource types such as compute, storage and
network resource, code repository, is provided by the grid. Grids usually rely
on existing fabric components, for instance, local resource managers.As 
result of sharing operations at higher levels, fabric components
implement the local, resource-specific operations that occur on specific
resources (whether physical or logical).Richer Fabric
functionality enables more sophisticated sharing operations.

Connectivity layer defines core communication and authentication
protocols for easy and secure network transactions. The GSI (Grid Security
Infrastructure) protocol underlies every Grid transaction.Communication protocols enable the exchange
of data between Fabric layer resources. Authentication protocols build on
communication services to provide cryptographically secure mechanisms for
verifying the identity of users and resources.

The Resource layer defines functions for the publication,
discovery, negotiation, monitoring, accounting and payment of sharing
operations on individual resources.The
Resource layer builds on Connectivity layer communication and authentication
protocols to define protocols for the secure negotiation, initiation,
monitoring, control, accounting, and payment of sharing operations on
individual resources. When Resource layer implements these protocols, Fabric
layer functions to access and control local resources is called.

The Collective layer -While the Resource layer is focused on interactions with a single
resource, the next layer in the architecture contains protocols and services
(and APIs and SDKs) that are not associated with any one specific resource but
rather are global in nature and capture interactions across collections of
resources. For this reason, we refer to the next layer of the architecture as
the Collective layer

The Application Layer is the highest layer of the
structure.By calling upon  services defined at any layerapplications are
constructed. At each layer, we have well-defined protocols that provide access
to some useful service: resource management, data access, resource discovery,
and so forth. At each layer, APIs may also be defined whose implementation
exchange protocol messages with the appropriate service(s) to perform desired





Issues of grid computing when failure occurs:

Since grid environments are extremely heterogeneous and
dynamic, with components joining and leaving the system all the time, more
faults are likely to occur in grid environment.

1. As and when the fault occurs at a grid resource which
eventually results in failing to satisfy the user’s deadline, the job is
rescheduled on another resource. As the job is re-executed, it consumes more

2.There are resources that fulfill the criteria of deadline
constraint, but they have a tendency towards faults in computational based grid
environments. The grid scheduler selects the same resource for mere reason that
grid resource promises to meet user’s requirements of grid jobs. This
eventually results in compromising user’s QOS parameters in order to complete
the job.

3. Even if there is a fault in the system, a task running
needs to be finished on its deadline. There is no meaning of a task which is
not completed before its deadline. Hence, deadline is the major issue in real

4. It is about the ability to handle the growing amount of
work, and the capability of a system to increase total throughput under an
increased load when resources are added.       

Hence, fault tolerance in grid
computing is important as the dependability of the grid resources may not be
guaranteed. It is needed to enable the grid to continue its work when one or
more resources fail. Hence, a fault tolerant system must be included to detect
errors and recover them from them and thus avoiding the failure of the grid.


Job replication and job check pointing are the two often
used techniques to accomplish fault tolerance in grid computing.

Job replication:

Job replication is based on the assumption that the
probability of single resource failure is much higher than of a simultaneous
failure of multiple resources. It copies the same job on different resources
with redundant copies of a job, the grid can continue to provide a service in
spite of failure of grid resource carrying out job copies without affecting the
performance. Job replication is the method of replicating job on multiple
servers such as in grid computing service is capable of receiving jobs,
executing them, performing checksum operations on them, and sending the result
back to the client.

Fig 1.Distributed system
with multiple clients and server

Data Replication is commonly used to enhance availability in
Grid like environments where failures are more likely to occur. Components are
replicated on different machines, and if any component or machine fail, then that
application can be transferred and run on another machine having the required
components.  The main disadvantage of job
replication technique is the additional resources used in executing the same
job. This can cause grid over provisioning and can lead to great delays for
other jobs waiting these resources to become free. Also, most of the existing
replication based techniques are static. This means that the number of replicas
of the original job is decided before execution and it is fixed number. Static
job replication leads to excessive utilization of resources and also to excess
load on the grid.

On the other hand, adaptive job replication can alleviate
this extra load resulting from using fixed number of replica. Adaptive job
replication techniques determine the number of replica according to the failure
history of the primary resource allocated to execute the job. Thus, the number
of replica will be different for each job according to the failure behavior of
each resource in the past. Bad failure history means big number of replica and
good failure history means small number of replica

Job check-pointing:

Check-pointing is the ability to
save the state of a running job to a stable storage. In case of any fault, this
saved state can be used to resume execution of the application from the point
in computation where the checkpoint was last registered instead of restarting
the application from its very beginning. This can reduce the execution time to
a large extent. Each interval starts when a checkpoint is established and ends
when next checkpoint is established. A short check pointing interval leads to a
large number of redundant checkpoints, which delay job processing by consuming
computational and network resources. On the other hand, when a check pointing
interval is too long, a substantial amount of work has to be redone in case of
a resource failure. So, calculating the optimal length of a check pointing
interval represents the main challenge when using this check pointing. Hence, the
decision about the size of the check pointing interval and the check pointing
technique is a complicated task and should be based upon the knowledge about
the application as well as the system.

The efficiency of checkpoint is based on:

Checkpoint overhead in terms of time and
resources consumed.

Checkpoint length plays a major role.

Compatibility and portability of checkpoints.

Various types of check pointing
optimization have been considered by the researchers, e.g., Full check
pointing, Incremental check pointing, Unconditional check pointing, dynamic
check pointing, Synchronous and asynchronous check pointing etc. A check point
may be system level, application level, or mixed level depends on its
characteristics. Check-pointing is also categorized on the basis of In-transit
or orphan message. These are Uncoordinated Check pointing, Coordinated
Check-pointing, and Communication-induced Check-pointing. Check-pointing also
can be classified is based on who instruments the application that do the
actual capturing and re-establishing of the application execution state. These
are Manual code insertion, Pre-compiler check pointing, Post-compiler
check-pointing. A check point may be local or global on the basis of their
scope. Check-point for separate process is local checkpoint and a check-point
applied for set of processes is called global check point. Check-pointing have
some demerits such as Check pointing causes execution time overhead even if
there are no crashes.

Result and Discussion:

The response time of a check pointing
technique is not good compared with the job replication technique. This is due
to the extra time needed to migrate the job to another resource when a resource
fails. On the other hand, job replication technique does not need to migrate
jobs between resources and the first returned response is employed. The
required networking and computing resources of job replication techniques are
much higher than those of check pointing techniques. Check pointing has another
cost when writing checkpoint data to stable storage whenever a checkpoint is
taken. This cost is proportional to the size of the checkpoint data.  Thus, we can use check pointing strategy for
the resources constrained grids and job replication technique for real time
applications. However, determination of the number of replica and the number
and intervals of checkpoints are still big challenges.   


In all distributed environments fault tolerance is
an important problem. Thus, by dynamically adapting the checkpoint frequency
and optimal number of replicas, based on history of information of failure and
job execution time, which reduces checkpoint overhead and also, increases the
throughput by which the proposed work achieves fault tolerance. Hence,
following have been proposed new fault detection methods, client transparent
fault tolerance architecture, on demand fault tolerant techniques, economic
fault tolerant model, optimal failure prediction system, multiple faults
tolerant model and self adaptive fault tolerance framework  to make the grid environment is more
dependable and trustworthy.



I'm Owen!

Would you like to get a custom essay? How about receiving a customized one?

Check it out