IJMEIT// Vol. 2 Issue 3 //March 2014 //Page No: 172-179//e-ISSN: 2348-196x
2014
Secure Load Rebalancing Algorithm for Distributed File Systems in Cloud Raihanath A S1, Anu V R2 1
P G Student, Computer Science, Ilahia College of Engineering & Technology (ICET), Kerala, India
2
Head of the Department, CSE & IT, Ilahia College of Engineering & Technology (ICET)), Kerala, India
Abstract Distributed file systems are key technology for cloud computing applications. In such file system, each node having storage as well as computing functionalities. A file is partitioned into a number of chunks allocated in distinct nodes so that data processing can be performed in parallel. Specifically, in this study, we suggest offloading the load rebalancing task to storage nodes by having the storage nodes balance their loads spontaneously. This eliminates the dependence on central nodes. The storage nodes are organized as distributed hash tables forming a network. Discovering a file needs rapid key look up operation given that a unique handle is allocated to each file chunk. DHTs enable nodes to self-organize and repair while constantly offering lookup functionality in node dynamism, simplifying the system provision and management. We construct a sophisticated load rebalancing algorithm combined with AES encryption. The simulation result shows that our proposed scheme outperforms the existing distributed approach in terms of security parameters. Keywords:Cloud Computing, Load Rebalancing, Distributed File System, AES encryption.
1. INTRODUCTION Cloud computing relies on sharing of resources to achieve coherence and economies of scale, similar to a utility over a network. In clouds, clients can dynamically allocate their resources on demand without sophisticated deployment. MapReduce[2] programming paradigm, distributed file systems, virtualization are key enabling technologies in
cloud computing application. These techniques emphasize scalability, so clouds can be large in scale, and comprising entities can arbitrarily fail and join while maintaining system reliability. Distributed file systems are key technology for cloud computing applications. In such file system each node performs computing and storage functions. Data is divided into different chunks and stored in distinct nodes. So that mapreduce
Raihanath A S, Anu V R IJMEIT Volume 2 Issue 3 March 2014
Page 172
2014
IJMEIT// Vol. 2 Issue 3 //March 2014 //Page No: 172-179//e-ISSN: 2348-196x application runs in a parallel way. Consider a set
constraints, priorities and budgets. Allocating their
of web servers, each with a set of websites. As
tasks to the appropriate resources in the clouds so
information is collected about the usage of each
that performance requirements are satisfied and
website on each web server, it
might become
costs are subject to an extraordinarily complicated
apparent that the load is not uniformly distributed
problem. Allocating the resources to the proper
across the web servers. An obvious solution
users so that utilization of resources and the
would be to reassign websites to web servers so as
profits generated are maximized is also an
to minimize the maximum load on a server. A
extremely
cloud divides the file into a large number of fixed-
computational perspective, it is impractical to
size chunks and assigns to different servers. Each
build a centralized resource allocation mechanism
server node calculates the usage of each unique
in such a large scale distributed environment.
website by searching its local file chunks.
In case of distributed file system, the load of a
Load balancing is a technique to enhance
node is proportional to the number of file chunks.
resources,
exploiting
The file chunks are not distributed uniformly
throughput improvisation, and to reduce response
among the nodes because of arbitrarily file
time through an appropriate distribution of the
creation, deletion, and update. In the case of
application.
Cloud
an
GFS(Google File System)and HDFS(Hadoop File
organization
with
distribute
System)[9]central nodes to manage the metadata
application requests across any number of
information of the file system to balance the loads
application deployments located in data centers
of storage nodes. The centralized approach is
and through cloud-computing providers. Cloud
quite simple as compare with the distributed
balancing takes a broader view of application
method. Distributed method is very challenging in
delivery and applies specified thresholds and
cloud
service level agreements (SLAs) to every request.
performance degradation of central node when the
The use of cloud balancing can result in the
number of node or file accesses increase linearly.
majority of users being served by application
This makes the node in a blocked stage. Further
deployments
providers’
operation cannot be handled by the central node.
environments, even though the local application
To overcome these problem HDFS[9] release the
deployment or internal, private cloud might have
concept of multiple name node. The workload
more than enough capacity to serve that user.
changes at a given time for each operation. Due to
In cloud computing, heterogeneous resources with
the lack of proper migration scheme for this name
different
node any of the nodes will under degradation of
utilizing
in
systems
dynamically
parallelism,
balancing the
the
in
available
ability
cloud
different and
provides to
places
are
distributed
complex
computing.
problem.
There
is
a
From
chance
a
for
their performance.
geographically. The user’s resource requirements
In this paper, we suggest offloading the load
in the clouds vary depending on their goals, time
rebalancing task to storage nodes by having the
Raihanath A S, Anu V R IJMEIT Volume 2 Issue 3 March 2014
Page 173
IJMEIT// Vol. 2 Issue 3 //March 2014 //Page No: 172-179//e-ISSN: 2348-196x
2014
storage nodes balance their loads spontaneously.
eliminate skew. We presented asymptotically
These reduce the dependence on the central nodes.
optimal online load-balancing algorithms that
The storage nodes are structured as a network
guarantee a constant imbalance ratio. The data
based on distributed hash tables discovering a file
movement cost per tuple insert or delete is
chunk can simply refer to rapid key lookup in
constant, and was shown to be close to 1 in
DHTs, given that a unique handle is assigned to
experiments. We showed how to adapt our
each file chunk. DHTs enable nodes to self-
algorithms to dynamic P2P environments, and
organize and -repair while constantly offering
architected a new P2P system that can support
lookup
efficient range queries.
functionality
simplifying
the
in
system
node
dynamism,
provision
and
Ashwin and R. Bharambe[6] present the design of
management. In this paper, we devise algorithms
Mercury, a scalable protocol for supporting multi-
that are specific to the load rebalancing problem
attribute range-based searches. Mercury differs
and obtain better results in a secure manner.
from previous range-based query systems in that it
Section 2 introduces related works for this paper.
supports multiple attributes as well as performs
Section 3 contains some basic load rebalancing
explicit load balancing. To guarantee client
concepts. Section 4 describes the load rebalancing
routing and load balancing, Mercury uses novel
algorithm. Section 5 shows the tests we have
light-weight sampling mechanisms for uniformly
performed and the results obtained. Finally,
sampling random nodes in a highly dynamic
Section 6 summarizes conclusion.
overlay network. Our evaluation shows that Mercury is able to achieve its goals of logarithmic-hop routing and near-uniform load
2. RELATED WORKS I.
Stoica
and
R.[1]
Morris
introduced
a
balancing.
conceptually similar but especially for large
David and R. Karger[3] have given a provably
overlay network. This paper presents the design
efficient load balancing protocols for distributed
and evaluation of Pastry, a scalable, distributed
data storage in P2P systems. The algorithm is
object location and routing scheme for wide-area
simple, and easy to implement, so an obvious next
peer-to-peer applications.
is a protocol that
research step should be a practical evaluation of
performs application-level routing and object
these schemes. In addition, several concrete open
location in a potentially very large overlay
problems follow from this work. First, it might be
network of nodes connected via the Internet. It can
possible to further improve the consistent hashing
be used to support a wide range of peer-to-peer
scheme. Second, the range search data structure
applications like global data storage, global data
does not easily generalize to more than one order.
sharing, and naming.
For example when storing music files, one might
P. Ganesan, M. Bawa[5] investigate that load
want to index them by both artist and song title,
balancing is necessary in such scenarios to
allowing lookups according to two orderings.
Raihanath A S, Anu V R IJMEIT Volume 2 Issue 3 March 2014
Page 174
2014
IJMEIT// Vol. 2 Issue 3 //March 2014 //Page No: 172-179//e-ISSN: 2348-196x Ion Stoicay and Robert Morrisz[3] introduce a
software or hardware, such as a multilayer switch
Chord protocol for solving the challenging
or a Domain Name System server. Load balancing
problem in decentralized manner. It offers a
is one of the central issues in cloud computing. It
powerful primitive given a key; it determines the
is a mechanism that distributes the dynamic local
node responsible for storing the key’s value, and
workload evenly across all the nodes in the whole
does so efficiently. In the steady state, in an N-
cloud to avoid a situation where some nodes are
node network, each node maintains routing
heavily loaded while others are idle or doing little
information for only O(logN) other nodes, and
work. It helps to achieve a high user satisfaction
resolves all lookups via O(logN) messages to
and resource utilization ratio, hence improving the
other nodes. P. Brighten Godfrey and Ion Stoica
overall performance and resource utility of the
have proposed a scheme to assign IDs to virtual
system. It also ensures that every computing
servers, called Low Cost Virtual Server Selection
resource is distributed efficiently and fairly. It
that yields a simple DHT protocol, called Y0, for
further prevents bottlenecks of the system which
which node degree does not increase significantly
may occur due to load imbalance. When one or
with the number of virtual servers. Y0 adapts to
more components of any service fail, load
heterogeneous node capacities, can achieve an
balancing helps in continuation of the service by
arbitrarily good load balance, moves little load,
implementing fair-over, i.e. in provisioning and
and can compute a node’s IDs as O(log n) hashes
de-provisioning of instances of applications
of its IP address for security purposes. The
without fail. The goal of load balancing is
techniques behind Y0 generalize to arbitrary
improving the performance by balancing the load
overlay
some
among these various resources (network links,
flexibility in neighbor selection, even if the
central processing units, disk drives) to achieve
underlying topology did not.
optimal
topologies
while
providing
throughput,
resource
utilization,
maximum
response
maximum time,
and
3. LOAD REBALANCING
avoiding overload. To distribute load on different
Load Balancing is a computer networking method
systems, different load balancing algorithms are
to distribute workload across multiple computers
used. In general, load balancing algorithms follow
or a computer cluster, network links, central
two major classifications: Depending on how the
processing units, disk drives, or other resources, to
charge is distributed and how processes are
achieve optimal resource utilization, maximize
allocated to nodes (the system load): Depending
throughput, minimize response time, and avoid
on the information status of the nodes (System
overload. Using multiple components with load
Topology).
balancing, instead of a single component, may
performance, response time and overhead are the
increase reliability through redundancy. The load
metrics used for load balancing in cloud.
Scalability,
resource
utilization,
balancing service is usually provided by dedicated Raihanath A S, Anu V R IJMEIT Volume 2 Issue 3 March 2014
Page 175
IJMEIT// Vol. 2 Issue 3 //March 2014 //Page No: 172-179//e-ISSN: 2348-196x
2014
1. Initialize server and its sub-servers 2. Establish connection between sub-server and servers using the IP or Port number. 3. Upload File to server that should be shared. 4. Server encrypts data with AES Encryption. 5. Split the file into multiple chunks 6. Calculate the each sub server memory 7. Divide the total chunks value by total Fig -1: Architecture diagram
number of sub-servers
Fig 1 shows the diagrammatic representation of DFS in cloud environment.
The load of main
server is split into sub servers. However, it moves from the centralized to distributed scheme. Client accesses data from sub servers that minimizes the workload of main server.
8. Upload each chunk into sub servers based on its memory capacity 9. If Capacity is less then transfer the excess chunks into next sub-servers 10. Each chunk will be appended with a index value. 11. When the client request for a file, that will
4. LOAD REBALANCING ALGORITHM
be received from different sub-servers
In order to balance the requests of the resources it
based on the index value.
is important to recognize a few major goals of
12. Client collects all the chunks then the file
load balancing algorithms. Cost effectiveness is
will be decrypted, then that will be viewed
the primary aim is to achieve an overall
by client.
improvement
in
system
performance
at
a
The data will encrypt at the server side using AES
reasonable cost. The distributed system in which
encryption algorithm shown in fig.2.The sub
the algorithm is implemented may change in size
servers are not allowed to view the data which is
or topology. So the algorithm must be scalable
in an encrypted format. The client has the
and flexible enough to allow such changes to be
privilege to decrypt the data.
handled easily. prioritization of the resources or jobs need to be done on before hand through the algorithm itself for better service to the important or high prioritized jobs in spite of equal service provision for all the jobs regardless of their origin. The load rebalancing algorithm with AES encryption is shown below. Raihanath A S, Anu V R IJMEIT Volume 2 Issue 3 March 2014
Page 176
2014
IJMEIT// Vol. 2 Issue 3 //March 2014 //Page No: 172-179//e-ISSN: 2348-196x
node dynamism, simplifying the system provision and management. The chunk servers in our proposal are organized as a DHT network. Typical DHTs guarantee that if a node leaves, then its locally hosted chunks are reliably migrated to its successor; if a node joins, then it allocates the chunks whose IDs immediately precede the joining node from its successor to manage. Fig -2: Encrypted data transfer among server and
4.3 Replica Management
peer
In distributed file systems (e.g., Google GFS and Hadoop HDFS), a constant number of replicas for
4.1 Chunk creation
each file chunk are maintained in distinct nodes to
A file is partitioned into a number of chunks allocated in distinct nodes so that Map Reduce Tasks can be performed in parallel over the nodes. The load of a node is typically proportional to the number of file chunks the node possesses. Because the files in a cloud can be arbitrarily created, deleted, and appended, and nodes can be upgraded, replaced and added in the file system, the file chunks are not distributed as uniformly as
Fig -3: Used space of each peers
possible among the nodes. Our objective is to allocate the chunks of files as uniformly as
Improve file availability with respect to node
possible among the nodes such that no node
failures
manages an excessive number of chunks.
balancing algorithm does not treat replicas
and
departures.
Our
current
load
distinctly. It is unlikely that two or more replicas 4.2 DHT formulation
are placed in an identical node because of the
The storage nodes are structured as a network
random nature of our load rebalancing algorithm.
based on distributed hash tables (DHTs), e.g.,
More specifically, each under loaded node
discovering a file chunk can simply refer to rapid
samples a number of nodes, each selected with a
key lookup in DHTs, given that a unique handle
probability of 1/n, to share their loads (where n is
(or identifier) is assigned to each file chunk.
the total number of storage nodes).
DHTs enable nodes to self organize and Repair while constantly offering lookup functionality in Raihanath A S, Anu V R IJMEIT Volume 2 Issue 3 March 2014
Page 177
2014
IJMEIT// Vol. 2 Issue 3 //March 2014 //Page No: 172-179//e-ISSN: 2348-196x
5. EXPERIMENTAL RESULTS
6. CONCLUSIONS
The performance of our algorithm is evaluated
Cloud Computing has widely been adopted by the
through
Our
industry, though there are many existing issues
implementation is demonstrated through a small-
like Load Balancing, Virtual Machine Migration,
scale cluster environment (Fig. 1). The load of
Server Consolidation, Energy Management, etc.
main server is split into sub servers. Client
which have not been fully addressed. Central to
accesses data from sub servers that minimizes the
these issues is the issue of load balancing, that is
workload of main server. In the experimental set
required to distribute the excess dynamic local
up we divide the server load into four peers. The
workload evenly to all the nodes in the whole
total capacity of the server is 100 GB. IT splits
Cloud to achieve a high user satisfaction, resource
into 4 peers (sub servers). The load of the server is
utilization and reliability. It also ensures that
dynamically allocated into four peers by running
every computing resource is distributed efficiently
the load rebalancing algorithm. Fig 3 shows the
and fairly. This paper presents a load rebalancing
usage
and
algorithm with AES encryption. Simulation
downloading speed are shown in fig.4. The
results shows that our proposed load rebalancing
performance of each peer can be evaluated here.
algorithm outperforms the existing one in terms of
For each experimental run, we quantify the time
speed, resource utilization and reliability.
elapsed
computer
of
each
to
peer.
complete
simulations.
The
the
uploading
load-balancing
algorithms, including the HDFS load balancer and
REFERENCES
our proposal. We perform 20 runs for a given M
[11] A. Rowstron and P. Druschel, “Pastry:
and average the time required for executing a load
Scalable, Distributed Object Location and Routing
balancing algorithm. In our proposal the data is
for Large-Scale Peer-to-Peer Systems,” Proc.
divided into chunks each of having 3 KB size. If
IFIP/ACM
the file is 366KB the algorithm split it into 122
Platforms Heidelberg, pp. 161-172, Nov. 2001.
chunks.
[2] J. Dean and S. Ghemawat, “MapReduce:
Int’l
Conf.
Distributed
Systems
Simplified Data Processing on Large Clusters,” Proc. Sixth Symp. Operating System Design and Implementation (OSDI ’04), pp. 137-150, Dec. 2004. [3] I. Stoica, R. Morris, D. Liben-Nowell, D.R. Karger, M.F. Kaashoek, F. Dabek, and H. Fig -4: Performance evaluation
Balakrishnan, “Chord: A Scalable Peer-to-Peer Lookup Protocol for Internet Applications,”
Raihanath A S, Anu V R IJMEIT Volume 2 Issue 3 March 2014
Page 178
IJMEIT// Vol. 2 Issue 3 //March 2014 //Page No: 172-179//e-ISSN: 2348-196x
2014
IEEE/ACM Trans. Networking, vol. 11, no. 1, pp. 17-21, Feb. 2003. [4] Hung-Chang Hsiao, Hsueh-Yi Chung, Haiying Shen and Yu-Chang Chao, “Load rebalancing for distributed file systems in cloud” IEEE Trans. On parallel and distributed systems, vol. 24, no. 5, pp.951-962, May 2013 [5] Prasanna Ganesan Mayank Bawa Hector, “Online Balancing of Range-Partitioned Data with Applications to Peer-to-Peer Systems”. [6]Ashwin R. Bharambe“Mercury: Supporting Scalable MultiAttribute Range Queries”. [7]
U.Karthik
Kumar,”
A
Dynamic
Load
Balancing Algorithm in Computational Grid Using Fair Scheduling” International Journal of Computer Science Issues, Vol. 8, Issue 5, No 1, September 2011. [8] Gagan Aggarwal, Rajeev Motwani, An Zhu”The Load Rebalancing Problem”. [9] Hadoop Distributed File System “Rebalancing Blocks”.http://developer.yahoo.com/hadoop/tutori al/module2.html#rebalancing,2012. [10] J. Dean and S. Ghemawat, “MapReduce: Simplified Data Processing on Large Clusters,” Proc. Sixth Symp. Operating System Design and Implementation (OSDI ’04), pp. 137-150, Dec. 2004
Raihanath A S, Anu V R IJMEIT Volume 2 Issue 3 March 2014
Page 179