Secure Load Rebalancing Algorithm for Distributed File Systems in Cloud

IJMEIT// Vol. 2 Issue 3 //March 2014 //Page No: 172-179//e-ISSN: 2348-196x 2014 Secure Load Rebalancing Algorithm for Distributed File Systems in Cl...
Author: Melissa Johnson
0 downloads 1 Views 384KB Size
IJMEIT// Vol. 2 Issue 3 //March 2014 //Page No: 172-179//e-ISSN: 2348-196x

2014

Secure Load Rebalancing Algorithm for Distributed File Systems in Cloud Raihanath A S1, Anu V R2 1

P G Student, Computer Science, Ilahia College of Engineering & Technology (ICET), Kerala, India

2

Head of the Department, CSE & IT, Ilahia College of Engineering & Technology (ICET)), Kerala, India

Abstract Distributed file systems are key technology for cloud computing applications. In such file system, each node having storage as well as computing functionalities. A file is partitioned into a number of chunks allocated in distinct nodes so that data processing can be performed in parallel. Specifically, in this study, we suggest offloading the load rebalancing task to storage nodes by having the storage nodes balance their loads spontaneously. This eliminates the dependence on central nodes. The storage nodes are organized as distributed hash tables forming a network. Discovering a file needs rapid key look up operation given that a unique handle is allocated to each file chunk. DHTs enable nodes to self-organize and repair while constantly offering lookup functionality in node dynamism, simplifying the system provision and management. We construct a sophisticated load rebalancing algorithm combined with AES encryption. The simulation result shows that our proposed scheme outperforms the existing distributed approach in terms of security parameters. Keywords:Cloud Computing, Load Rebalancing, Distributed File System, AES encryption.

1. INTRODUCTION Cloud computing relies on sharing of resources to achieve coherence and economies of scale, similar to a utility over a network. In clouds, clients can dynamically allocate their resources on demand without sophisticated deployment. MapReduce[2] programming paradigm, distributed file systems, virtualization are key enabling technologies in

cloud computing application. These techniques emphasize scalability, so clouds can be large in scale, and comprising entities can arbitrarily fail and join while maintaining system reliability. Distributed file systems are key technology for cloud computing applications. In such file system each node performs computing and storage functions. Data is divided into different chunks and stored in distinct nodes. So that mapreduce

Raihanath A S, Anu V R IJMEIT Volume 2 Issue 3 March 2014

Page 172

2014

IJMEIT// Vol. 2 Issue 3 //March 2014 //Page No: 172-179//e-ISSN: 2348-196x application runs in a parallel way. Consider a set

constraints, priorities and budgets. Allocating their

of web servers, each with a set of websites. As

tasks to the appropriate resources in the clouds so

information is collected about the usage of each

that performance requirements are satisfied and

website on each web server, it

might become

costs are subject to an extraordinarily complicated

apparent that the load is not uniformly distributed

problem. Allocating the resources to the proper

across the web servers. An obvious solution

users so that utilization of resources and the

would be to reassign websites to web servers so as

profits generated are maximized is also an

to minimize the maximum load on a server. A

extremely

cloud divides the file into a large number of fixed-

computational perspective, it is impractical to

size chunks and assigns to different servers. Each

build a centralized resource allocation mechanism

server node calculates the usage of each unique

in such a large scale distributed environment.

website by searching its local file chunks.

In case of distributed file system, the load of a

Load balancing is a technique to enhance

node is proportional to the number of file chunks.

resources,

exploiting

The file chunks are not distributed uniformly

throughput improvisation, and to reduce response

among the nodes because of arbitrarily file

time through an appropriate distribution of the

creation, deletion, and update. In the case of

application.

Cloud

an

GFS(Google File System)and HDFS(Hadoop File

organization

with

distribute

System)[9]central nodes to manage the metadata

application requests across any number of

information of the file system to balance the loads

application deployments located in data centers

of storage nodes. The centralized approach is

and through cloud-computing providers. Cloud

quite simple as compare with the distributed

balancing takes a broader view of application

method. Distributed method is very challenging in

delivery and applies specified thresholds and

cloud

service level agreements (SLAs) to every request.

performance degradation of central node when the

The use of cloud balancing can result in the

number of node or file accesses increase linearly.

majority of users being served by application

This makes the node in a blocked stage. Further

deployments

providers’

operation cannot be handled by the central node.

environments, even though the local application

To overcome these problem HDFS[9] release the

deployment or internal, private cloud might have

concept of multiple name node. The workload

more than enough capacity to serve that user.

changes at a given time for each operation. Due to

In cloud computing, heterogeneous resources with

the lack of proper migration scheme for this name

different

node any of the nodes will under degradation of

utilizing

in

systems

dynamically

parallelism,

balancing the

the

in

available

ability

cloud

different and

provides to

places

are

distributed

complex

computing.

problem.

There

is

a

From

chance

a

for

their performance.

geographically. The user’s resource requirements

In this paper, we suggest offloading the load

in the clouds vary depending on their goals, time

rebalancing task to storage nodes by having the

Raihanath A S, Anu V R IJMEIT Volume 2 Issue 3 March 2014

Page 173

IJMEIT// Vol. 2 Issue 3 //March 2014 //Page No: 172-179//e-ISSN: 2348-196x

2014

storage nodes balance their loads spontaneously.

eliminate skew. We presented asymptotically

These reduce the dependence on the central nodes.

optimal online load-balancing algorithms that

The storage nodes are structured as a network

guarantee a constant imbalance ratio. The data

based on distributed hash tables discovering a file

movement cost per tuple insert or delete is

chunk can simply refer to rapid key lookup in

constant, and was shown to be close to 1 in

DHTs, given that a unique handle is assigned to

experiments. We showed how to adapt our

each file chunk. DHTs enable nodes to self-

algorithms to dynamic P2P environments, and

organize and -repair while constantly offering

architected a new P2P system that can support

lookup

efficient range queries.

functionality

simplifying

the

in

system

node

dynamism,

provision

and

Ashwin and R. Bharambe[6] present the design of

management. In this paper, we devise algorithms

Mercury, a scalable protocol for supporting multi-

that are specific to the load rebalancing problem

attribute range-based searches. Mercury differs

and obtain better results in a secure manner.

from previous range-based query systems in that it

Section 2 introduces related works for this paper.

supports multiple attributes as well as performs

Section 3 contains some basic load rebalancing

explicit load balancing. To guarantee client

concepts. Section 4 describes the load rebalancing

routing and load balancing, Mercury uses novel

algorithm. Section 5 shows the tests we have

light-weight sampling mechanisms for uniformly

performed and the results obtained. Finally,

sampling random nodes in a highly dynamic

Section 6 summarizes conclusion.

overlay network. Our evaluation shows that Mercury is able to achieve its goals of logarithmic-hop routing and near-uniform load

2. RELATED WORKS I.

Stoica

and

R.[1]

Morris

introduced

a

balancing.

conceptually similar but especially for large

David and R. Karger[3] have given a provably

overlay network. This paper presents the design

efficient load balancing protocols for distributed

and evaluation of Pastry, a scalable, distributed

data storage in P2P systems. The algorithm is

object location and routing scheme for wide-area

simple, and easy to implement, so an obvious next

peer-to-peer applications.

is a protocol that

research step should be a practical evaluation of

performs application-level routing and object

these schemes. In addition, several concrete open

location in a potentially very large overlay

problems follow from this work. First, it might be

network of nodes connected via the Internet. It can

possible to further improve the consistent hashing

be used to support a wide range of peer-to-peer

scheme. Second, the range search data structure

applications like global data storage, global data

does not easily generalize to more than one order.

sharing, and naming.

For example when storing music files, one might

P. Ganesan, M. Bawa[5] investigate that load

want to index them by both artist and song title,

balancing is necessary in such scenarios to

allowing lookups according to two orderings.

Raihanath A S, Anu V R IJMEIT Volume 2 Issue 3 March 2014

Page 174

2014

IJMEIT// Vol. 2 Issue 3 //March 2014 //Page No: 172-179//e-ISSN: 2348-196x Ion Stoicay and Robert Morrisz[3] introduce a

software or hardware, such as a multilayer switch

Chord protocol for solving the challenging

or a Domain Name System server. Load balancing

problem in decentralized manner. It offers a

is one of the central issues in cloud computing. It

powerful primitive given a key; it determines the

is a mechanism that distributes the dynamic local

node responsible for storing the key’s value, and

workload evenly across all the nodes in the whole

does so efficiently. In the steady state, in an N-

cloud to avoid a situation where some nodes are

node network, each node maintains routing

heavily loaded while others are idle or doing little

information for only O(logN) other nodes, and

work. It helps to achieve a high user satisfaction

resolves all lookups via O(logN) messages to

and resource utilization ratio, hence improving the

other nodes. P. Brighten Godfrey and Ion Stoica

overall performance and resource utility of the

have proposed a scheme to assign IDs to virtual

system. It also ensures that every computing

servers, called Low Cost Virtual Server Selection

resource is distributed efficiently and fairly. It

that yields a simple DHT protocol, called Y0, for

further prevents bottlenecks of the system which

which node degree does not increase significantly

may occur due to load imbalance. When one or

with the number of virtual servers. Y0 adapts to

more components of any service fail, load

heterogeneous node capacities, can achieve an

balancing helps in continuation of the service by

arbitrarily good load balance, moves little load,

implementing fair-over, i.e. in provisioning and

and can compute a node’s IDs as O(log n) hashes

de-provisioning of instances of applications

of its IP address for security purposes. The

without fail. The goal of load balancing is

techniques behind Y0 generalize to arbitrary

improving the performance by balancing the load

overlay

some

among these various resources (network links,

flexibility in neighbor selection, even if the

central processing units, disk drives) to achieve

underlying topology did not.

optimal

topologies

while

providing

throughput,

resource

utilization,

maximum

response

maximum time,

and

3. LOAD REBALANCING

avoiding overload. To distribute load on different

Load Balancing is a computer networking method

systems, different load balancing algorithms are

to distribute workload across multiple computers

used. In general, load balancing algorithms follow

or a computer cluster, network links, central

two major classifications: Depending on how the

processing units, disk drives, or other resources, to

charge is distributed and how processes are

achieve optimal resource utilization, maximize

allocated to nodes (the system load): Depending

throughput, minimize response time, and avoid

on the information status of the nodes (System

overload. Using multiple components with load

Topology).

balancing, instead of a single component, may

performance, response time and overhead are the

increase reliability through redundancy. The load

metrics used for load balancing in cloud.

Scalability,

resource

utilization,

balancing service is usually provided by dedicated Raihanath A S, Anu V R IJMEIT Volume 2 Issue 3 March 2014

Page 175

IJMEIT// Vol. 2 Issue 3 //March 2014 //Page No: 172-179//e-ISSN: 2348-196x

2014

1. Initialize server and its sub-servers 2. Establish connection between sub-server and servers using the IP or Port number. 3. Upload File to server that should be shared. 4. Server encrypts data with AES Encryption. 5. Split the file into multiple chunks 6. Calculate the each sub server memory 7. Divide the total chunks value by total Fig -1: Architecture diagram

number of sub-servers

Fig 1 shows the diagrammatic representation of DFS in cloud environment.

The load of main

server is split into sub servers. However, it moves from the centralized to distributed scheme. Client accesses data from sub servers that minimizes the workload of main server.

8. Upload each chunk into sub servers based on its memory capacity 9. If Capacity is less then transfer the excess chunks into next sub-servers 10. Each chunk will be appended with a index value. 11. When the client request for a file, that will

4. LOAD REBALANCING ALGORITHM

be received from different sub-servers

In order to balance the requests of the resources it

based on the index value.

is important to recognize a few major goals of

12. Client collects all the chunks then the file

load balancing algorithms. Cost effectiveness is

will be decrypted, then that will be viewed

the primary aim is to achieve an overall

by client.

improvement

in

system

performance

at

a

The data will encrypt at the server side using AES

reasonable cost. The distributed system in which

encryption algorithm shown in fig.2.The sub

the algorithm is implemented may change in size

servers are not allowed to view the data which is

or topology. So the algorithm must be scalable

in an encrypted format. The client has the

and flexible enough to allow such changes to be

privilege to decrypt the data.

handled easily. prioritization of the resources or jobs need to be done on before hand through the algorithm itself for better service to the important or high prioritized jobs in spite of equal service provision for all the jobs regardless of their origin. The load rebalancing algorithm with AES encryption is shown below. Raihanath A S, Anu V R IJMEIT Volume 2 Issue 3 March 2014

Page 176

2014

IJMEIT// Vol. 2 Issue 3 //March 2014 //Page No: 172-179//e-ISSN: 2348-196x

node dynamism, simplifying the system provision and management. The chunk servers in our proposal are organized as a DHT network. Typical DHTs guarantee that if a node leaves, then its locally hosted chunks are reliably migrated to its successor; if a node joins, then it allocates the chunks whose IDs immediately precede the joining node from its successor to manage. Fig -2: Encrypted data transfer among server and

4.3 Replica Management

peer

In distributed file systems (e.g., Google GFS and Hadoop HDFS), a constant number of replicas for

4.1 Chunk creation

each file chunk are maintained in distinct nodes to

A file is partitioned into a number of chunks allocated in distinct nodes so that Map Reduce Tasks can be performed in parallel over the nodes. The load of a node is typically proportional to the number of file chunks the node possesses. Because the files in a cloud can be arbitrarily created, deleted, and appended, and nodes can be upgraded, replaced and added in the file system, the file chunks are not distributed as uniformly as

Fig -3: Used space of each peers

possible among the nodes. Our objective is to allocate the chunks of files as uniformly as

Improve file availability with respect to node

possible among the nodes such that no node

failures

manages an excessive number of chunks.

balancing algorithm does not treat replicas

and

departures.

Our

current

load

distinctly. It is unlikely that two or more replicas 4.2 DHT formulation

are placed in an identical node because of the

The storage nodes are structured as a network

random nature of our load rebalancing algorithm.

based on distributed hash tables (DHTs), e.g.,

More specifically, each under loaded node

discovering a file chunk can simply refer to rapid

samples a number of nodes, each selected with a

key lookup in DHTs, given that a unique handle

probability of 1/n, to share their loads (where n is

(or identifier) is assigned to each file chunk.

the total number of storage nodes).

DHTs enable nodes to self organize and Repair while constantly offering lookup functionality in Raihanath A S, Anu V R IJMEIT Volume 2 Issue 3 March 2014

Page 177

2014

IJMEIT// Vol. 2 Issue 3 //March 2014 //Page No: 172-179//e-ISSN: 2348-196x

5. EXPERIMENTAL RESULTS

6. CONCLUSIONS

The performance of our algorithm is evaluated

Cloud Computing has widely been adopted by the

through

Our

industry, though there are many existing issues

implementation is demonstrated through a small-

like Load Balancing, Virtual Machine Migration,

scale cluster environment (Fig. 1). The load of

Server Consolidation, Energy Management, etc.

main server is split into sub servers. Client

which have not been fully addressed. Central to

accesses data from sub servers that minimizes the

these issues is the issue of load balancing, that is

workload of main server. In the experimental set

required to distribute the excess dynamic local

up we divide the server load into four peers. The

workload evenly to all the nodes in the whole

total capacity of the server is 100 GB. IT splits

Cloud to achieve a high user satisfaction, resource

into 4 peers (sub servers). The load of the server is

utilization and reliability. It also ensures that

dynamically allocated into four peers by running

every computing resource is distributed efficiently

the load rebalancing algorithm. Fig 3 shows the

and fairly. This paper presents a load rebalancing

usage

and

algorithm with AES encryption. Simulation

downloading speed are shown in fig.4. The

results shows that our proposed load rebalancing

performance of each peer can be evaluated here.

algorithm outperforms the existing one in terms of

For each experimental run, we quantify the time

speed, resource utilization and reliability.

elapsed

computer

of

each

to

peer.

complete

simulations.

The

the

uploading

load-balancing

algorithms, including the HDFS load balancer and

REFERENCES

our proposal. We perform 20 runs for a given M

[11] A. Rowstron and P. Druschel, “Pastry:

and average the time required for executing a load

Scalable, Distributed Object Location and Routing

balancing algorithm. In our proposal the data is

for Large-Scale Peer-to-Peer Systems,” Proc.

divided into chunks each of having 3 KB size. If

IFIP/ACM

the file is 366KB the algorithm split it into 122

Platforms Heidelberg, pp. 161-172, Nov. 2001.

chunks.

[2] J. Dean and S. Ghemawat, “MapReduce:

Int’l

Conf.

Distributed

Systems

Simplified Data Processing on Large Clusters,” Proc. Sixth Symp. Operating System Design and Implementation (OSDI ’04), pp. 137-150, Dec. 2004. [3] I. Stoica, R. Morris, D. Liben-Nowell, D.R. Karger, M.F. Kaashoek, F. Dabek, and H. Fig -4: Performance evaluation

Balakrishnan, “Chord: A Scalable Peer-to-Peer Lookup Protocol for Internet Applications,”

Raihanath A S, Anu V R IJMEIT Volume 2 Issue 3 March 2014

Page 178

IJMEIT// Vol. 2 Issue 3 //March 2014 //Page No: 172-179//e-ISSN: 2348-196x

2014

IEEE/ACM Trans. Networking, vol. 11, no. 1, pp. 17-21, Feb. 2003. [4] Hung-Chang Hsiao, Hsueh-Yi Chung, Haiying Shen and Yu-Chang Chao, “Load rebalancing for distributed file systems in cloud” IEEE Trans. On parallel and distributed systems, vol. 24, no. 5, pp.951-962, May 2013 [5] Prasanna Ganesan Mayank Bawa Hector, “Online Balancing of Range-Partitioned Data with Applications to Peer-to-Peer Systems”. [6]Ashwin R. Bharambe“Mercury: Supporting Scalable Multi­Attribute Range Queries”. [7]

U.Karthik

Kumar,”

A

Dynamic

Load

Balancing Algorithm in Computational Grid Using Fair Scheduling” International Journal of Computer Science Issues, Vol. 8, Issue 5, No 1, September 2011. [8] Gagan Aggarwal, Rajeev Motwani, An Zhu”The Load Rebalancing Problem”. [9] Hadoop Distributed File System “Rebalancing Blocks”.http://developer.yahoo.com/hadoop/tutori al/module2.html#rebalancing,2012. [10] J. Dean and S. Ghemawat, “MapReduce: Simplified Data Processing on Large Clusters,” Proc. Sixth Symp. Operating System Design and Implementation (OSDI ’04), pp. 137-150, Dec. 2004

Raihanath A S, Anu V R IJMEIT Volume 2 Issue 3 March 2014

Page 179

Suggest Documents