Application for Data De-duplication Algorithm Based on Mobile Devices

2498 JOURNAL OF NETWORKS, VOL. 8, NO. 11, NOVEMBER 2013 Application for Data De-duplication Algorithm Based on Mobile Devices Ge Xingchen School of ...

Author: Phoebe Glenn

3 downloads 2 Views 784KB Size

Report

Download PDF

Recommend Documents

Bird Observation Application for Mobile Devices

Integrated Information Application on Mobile Devices for Air Passengers

Data Structure to Store GTFS Data Efficiently on Mobile Devices

Proxy Server Based Data and Service Accessing in Mobile Devices

Phishing on Mobile Devices

Search Algorithm for Image Recognition Based on Learning Algorithm for Multivariate Data Analysis

Whitepaper on identity solutions for mobile devices

Agent-Based Distributed IDA* Search Algorithm for a Grid of Mobile Devices

Application guide for USB 3.1. Protected connection for mobile devices

Middleware for Social Networking on Mobile Devices

CASE: Comprehensive Application Security Enforcement on COTS Mobile Devices

Mobile Devices for Control

QUIZ LOUNGE. Game-based learning on mobile devices

upgradable on some mobile devices?

UbiLoc: A System for Locating Mobile Devices using Mobile Devices

A New Data Encryption Algorithm Based on the Location of Mobile Users

Proximity-based Authentication of Mobile Devices

Keystroke-Based Biometric Authentication in Mobile Devices

NEXT-GENERATION LOCATION-BASED SERVICES FOR MOBILE DEVICES

SymTorrent BitTorrent-based File-sharing for Mobile Devices

Usability Heuristics for Touchscreen-based Mobile Devices: Update

igrasp: Grasp-based Adaptive Keyboard for Mobile Devices

ACQUISITION AND ANALYSIS OF PERFORMANCE DATA FOR MOBILE DEVICES

Characteristics of Pressure-Based Input for Mobile Devices

2498

JOURNAL OF NETWORKS, VOL. 8, NO. 11, NOVEMBER 2013

Application for Data De-duplication Algorithm Based on Mobile Devices Ge Xingchen School of Mechanical, Electrical & Information Engineering, Shandong University, Weihai, China Email: [email protected]

Deng Ning Lenovo Corporate Research & Development, Beijing, China Email: [email protected]

Yin Jian* School of Mechanical, Electrical & Information Engineering, Shandong University, Weihai, China *Corresponding author, Email: [email protected]

Abstract—With the rapid development of Mobile Devices, coupled with the rise of the personal cloud service, the business of Cloud Storage and Cloud Synchronization grows rapidly in recent years. As a result, it imposes pressure on the space of the network storage and the network bandwidth, especially in the field of the Mobile Internet. Data De-duplication algorithm can reduce the data redundancy in the method of deleting the same file or the same data chunk in data storage systems so that the space of network storage could be saved and the utilization ratio of network bandwidth could be improved. The algorithm which takes Fixed-size Partition as chunking strategy has been applied in systems of Cloud Storage and Cloud Synchronization in the Mobile Internet field. Though this method is simple and efficient, it is so sensitive to the operation of adding or deleting data that the Data Deduplication Rate is lower than expected. An application for Data De-duplication algorithm based on Content-Defined Chunking is proposed to provide basis in the field of Cloud Storage and Cloud Synchronization through mobile devices. In this way, a file is divided into chunks of different sizes instead of a pre-fixed size based on file contents. Key steps of Data De-duplication algorithm were optimized to make it accommodate to mobile devices. Experimental results showed that the approach brought relatively higher Data De-duplication Rate and lower machine overhead. Index Terms—Data De-duplication; CDC; Mobile Devices; Chunking

I.

INTRODUCTION

The rapid development of mobile devices has revolutionarily altered the way people get information. People could gain and process information by handheld or mobile devices which used to be only conducted on PC. Personal cloud storage applications represented by Dropbox have become the main solution for Crossplatform Data sharing and coordination work. As the number of personal computer terminals grows larger and larger, it has become a hot issue in both academic and industrial circles that how to achieve the way in which © 2013 ACADEMY PUBLISHER doi:10.4304/jnw.8.11.2498-2505

data can be shared seamlessly and cooperatively among multiple devices. With the development of the Mobile Internet and Bulk Data Handling technique, Mobile Internet users can get access to Internet services at any time or place through handheld or mobile devices. Due to the low bandwidth and high-cost characteristics of networks through which we can access to the Mobile Internet and the relatively small physical memory of mobile terminals, uploading and downloading file data between Cloud Servers and mobile terminals exert heavy burden on the network bandwidth, the network storage and the physical memory of terminal hardware. Research shows that 80%~90% of data in network storage systems and filing systems are redundant [1]. Therefore, it becomes one of the hot topics in the field of Cloud Storage and Cloud Synchronization that how to reduce the redundant data so as to make best use of the space of the network storage and the network bandwidth, especially in the field of them based on mobile devices which is much more sensitive to power consumption and performance of devices, space of network storage and network bandwidth. Data De-duplication technology which removes duplicate data and saves one copy of them in order to eliminate data redundancy has been applied to systems of data backup and data archiving. Out of the same reason, it has been used in the field of the Cloud Storage and Cloud Synchronization based on mobile devices. For example, Data De-duplication algorithm that uses Fixed-size Partition (FSP) method as file chunking strategy has been applied to applications of Personal Cloud Storage represented by Dropbox. Surveys show that FSP Data De-duplication algorithm is a simple and effective way to reduce storage space and improve utilization rate of network bandwidth. However, through FSP algorithm, a file will be divided into chunks in a pre-fixed size and cut-off points of a file are decided before file segmentation. So when adding or deleting data within a file, cut-off points and data chunks following modified

JOURNAL OF NETWORKS, VOL. 8, NO. 11, NOVEMBER 2013

points will be changed, which makes it impossible to delete the duplicate data in these data chunks. As FSP algorithm’s sensitivity to adding or deleting data in a file, the efficiency of Data De-duplication will be reduced visibly. In order to solve this problem, Content-defined Chunking (CDC) is proposed to be used at the stage of file segmentation. The algorithm calculates data fingerprints on the basis of the content of the file, and then a file will be divided into chunks of different sizes according to the data fingerprint. It affects only a very small amount of data chunks near the modified point when inserting or deleting data in a file theoretically. Proved by experimental data, the algorithm based on CDC can increase the efficiency of Data De-duplication. What’s more, it will not bring out considerable machine overhead. The paper analyzes key techniques of the CDC Data De-duplication algorithm based on concerns of the industrial circle:  measure the candidate algorithms of calculating data fingerprints  analyze influences of file data segmentation granularities.  propose some optimization for Data Deduplication based on the references. The research optimizes the CDC Data De-duplication algorithm and analyzes its performance so that the basis for applications of CDC Data De-duplication algorithm will be provided in the field of Cloud Storage and Cloud Synchronization based on mobile devices. II.

RELEVANT KNOWLEDGE

Data De-duplication algorithm detects the same data object in data streams basing on the degree of redundancy of the data itself. It only transmits and stores a copy of the data object and uses a pointer to point to the copy for replacing the other duplicate ones [2, 18]. Difficulties lies in the following aspects: segmentation of the file, extraction eigenvalue of data chunks and calculation fingerprints of files. Data De-duplication algorithm that uses chunk as the basic object for data operations is divided into the following steps: Segment files. Segmentation of data chunks is the basis of Data De-duplication algorithm. According to the different granularity of data operations, Data Deduplication can be classified into whole file level and data chunk level. Whole File Detection (WFD) Data Deduplication uses a whole file as a chunk. And data chunk level Data De-duplication has three file segmentation methods which consists of Fixed-size Partition, Contentdefined Chunking and Sliding Block technology. Selection and calculation of data eigenvalue. For each chunk, we calculate a unique value to represent it. Generally we adopt Hash functions, whose collision resistance and encryption are usually good, to calculate the eigenvalue of each data chunk. MD-5 or SHA-1 Hash is used in getting the eigenvalues in general, both of which show good collision resistance and encryption [3]. Data detection and data processing. Compare the new chunk eigenvalue with those in the chunk index and judge

© 2013 ACADEMY PUBLISHER

2499

whether there is the same one. If it does, then there is no need to store this chunk. While reading a file, we can get the copy of the chunk needed through eigenvalue and the mapping of chunk’s physical storage location in the index. On the contrary, if there doesn’t have the same eigenvalue, the chunk represented by the eigenvalue will be stored, what’s more, the eigenvalue and the mapping of chunk’s physical storage location will be added into the index of chunks. A. WFD Data De-duplication Algorithm Data De-duplication algorithm based on Whole File Detection technique regards a file as a chunk to process data redundancy. Hash value of a file will be calculated and it will be compared with hash values in the index. If there has the same hash value, the file represented by the value will not be stored. Otherwise, the system will store the new file. This algorithm can detect the same file in storage system at a high rate of speed, but it is impossible to detect the same data within a file, leaving its deduplication rate much lower than data chunk level data de-duplication. Especially, modifying a byte will make the storage system to store a new file [3]. B. FSP Data De-duplication Algorithm Date De-duplication algorithm based on FSP technique pre-defines a fixed size of data chunk and a file will be segmented into a chunk sequence in the fixed size. Then, MD-5 or SHA-1 algorithm will be used to calculate data fingerprint of each chunk which will be compared with fingerprints in the index. If there has the same fingerprint, the chunk represented by the fingerprint will not be stored. Otherwise, chunk will be stored. Comparing to WFD Data De-duplication algorithm, this method has raised the Data De-duplication Rate and has been applied in many areas. Steps of the FSP algorithm is shown in Fig. 1. The FSP technique has lots of limitations: a file will be divided into chunks in a fixed size before file segmentation. So when we add or delete data in a file, data chunks backwards the modification will be changed, causing changes of fingerprints of these chunks, making it impossible to detect duplicate data in these chunks.

Figure 1. Date de-duplication algorithm based on FSP

C. CDC Data De-duplication Algorithm Fig. 2 shows the basic steps of CDC algorithm. First, a fixed-size sliding block will be used to slide backward byte-by-byte from the beginning of the file and meanwhile, a fingerprint will be calculated each time (the efficient and randomness Rabin Hash algorithm will always be used [4, 5]). If the fingerprint equals a pre-set

2500

JOURNAL OF NETWORKS, VOL. 8, NO. 11, NOVEMBER 2013

threshold value, the position of the block will be regarded as a chunk boundary. Repeat this process until a file has been divided into chunks. Then, MD-5 algorithm will be used to calculate eigenvalue of every chunk. At last, we can query the index and store or delete the chunk in the same way as FSP algorithm does.

Figure 2. Date de-duplication algorithm based on CDC

Data De-duplication algorithm based on CDC adopts content-based chunking method, leaving sizes of chunks unfixed. Due to the correlation of file content, modification of a small amount of data will only affect a few data chunks near the modification point, while the subsequent data chunks will not change in theory. Comparing to the FSP algorithm, it can detect more data redundancy. However, CDC algorithm has its own limitations as well. Too much chunks will cause memory overhead by storing the metadata information and other duplicates produced in the process of Data De-duplication. But fewer chunks will make the Data De-duplication Rate much lower than expected. Therefore, a variety of factors should be taken into consideration to determine the partition granularity of data chunking. D. Data De-duplication Algorithm Based on Sliding Block Technique

Figure 3. Date de-duplication algorithm based on sliding block technique

Combined with characteristics of the FSP algorithm and CDC algorithm, the chunk size of Data Deduplication algorithm based on Sliding Block technique is fixed. The technology effectively removes duplicates in a file by detecting data redundancy in each fixed data chunk. Checksum algorithm is considered to be used to calculate the weak checksum. Sliding Block technique will calculate a checksum value of a fixed-size chunk. Compare the calculated weak checksum value with the © 2013 ACADEMY PUBLISHER

stored checksums. If identical, detect duplicated data using MD-5 or SHA-1 to calculate strong checksum value. If there has redundancy, we can glide the redundancy, and then record the new chunk, redundant data chunk and data fragments before redundancy [4]. If we can`t detect redundancy when the sliding distance reaches the length of the pre-set size of chunks, the hash value of the chunk will be calculated and the chunk will be stored. Sliding Block technique combines the advantages of FSP and CDC algorithms, while it may prone to produce debris and cause high overhead during calculation. Steps of Data De-duplication algorithm based on Sliding Block technique is shown in Fig. 3. E. Comparison of Data De-duplication Algorithms Data De-duplication based on WFD algorithm can’t detect redundancy within a file, leading to a lower deduplication rate. Thus, data chunk level de-duplication algorithms are used under normal circumstances. Characteristics of three algorithms of data chunk level deduplication are summed up according to IBM’s research [7]. Data De-duplication algorithm based on Sliding Block possesses the highest de-duplication rate, followed by CDC algorithm, and the last is FSP algorithm. However, according to Ref. [6], data de-duplication rate of Sliding Block method is higher than CDC’s when granularity of data chunking is small, while CDC algorithm possesses a higher de-duplication rate on the contrary. In terms of machine overhead, Sliding Block algorithm produces most metadata information, followed by FSP Data De-duplication algorithm. CDC algorithm generates minimum extra system overhead concerning metadata. F. Optimizations for the application of CDC Algorithm I/O optimization. Since mobile devices and computers have some differences in I/O performance, so granularity setting will be different between them. As we all know, many factors will affect the I/O throughput such as hardwares and codes of upper-layer applications. So, we calculate the machine overhead without I/O performance. Some studies in optimizing the I/O performance can be put into practice in applications for Data De-duplication algorithm based on mobile devices [2]. Data security. Besides the I/O overhead, some effective methods are required to insure the correctness of the data stored in Cloud Storage and Cloud Synchronization systems based on mobile devices. Ref. [8] proposed a secure Cloud Storage system based on mobile devices with provable data possession. What’s more, Ref. [9] proposed a method to choose for each chunk a replication that is a function of the amount of data that would be lost if that chunk were lost. Experiment in Ref. [9] showed that the technique achieved significantly higher robustness than a conventional approach while requiring about half the storage space. Retrieve data fingerprints. At present, fingerprints of the big data chunks stored in storage systems are huge. It is a difficulity to retrieve fingerprint efficiently. An

JOURNAL OF NETWORKS, VOL. 8, NO. 11, NOVEMBER 2013

efficient data fingerprint retrieve algorithm was proposed in Ref. [10]. Optimization based on the metadata information. Ref. [11] proposed a method to segment files based on metadata information to maximally reduce the duplicates that produced in the process of Data De-duplication. Ref. [12] proposed a new method that we could achieve comparable duplicate elimination while using chunks of larger average size in distributed storage systems requiring multiple chunk fragments, for which metadata overheads per stored chunk were high. Data De-duplication in data streams. Data Deduplication is always been performed in physical mediums of source parts and destination parts based on physical hardware, which are always been thought as servers and clients. Ref. [13] presented some algorithms to detect duplicates in data streams. G. Applications for Data De-duplication Algorithm Data De-duplication algorithm changed the storage mode, bringing huge benefits to industry circles. Nowadays, Data De-duplication algorithm has been applied in many fields, such as data backup, disaster recovery and data archiving system [2]. Data Deduplication algorithm based on FSP is the most wildly used one among them, such as Venti storage system and some virtual machine systems. Applications for CDC Data De-duplication are relatively fewer in the market and most studies of CDC are based on computers. In CDC algorithm rabin-hash is usually used to calculate the data fingerprints and MD-5 or SHA-1 algorithm is used to calculate the data chunk eigenvalues. Data De-duplication algorithm mostly uses FSP algorithm on mobile devices while CDC algorithm on computers. After comprehensive analysis of advantages and disadvantages of each algorithm of Data Deduplication, we proposed that CDC detection technology can be applied to the applications on mobile devices. The study adopts CDC detection technology as the way to segment files in Data De-duplication based on mobile devices. III.

2501

calculated. There are some limitations of file types that can be edited on mobile devices. According to the data in some Cloud Servers, the experiment used files of txt, doc or docx, ppt or pptx and xls or xlsx types as samples. We chose at least ten files of each type and then calculated the average value of the experimental results. B. Selection of Data Fingerprint Calculation Algorithms File segmentation based on file content is a key technique of CDC Data De-duplication algorithm. File contents are represented by data fingerprints in implementation of the CDC algorithm. An appropriate algorithm of fingerprint calculation will bring high correlation of content in a data chunk. So only one or a few data chunks near the modification point of the file will be changed theoretically when inserting or deleting data within a file so that Data De-duplication Rate can be increased. The experiment selected simple-hash,RS-hash, JS-hash, PJW-hash, ELF-hash, BKDR-hash, SDBM-hash, DJBhash, AP-hash, CRC-hash and rabin-hash as candidates for calculating data chunk fingerprints. First, these candidates should be simply estimated by two indicators: the hash distribution and the average length of hash bucket. The two indicators were calculated by Eq. (1) and Eq. (2). In Eq. (1), bucket_usage was the ratio of the number of used hash buckets and the number of total hash buckets. It was calcuated to represent the hash distribution. In Eq. (2), avg_bucket_length was the average length of hash bucket. Based on the two equations, we estimated the hash candidate functions of the experiment. Throuth the test, we could know that the efficiency of candidate algorithms of fingerprint calculation are equal in the two aspects. So we could go on the experiment with the hash candidate functions to find the best one to calculate the data fingerprint. bucket _ usage 

k n

(0  k  n)

IMPLEMENTATION STRATEGIES

The work target of this study is to find the optimal implementation strategy of CDC algorithm based on Mobile Devices. Nowadays, the Mobile Internet has a series of restrictions in some aspects such as mobile terminal, access network, application service, security and privacy protection [14]. Considering characteristics of mobile terminals and Data De-duplication algorithm, some key techniques and implementation strategies of CDC algorithm based on mobile devices are introduced as follows: A. Selection of Experimental Samples The experiment performed such operations as inserting or deleting few data on files of frequently-used types based on the situation of industrial circles. Then Data Deduplication Rate and time of data chunking, fingerprint calculation and eigenvalue calculation would be

© 2013 ACADEMY PUBLISHER

(1)

n

avg _ bucket _ length 

 length i 1

n

(2)

These candidate algorithms were used in the experiment respectively and then we could contrast Data De-duplication Rates and machine overheads so that the basis of selection of algorithms of chunks fingerprint calculations could be provided. C. Setting of File Segmentation Granularity In storage systems, both file data itself and metadata information will be stored. Small granularity will bring high Data De-duplication Rate, but more metadata information and other duplicates that produced in the process of Data De-duplication will need to be stored. Considering that some mobile devices only have gigabytes or dozens of gigabytes storage, comprehensive

2502

JOURNAL OF NETWORKS, VOL. 8, NO. 11, NOVEMBER 2013

analysis should be made to provide basis for granularity setting. Granularity will be influenced by the pre-set threshold value and data chunk boundary [15]. Ref. [15] proposed a method to set granularity based on the histogram of chunk boundaries according to abundant statistical data. The histogram showed the distribution patterns of the chunk boundary according to muliple pre-set thresholds. What’s more, file size and self-adaptation of file content should be taken into consideration to set the granularity. IV.

time 

n

k

k

i 1

i 1

i 1

n

 (( FileSize  ChunkSize) / FileSize) i 1

n

(3)

n

ChunkSize   bytes

(4)

i 1

The value which is calculated by Eq. (3) is associated with the size of modification data. So, the ratio of modification size to total size of new chunks stored in the Data De-duplication process can be used to further measure results calculated by Eq. (3). In Eq. (5), ModifySize is the size of modification data in a file, ChunkSize is the total size of new chunks which are stored in the Data De-duplication process and n is the number of sample files.

k

 ( fingertime   segtime   eigentime) i 1

n

ALGORITHM ANALYSIS

A. Metric of Algorithm The purpose of Data De-duplication algorithm is to reduce data redundancy so that the space of network storage and network bandwidth will be saved. When it is put into practical applications, machine overhead is an important factor that should be considered. As a result, Data De-duplication Rate and machine overhead should be taken into consideration to analyze the algorithm. Metric of the method to Data De-duplication Rate calculation. The ratio of redundancy size to file size is regarded as Date De-duplication Rate (DDR) in the experiment. Eq. (3) is used to preliminary calculate the DDR when we conduct the experiment to delete or insert data in files of various types. In the equation, ChunkSize is the total size of new chunks that has been stored in the process of Data De-duplication and n is the number of sample files. The difference between FileSize and ChunkSize are approximatively used as the size of redundancy.

DDR 

of reading or writing files and overhead of metadata information storage into consideration and we approximatively measured the machine overhead by calculating the total time of fingerprint calculation, file segmentation and eigenvalue calculation. Eq. (6) was used to calculate the total time defined above to represent the machine overhead and n is the number of sample files.

(6)

B. Algorithm Implementation Experimental environment. The Android 4.0 was adopted to be the test platform and the SQLite3 database was used to create the index of data chunks in the experiment. The chunk index consisted of chunk eigenvalue calculated by MD-5 algorithm, mapping of chunk storage location and mapping of files that the chunk belonging to. What’s more ,the experiment was based on the MD-5 and Hash functions of data fingprint calculation function libraries of C Programming Languange and the Java JNI functions library was used to package the functions with C Programming Language into Android program. MD-5 algorithm implementation. The MD-5 algorithm takes a message of arbitrary length as input and produces a 128-bit data fingerprint as output [16]. There are four 32-bit buffers which are used to compute the data fingerprints based on MD-5 algorithm. These buffers are initialized by those values which are 0x67452301, 0xefcdab89, 0x98badcfe and 0x10325476 in hexadecimal. The MD-5 value is calculated by Eq. (7) which is followed by four sequentially auxiliary bitwise functions. Each bitwise equation takes a 32-bit word as input and produces a 32- bit word as output. At last, outputs calculated by these bitwise functions will be combined to be the fingerprint to represent the file content with Eq. (7). MD5  F  G  32  H  64  I  96

(7)

F(X,Y,Z) =(X&Y)|((~X)&Z)

(8)

G(X,Y,Z) =(X&Z)|(Y&(~Z))

(9)

H(X,Y,Z) =X^Y^Z

(10)

I (X,Y,Z)=Y^(X|(~Z))

(11)

n

DDR 

 (ModifySize / ChunkSize) i 1

n

V.

Metric of the method to measure machine overhead. The purpose of the experiment was to optimize the CDC algorithm on mobile devices through the comparison of candidate key techniques of CDC algorithm combined with the situation of the industrial circle. As there were so many factors affecting the I/O, the experiment took no account of the I/O overhead. So we didn’t take the time © 2013 ACADEMY PUBLISHER

EXPERIMENT RESULUTS

(5) A. Methods of Fingerprint Calculation Comparison of Data De-duplication Rates that influenced by algorithms of fingerprint calculation is shown in Fig. 4 and Fig. 5. We can see that the optimal algorithm of fingerprint calculation is different among files in the same type. And some experimental data showed that files of different types may have the similar Data De-duplication Rate in similar conditions. So we

JOURNAL OF NETWORKS, VOL. 8, NO. 11, NOVEMBER 2013

can conclude that selection of algorithms of fingerprint calculation has few things to do with file types. Through comprehensive analysis of sample files, we can conclude that plain text files have a higher Data De-duplication Rate than structured files such as form documents. This conclusion was mentioned in Ref. [3] as well. TABLE I.

CDC ALGORITHM BASED ON MOBILE DEVICES

Input

File A before modification; File A after modification (File B) Output Date De-duplication Rate of File B; time consumption (1) Calculate date fingerprint and segment file based on fingerprint (2) Calculate the eigenvalue of each chunk using MD-5 hash and output the time for machine overhead. (3) Retrieve the index of chunks and calculate the Date Deduplication Rate based on Eq. (3) and Eq. (5). (4) Process data chunks: Store the chunk if there doesn’t exist the same eigenvalue in the index

To highlight the difference between Data Deduplications Rates calculated by different fingerprint calculation algorithms, we calculated the ratio among the results. In this way, total time based on simple-hash is adopted as the basic unit, which is used to represent the machine overhead, and then ratios of other machine overheads to the machine overhead of simple-hash will be calculated. Comparison of machine overheads that influenced by algorithms of fingerprint calculation is shown in Fig. 6 and Fig. 7.

2503

applications. This conclusion can be seen from the experimental data. According to the results shown in Fig. 6 and Fig. 7, for all files, CDC algorithm influenced by rabin-hash takes much longer time than it influenced by other fingerprint calculation algorithms. Since mobile devices usually have a much more sensitive requirement on power consumption and performance of devices, rabin-hash may be not suitable for CDC Data Deduplication algorithm based on mobile devices.

Figure 6. Comparison of time overhead influenced by fingerprint calculation

Figure 7. Comparison of time overhead influenced by fingerprint calculation

Figure 4. Comparison of date de-duplication rate influenced by fingerprint calculation

According to the experiment data, we can conclude that the BKDR-hash has a better efficiency on both Data De-duplication Rate and machine overhead. The BKDRhash calculates the hash value with Eq. (12). The equation involves all characters in the string which will be calculated by the hash equation. In the equation, Key is the string to be calculated, KeySize is the size of the String, value is used to represent a hexadecimal number which is often represented by hash table size and seed is a pre-defined value which is defined as 131 here. The BKDR-hash algorithm can be used in applications for CDC Date De-duplication algorithm based on mobile devices. hash  (

KeySize 1



Key( KeySize  i  1) * seed i )&value

(12)

i 0

Figure 5. Comparison of date de-duplication rate influenced by fingerprint calculation

Rabin-hash is usually applied to the CDC Data Deduplication algorithm based on PCs. But considering that it takes a longer time in pre-calculation of plain text and pattern strings in the process of text matching, it may have a higher algorithm time complexity in practical © 2013 ACADEMY PUBLISHER

B. Chunk Granularity Setting In the experiment, setting of chunk granularities is based on the pre-defined threshold and the setting of upper and lower bound of chunk size. Here we adopt three chunk granularities with three upper and lower bounds to segment a file into chunks based on some

2504

JOURNAL OF NETWORKS, VOL. 8, NO. 11, NOVEMBER 2013

conclusions from the server data and sample files themselves. The influences of chunk granularities to Data Deduplication Rate are shown in Fig. 8. Among sample files, types of txt, doc and pptx were plain txt files and the type of xls was a form file. We can see from the experimental data that the smaller the granularity is, the higher the Date De-duplication Rate will be for plain text files. But for structured files, Date De-duplication Rate has little thing to do with the chunk granularity. We can conclude from some experimental data that it is influenced by the correlation of file content.

when we take the metadata overhead of data chunks into consideration. What’s more, both Ref. [3] and Ref. [19] proposed new techniques for Data De-duplication algorithm. They proposed algorithms based on different object partitioning techniques. Also, we can also try to put the two algorithms proposed in Ref. [3] and Ref. [19] into applications for Data De-duplication algorithm based on mobile devices in the future so that we can test the Data De-duplication Rates and machine overheads to provide basis for these applications on mobile devices. ACKNOWLEDGMENT The authors wish to thank editors and reviewers of this paper and some colleagues in Lenovo Corporate Research & Development who provided help to implement the applications for CDC Data De-duplication algorithm based on mobile devices. This work was supported in part by a grant from Lenovo Corporate Research & Development in Beijing. REFERENCES

Figure 8. Comparison of date de-duplication rate influenced by chunk granularity

VI.

CONCLUSIONS

Applications for Data De-duplication algorithm based on mobile devices should take the sensitive conditions of mobile terminals into consideration, such as power consumption, performance of devices, a relatively small physical storage and the network access to the Mobile Internet and so on. It is proved by the experiment that some key technologies in CDC Data De-duplication based on PC may not be suitable for the applications on mobile devices with the existing hardware conditions. Proved by the experiment, the CDC algorithm can increase the efficiency of Data De-duplication algorithm without considerable machine overhead. In combination with the practical situation of the industry circle and the experimental data, there are still several questions needing to be considered in applications for CDC algorithm based on mobile devices:  find a metric to balance the machine overhead and the Date De-duplication Rate;  find a method to optimize the I/O efficiency on mobile devices;  find a way to classify files according to file contents, and then we can optimize the algorithm based on every category;  find a better way in verifying the correctness of the data come from mobile devices and stored in Cloud Storage systems. In the future, a new technique called Frequency Based Chunking [17] can also try to be applied into applications for Data De-duplication algorithm based on mobile devices. The Frequency Based Chunking algorithm utilizes chunks’ frequency information in the data stream to improve the Data De-duplication efficiency especially © 2013 ACADEMY PUBLISHER

[1] McKnight J, Asaro T, Babineau B. Digital archiving: enduser survey and market forecast 2006-2010[EB/OL]. [2006-03-18]. [2] Fu Yinjin, Xiao Nong, LiuFang. Research and Development on Key Techniques of Data Deduplication. Journal of Computer Research and Development, 2012, 49(1) pp. 12-20. [3] Fang Yang, YuAn Tan, A Method of Object-based Deduplication. Journal of Networks, Vol. 6, No. 12, December 2011 [4] Ao Li, Shu Jiwu, Li Mingqiang. Data Deduplication Techniques. Journal of Software, Vol.21, No.5, May 2010, pp.916-929. [5] Rabin Mo. Fingerprinting by random polynomials. Technical Report, CRCT TR-15-81, Harvard University, 1981. [6] Jain N, Dahlin M, Tewari R. TAPER: Tiered approach for eliminating redundancy in replica synchronization. In: Proc. of the 4th Usenix Conf. on File and Storage Technologies (FAST 2005). Berkeley: USENIX Association, 2005. 281294. [7] Denehy TE, Hsu WW. Duplicate management for reference data. IBM Research Report, RJ 10305 (A0310017), IBM Research. [8] Jian Yang, Haihang Wang, Jian Wang, Chengxiang Tan and Dingguo Yu. Provable Data Possession of Resourceconstrained Mobile Devices in Cloud Computing. Journal of Networks, Vol. 6, No. 7, July 2011. [9] Deepavali Bhagwat, Kristal Pollack, Darrell D.E. Long, Thomas Schwarz, S.J. and Ethan L.Miller, et al. Providing high reliability in a minimum redundancy archival storage system//Modeling, Analysis, and Simulation of Computer and Telecommunication Systems, 2006. MASCOTS 2006. 14th IEEE International Symposium on. IEEE, 2006 pp. 413-421. [10] Bin Zhou, Rongbo Zhu, Ying Zhang and Linhui Cheng. An Efficient Data Fingerprint Query Algorithm Based on Two-Leveled Bloom Filter. Journal of Multimedia, Vol. 8, No. 2, April 2013. [11] Chuanyi Liu, Yingping Lu, Chunhui Shi, Guanlin Lu, David H.C. Du and Dong-Sheng Wang. ADMAD: Application-Driven Metadata Aware De-duplication Archival Storage System//Storage Network Architecture

JOURNAL OF NETWORKS, VOL. 8, NO. 11, NOVEMBER 2013

[12]

[13]

[14]

[15]

[16] [17]

[18] [19]

and Parallel I/Os, 2008. SNAPI'08. Fifth IEEE International Workshop on. IEEE, 2008 pp. 29-35. Kruus E, Ungureanu C and Dubnicki C. Bimodal content defined chunking for backup streams//Proceedings of the 8th USENIX conference on File and storage technologies. USENIX Association, 2010 pp. 18-18. Bera S K, Dutta S, Narang A, and Souvik Bhattacherjee. Advanced Bloom Filter Based Algorithms for Efficient Approximate Data De-Duplication in Streams. arXiv preprint arXiv:1212.3964, 2012. Luo Junzhou, Wu Wenjia, Yang Ming. Mobile Internet: Terminal Devices, Networks and Services. Chinese Journal of Computers, Vol. 34, No. 11, Nov.2011. Zhou Jingli, Nie Xuejun, Qin Leihua, Liu Ke, Zhu Jianfeng, Wang Yu. Optimization for Data De-deplication Algorithm Based on Storage Environment Aware. Journal of Computer Science, Vol. 38, No.2 Feb 2011. Rivest R. The MD5 message-digest algorithm. MIT Laboratory for Computer Science and RSA Data Security, Inc. April 1992. Lu G, Jin Y, Du D H C. Frequency based chunking for data de-duplication//Modeling, Analysis & Simulation of Computer and Telecommunication Systems (MASCOTS), 2010 IEEE International Symposium on. IEEE, 2010 pp. 287-296. Balachandran S, Constantinescu C. Sequence of Hashes Compression in Data De-duplication//Data Compression Conference, 2008. DCC 2008. IEEE, 2008 pp. 505-505. Bobbarjung D R, Jagannathan S, Dubnicki C. Improving duplicate elimination in storage systems. ACM Transactions on Storage (TOS), 2006, 2(4) pp. 424-448.

© 2013 ACADEMY PUBLISHER

2505

Ge Xingchen is a graduate student in computer application of Shandong University in Weihai, China. She completed her bachelor degree in Shandong University as well. During her graduate studies her interests focus on data storage and data transmission. Now she focuses on the research of cloud synchronization and the research of peer-to-peer data transmission. Deng Ning is a staff researcher of Lenovo research and development. He received his doctor degree from Beijing Institute of Technology in 2012. The main research topic of his doctorial thesis included computer architecture, multicore system, on-chip scratchpad memory management and parallel computing. After joining Lenovo, he conducts the research of personal cloud and data synchronization project in Lenovo. Yin Jian is an associate professor of computer application at Shandong University. He received his master degree of engineering at Harbin Institute of Technology. Now he is studying for doctor degree of computer application at Shandong University. He focuses on the research of multimedia data mining. He has been teaching in universities in the past 26 years. Since his being the head of the department of computer science in Shandong University in 2004, he dedicated himself to the field of computer media data researching. Recent years, he published many articles in various publications at home and abroad.