WITH the explosive growth of data, the risk and cost

IEEE TRANSACTIONS ON XXX, XXX, XXXX, OCTOBER 2014 1 EPAS: A Sampling Based Similarity Identification Algorithm for the Cloud Yongtao Zhou, Yuhui Den...

Author: Janice Hamilton

3 downloads 2 Views 719KB Size

Report

Download PDF

Recommend Documents

With the explosive growth of

The Cost of Lost Data

DUE to the explosive growth of user-generated videos

THE explosive growth of video sources has created new

The Risk-Adjusted Cost of Financial Distress*

Progress Against the Law: Fan Distribution, Copyright, and the Explosive Growth of Japanese Animation

Explosive Growth In Challenging Times

The cost of counterparty risk and collateralization in longevity swaps

Accounting-based downside risk, cost of capital, and the macroeconomy

Job Displacement Risk and the Cost of Business Cycles

With the rapid growth of the

The explosive expansion and consolidation of the balearic hotel sector,

Beware of the Risk Behind Big Data

Big Data and the Risk of Employment Discrimination

The Problem With Cost Allocation: The Heart of The Matter

SimCorp Solutions. Seminarkalender. Mitigate Risk Reduce Cost Enable Growth. simcorp.com

The cost of low-cost

Cost-Effectively Reducing Shrinkage and Risk Exposures with Validated Assessments

The CPI and the Cost of Living

RISK ASSESSMENT OF EXPLOSIVE ATMOSPHERES IN WORKPLACES ABSTRACT

Deriving the Economic Impact of Derivatives. Growth Through Risk Management

The RASE Project. Methodology for the Risk Assessment of Unit Operations and Equipment for Use in Potentially Explosive Atmospheres

Measuring and Modeling Execution Cost and Risk

IEEE TRANSACTIONS ON XXX, XXX, XXXX, OCTOBER 2014

1

EPAS: A Sampling Based Similarity Identification Algorithm for the Cloud Yongtao Zhou, Yuhui Deng, Member, IEEE, Junjie Xie, and Laurence T. Yang, Member, IEEE Abstract—The explosive growth of data brings new challenges to the data storage and management in cloud environment. These data usually have to be processed in a timely fashion in the cloud. Thus, any increased latency may cause a massive loss to the enterprises. Similarity detection plays a very important role in data management. Many typical algorithms such as Shingle, Simhash, Traits and Traditional Sampling Algorithm (TSA) are extensively used. The Shingle, Simhash and Traits algorithms read entire source file to calculate the corresponding similarity characteristic value, thus requiring lots of CPU cycles and memory space and incurring tremendous disk accesses. In addition, the overhead increases with the growth of data set volume and results in a long delay. Instead of reading entire file, TSA samples some data blocks to calculate the fingerprints as similarity characteristics value. The overhead of TSA is fixed and negligible. However, a slight modification of source files will trigger the bit positions of file content shifting. Therefore, a failure of similarity identification is inevitable due to the slight modifications. This paper proposes an Enhanced Position-Aware Sampling algorithm (EPAS) to identify file similarity for the cloud by modulo file length. EPAS concurrently samples data blocks from the head and the tail of the modulated file to avoid the position shift incurred by the modifications. Meanwhile, an improved metric is proposed to measure the similarity between different files and make the possible detection probability close to the actual probability. Furthermore, this paper describes a query algorithm to reduce the time overhead of similarity detection. Our experimental results demonstrate that the EPAS significantly outperforms the existing well known algorithms in terms of time overhead, CPU and memory occupation. Moreover, EPAS makes a more preferable tradeoff between precision and recall than that of other similarity detection algorithms. Therefore, it is an effective approach of similarity identification for the cloud. Index Terms—Similarity detection; Sampling; Shingle; Position-aware; Cloud.

F

1

I NTRODUCTION

W

ITH the explosive growth of data, the risk and cost of data management are significantly increasing. In order to address this problem, more and more users and enterprises transfer their data to the cloud and access the data via Internet [1]. However, this approach often results in a large volume of redundant data in the cloud. According to an IDC report, around 75% of the data are redundant across the world [2]. ESG indicates that over 90% of the redundant data are in backup and archiving systems [3]. The reason behind this is that multiple users tend to store similar files in the cloud. Unfortunately, the redundant data not only consume significant IT resources and energy but also occupy expensive network bandwidth. Therefore, data deduplication is urgently required to alleviate these problems in the cloud. Data deduplication calculates a unique fingerprint for every data block by using hash algorithms such as MD5 and SHA-1. The calculated fingerprint is then compared against other existing fingerprints in a database that dedicates for storing the fingerprints. If the fingerprint is already in the database, the data block does not need to be stored again, a

•

• •

Y. Deng is with the Department of Computer Science, Jinan University, Guangzhou 510632, China, and also with the State Key Laboratory of Computer Architecture, Institute of Computing Technology, Chinese Academy of Sciences, Beijing 100080, China. E-mail: [email protected]. Y. Zhou and J. Xie are with the Department of Computer Science, Jinan University, Guangzhou, P.R.China 510632. E-mail: [email protected], [email protected]. L. T. Yang is with the Department of Computer Science, St. Francis Xavier University, Antigonish, NS B2G 2W5, Canada. E-mail: [email protected].

pointer to the first instance is inserted in place of the duplicated data block. By doing so, data deduplication is able to eliminate the redundant data by storing only one data copy, thus reducing the cost of data management [4] and network bandwidth [5]. However, data deduplication suffers disk bottleneck in the process of fingerprint lookup. This is because massive requests going to disk drives generate a large volume of random disk accesses which significantly decrease the throughput of deduplication and increases the system latency [6], [7]. In the cloud, any increased latency may result in a massive loss to the enterprises [8]. For example, according to [9], [10], every 100 ms of increased latency would reduce 1% of sales for Amazon, and an extra 0.5 seconds in search page display time can cut down revenues of Google by 20%. On the contrary, any decreased latency will bring huge benefits to the enterprises. Hamilton et al. [11] note that only a speedup of 5 seconds at Shopzill will bring about an increase of page view by 25% and taxes by 10% while a reduction of hardware by 50% and traffic from Google by 120%. Therefore, reducing the latency in the cloud environment is very important for the enterprises who store their data in the cloud. Similarity based data deduplication strategies [12], [13], [14], [15], [16] are very important methods to eliminate the redundant data. The essential of them is that similar files or segments can share those identical data blocks. Therefore, they only store similarity characteristic values instead of storing all fingerprint indexes in memory. This method can significantly reduce the number of disk accesses when performing data deduplication. Since the effectiveness of similarity detection has a significant impact on the perfor-

IEEE TRANSACTIONS ON XXX, XXX, XXXX, OCTOBER 2014

mance and deduplication ratio of this kind of deduplication approaches, improving the performance and accuracy of similarity detection is very critical. Typical similarity detection algorithms such as Shingle [17], Simhash [18] and Traits [15] are widely used at present. When calculating the eigenvalues of similar files, these algorithms have to read entire source files, thus requiring lots of CPU cycles and memory space. Furthermore, the latency is multiplying with the growth of the data set volume. When applying the similarity detection algorithms to the cloud, the high resource occupation and latency would significantly decrease the satisfaction of cloud users. Although the community has made great strides in identifying data similarity, there are still some challenges brought by the explosive growth of data and the complicated characteristics of data stored in the cloud. We summarize the challenges as follows: Challenge 1: Reducing the resource occupation of similarity detection. Similarity identification algorithms belong to I/O bound and CPU-bound tasks. Thus calculating the eigenvalues of similar files requires lots of CPU cycles and memory space and incurs large quantities of disk accesses. Since the disk accesses generated are usually random, this would incur significant performance degradation. Additionally, the computing overhead increases with the growth of data set volume. All of the above issues indicate a significant delay in the cloud. Challenge 2: Reducing the time overhead of similarity query. Similarity identification algorithms usually require a large amount of time to perform detection, which incurs long delays especially with large data sets. This makes it difficult to apply these algorithms to some applications requiring real time and high throughput. Challenge 3: Achieving both the efficiency and accuracy. It is a challenge to achieve both the efficiency and accuracy of the similarity detection with low overhead. Similarity identification algorithms have to make a tradeoff between the efficiency and accuracy. Traditional Sampling Algorithm (TSA) [19] does not read entire files, but samples some data blocks to calculate the fingerprints as similarity characteristic values. TSA is simple and has a fixed overhead. However, a slight modification will incur a failure of similarity identification due to the shifted bit positions. In order to solve this problem, this paper proposes an Enhanced Position-Aware Sampling algorithm (EPAS) to identify file similarity for the cloud. This approach modulos file length to amend the shifted sampling positions. Furthermore, it concurrently samples data blocks from the head and the tail of the modulated file to avoid position shift incurred by the modifications. The major contributions of this paper include: •

1) We propose a new method EPAS to reduce the resource occupation when performing similarity detection. Instead of reading entire files, EPAS samples several data blocks to calculate the fingerprint as similarity characteristic values, and amends file length to avoid shifted positions of sampling by modulo file length. Additionally, EPAS samples data blocks from the head and tail of the modulated file respectively, which can specifically prevent the slight

2

•

•

modification of the file head from shifting the file content. Therefore, EPAS performs well in detecting similarity by employing these two methods with a fixed overhead. 2) We design an effective query algorithm for EPAS to locate those similar files with given files, thus reducing the time overhead of similarity query. Experimental results demonstrate that the designed query method achieves low query time with low overhead when comparing with Simhash, Traits and PAS. 3) We propose a new metric to judge similarity between two files, thus achieving both the efficiency and accuracy. The essence of similarity detection is catching how many identical data blocks or attributes two files share. An optimal method should make the detection probability close to the actual similarity degree of two files. Furthermore, in order to ensure the efficiency and accuracy of EPAS, we employ precision and recall to select a threshold value to make a better tradeoff between them. Our experimental results show that this metric can promote the efficiency of similarity detection and acquire a close detection probability to the actual similarity degree of two files.

The remainder of this paper is organized as follows. We present related work in section 2. In section 3, we describe some background knowledge. Section 4 illustrates the theory of new similarity metric. Section 5 introduces the basic idea of EPAS algorithm. Section 6 shows the evaluation results of EPAS algorithm. Section 7 concludes this paper and presents future work.

2

R ELATED W ORK

Over the past decade, a lot of research efforts have been invested in identifying data similarity. The existing work can be classified into five categories. The first one is similar web page detection with web search engine. Detecting and removing similar web pages can save network bandwidth, reduce storage consumption, and improve the quality of web search engine index. Andrei et al. [17], [20] proposed a similar web page detection technique called Shingle algorithm which utilizes set operation to detect similarity. Shingle is a typical sampling based approach employed to identify similar web pages. In order to reduce the size of shingle, Andrei presented Modm and Mins sampling methods. This algorithm is applied to AltaVista web search engine at present. Manku et al. [21] applied a Simhash algorithm to detect similarity in web documents belonging to a multi-billion page repository. Simhash algorithm practically runs at Google web search engine combining with Google file system [22] and MapReduce [23] to achieve batch queries. Elsayed et al. [24] presented a MapReduce algorithm for computing pairwise document similarity in large document collections. The second one is similar file detection in storage systems. In storage systems, data similarity detection and encoding play a crucial role in improving the resource utilization. Forman [25] presented an approach for finding similar files and applied the method to document repositories. This approach brings a great reduction in storage

IEEE TRANSACTIONS ON XXX, XXX, XXXX, OCTOBER 2014

space consumption. Ouyang [26] presented a large-scale file compression technique based on cluster by using Shingle similarity detection technique. Ouyang uses Min-wise [27] sampling method to reduce the overhead of Shingle algorithm. Han et al. [28] presented fuzzy file block matching technique, which was first proposed for opportunistic use of content addressable storage. Fuzzy file block matching technique employs Shingle to represent the fuzzy hashing of file blocks for similarity detection. It uses Mins sampling method to decrease the overhead of shingling algorithm. The third one is plagiarism detection. Digital information can be easily copied and retransmitted. This feature causes owners copyright be easily violated. In purpose of protecting copyright and other related rights, we need plagiarism detection. Baker [29] described a program called dup which can be used to locate instances of duplication or near duplication in a software. Shivakumar [30] presented data structures for finding overlap between documents and implemented these data structures in SCAM. The forth one is remote file backup. Traditional remote file backup approaches take high bandwidth and consume a lot of resources. Applying similarity detection to remote file backup can greatly reduce bandwidth consumption. Teodosiu et al. [15] proposed a Traits algorithm to find out the client files which are similar to a given server file. Teodosiu implemented this algorithm in DFSR. Experimental results suggest that these optimizations may help reduce the bandwidth required to transfer file updates across a network. Muthitacharoen et al. [5] presented LBFS which exploits similarity between files or versions of the same file to save bandwidth. Cox et al. [31] presented a similaritybased mechanism for locating a single source file to perform peer-to-peer backup. They implemented a system prototype called Pastiche. The fifth one is the similarity detection for specific domains. Hua et al. [32] explored and exploited data similarity which supports efficient data placement for cloud. They designed a novel multi-core-enabled and localitysensitive hashing that can accurately capture the differentiated similarity across data. Biswas et al. [33] proposed a cache architecture called Mergeable. Mergeable detects data similarities and merges cache blocks so as to decrease cache storage requirements. Experimental evaluation suggested that Mergeable reduces off-chip memory accesses and overall power usage. Vernica et al. [34] proposed a three-stage approach for end-to-end setsimilarity joins in parallel using the popular MapReduce framework. Deng et al. [35] proposed a MapReducebased framework Massjoin for scalable string similarity joins. The approach achieves both set-based similarity functions and character-based similarity functions. Most of the above work focus on a specific application scenario, and the computational or similarity detection overhead are increased with the growth of data volume. In addition, the similarity detection metric may not be able to well measure the similarity between two files. Therefore, this paper proposes an EPAS algorithm and a new similarity detection metric to identify file similarity for the cloud. According to the analysis (see Section 4 and 5.3) and experimental results, it illustrates that the proposed similarity metric catch the similarity between two files more accurately

3

that that of traditional metric. Furthermore, the overhead of EPAS is fixed and minimized in contrast to previous work.

3

BACKGROUND

As introduced in Section 2, many research work have investigated how to accurately identify file similarity, Shingle [17], Simhash [18] and Traits [15] are three widely used algorithms to perform the identification. Therefore, we detail the three algorithms this section and compare them with the proposed EPAS in section 6. 3.1

Shingle

Andrei et al. [17], [20] employed a mathematical method to transform the similarity calculation problem into a set operation problem. This approach is usually called Shingle algorithm. Shingle mainly consists of three steps. Firstly, it employs information retrieval method to handle file contents and ignore some irrelevant information (e.g., timestamps, spaces). Secondly, after processing the file, it is treated as a canonical sequence of tokens which could be a letter, a word or a line. The contiguous subsequence in the file is called Shingle. Thirdly, associate the file to a set of contiguous subsequences of tokens S(D, ω), where D represents the file, ω is a parameter defined as the size of contiguous subsequences. For example, given a document D, we have D(a, rose, is, a, rose, is, a, rose), when ω equals to 4, we can obtain a 4-shingle multiset {(a, rose, is, a), (rose, is, a, rose), (is, a, rose, is), (a, rose, is, a), (rose, is, a, rose)}. We can get S(D, ω) by marking each element of a ω -shingle with its occurrence number. Then, the set of file D is {(a, rose, is, a, 1), (rose, is, a, rose, 1), (is, a, rose, is, 1), (a, rose, is, a, 2), (rose, is, a, rose, 2)}. Given a file A and a file B, following the partition method of Shingle algorithm, the similarity of the above two files is defined as |S(A, ω) ∩ S(B, ω)| . (1) rw (A, B) = |S(A, ω) ∪ S(B, ω)| where ω is the shingle length and rw (A, B) lies between 0 and 1. If the value of rw (A, B) is close to 1, it means file A and file B are very similar, while if the value of rw (A, B) is close to 0, it indicates that file A and file B are definitely different. However, if rw (A, B) equals to 1, it does not mean file A and file B are identical but implies that file B is an arbitrary permutation of file A. For instance, (a, c, a, b, a) is a permutation of (a, b, a, c, a). Therefore, given a file A and a file B, we have two sets including set A(a, rose, is, a, rose, is, a, rose) and set B(a, rose, is, a, flower, which, is, a, rose). According to the Shingle algorithm and formula (1), if ω equals to 1, 2, 3, then rw (A, B) will be 70%, 50%, and 30%, respectively. In order to reduce the size of Shingle W , Border [17], [20] proposed M IN s and M ODm sampling methods. Mins select the smallest s elements from W as equation (2). As demonstrated in equation (3), M ODm choice the elements that mod m equals to 0.

 the set of the smallest     s elements in W ,if |W | > s. M IN s (W ) =     W ,otherwise.

(2)

IEEE TRANSACTIONS ON XXX, XXX, XXXX, OCTOBER 2014

Sig3

Sig4

[-1,1,-1,ĊĊ,1-1,1] Image sets

1, [-1,-1,-1,ĊĊ,1,1,1]

min@2

Г

Ht(Sig1) Ht(Sig2) Ht(Sig3) Ht(Sig4) 9OSNGYNYOMTGZ[XK 101ĊĊ111

ĊĊ

H1(Sig1) H1(Sig2) H1(Sig3) H1(Sig4) ĊĊ

Г

010ĊĊ101 0 0 000ĊĊ111

Sig2

[18,-8,16,ĊĊ,5,10,6] 16 SJOSKTYOUT

File F PT1

Г

...

Cn Chunks

Sig1

[[-1,1,-1,ĊĊ,1,-1,1] [1,1,-1,ĊĊ,1,1,1]

ĊĊ

PTt

sel0..b-1

Traits(F) T1

Г

C3

File

010ĊĊ101 110ĊĊ111 1 1

Г

C1 C2

4

sel(t-1)b...b-11

Tt

min@4

Fig. 2: Process of Calculating Traits.

Fig. 1: Process of calculating Simhash fingerprint.

process of the similarity characteristics value of Traits. It can be described as follows:

M ODm (I) = the set of elements of W that are 0 mod m.

1)

(3)

The Shingle algorithm is applied to identify similar web pages. It runs on the AltaVista web search engine. However, the overhead of Shingle algorithm increases with the growth of the file size [17], [20]. Therefore, it is not really favorable for large files. Therefore, two important algorithms Simhash [18] and Traits [15] are proposed to handle the problems incurred by the Shingle algorithm. 3.2

1) 2) 3)

4)

Employ a chunk algorithm to split files into a set of data blocks: C1 , C2 , . . . , Cn . Define a m-dimension vector V , every dimension is initialized as zero. Calculate a m-bit signature of every data block using traditional hash functions. If the i-th bit of a signature is positive, then the i-th dimension V should plus 1. Otherwise, it minus 1. Generate a m-bit Simhash fingerprint f according to each dimension of vector V . If the i-th dimension of V is a positive number, then the i-th bit of f is 1. Otherwise, it will be 0.

After figuring out the Simhash fingerprints of files, we can determine the similarity of those files by working out their Hamming distance. 3.3

IS 1 = {H1 (Sig 1 ), ..., H1 (Sig n )}. .. . IS t = {Ht (Sig 1 ), ..., Ht (Sig n )}.

2)

Simhash

Charikar proposed a Simhash [21] algorithm. Manku et al. [18] applied the Simhash algorithm to identify similarity in web documents belonging to a multi-billion page repository. Simhash is a member of the local sensitive hash [36]. It is different from traditional hash functions whose signature values are discrete and uniform distributed. When using the traditional hash functions, if two files differ just a bit, their hash signature values are almost different. On the contrary, Simhash has the property that the fingerprints of similar files only differ in a small number of bit positions. It can map a file into f-bit fingerprints. After transferring the source files into m-bit fingerprints by using Simhash algorithm, we can detect the similarity through calculating the hamming distance between two fingerprints. Fig. 1 shows the computing process of m-bit Simhash fingerprints. It can be described as follows:

Traits

Teodosiu et al. [15] proposed a novel similarity detection algorithm Traits to find the most similar files in the client to a given file in the server. Fig. 2 shows the computing

Calculate t fingerprint image sets {IS 1 ...IS t } of signature set {Sig 1 ...Sig n } by using t different hash functions H1 ...Ht . The size of each fingerprint image set is n:

Select the minimal element of each image set as the pre-traits P T 1 ...P T t :

P T 1 = Sig 1j where H1 (Sig 1j ) = min(IS 1 ). .. . P T t = Sig tj where Ht (Sig tj ) = min(IS t ). 3)

Get the traits T1 ...Tt by selecting b different bits from each pre-traits P T 1 ...P T t . For example:

T1 = select0...b−1 (P T 1 ). .. . Tt = select(t−1)b...tb−1 (P T t ). Given a file FS , after calculating the Traits value of T raits(FS ), we can determine the similarity of files in the file set FC by using the equation as follows:

T raitsSim(FS , FC ) = |{i | Ti (FC ) = Ti (FS )}|. SimT raits(FS , F ilesC , n, s) = {FC1 , FC2 , ..., FCk }. Where set {FC1 , FC2 , ..., FCk } (contain (0 ≤ k ≤ n) files) is the subset of set F ilesC . Each file FCi in the set {FC1 , FC2 , ..., FCk } has (T raitsSim(FS , FCi )/t) ≥ s, and for all other files x in the F ilesc has T raitSim(Fs , x) ≤ T raitSim(Fs , FCi ).

4

R EDEFINING SIMILARITY

The symbols used in the following sections are summarized in Table 1. Similarity means that the majority of data blocks in the similar files are identical. The essence of similarity detection is to catch how many identical data blocks or attributes two files have. However, there is no a general method to measure the similarity between two files. An optimal method should make the detection probability close to the actual similarity degree of two files. This means that the detection probability should be close to the portion of identical data blocks two files actually share. The widely used method is transforming the similarity detection problem into a set operation problem described as equation (4) [17], [20]. The

IEEE TRANSACTIONS ON XXX, XXX, XXXX, OCTOBER 2014

5

TABLE 1: Symbols and the corresponding means used in the following sections. Meaning

FileSize δ N K Lenc x LenR T

File size The threshold of similarity Number of sampling data blocks Number of data blocks one files has The length of sampling data blocks The portion of matching chunk of two files The distance between two sampled data blocks Sampling position impact factor of PAS and EPAS

The percentage of identical data block Equation (4) Equation (6)

value of Sim(A, B) lies between 0 and 1. If it is close to 1, then file A and file B are very similar. If it is close to 0, then file A and file B are essentially different. After selecting a threshold δ of the similarity, we can determine that file A is similar to file B if Sim(A, B) ≥ δ is satisfied. Equation (4) can be transformed to a mathematical form described as equation (5). If we use a hash set (O(1) lookup time) to store SigA (N, Lenc), the computational time complexities of |SigA (N, Lenc) ∩ SigB (N, Lenc)| and |SigA (N, Lenc)∪SigB (N, Lenc)| are both O(N ). Therefore, the computational time complexity of equation (4) is O(N ).

Sim(A, B) = y=

|SigA (N, Lenc) ∩ SigB (N, Lenc)| . |SigA (N, Lenc) ∪ SigB (N, Lenc)| x , (0 ≤ x ≤ N, 0 < N ). 2N − x

(4) (5)

However, this method may not be able to effectively detect similarity between two files. As illustrated in Table 9 2, file A and file B share 90% (i.e. 10 K ) data blocks, the achieved Sim(A, B) is 0.818. If the shared data blocks 5 K ), the calculated Sim(A, B) is decreased to 50% (i.e. 10 will be reduced to 0.333. This indicates that the detection probability of equation (4) is far from the actual similarity degree of two files.

|SigA (N, Lenc) ∩ SigB (N, Lenc)| . N x y = , (0 ≤ x ≤ N, 0 < N ). N

Sim(A, B) =

(6) (7)

Therefore, we propose a new method illustrated in equation (6) to measure the similarity of EPAS. Equation (7) represents the mathematical form of the equation (6). If file 9 A and file B share 90% (i.e. 10 K ) data blocks, the calculated Sim(A, B) is 0.9 that is the same as the actual similarity degree of two files, as demonstrated in Table 2. This also works for other percentage of shared data blocks between two files. This indicates that although the computational time complexity of equation (6) is O(N) (the same as equation (4)), it can accuately catch the file similarity. In order to explain the reason, we suppose that file A and file B both have K data blocks and x is the portion of identical chunks of two files. The portion of matching chunks of file A and file B is also figured out by the equation (7). The difference between equation (5) and equation (7) is presented as equation (8) by calculus. Equation (8) shows

The similarity probability of two files

Symbol

TABLE 2: The detection probability of equation (4) and (6) under portions of identical blocks. 1 K 10

3 K 10

5 K 10

7 K 10

9 K 10

0.052 0.176 0.333 0.538 0.818 0.1 0.3 0.5 0.7 0.9

1.0

0.8

0.6

0.4

K=10, y=x/10, eq(4)

0.2

K=10, y=x/(20-x), eq(2) K=20, y=x/20, eq(4) K=20, y=x/(40-x), eq(2)

0.0 0

5

10

15

20

The number of identical blocks

Fig. 3: Compare equation (6) against equation (4). that the different increases with the growth of K . As illustrated in Fig. 3, the gap between equation (5) and equation (7) grows when K varies from 10 to 20. This means that equation (4) is not suitable for similarity detection especially when K is very large. The reason is because the detection probability of equation (4) is gradually far from the actual portion of similarity of two files. ∫ K x x )dx ( − K 2K − x 0 ∫ 2K x2 K 2K − u = |0 − du 2K u K (8) x2 K 2K = | − 2K ln u|2K K + u|K 2K 0 K = K( − 2.386). 2 Furthermore, we can transform equation (6) into the equation as follows: ∩ |Sig A (N, Lenc) Sig B (N, Lenc)| ≥ N δ. ∩ When |Sig A (N, Lenc) Sig B (N, Lenc)| ≥ N δ is satisfied, then file A is similar to file B . Therefore, we can simplify the process of similarity detection in comparison with the original equation (4).

5

S AMPLING BASED S IMILARITY I DENTIFICATION

In order to effectively detect file similarity with low overhead and high accuracy, we propose and introduce EPAS in this section. 5.1

Traditional Sampling Algorithm (TSA)

TSA is described in Algorithm 1 by using pseudo-code. We sample N data blocks from file A, inject each data block sizing Lenc to a hash function. We then obtain N fingerprint values that are collected as a fingerprint set SigA (N, Lenc).

IEEE TRANSACTIONS ON XXX, XXX, XXXX, OCTOBER 2014 0KB

File A

0KB

11KB 12KB

22KB

33KB

24KB

44KB

36KB

55KB 48KB

File A

60KB

File B1

6

0KB

11KB

22KB

33KB

44KB

0KB

11KB

22KB

33KB

44KB

55KB 60KB

File B1 0KB

12KB

24KB

36KB

48KB

60KB

File B2 0KB

12KB

24KB

36KB

48KB

0KB

60KB

11KB

22KB

33KB

44KB

60KB

Increasing chunk

0KB

11KB

22KB

33KB

22KB

44KB

0KB

11KB

22KB

38KB

49KB

60KB

0KB

11KB

22KB

38KB

49KB

60KB

0KB

11KB

22KB

38KB

49KB

60KB

33KB

55KB

44KB

60KB

File B3 Sampling chunk

(a) TSA

11KB

File B2

File B3 Sampling chunk

0KB

File B1

File B2

File B3

File A

Increasing chunk

Sampling chunk

(b) PAS

Increasing chunk

(c) EPAS

Fig. 4: The sampling positions of TSA, PAS and EPAS.

Algorithm 1: TSA Algorithm 1 2 3 4 5 6 7 8

Data: fd, N, Lenc, T begin LenR = (FileSize - Lenc*N)/(N - 1) for i← 1 to N do offset = (i - 1)*(Lenc + LenR) lseek(fd, offset, SEEK SET) read(fd, buf, Lenc) Md5(buf, Lenc, Md5Val) put(Md5Val, SigA)

By analogy, we will have a fingerprint set SigB (N, Lenc) of file B . The degree of similarity between file A and file B can be described as equation (6). TSA is simple, but it is very sensitive to file modifications. A small modification would cause the sampling positions shifted, thus resulting a failure. Suppose we have a file A sizing 56 KB. We sample 6 data blocks and each data block sizes 1 KB. According to Algorithm 1, file A has N = 6, Lenc = 1 KB, F ileSize = 56 KB, LenR = 10 KB. If we add 5 KB data to file A to form file B , then file B will have N = 6, Lenc = 1 KB, F ileSize = 61 KB, LenR = 11 KB in terms of Algorithm 1. Adding 5 KB data to file A has three situations including the beginning, the middle, and the end of file A. File B1, B2 and B3 in Fig. 4(a) represent these three different situations. We can find that the above file modifications cause the sampling positions shifted and result in an inaccuracy of similarity detection. Table 3 summarises these positions. Furthermore, the last row of file B1, B2 and B3 (indicated with optimal) shows where the sampling positions should be reached due to the file modifications. As shown in Fig. 4(a) and Table 3, the six sampling positions of file A are 0 KB, 11 KB, 22 KB, 33 KB, 44 KB, and 55 KB, respectively. However, due to the added 5 KB data, the six sampling positions of file B1, B2, and B3 are shifted to 0 KB, 12 KB, 24 KB, 36 KB, 48 KB, and 60 KB, respectively. The detection probability of Sim(A, B) are 0.167, 0.333 and 0.167 of file B1, B2 and B3, respectively. Table 3 shows that in contrast to the optimal algorithm, when applying TSA to file B1, B2 and B3, the algorithm only catches one (the last sampling data block), two (the first and the last sampling data blocks) and one (the first sampling data block) sampling positions, respectively. Although the Sim(A, B) is far from the actual value, TSA is very simple and takes much less overhead in

TABLE 3: The sampling positions of file A and file B with TSA, PAS, EPAS and Optimal algorithms. File A

0KB

11KB 22KB 33KB 44KB 55KB Sim(A, B)

TSA PAS File B1 EPAS Optimal

0KB 0KB 0KB 5KB

12KB 11KB 11KB 16KB

24KB 22KB 22KB 27KB

36KB 33KB 38KB 38KB

48KB 44KB 49KB 49KB

60KB 60KB 60KB 60KB

0.167 0.167 0.5 1

TSA PAS File B2 EPAS Optimal

0KB 0KB 0KB 0KB

12KB 11KB 11KB 11KB

24KB 22KB 22KB 22KB

36KB 33KB 38KB 38KB

48KB 44KB 49KB 49KB

60KB 60KB 60KB 60KB

0.333 0.667 1 1

TSA PAS File B3 EPAS Optimal

0KB 0KB 0KB 0KB

12KB 11KB 11KB 11KB

24KB 22KB 22KB 22KB

36KB 33KB 38KB 33KB

48KB 44KB 49KB 44KB

60KB 60KB 60KB 55KB

0.167 0.83 0.5 1

contrast to Shingle, Simhash and Traits algorithms. 5.2

Position-Aware Similarity algorithm (PAS)

TSA is sample and effective. However, as explained in section 5.1, a single bit modification would result in a failure of similarity detection. Therefore, PAS1 is proposed to solve this problem. Theorem 1. Given a positive integer p for any integer n, there must be existing an equation n = kp + r, where k and r are the quotient and remainder of n divided by p, k and r are both integers and 0 ≤ r < p. Suppose n = 150, p = 100 , we have 150 = 1×100 + 50 in terms of Theorem 1, where r and k equal to 50 and 1, respectively. The k always equals to 1 for −50 < r ≤ 50. Then we have k × p ≡ p. Therefore, the change of r will not influence n. We can apply this simple method to the identification of file similarity as Algorithm 2 illustrated. In order to detect similarity accurately, we sample two data blocks in the beginning and in the end of files, respectively. The remaining sampling positions are calculated by using Algorithm 2. Meanwhile, Algorithm 2 also contributes to choosing an appropriate parameter T, so that the sampling positions shifting resulted from slight file modifications can be avoided. 1. A preliminary short version of this paper [37] appears in Proceedings of the 14th International Conference on Algorithms and Architectures for Parallel Processing (ICA3PP2014)

IEEE TRANSACTIONS ON XXX, XXX, XXXX, OCTOBER 2014

Algorithm 2: PAS Algorithm. 1 2 3 4 5 6 7 8 9 10 11 12 13 14

Data: fd, N, Lenc, T begin FileSize = (FileSize/T)*T LenR = (FileSize - Lenc*N)/(N - 1) LenR = LenR > 0 ? LenR : 0 for i← 1 to N - 1 do offset = (i - 1)*(Lenc + LenR) lseek(fd, offset, SEEK SET) read(fd, buf, Lenc) Md5(buf, Lenc, Md5Val) put(Md5Val, SigA) lseek(fd, -Lenc, SEEK END) read(fd, buf, Lenc) Md5(buf, Lenc, Md5Val) put(Md5Val, SigA)

7

Algorithm 3: EPAS Algorithm. 1 2 3 4 5

6 7 8 9 10 11

12 13 14

We take the same example used in Section 5.1 to illustrate the basic idea of PAS. We take T as 28 KB. This is because in order to avoid the shifting of sampling positions incurred by adding the 5 KB data, T should be bigger than 5 KB and smaller than the file size of file A. Other numbers of T are also applicable. For example, T could be 6 KB, 7 KB, 8 KB, 9 KB and so on. Then, file A has N = 6, F ileSize = 56 KB, LenR = 10 KB, T = 28 KB. According to Algorithm 2, file B will have N = 6, F ileSize =61 KB, LenR = 10 KB, T = 28 KB. From Fig. 4(b) and Table 3, we can find that the sampling positions of file B (including file B1, B2 and B3) are 0 KB, 11 KB, 22 KB, 33 KB, 44 KB and 60 KB, respectively. In contrast to the sampling positions of file A, the only difference is the last sampled data block at the position 60 KB. This is because we fix two sampling positions in the beginning and the end of the corresponding files. However, the contents of those sampling blocks have been shifted due to the file modifications. The Sim(A, B) of file B1, B2 and B3 are 0.167, 0.667 and 0.83, respectively. Table 3 indicates that PAS catches more real sampling positions than that of TSA, if we compare the sampling positions achieved by PAS against the positions offset by the optimal algorithm. According to the above analysis, we can conclude that the PAS algorithm can avoid the shifting of sampling positions generated from slight file modifications in the middle and the end of source files. In addition, if it is applied to data deduplication system, PAS is also superior to other similarity detection algorithms [16]. 5.3 Enhanced Position-Aware Similarity Algorithm (EPAS) Although PAS algorithm keeps the advantages of TSA that the overhead is fixed and low, a slight modification in the file head will cause a failure of similarity detection. The file B1 in Fig. 4(b) and Table 3 shows this scenario. Therefore, EPAS is proposed to alleviate this problem by enhancing PAS. EPAS maintains the advantages of both PAS and TSA. We describe it in Algorithm 3. EPAS respectively samples N 2 data blocks from the head and the tail of the modulated files. It then maps these data blocks into fingerprints by utilizing hash functions

15 16

Data: fd, N, Lenc, T begin FileSize = (FileSize/T)*T LenR = (FileSize - Lenc*N)/(N - 1) LenR = LenR > 0 ? LenR : 0 for i← 1 to N/2 do // Sampling from file head offset = (i - 1)*(Lenc + LenR) lseek(fd, offset, SEEK SET) read(fd, buf, Lenc) Md5(buf, Lenc, Md5Val) put(Md5Val, SigA) for i← 1 to N - N/2 do // Sampling from file tail offset = (i - 1)*(Lenc + LenR) lseek(fd, -(offset + Lenc), SEEK END) read(fd, buf, Lenc) Md5(buf, Lenc, Md5Val) put(Md5Val, SigA)

and obtains the similarity characteristic value. N is a fixed constant and much smaller than the file size. Since EPAS only samples and calculates N data block fingerprints, the time and computation complexity of EPAS are O(1). We employ the same example used in section 5.1 to illustrate the basic idea of EPAS as shown in Fig. 4(c) and Table 3. We also sample six data blocks with 1 KB length, take T as 28 KB. According to Algorithm 3, file A has F ileSize = 56 KB, N = 6, T = 28 KB, LenR = 10 KB, Lenc = 1 KB. File B has F ileSzie = 61 KB, N = 6, T = 28 KB, LenR = 10 KB, Lenc = 1 KB. The six sampling positions of file A are 0 KB, 11 KB, 22 KB, 33 KB, 44 KB and 55 KB, respectively. The six sampling positions of file B (including file B1, B2 and B3) are 0 KB, 11 KB, 22 KB, 38 KB, 49 KB and 60 KB, respectively. The Sim(A, B) of file B1, B2 and B3 are 0.5, 1 and 0.5, respectively. The number of identical sampling data blocks of file A with file B1, B2 and B3 are three (the first, the second and the third sampling data blocks), six (all sampling data blocks) and three (the fourth, the fifth and the sixth sampling data blocks), respectively. This sampling method keeps the effectiveness of EPAS in different situations for file B1, B2 and B3. File B1 represents the situation of a slight modification in the head of file A. The first three data blocks sampled from the head of file B1 are not identical to the corresponding data blocks of file A, even though the corresponding sampling positions are the same. This is mainly because the modification causes the content of file B1 shifting. Similarly, even though the last three sampling positions of file B1 and file A are different, the three data blocks sampled from the tail of file B1 are identical to file A, because they are consistent with the sampling positions achieved by the optimal algorithm. File B2 illustrates the situation of a small modification in the middle of file A. The content of six sampling data blocks of file B2 are equivalence to those of file A. Since, a small modification in the middle of file B2 does not impact the sampling data

IEEE TRANSACTIONS ON XXX, XXX, XXXX, OCTOBER 2014

N=3

99 67 34

Occurrence ID number 1 2 3 2 4 1

4000

File Number

ID=10

fingerprint ID 99 1 67 3,4 Combine results ĊĊ Ċ ĊĊ 34 1,3 ĊĊ Ċ ĊĊ

8

Fig. 5: The query algorithm of EPAS.

blocks when applying PEAS. File B3 represents the modification in the tail of file A. The three data blocks sampled from the head of file B3 are identical to those of file A, whereas the content of last three data blocks are different to those of file A. According to the three different modification situations of file B1, B2 and B3, we can draw the conclusion that EPAS has desirable performance in detecting similarity between two files. EPAS ensures that at least one half of sampling data blocks are identical in different modification situations. Therefore, we have Sim(A, B) ≥ 50% according to equation (6) in contrast to the Sim(A, B) ≥ 33% via equation (4). This makes EPAS more effective and accurate. The similarity algorithms including Shingle, Simhash and Traits need to split files into chunks by using Rabin hash functions. These algorithms neglect the sequence of file contents. However, an important point about similar files is that the orders of content are roughly the same. For example, (a, c, a, b, a) is similar to (a, b, a, c, a) by using Shingle when w is determined as 2, although the order of (a, c, a, b, a) is different to (a, b, a, c, a). On the contrary, slightly modifying the order of file content has a significant effect on the performance of EPAS, Since the EPAS detects similarity based on sampling. 5.4

0

0K-12K

1)

2)

For each fingerprint value in the set (99, 67, 34), we find the file ID which has identical fingerprint value with that of file A. In this step, we need N queries. Combine the result of pervious queries, we need to record the occurrence number of every ID. For example, we get the result {(99→1), (67→3, 4), (34→1, 3)}. And then combine the results, we can obtain

12K-32K

32K-2M

2M-128M

128M-1G

File Size

Fig. 6: The file size distribution of data set D1. TABLE 4: The profile of data set D1. Popularity Rank 1 2 3 4 5 Total

3)

Querying algorithm for EPAS

After calculating the similarity characteristic value with Algorithm 3, we can determine that file A is similar to file B if Sim(A, B) ≥ δ is satisfied in terms of equation (6). Given a file A and a data set Set(K) (contains K files), we need to find the files in the Set(K) that are similar to the file A. A simple approach is to traverse the files in Set(K) for every query. The time complexity of this method is O(K), and it increases with the growth of Set(K). However, traversing the whole Set(K) can not accommodate the cloud environment which requires low latency and quick response [38]. In order to solve this problem, we build a mapping relationship of fingerprints to the ID of corresponding file (fingerprint→ID). Fig. 5 shows the process of querying similar files in the Set(K) to a given file A, where the ID of file A is 10. We suppose that N equals to 3 and the similarity characteristic value of file A is (99, 67, 34). This process can be described as follows:

2000

Ext. h pdf jpg c mp3 –

%Occur 55.30 14.70 5.34 4.28 3.48 83.1

Storage Space Ext. pdf mkv rar mp3 zip –

%Storage 77.52 4.38 4.24 4.01 2.39 92.54

the ID and the corresponding occurrence number {(1:2), (3:2), (4:1)}. For each occurrence number of each ID, we divide it by N . This follows equation (6) to calculate the similarity probability. For simplicity, we only need to judge if the occurrence number of each ID is greater than N ∗ δ . In this example, we take δ as 0.1. It is very easy to find that file 1, 3 and 4 are similar to file A(2>0.3, 2>0.3, 1>0.3).

Furthermore, we can use a key-value database to build this mapping relationship. In section 6, we employ the keyvalue database Tokyo Cabinet [39] to build the mapping relationship. The query response time of Tokyo Cabinet is very short. In addition, the memory overhead of Tokyo Cabinet is fixed. This is mainly because Tokyo Cabinet does not cache any data in memory. Therefore, Tokyo Cabinet is very suitable for the cloud environment with low latency and quick response.

6 6.1

E VALUATION Evaluation environment

The experiments in this paper are performed in a Ubuntu operation system (kernel version is 2.6.32). The hardware consists of 1 GB memory, 2.0 GHZ Intel(R) Pentium(R) CPU. We adopt Tokyo Cabinet (1.4.48) [39] to store EPAS fingerprint sets. In order to measure the performance of EPAS algorithm, we employ two data sets D1 and D2 to perform the evaluation. Data set D1 is collected from a Linux server in our research lab and a personal cloud DropBox. D1 has 2756 files with total size of 11.5 GB. Table 4 summarizes the profile of

IEEE TRANSACTIONS ON XXX, XXX, XXXX, OCTOBER 2014

9

D1. It shows that the top five popular files are those with the suffix of .h, .pdf, .jpg, .c and .mp3. Table 4 also indicates that the files with suffix .pdf consumes the highest portion of storage capacity. Fig. 6 shows the distribution of file size. It implies that the highest portion of file size ranges from 0 KB to 4 KB. The file size distribution in Fig. 6 is consistent with the investigation of Agrawal et al. [40] and Meyeret et al. [41]. Therefore, we believe that data set D1 is very representative. Similarly, we build another data set D2 to determine the optimal parameters of EPAS. The files in D2 consist of original files and the augmented files that are modified in the beginning, the middle, and the end of the corresponding original files. D2 is made up of 14 txt files. The total size is 128 MB. 6.2

Parameters selection

Since the parameters T , Lenc, N and threshold δ have a great impact on the performance of EPAS, it is meaningful to determine the optimal parameters. We compare the detection probability of EPAS with the actual portion of matching chunks in data set D2. Because the actual portion of matching chunks is the up bound of the similarity between two files, the optimal parameters should get the closest detection probability of EPAS to the actual portion of matching chunks. In order to obtain the actual portion of matching chunks, we separate files into variable-size chunks by using a content defined chunking algorithm which can effectively avoid misaligned file content [42], [5]. These chunks are then mapped into fingerprints with hash functions to obtain a fingerprint set. Applying this method to file A and file B , we have two fingerprint sets F inger(A) and F inger(B). The actual portion of matching chunk fingerprints of file A and file B is described with equation (9), where M atch(A, B) lies between 0 to 1. This is consistent with equation (4). If M atch(A, B) reaches 1, this indicates that most chunks of file A and file B are matching, vice verse.

M atch(A, B) =

|F inger(A) ∩ F inger(B)| . |F inger(A) ∪ F inger(B)|

(9)

Comparing Match(A,B) in equation (9) with Sim(A, B) in equation (6), we can determine optimal parameters. For example, if M atch(A, B) = Sim(A, B), this implies that the EPAS algorithm catches the real similarity of file A and file B . 6.2.1 Sampling position impact factor T Table 5 illustrates the impact of T on the detection probability, where Lenc equals to 32 B, N equals to 8, and T is assigned with 2 KB, 8 KB, 32 KB, 128 KB and 512 KB, respectively. It is very interesting to observe that when T is defined as 2 KB and 8 KB, the corresponding value go far from the portion of actual matching chunk fingerprint M atch(A, B). At this point, the values of actual matching chunk fingerprint M atch(A, B) range from 0.48 to 0.97, while the values of detection probability Sim(A, B) float around 0.25. In this situation, the failure ratio of detection is rather high. Because even though most data blocks of tow files are identical, the detection probability is still very low. When T is set as 32 KB, 128 KB and 512 KB, the

TABLE 5: The impact of T on the detection probability (Lenc = 32 B, N = 8, T = 2 KB, 8 KB, 32 KB, 128 KB, 512 KB). ρ represents the value of M atch(A, B) in equation (9).

HH T ρ HH

2 KB

8 KB

32 KB

128 KB

512 KB

0 0.48 0.54 0.56 0.65 0.73 0.82 0.82 0.89 0.89 0.95 0.97 1

0 0.25 0.25 0.25 0.25 0.25 0.25 0.25 0.25 0.13 0.25 0.13 1

0 0.25 0.25 0.25 0.25 0.25 0.75 0.25 0.63 0.5 0.88 0.5 1

0 0.63 0.38 0.38 0.38 0.38 0.75 0.25 0.63 0.5 0.88 0.5 1

0 0.63 0.38 0.38 0.38 0.38 0.75 0.38 0.63 0.5 0.88 0.5 1

0 0.63 0.38 0.38 0.38 0.38 0.75 0.38 0.63 0.5 0.88 0.5 1

TABLE 6: The impact of Lenc on the detection probability (T = 521 KB, N = 8, Lenc = 32 B, 128 B, 512 B, 2 KB, 8 KB. ρ represents the value of M atch(A, B) in equation (9).

PP Lenc PP 8B ρ P P 0 0.48 0.54 0.56 0.65 0.73 0.82 0.82 0.89 0.89 0.95 0.97 1

0 0.63 0.38 0.38 0.38 0.38 0.75 0.38 0.63 0.5 0.88 0.5 1

32 B

128 B

512 B

2 KB

8 KB

0 0.63 0.38 0.38 0.38 0.38 0.75 0.38 0.63 0.5 0.88 0.5 1

0 0.63 0.38 0.38 0.38 0.38 0.75 0.38 0.63 0.5 0.88 0.5 1

0 0.63 0.38 0.38 0.38 0.38 0.75 0.38 0.63 0.5 0.88 0.5 1

0 0.63 0.38 0.38 0.38 0.25 0.75 0.38 0.63 0.5 0.88 0.5 1

0 0.63 0.38 0.38 0.38 0.25 0.75 0.25 0.63 0.5 0.88 0.5 1

corresponding values are very close to the M atch(A, B). However, the detection probability is extremely low when the M atch(A, B) equals to 0.8 and T is determined as 32 KB. This proves that 32 KB is not the optimal parameter for T . From Table 5, we observe the fluctuation of detection probability Sim(A, B). According to Algorithm 3, the detection probability of EPAS is at least greater than 0.5 when the modifications are in the head and tail of the corresponding files, so EPAS is somewhat less than the actual portion of identical data blocks contained in two files, which will incur some fluctuation of the detection probability Sim(A, B). This also supports that EPAS can effectively detect similarity when comparing with PAS. Therefore, we believe that EPAS can ensure efficiency and accuracy. In the following experiments, we take T = 512 KB. According to Algorithm 3, we know that T has no impact on the overhead of EPAS, so we will not discuss any more in this subsection about overhead.

10

3500

7000

3000

6000 s)

4000

8000

Time overhead(

!"#$ %&#'(#)*+,-.

IEEE TRANSACTIONS ON XXX, XXX, XXXX, OCTOBER 2014

2500 2000 1500 1000

8B

32 B

128 B

512 B

2 KB

4

6

8

10

12

14

16

18

Sampling data block numbers

Fig. 7: The time overhead with different Lenc (T = 512 KB, N = 8, Lenc = 32 B, 128 B, 512 B, 2 KB, 8 KB). TABLE 7: The impact of N on the detection probability (T = 512 KB, Lenc = 32 B, N = 4, 6, 8, 10, 12, 14, 16, 18). ρ represents the value of M atch(A, B) in equation (9).

6.2.2

2000

8 KB

Sampling data block lengths Lenc

0 0.75 0.5 0.5 0.5 0.5 0.75 0.5 0.75 0.5 0.75 0.5 1

3000

0

0

0 0.48 0.54 0.56 0.65 0.73 0.82 0.82 0.89 0.89 0.95 0.97 1

4000

1000

500

HH N 4 ρ HH

5000

6

8

10

12

14

16

18

0 0.67 0.5 0.33 0.5 0.5 0.67 0.33 0.67 0.5 0.83 0.5 1

0 0.63 0.38 0.38 0.38 0.38 0.75 0.38 0.63 0.5 0.88 0.5 1

0 0.6 0.3 0.3 0.4 0.4 0.7 0.3 0.6 0.5 0.8 0.5 1

0 0.67 0.33 0.25 0.33 0.33 0.75 0.25 0.67 0.5 0.83 0.5 1

0 0.57 0.29 0.26 0.36 0.36 0.71 0.21 0.64 0.5 0.86 0.5 1

0 0.56 0.31 0.25 0.38 0.31 0.69 0.25 0.63 0.5 0.81 0.5 1

0 0.56 0.28 0.22 0.33 0.33 0.72 0.22 0.61 0.5 0.83 0.5 1

Size of sampling data blocks

Table 6 demonstrates the impact of sampling data block length on the detection probability, where T equals to 512 KB, N equals to 8, and Lenc is set as 8 B, 32 B, 128 B, 512 B, 2 KB and 8 KB, respectively. Similarly, when Lenc is defined as 2 KB and 8 KB, the corresponding values Sim(A, B) go far from the M atch(A, B). This is because the probability of reading the modified data blocks increases with the growth of Lenc, thus resulting in different fingerprints. When Lenc is set as 8 B, 32 B, 128 B and 512 B, the corresponding values of Sim(A, B) are identical. Fig. 7 depicts that the time overhead grows with the increase of Lenc when Lenc ranges from 8 B to 32 B, 128 B, 512 B, 2 KB and 8 KB. Although we can observe a little bit fluctuation of the time overhead, the main trend is very clear. According to Algorithm 3, we can discover several reasons behind the fluctuations in Fig. 7. The first one is that the MD5 hash function we used in experiment influences the computation overhead. The second one is the read ahead strategy of file system and the operation process schedule method. Taking all things from Table 6 and Fig. 7 into consideration, we adopt Lenc as 32 B in subsequent

Fig. 8: The time overhead with different N (T = 512 KB, Lenc = 32 B, N = 4, 6, 8, 10, 12, 14, 16, 18). experiments. In this case, the time overhead is minimum and the detection probability is close to the actual matching fingerprint M atch(A, B). 6.2.3 Number of sampled data blocks Table 7 describes the impact of the number of sampling data blocks N on the detection probability, where T equals to 512 KB, Lenc equals to 32 B, and N is determined as 4, 6, 8, 10, 12, 14, 16 and 18, respectively. When N is set as 8, 10, 12, 14, 16 and 18, the corresponding values go far from the M atch(A, B). In addition, the detection error of Algorithm 3 grows with the increase of N . When N is defined as 4 and 6, the values Sim(A, B) become very close to the M atch(A, B), which means that we can take 4 or 6 as the optimal parameter. A basic trend of Fig. 8 is that the time overhead grows with the increase of N . Since Table 7 indicates the detection accuracy of EPAS, we have to make a balance between the detection accuracy and the time overhead. N = 4 can achieve this balance in terms of our investigation. 6.3

Threshold δ of EPAS algorithm

Consider a file A and a file B , Sim(A, B) ≥ δ indicates that file A is similar to file B , where δ is the threshold of similarity. We employ precision and recall introduced in [43], [44] to select an optimal threshold δ . The precision and recall are defined in equation (10) and equation (11), respectively, where u represents a file set, A denotes the file required to detect similarity among the file set u, Query(A, u) means a file set extracted by similarity detection algorithms, and that file set is similar to the file A among the file set u. M atchall(A, u) indicates a file set, which is actually similar to file A among the file set u, |M atchall(A, u) ∩ Query(A, u)| implies that a detection file set is actually similar to file A.

P recision =

|M atchall(A, u) ∩ Query(A, u)| . |Query(A, u)|

(10)

Recall =

|M atchall(A, u) ∩ Query(A, u)| . |M atchall(A, u)|

(11)

As formula (10) and (11) indicate, precision is the fraction of detection instances which are actually similar, while recall

IEEE TRANSACTIONS ON XXX, XXX, XXXX, OCTOBER 2014

11 4200

1.0

0.9

0.7

Time overhead(ms)

Precision and Recall

0.8

0.6

0.5

Precision 0.4

Recall

0.3

0.2

3600 3000 2400 1800 1200 600

Shingle Simhash Traits PAS EPAS

4 2

0.1

0.0 0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

0

1.0

SPAS similarity threshold

6.4

EPAS algorithm evaluation

In this section, we evaluate the time overhead, query time, memory and CPU utilization, precision and recall of EPAS against the well-known similarity detection algorithms such as Shingle, Simhash and Traits. The Lenc, N , and δ are set as 32 B, 4, and 0.5, respectively. According to the work in [18], hamming distance is selected as 3, and the number of stored tables is determined as 4. All the measurements in this section are performed with data set D1. In addition, we will also compare EPAS with PAS. In order to reduce the storage consumption, EPAS algorithm uses 8 bits to store a fingerprint. Therefore, it takes 32

10MB

Fig. 10: The time overhead of EPAS, Shingle, Simhash, Traits and PAS algorithm with different file size 2 MB, 5 MB, and 10 MB. 700 600

Simhash Traits

500 Time overhead(s)

is the fraction of actually similar instances that are retrieved, their values are both between 0 to 1. If the precision value is close to 1, it means that most detection instances are actually similar. On the contrary, if the precision value is close to 0, which indicates that most detection instances are not similar. By analogy, if the recall value is close to 1, it implies that we detect most actually similar instances. If the recall value is close to 0, it denotes that just detect very few similar instances. According to the above analysis, we expect to have high precision and recall that are close to 1. Unfortunately, it is very hard to achieve this goal. If we want to detect more actually similar files, we have to relax the limit of threshold value δ . However, reducing the threshold value δ incurs more actual instances that are not similar in the detection results. This will decrease the precision value. Expecting most detection results are actually similar means that we need to restrict the limit of threshold value δ . This will reduce the actually similar instances detected, thus decreasing the recall value. Therefore, we have to make a tradeoff between the precision and recall. Fig. 9 shows the impact of similarity threshold δ on the precision and recall under EPAS algorithm, where T equals to 512 KB, Lenc is 32 B and N is defined as 4. We can observe that with the growth of δ , the precision increases, while the recall decreases. According to Fig. 9, we determine that the optimal similarity threshold value is 0.5, for both of the precision and recall can achieve a relatively high value. Therefore, if file A and file B satisfy the equation Sim(A, B) ≥ 0.5, we treat them as similar.

5MB File size

PAS EPAS

400 300 200 100 0

D1

Fig. 11: The time overhead of Simhash, Traits, PAS and EPAS algorithm with data set D1. 1600

1560

Simhash Traits PAS EPAS

1400

Query time(ms)

Fig. 9: The precision and recall of EPAS algorithm.

2MB

48

50 40

38

37

30 20 10 0 D2

Fig. 12: The query time of Simhash, Traits, PAS and EPAS algorithm with data set D2.

bits for each file. However, the redundant tables of Simhash need 256 bits to store the fingerprints of each file. 6.4.1

Time overhead

The time overhead is evaluated with three different file size including 2 MB, 5 MB and 10 MB. Fig. 10 shows that the time overhead of EPAS is much smaller than that of Shingle, Simhash, Traits and PAS. It also proves that the time overhead of Shingle is the largest one. This is mainly

IEEE TRANSACTIONS ON XXX, XXX, XXXX, OCTOBER 2014

12

TABLE 8: The precision and recall of Traits, Simhash, PAS and EPAS.

90 80

CPU and Memory overhead(%)

70 60 50 40

Simhash Traits PAS EPAS

30 Simhash CPU

20

Traits CPU PAS CPU

10

EPAS CPU Simhash Mem Traits Mem

0.3

Precision

Recall

0.66 0.888 1 0.75

0.666 0.888 0.333 1.0

PAS Mem EPAS Mem

0.2

0.1

0.0 0

20

40

60

80

200

400

600

800

1000

Time(s)

Fig. 13: CPU and memory utilization of EPAS, Simhash, PAS and Traits with data set D1. because Shingle needs a lot of time to read file contents and calculate the fingerprints by using MD5 hash functions, as introduced in Section 3.1. Fig. 11 investigates the time overhead with a real data set D1. It shows the same trend as Fig. 10. Because the overhead of Shingle is too high to be printed on the figure, we omit it in the following experiments. 6.4.2 Query time Fig. 12 demonstrates the query time of Simhash, Traits, PAS and EPAS. The query time of PAS is the largest one among them. When querying similarity characteristic value, PAS needs to traverse the whole set. This will cause multiple disk accesses which generate long latency to read the corresponding data from disk drive. The query time of EPAS is between Simhash and Traits. In order to enhance the performance of query, Simhash uses multiple redundant tables to store similarity characteristic value, but at the cost of consuming more memory resource. In contrast to Simhash, EPAS and Traits only store one copy of similarity characteristics value in memory. Compared with EPAS, Traits needs to compare multiple pre-traits P T 1 ...P T t , and it consumes more time to accomplish similarity detection. Instead of traversing the whole set, EPAS builds a mapping relationship of fingerprints to the ID of corresponding file. This method reduces the time overhead of querying and comparing. Therefore, the performance of EPAS is enhanced. 6.4.3 CPU and Memory utilization Fig. 13 illustrates the CPU and memory utilization of Simhash, Traits, PAS and EPAS. In this figure, the CPU utilization of Traits, Simhash, PAS and EPAS are about 80%, 44%, 13% and 15%, respectively. This indicates that Traits and Simhash are more computation intensive than PAS and EPAS. Specially, Traits needs more CPU cycles to calculate fingerprints than what Simhash requires. According to Algorithm 2 and 3, the CPU overhead of PAS and EPAS is roughly identical. From Fig. 13, we can find that the memory utilization of Traits slightly exceeds those of other algorithms and increases with time goes by. There are two reasons for

this phenomenon. The first one is that Traits uses a lot of memory space to store the temp fingerprint sets when calculating the similarity characteristic value. The second reason is that the similarity characteristic value of Traits are stored in memory. Hence, the memory utilization of EPAS and PAS are much less than that of Traits. Both PAS and EPAS algorithms adopt Tokyo Cabinet to store fingerprint sets, and the Tokyo Cabinet maps data files into memory as much as possible, so that the PAS and EPAS algorithms take more memory space than that of Simhash. However, the memory utilization of PAS and EPAS is relatively stable. This is because Tokyo Cabinet does not store any data in cache, while Simhash algorithm stores all fingerprints in memory and keeps redundant fingerprints. 6.4.4

Precision and Recall

Table 8 indicates the precision and recall of Simhash, Traits, PAS and EPAS with data set D2. From Table 8, we can find that Traits makes a good tradeoff between precision and recall, which means that Traits can find more actual similar instances and keep the accuracy. However, the computing overhead of Traits is rather large. As described in section 3, Traits needs to compute t groups of fingerprint sets, it takes a large volume of memory space and CPU cycles. The precision and recall of Simhash are less than that of Traits and EPAS. This is mainly because Simhash employs a fixedsize partition algorithm. This makes the Simhash algorithm more sensitive to the file modifications. One single bit modification makes the fingerprints of the two corresponding files completely distinct. Although PAS maintains a high value of precision, the recall is very low. This means PAS can not find the most actual similar instances. This is because PAS is incapable of detecting the situation of slight modifications in the file head. Despite EPAS is able to find the most similar files, some false alarms still exist. It is noteworthy that the overhead of EPAS is the smallest one among those algorithms. Therefore, we believe that EPAS is a practical and applicable solution for the file similarity detection.

7

C ONCLUSION AND F UTURE W ORK

In this paper, we propose an Enhanced Position-Aware Sampling algorithm (EPAS) for the cloud environment. Comprehensive experiments are performed to select optimal parameters for EPAS. Corresponding analysis and discussion of the parameter selection are introduced in this paper. The evaluation of precision and recall demonstrates that EPAS is very effective in detecting file similarity in contrast to Shingle, Simhash, Traits and PAS. The experimental results also suggest that the time overhead, CPU and memory occupation of EPAS are much less than that of those algorithms.

IEEE TRANSACTIONS ON XXX, XXX, XXXX, OCTOBER 2014

Therefore, we believe that EPAS can be applied to the cloud environment to reduce the latency and achieve both the efficiency and accuracy. Sampling based similarity identification opens up several directions for future work. The first one is using contentbased chunk algorithm to sample data blocks, since this approach can avoid content shifting incurred by data modification. The second one is employing file metadata to optimize the similarity detection. This is because the file size and type which are contained in the metadata of similar file are normally very close.

ACKNOWLEDGMENTS We would like to thank the anonymous reviewers for helping us refine this paper. Their constructive comments and suggestions are very helpful. This work is supported by the NSF of China under Grant (No. 61572232, and No. 61272073), the key program of NSF of Guangdong Province (No. S2013020012865), the Fundamental Research Funds for the Central Universities, the Open Research Fund of Key Laboratory of Computer System and Architecture, Institute of Computing Technology, Chinese Academy of Sciences (CARCH201401), and the Science and Technology Planning Project of Guangdong Province (No. 2013B090200021). The corresponding author is Yuhui Deng.

R EFERENCES [1]

Q. Zhang, Z. Chen, A. Lv, L. Zhao, F. Liu, and J. Zou, “A universal storage architecture for big data in cloud environment,” in Green Computing and Communications (GreenCom), 2013 IEEE and Internet of Things (iThings/CPSCom), IEEE International Conference on and IEEE Cyber, Physical and Social Computing. IEEE, 2013, pp. 476– 480. [2] J. Gantz and D. Reinsel, “The digital universe decade-are you ready,” IDC iView, 2010. [3] H. Biggar, “Experiencing data de-duplication: Improving efficiency and reducing capacity requirements,” The Enterprise Strategy Group, 2007. [4] F. Guo and P. Efstathopoulos, “Building a highperformance deduplication system,” in Proceedings of the 2011 USENIX conference on USENIX annual technical conference. USENIX Association, 2011, pp. 25–25. [5] A. Muthitacharoen, B. Chen, and D. Mazieres, “A low-bandwidth network file system,” in ACM SIGOPS Operating Systems Review, vol. 35, no. 5. ACM, 2001, pp. 174–187. [6] B. Zhu, K. Li, and R. H. Patterson, “Avoiding the disk bottleneck in the data domain deduplication file system.” in Fast, vol. 8, 2008, pp. 1–14. [7] Y. Deng, “What is the future of disk drives, death or rebirth?” ACM Computing Surveys (CSUR), vol. 43, no. 3, p. 23, 2011. [8] C. Wu, X. LIN, D. Yu, W. Xu, and L. Li, “End-to-end delay minimization for scientific workflows in clouds under budget constraint,” IEEE Transaction on Cloud Computing (TCC), vol. 3, pp. 169–181, 2014. [9] G. Linden, “Make data useful,” http://home.blarg.net/∼glinden/ StanfordDataMining.2006-11-29.ppt, 2006. [10] R. Kohavi, R. M. Henne, and D. Sommerfield, “Practical guide to controlled experiments on the web: listen to your customers not to the hippo,” in Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2007, pp. 959–967. [11] J. Hamilton, “The cost of latency,” Perspectives Blog, 2009. [12] D. Bhagwat, K. Eshghi, D. D. Long, and M. Lillibridge, “Extreme binning: Scalable, parallel deduplication for chunk-based file backup,” in Modeling, Analysis & Simulation of Computer and Telecommunication Systems, 2009. MASCOTS’09. IEEE International Symposium on. IEEE, 2009, pp. 1–9.

13

[13] W. Xia, H. Jiang, D. Feng, and Y. Hua, “Silo: a similarity-locality based near-exact deduplication scheme with low ram overhead and high throughput,” in Proceedings of the 2011 USENIX conference on USENIX annual technical conference. USENIX Association, 2011, pp. 26–28. [14] Y. Fu, H. Jiang, and N. Xiao, “A scalable inline cluster deduplication framework for big data protection,” in Middleware 2012. Springer, 2012, pp. 354–373. [15] D. Teodosiu, N. Bjorner, Y. Gurevich, M. Manasse, and J. Porkka, “Optimizing file replication over limited bandwidth networks using remote differential compression,” Microsoft Research TR2006-157, 2006. [16] Y. Zhou, Y. Deng, and J. Xie, “Leverage similarity and locality to enhance fingerprint prefetching of data deduplication,” in Proceedings of The 20th IEEE International Conference on Parallel and Distributed Systems. Springer, 2014. [17] A. Z. Broder, “On the resemblance and containment of documents,” in Compression and Complexity of Sequences 1997. Proceedings. IEEE, 1997, pp. 21–29. [18] G. S. Manku, A. Jain, and A. Das Sarma, “Detecting nearduplicates for web crawling,” in Proceedings of the 16th international conference on World Wide Web. ACM, 2007, pp. 141–150. [19] L. Song, Y. Deng, and J. Xie, “Exploiting fingerprint prefetching to improve the performance of data deduplication,” in Proceedings of the 15th IEEE International Conference on High Performance Computing and Communications. IEEE, 2013. [20] A. Z. Broder, S. C. Glassman, M. S. Manasse, and G. Zweig, “Syntactic clustering of the web,” Computer Networks and ISDN Systems, vol. 29, no. 8, pp. 1157–1166, 1997. [21] M. S. Charikar, “Similarity estimation techniques from rounding algorithms,” in Proceedings of the thiry-fourth annual ACM symposium on Theory of computing. ACM, 2002, pp. 380–388. [22] S. Ghemawat, H. Gobioff, and S.-T. Leung, “The google file system,” in ACM SIGOPS Operating Systems Review, vol. 37, no. 5. ACM, 2003, pp. 29–43. [23] J. Dean and S. Ghemawat, “Mapreduce: simplified data processing on large clusters,” Communications of the ACM, vol. 51, no. 1, pp. 107–113, 2008. [24] T. Elsayed, J. Lin, and D. W. Oard, “Pairwise document similarity in large collections with mapreduce,” in Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics on Human Language Technologies: Short Papers. Association for Computational Linguistics, 2008, pp. 265–268. [25] G. Forman, K. Eshghi, and S. Chiocchetti, “Finding similar files in large document repositories,” in Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining. ACM, 2005, pp. 394–400. [26] Z. Ouyang, N. Memon, T. Suel, and D. Trendafilov, “Cluster-based delta compression of a collection of files,” in Web Information Systems Engineering, 2002. WISE 2002. Proceedings of the Third International Conference on. IEEE, 2002, pp. 257–266. [27] A. Z. Broder, M. Charikar, A. M. Frieze, and M. Mitzenmacher, “Min-wise independent permutations,” Journal of Computer and System Sciences, vol. 60, no. 3, pp. 630–659, 2000. [28] B. Han and P. J. Keleher, “Implementation and performance evaluation of fuzzy file block matching.” in USENIX Annual Technical Conference, 2007, pp. 199–204. [29] B. S. Baker, “On finding duplication and near-duplication in large software systems,” in Reverse Engineering, 1995., Proceedings of 2nd Working Conference on. IEEE, 1995, pp. 86–95. [30] N. Shivakumar and H. Garcia-Molina, “Building a scalable and accurate copy detection mechanism,” in Proceedings of the first ACM international conference on Digital libraries. ACM, 1996, pp. 160–168. [31] L. P. Cox, C. D. Murray, and B. D. Noble, “Pastiche: Making backup cheap and easy,” ACM SIGOPS Operating Systems Review, vol. 36, no. SI, pp. 285–298, 2002. [32] Y. Hua, X. Liu, and D. Feng, “Data similarity-aware computation infrastructure for the cloud,” IEEE Transactions on Computers, p. 1, 2013. [33] S. Biswas, D. Franklin, A. Savage, R. Dixon, T. Sherwood, and F. T. Chong, “Multi-execution: multicore caching for data-similar executions,” in ACM SIGARCH Computer Architecture News, vol. 37, no. 3. ACM, 2009, pp. 164–173. [34] R. Vernica, M. J. Carey, and C. Li, “Efficient parallel set-similarity joins using mapreduce,” in Proceedings of the 2010 ACM SIGMOD

IEEE TRANSACTIONS ON XXX, XXX, XXXX, OCTOBER 2014

[35]

[36]

[37] [38]

[39] [40] [41] [42] [43] [44]

International Conference on Management of data. ACM, 2010, pp. 495–506. D. Deng, G. Li, S. Hao, J. Wang, and J. Feng, “Massjoin: A mapreduce-based method for scalable string similarity joins,” in Data Engineering (ICDE), 2014 IEEE 30th International Conference on. IEEE, 2014, pp. 340–351. P. Indyk and R. Motwani, “Approximate nearest neighbors: towards removing the curse of dimensionality,” in Proceedings of the thirtieth annual ACM symposium on Theory of computing. ACM, 1998, pp. 604–613. Y. Zhou, Y. Deng, X. Chen, and J. Xie, “Identifying file similarity in large data sets by modulo file length,” in Algorithms and Architectures for Parallel Processing. Springer, 2014, pp. 136–149. X. Yun, G. Wu, G. Zhang, K. Li, and S. Wang, “Fastraq: A fast approach to range-aggregate queries in big data environments,” IEEE Transactions On Cloud Computing (TCC), vol. 3, pp. 206–218, 2014. F. Labs, “Tokyo cabinet,” http://fallabs.com/tokyocabinet/. N. Agrawal, W. J. Bolosky, J. R. Douceur, and J. R. Lorch, “A fiveyear study of file-system metadata,” ACM Transactions on Storage (TOS), vol. 3, no. 3, p. 9, 2007. D. T. Meyer and W. J. Bolosky, “A study of practical deduplication,” ACM Transactions on Storage (TOS), vol. 7, no. 4, p. 14, 2012. D. R. Bobbarjung, S. Jagannathan, and C. Dubnicki, “Improving duplicate elimination in storage systems,” ACM Transactions on Storage (TOS), vol. 2, no. 4, pp. 424–448, 2006. M. K. Buckland and F. C. Gey, “The relationship between recall and precision,” JASIS, vol. 45, no. 1, pp. 12–19, 1994. D. M. Powers, “Evaluation: from precision, recall and f-measure to roc, informedness, markedness & correlation,” Journal of Machine Learning Technologies, vol. 2, no. 1, pp. 37–63, 2011.

Yongtao Zhou is a research student at the Computer Science Department of Jinan University. His current research interests cover data deduplication, file system, key-value database, cloud storage, Linux kernel, etc.

Yuhui Deng is a professor at the Computer Science Department of Jinan University. Before joining Jinan University, Dr. Yuhui Deng worked at EMC Corporation as a senior research scientist from 2008 to 2009. He worked as a research officer at Cranfield University in the United Kingdom from 2005 to 2008. He received his Ph.D. degree in computer science from Huazhong University of Science and Technology in 2004. His research interests cover green computing, cloud computing, information storage, computer architecture, performance evaluation, etc.

Junjie Xie is a research student at the Computer Science Department of Jinan University. His current research interests cover network interconnection, data center architecture and cloud computing.

14

Laurence T. Yang received the B.E. degree in computer science and technology from Tsinghua University, China, and the Ph.D. degree in computer science from University of Victoria, Canada. He is a professor with the School of Computer Science and Technology, Huazhong University of Science and Technology, China, as well as with the Department of Computer Science, St. Francis Xavier University, Canada. His research interests include parallel and distributed computing, embedded and ubiquitous/pervasive computing, big data. He has published more than 200 papers in various refereed journals (about 40% in IEEE/ACM Transactions/Journals and the others mostly in Elsevier, Springer, and Wiley Journals). His research has been supported by the National Sciences and Engineering Research Council of Canada (NSERC) and the Canada Foundation for Innovation.