A Fusion Approach for Anomaly Detection in Hard Disk Drives

A Fusion Approach for Anomaly Detection in Hard Disk Drives Yu Wang Eden W. M. Ma Center for Prognostics and System Health Management City Universit...
Author: Rudolph Douglas
1 downloads 3 Views 390KB Size
A Fusion Approach for Anomaly Detection in Hard Disk Drives Yu Wang

Eden W. M. Ma

Center for Prognostics and System Health Management City University of Hong Kong Hong Kong

Center for Prognostics and System Health Management City University of Hong Kong Hong Kong

KL Tsui

Michael Pecht

Dept. of Systems Engineering and Engineering Management City University of Hong Kong Hong Kong

Center for Advanced Life Cycle Engineering (CALCE) University of Maryland, College Park Maryland, USA Email: [email protected]

Abstract— As the information stored in hard disk drives (HDDs) is continuous increasing, the safety of data is become more and more important. Among the safety technologies, anomaly detection is crucial for users to prevent data loss and to backup their data. A fusion approach was proposed to monitor the HDD health status based on Mahalanobis distance (MD) and Box-Cox transformation. A quality control technique—Shewhart control chart — was introduced using the transformed MD values to detect the anomalies in HDDs. A case study was then conducted to verify the validity of the proposed approach. The results showed that the proposed approach is effective for detecting the anomalies. Keywords--hard disk drive; SMART; Mahalanobis distance; Box-Cox transformation; anomaly prediction

I.

INTRODUCTION

A hard disk drive (HDD) is one of the essential components in a computer [1]. HDD shipments reached 652.4 million units in 2010 [2]. The storage capacity per platter has also increased to 1 TB [3], which makes the largest capacity of a 3.5-inch HDD upwards of 4 TB [4]. Due to the large amount of data stored on HDDs, concerns about the reliability of HDDs have arisen not only from manufacturers but also from end-users. Manufacturers developed self-monitoring, analysis, and reporting technology (SMART) to predict the impending failures. The attributes of SMART are varied from different manufacturers, but the typical characteristics monitored in HDDs can be summarized as track-seek retries, read errors, write faults, reallocated sectors, head fly height, and environmental temperature. The threshold value of each attribute is defined by manufacturers. If any attributes exceed their threshold value, the drive is determined to have experienced failure [5] and has to be returned to the factory for replacement under warranty. Therefore, manufacturers prefer to reduce failure detection rates in order to reduce the false alarm The work described in this paper was fully supported by a grant from the Research Grants Council of the Hong Kong Special Administrative Region, China (CityU8/CRF/09).

978-1-4577-1911-0/12/$26.00 ©2012 IEEE

rates of their algorithms so as to reduce the chances of failure disk replacement. This causes the SMART algorithm currently used in HDDs unable to predict the failure well. The failure detection rate (FDR) is 3–10 % [6]. In academic side, several algorithms have been developed in the past decade to increase the failure detection rate. Hughes et al. found that when they performed failure prediction, most SMART attributes were nonparametrically distributed [6, 7]. Murray et al. compared different machine learning methods and found that support vector machines (SVMs) had the highest failure detection rate among all tested methods (multiple-instance naive Bayes, support vector machines, autoclass, and rank-sum test) [8]. Naive Bayes expectationmaximization was developed by Hamerly et al. [9] to predict HDD failure using the original Quantum data. Instead of using all attributes from the original dataset, three attributes were selected to get better prediction performance. However, most of these approaches were focused on building a binary classifier that assigned a label: healthy or failed to each observation in the drives, resulting in the predicted outputs have only two statuses. Thus, the detailed information of the degree of deviation from normal condition before the drives’ complete failure cannot be demonstrated by these approaches. The contributions of this work are: a fusion approach used for monitoring the health status of HDDs based on Mahalanobis distance (MD) and Box-Cox transformation is proposed. The deviation degree of HDDs is demonstrated by MD and Box-Cox transformation. After getting the transformed MD values, the Shewhart control chart was constructed to define the threshold that can detect the anomalies. An optimization step is also developed to enhance the false alarm of HDDs. Furthermore, a case study was then conducted to verify the proposed approach. The rest parts of this paper are organized as follows: In section 2, SMART was briefly introduced. In section 3, the proposed method and relevant algorithms were presented. In

MU3279

2012 Prognostics & System Health Management Conference (PHM-2012 Beijing)

section 4, a case study was conducted and the corresponding results were discussed. In section 5, the conclusions were drawn from the discussions. II.

SELF-MONITORING, ANALYSIS, AND REPORTING TECHNOLOGY (SMART)

SMART collects the attributes that correspond to counts or physical units to determine an HDD’s health state. The attributes are reliability-prediction parameters that are determined by field return evaluation and design consideration points. The attributes vary among different manufacturers and different drive models. The manufacturers gather information from the field to predict the reliability of drives and further enhance new reliability strategies at the same time [10].

B.

Health Monitoring In this study, MD and Box-Cox transformation were used to monitor HDD health status. The parameters got from FMMEA section were monitored and then transformed to normally distributed variables. MD takes the advantages of the correlations between variables to identify the different patterns [14]. Data preparation FMMEA

Data Collection

Health Monitoring

Some SMART software programs only provide the basic evaluation. For example, ―threshold not exceeded‖ and ―threshold exceeded‖ are represented as ―drive OK‖ or ―drive fail,‖ respectively. Manufacturers define the threshold values confidentially [5].

Healthy System

Test System

Training Data

Test Data

Normalize Data

Unfortunately, the ―threshold method‖ of SMART does not always perform well in operation. Murray studied 191 failed drives and found that the failure detection rate within the ―threshold method‖ of SMART was only 3–10% [6]. This was because that particular manufacturer was very concerned with the false-alarm rate [6–8]. The thresholds were set very high to avoid false alarm, however, it led to a low failure detection rate.

Calculate Healthy MD Values

Calculate Test MD Values

Transform MD Values to Normal Distribution by Box-Cox Transformation

Transform MD Values by BoxCox Transformation

Calculate Mean and Std Dev. time Threshold

Develop Decision time

III.

Threshold

PROPOSED APPROACH FOR HDD ANOMALIES

This paper proposed a fusion approach based on Mahalanobis distance (MD) and Box-Cox transformation to detect anomalies in HDDs. Failure modes, mechanisms, and effects analysis (FMMEA) was conducted on the HDD system to determine the critical parameters. MD was then used to compress the monitored parameters from multiple attributes to an index. Box-Cox transformation was utilized to transform the MD values into normally distributed variables. Then the Shewhart quality control chart was applied to detect the anomalies in HDDs. The flowchart of the proposed approach is shown in Fig. 1. A.

Failure Modes, Mechanisms, and Effects Analysis (FMMEA) FMMEA is a systematic method for analyzing the physicsof-failure of a system [11]. Failure modes refer to the type of failure. Failure causes refer to the factors that induced failure, including the specific process of production and design as well as environmental conditions. Potential failure mechanisms are determined by the available mechanisms corresponding to the physical, electrical, chemical, and mechanical stresses that can induce failure. A failure effect is the influence of the failure on the product or system [12]. FMMEA was conducted on HDDs in [13] where failure modes related to the head disk interface and head stack assembly were classified as critical reliability issues.

Unhealthy Healthy time

Figure 1. Flowchart of the proposed method.

The data set can be denoted as X . The columns of X are attributes denoted as X j , where j  1, 2, ..., m , and m is the number of attributes. The rows of X are observations of all attributes denoted by X i , where i  1, 2, ..., n , and n is the number of observations. The value of the ith observation in the jth attribute is denoted by x ij , where i  1, 2, ..., n and j  1, 2, ..., m . To eliminate the scale effect for different parameters, each individual attribute in each data vector is normalized using Eqn. (1), wherein the mean and standard deviation are based on the healthy data. z ij 

X

j

( x ij  X j )



1 n

Sj 

(1)

Sj n

 i 1 x ij

n  i 1 ( x ij  X j )

n 1

(2) 2

(3)

where, i  1, 2, ..., n , j  1, 2, ..., m .

978-1-4577-1911-0/12/$26.00 ©2012 IEEE

MU3279

2012 Prognostics & System Health Management Conference (PHM-2012 Beijing)

The value of each observation’s MD is calculated for healthy items using the following formula: M Di 

1 m

z iC

1 T zi

(4)

where z i  [ zi1 , zi 2 ,..., zim ] , z Ti is the transpose of z i , and C is the covariance matrix calculated as: C 

1

IV.

n

z ( n  1)

T i

(5)

zi

i 1

Box-Cox transformation was utilized to transform the MD values into normally distributed variables. It takes the following form [15]:  y 1 , if   0;  z( y,  )     ln y , if   0; 

(6)

where y is the original vector, corresponding to the MD values, and z ( y ,  ) is the transformed vector.  is the transformation parameter calculated by the maximum logarithmic likelihood function: f ( y,  )  

N  N ( z ( y i ,  )  z ( y ,  )) 2  ln     (   1)  ln( y i ) 2 N i 1  i 1 

N

(7) where N is sample size of the training data, and z ( y ,  ) is mean value of z ( y ,  ) . z ( y,  ) 

N

1 N

 z ( yi ,  )

(8)

i 1

C.

Threshold Determination and Anomalies Detection To determine the threshold to identify the anomalies, the Shewhart control chart [16], which is used to detect the shifts in the mean or variance of a quality characteristic in a process with the assumption that the observed process data come from a near-normal distribution, was introduced. Table 1 showed a summary for the decision metrics in the Shewhart control chart. TABLE I. Probability (%) 68.3 95.5 99.7

threshold   3 can be calculated accordingly. During the test process, the mean, standard deviation, and a correlation coefficient matrix of healthy HDDs are used to calculate MD value of each observation in the testing drives. Then the transformed MD values of the tested drives are compared with the decision metrics. If the transformed data exceed the threshold, the observations are determined as anomalies. RESULTS AND DISCUSSIONS

In this section, a case study was performed to investigate the performance of the proposed approach. A.

Data Set The data set used in this paper is available on the website of CMRR (Center for Magnetic Recording Research) in University of California, San Diego [17]. It included 369 drives from one model, 178 drives were labeled as good (healthy) and 191 drives were labeled as failed. The good drives were from a reliability test by the manufacturer. The failed drives were field returned to the manufacturer. The drives contained 300 samples (observations) and time interval between two samples was 2 hours. Only the last 600 hours data could be recorded, if the time exceeded 600 hours, the data will be overwritten. Some failed drive may have less than 300 samples because they were not able to survive 600 hours of operation. Each sample contains 60 performance-monitoring attributes and additional attributes such as the drive’s serial number and total power-onhours. Every performance attribute represents the number of certain events happened during two hours. Not all the attributes were monitored in every drive. The neglected attributes were set to constants [6]. B.

Training Setup The critical parameters were achieved by FMMEA. The detailed information can be found in [13]. 60% of the healthy HDDs were used as the training data set. The correlation matrix C and transformation parameter—  , as well as the threshold of anomaly detection, were calculated in this stage. All the failed drives and the remaining 40% of the healthy drives were used as testing data.

DECISION METRICS Lower bound μ-σ μ-2σ μ-3σ

Upper bound μ+σ μ+2σ μ+3σ

Note:  is mean, and  is the standard deviation.

The different metrics are corresponding to different probability. The third upper metric —   3 , is commonly used to identify the ―out-of-control‖ points with 99.7% confidence level. After Box-Cox transformation, the transformed MD values of healthy drives were used to build the control chart.  and  were calculated by these transformed MD values, and the

978-1-4577-1911-0/12/$26.00 ©2012 IEEE

MU3279

Figure 2. Box-Cox transformation of training data.

2012 Prognostics & System Health Management Conference (PHM-2012 Beijing)

The threshold in this paper is set to   3 by referring to the quality control standard after the data were transformed by Box-Cox transformation. The transformed data of the training drives are plotted in Fig. 2.

Then, the results for different values of n were evaluated based on a simple criterion: Score 

F ailure D etection R ate

(9)

F alse A larm R ate

C.

Anomaly Detection During the testing process, the MD values of testing HDDs were transformed, and then compared with the threshold (Fig. 3). The transformed MD values measured the degradation degree of the drives. Based on the predefined threshold, most failed drives and healthy drives could be distinguished. The points above the threshold could be determined as anomalies. The anomalies in failed drives were found to be much more than healthy drives.

With this criterion, we can minimize the false alarm rate and maximize the failure detection rate. Fig. 5 shows the scores by using different ns. The highest score was obtained when n = 5. In this case, the false alarm rate and failure detection rate are 2.9% and 90.58%, respectively. 40

Score

30

20

10

0

0

2 4 5 6 8 Number of points greater than the threshold

10

Figure 5. The effects of the number of points on the criterion.

V.

Figure 3. Box-Cox transformation of testing data.

D.

Preventing False Alarm To define a drive’s status: healthy or failed, this section takes anomalies count in a drive into account. Especially, to prevent the false alarm, which is critical to manufacturers, this section takes two rules to test how many anomalies were valid to trigger an alarm to users. First rule is that   3 is used as threshold to detect anomalies. Second rule is that if n anomalies occurred in a drive, the drive is determined as failed. Different values of n were tested. Fig. 4 shows the results of FDRs and false alarm rates (FARs) at different values of n. 100

An approach for detecting anomalies in HDDs based on Mahalanobis distance and Box-Cox transformation was proposed in this paper. Mahalanobis distance was used to compress these parameters into one index. Then Box-Cox transformation was applied to transform the MD values into normally distributed variables. To determine the threshold, the Shewhart control chart was constructed by the transformed MD values. Finally, a case study was conducted based on a SMART data set to verify the proposed approach. The proposed approach provides a promising way to lower the manufacturers’ cost by reducing the false alarms of HDDs so as to reduce the warranty issues. At the same time, the failure detection rate (FDR) can be maintained at a high level (90.58%), which can help to guarantee the safety of users’ data by providing the warnings of impending failure. REFERENCES

10 8

60

6

40

4

20

False alarm rate (%)

Failure detection rate (%)

[1] 80

[2]

[3]

2

Failure detection rate False alarm rate

0

0 0

2 4 6 8 Number of points greater than the threshold

CONCLUSIONS

10

[4]

How to Replace an Internal Hard Drive in a Desktop PC, http://www.ehow.com/how_4777881_replace-hard-drive-desktoppc.html, viewed on October 10, 2010. Seagate to control 40% of HDD market with Samsung acquisition, says IHS iSuppli, http://www.digitimes.com/news/a20110504PR202.html, viewed on May 10, 2011. Seagate Introduces 1TB Per Platter HDD, http://www.informationweek.com/news/hardware/desktop/229402732, viewed on June 23, 2011. Samsung Showed Off Prototype of 4TB Hard Disk Drive. http://www.xbitlabs.com/news/storage/display/20110308081634_Samsu ng_Shows_Off_Prototype_of_4TB_Hard_Disk_Drive.html, viewed on June 20, 2011.

Figure 4. The effects of the number of points on the false alarm rate and failure detection rate.

978-1-4577-1911-0/12/$26.00 ©2012 IEEE

MU3279

2012 Prognostics & System Health Management Conference (PHM-2012 Beijing)

[5]

Self-Monitoring, Analysis, and Reporting Technology, http://en.wikipedia.org/wiki/S.M.A.R.T, viewed on November 10, 2010. [6] G. F. Hughes, J. F. Murray, K. Kreutz-Delgado, and Charles Elkan, ―Improved disk-drive failure warnings,‖ IEEE Transactions on Reliability,‖ vol. 51(3), pp.350–357, September 2002. [7] J. F. Murray, G. F Hughes, and K. Kreutz-Delgado. ―Hard drive failure prediction using non-parametric statistical methods,‖ in Proceedings of ICANN/ICONIP, June 2003. [8] J. F. Murray, G. F. Hughes, and K. Kreutz-Delgado, ―Machine learning methods for predicting failures in hard drives: A multiple instance application,‖ J. Mach. Learn. Res., vol. 6, pp. 783–816, 2005. [9] G. Hamerly and C. Elkan, ―Bayesian approaches to failure prediction for disk drives,‖ in Proceedings of the Eighteenth International Conference on Machine Learning (ICML’01), June 2001. [10] Get S.M.A.R.T for Reliability, Seagate Technology Paper, http://www.seagate.com/docs/pdf/whitepaper/enhanced_smart.pdf, viewed on November 10, 2010. [11] S. Ganesan, V. Eveloy, D. Das, and M. Pecht, ―Identification and utilization of failure mechanisms to enhance FMEA and FMECA,‖ in

978-1-4577-1911-0/12/$26.00 ©2012 IEEE

[12] [13]

[14] [15] [16] [17]

MU3279

Proceedings of the IEEE workshop on accelerated stress testing & reliability (ASTR). Austin, Texas, October 2005. M. Pecht, Prognostics and Health Management of Electronics, WileyInterscience, New York, NY, 2008. Y. Wang, Q. Miao, and M. Pecht, ―Health monitoring of hard disk drive based on Mahalanobis distance,‖ in Proceeding of IEEE 2011 Prognostics and Health Management Conference (PHM-2011), Shenzhen, China, 23–25 May 2011. G. Taguchi and R. Jugulum, ―The Mahalanobis-Taguchi Strategy: A Pattern Technology System,‖ Wiley Press, May 2002. G. E. P. Box. and D. R. Cox, ―An analysis of transformations,‖ Journal of the Royal Statistical Society, Series B, vol. 26, pp. 211–243, 1964. L. S. Nelson, ―Technical Aids,‖ Journal of Quality Technology, vol. 16, no. 4, pp. 238–239, October 1984. S.M.A.R.T. Data Set, http://cmrr.ucsd.edu/people/hughes/smart/dataset/harddrive1.zip, viewed on November 20, 2010.

2012 Prognostics & System Health Management Conference (PHM-2012 Beijing)