DESIGN AND IMPLEMENTATION OF A RELIABLE RISK ASSESSMENT TECHNIQUE FOR SOFTWARE PROJECTS

DESIGN AND IMPLEMENTATION OF A RELIABLE RISK ASSESSMENT TECHNIQUE FOR SOFTWARE PROJECTS by Onur Kutlubay B.S. in Computer Engineering, Boğaziçi Unive...
1 downloads 0 Views 1MB Size
DESIGN AND IMPLEMENTATION OF A RELIABLE RISK ASSESSMENT TECHNIQUE FOR SOFTWARE PROJECTS

by Onur Kutlubay B.S. in Computer Engineering, Boğaziçi University, 2000

Submitted to the Department of Computer Engineering in partial fulfillment of the requirements for CMPE 690

Boğaziçi University 2004

ii

DESIGN AND IMPLEMENTATION OF A RELIABLE RISK ASSESSMENT TECHNIQUE FOR SOFTWARE PROJECTS

APPROVED BY:

Dr. Ayşe B. Bener (Research Supervisor) Prof. Ethem Alpaydın Prof. Emin Anarım

iii

ACKNOWLEDGEMENTS

First, I would like to thank my supervisor Dr. Ayşe Başar Bener for her guidence. This thesis would not have been possible without her encouragement and enthusiastic support. A great deal of progress on my research was made possible by the contributions of fellow researchers, especially Kemal Kaplan, Doğu Gül, Mehmet Balman and Olcay Taner Yıldız. Doğu and Mehmet provided help in decision tree experiments carried out through the experimental evaluation and also my research in the literature. Olcay also provided help in decision tree experiments especially when evaluating the results and verification of the experiments. Kemal dedicated his hours to the experiments done in this research. Thank you to Koray Helvacı for his reviews and proofreading efforts. I also thank all my colleagues in the Computer Networks Research Laboratory, for their continuing suggestions and contributions to this thesis. My sincere appreciation goes out to all of my committee members: Prof. Ethem Alpaydın and Prof. Emin Anarın. I would like to thank NASA IV & V Facility Metric Data Program team members for their efforts in making such a data repository publicly available. Finally I am deeply grateful to my family. They always give me endless love and support, which has helped me to overcome the various challenges along the way.

iv

ABSTRACT

DESIGN AND IMPLEMENTATION OF A RELIABLE RISK ASSESSMENT TECHNIQUE FOR SOFTWARE PROJECTS

Discovering defects is one of the main struggles of software development process. As the size of the software grows, it gets more expensive to accomplish the task of preserving the quality and the reliability of the software. Such software projects need more sophisticated testing and evaluation procedures. Software metrics do supply considerable information about the quality of the code being generated. A detailed analysis of metric data gathered from a software project both gives clues about the error generating potentials of specific code segments, and helps possible errors to come out as well. The aim of this research is to establish a technique for identifying software defects using machine learning methods. The experiments in this research are done with the software metric data repository at NASA IV & V Facility Metrics Data Program. The repository at NASA IV & V Facility MDP contains software metric data and error data at the function/method level. Throughout the tests of our model, the data set is normalized and cleaned against correlated and irrelevant values initially, and then, machine learning techniques are applied for error prediction. Our learning process incorporates k-nearest neighborhood, k-means, artificial neural networks and decision tree approaches. From the results of our experiments, we are able to conclude that the proposed technique brings out successful results in terms of defect prediction. Accordingly, the problem of predicting the magnitude of the impacts that the defects cause is divided into two subproblems: predicting whether there is a defect and then predicting its magnitude value. So improving the performance of the first prediction results in much better outcomes, since the second type of predictions are already successful using machine learning techniques.

v

ÖZET

YAZILIM PROJELERİ İÇİN GÜVENİLİR BİR RİSK DEĞERLENDİRME TEKNİĞİ TASARIMI VE GELİŞTİRİLMESİ

Yazılım projelerindeki hataların bulunması ve ayıklanması işi yazılım geliştirme sürecinin en zorlu uğraşlarından biridir. Projelerin büyüklükleri arttıkça üretilen yazılımın kalitesinin ve güvenilirliğinin kabul edilebilir bir seviyede tutulması çok daha fazla kaynak gerektirir. Bu tür büyük projeler daha kapsamlı test ve değerlendirme yordamları gerektirir. Yazılım metrikleri, ortaya çıkarılan kod ile ilgili önemli bilgiler sağlamaktadır. Bir yazılım projesinden elde edilen metrik bilgilerin detaylı analizi, hem kod bölümlerinin hataya sebep olma potansiyelleri üzerinde, hem de bu olası hataların tespit edilmesinde önemli ipuçları verebilir. Bu araştırmanın amacı makina öğrenmesi metodlarını kullanarak yazılım hatalarının bulunmasını sağlayan bir teknik geliştirilmesidir. Bu araştırmadaki deneyler NASA IV & V Facility Metrics Data Program adlı metrik programındaki değişik projelere ait veritabanları kullanarak gerçekleştirilmiştir. Bu veritabanları, fonksiyon/metod seviyesinde metrik bilgileri ve hata bilgileri içermektedir. Önerilen modelin test edilmesi sırasında öncelikle veri seti normalize edilmiş ardından korelasyonu yüksek olan ve önemsiz veriler temizlenmiştir. Bundan sonraki adımda, elde edilen yeni veri setini kullanarak hata tahmini yapmak amacıyla makina öğrenmesi teknikleri uygulanmıştır. Öğrenme süreci “k-yakın komşuluk”, “k-ortalama”, “yapay sinir ağları” ve “karar ağaçları” yöntemleri kullanılarak gerçekleştirilmiştir. Yapılan deneyler sonucunda önerilen modelin hata tahmininde başarılı sonuçlar ortaya çıkardığı görülmüştür. Bunun yanında, hataların sebep olabileceği etkilerin büyüklükleri ile ilgili tahmin yapma problemi, hata olup olmadığının bulunması ve hata içerdiği düşünülen verilerin üzerinde bu büyüklüğün tahmininin yapılması şeklinde iki ayrı probleme ayrılmıştır. Böylelikle, başarısı deneylerce kanıtlanan ikinci tip tahminlerin başarısının birinci tip tahminleri geliştirerek daha da artırılabileceği görülmüştür.

vi

TABLE OF CONTENTS

ACKNOWLEDGEMENTS ......................................................................................................iii ABSTRACT .............................................................................................................................. iv ÖZET.......................................................................................................................................... v TABLE OF CONTENTS .......................................................................................................... vi LIST OF FIGURES.................................................................................................................viii LIST OF TABLES .................................................................................................................... ix LIST OF ABBREVIATIONS .................................................................................................... x 1. INTRODUCTION.................................................................................................................. 1 1.1. Motivation ....................................................................................................................... 2 1.2. Outline............................................................................................................................. 3 2. BACKGROUND.................................................................................................................... 4 2.1. Software Measurement.................................................................................................... 4 2.2. Risk Issues....................................................................................................................... 5 2.3. Software Metrics ............................................................................................................. 6 2.3.1. Classification of Software Metrics........................................................................... 8 2.3.2. Metrics Programs ................................................................................................... 11 2.4. Principal Component Analysis...................................................................................... 12 2.5. Machine Learning ......................................................................................................... 13 2.5.1. K-Nearest Neighborhood Method .......................................................................... 14 2.5.2. K-Means Method.................................................................................................... 15 2.5.3. Artificial Neural Networks, Multilayer Perceptron Method .................................. 17 2.5.4. Decision Trees and Regression Trees .................................................................... 18 2.6. Metrics and Software Risk Assessment ........................................................................ 20 2.7. Applications of Machine Learning in Software Defect Prediction ............................... 21 2.8. Software Metric Data .................................................................................................... 22 3. PROBLEM STATEMENT .................................................................................................. 24 3.1. Software Defect Prediction ........................................................................................... 25 3.2. Evaluation of Experiment Results................................................................................. 26 3.2.1. Evaluation of Classification Experiments’ Results ................................................ 26 3.2.2. Evaluation of Regression Experiments’ Results .................................................... 26

vii 4. METHODOLOGY............................................................................................................... 28 4.1. Initial System Design .................................................................................................... 28 4.2. Improved Learning System ........................................................................................... 29 5. EXPERIMENT RESULTS .................................................................................................. 32 5.1. Regression Results Using Complete Dataset ................................................................ 32 5.1.1. Multi Layer Perceptron .......................................................................................... 33 5.1.2. Decision Tree ......................................................................................................... 34 5.2. Regression Results Using Dataset Containing Defected Items..................................... 36 5.2.1. Multi Layer Perceptron .......................................................................................... 36 5.2.2. Decision Tree ......................................................................................................... 37 5.3. Classification Results .................................................................................................... 39 5.3.1. Principal Component Analysis............................................................................... 39 5.3.2. K-Nearest Neighborhood ....................................................................................... 41 5.3.3. K-Means ................................................................................................................. 43 5.3.4. Multi Layer Perceptron .......................................................................................... 45 5.3.5. Decision Tree ......................................................................................................... 47 5.3.6. Comparison of the Classification Methods ............................................................ 48 5.4. Testing of the Proposed Model ..................................................................................... 48 6. CONCLUSION .................................................................................................................... 50 APPENDIX A: THE METRIC DATA REPOSITORY SPECIFICATIONS.......................... 52 APPENDIX B: SOURCE CODES .......................................................................................... 53 B.1. Shuffling Function ........................................................................................................ 53 B.2. kNN Function ............................................................................................................... 53 B.3. k-means Function ......................................................................................................... 55 B.4. MLP Function............................................................................................................... 56 B.5. Decision Tree Function ................................................................................................ 58 APPENDIX C: CLASSIFICATION TREE (PRUNE LEVEL=10) ........................................ 61 REFERENCES......................................................................................................................... 62

viii

LIST OF FIGURES

Figure 2.1. GQM Model for Software Metrics Program [34] .................................................. 12 Figure 2.2. PCA, (a) Original sample dataset, (b) Transformed dataset .................................. 13 Figure 2.3. An 8-nearest neighborhood decision ..................................................................... 15 Figure 2.4. K-means algorithm run example............................................................................ 16 Figure 3.1. The predictor system.............................................................................................. 24 Figure 4.1. The initial learning system..................................................................................... 28 Figure 4.2. The improved learning system............................................................................... 29 Figure 5.1. Experiment results for MLP with respect to different number of hidden units ..... 33 Figure 5.2. Experiment results for MLP using entire data set.................................................. 34 Figure 5.3. Experiment results for decision tree with respect to different levels of pruning ... 35 Figure 5.4. Experiment results for decision tree using entire data set ..................................... 35 Figure 5.5. Experiment results for MLP with respect to different number of hidden results .. 36 Figure 5.6. Experiment results for MLP using data set containing only defected items ......... 37 Figure 5.7. Experiment results for decision tree with respect to different levels of pruning .. 38 Figure 5.8. Experiment results for decision tree using dataset with only defected items ........ 38 Figure 5.9. Two dimensional projection of the first two components resulting from PCA..... 40 Figure 5.10. Performance graph of kNN algorithm in classification ....................................... 42 Figure 5.11. Performance graph of k-means algorithm in classification ................................. 44 Figure C.1. Resulting classification tree with pruning level 10. .............................................. 61

ix

LIST OF TABLES

Table 3.1. Performance evaluation of classification experiments............................................ 26 Table 5.1. Cumulative Eigenvalue percentages of the principal components ......................... 41 Table 5.2. kNN experiment results........................................................................................... 42 Table 5.3. kNN experiment results with PCA.......................................................................... 43 Table 5.4. k-means experiment results..................................................................................... 44 Table 5.5. k-means experiment results with PCA.................................................................... 45 Table 5.6. MLP experiment results .......................................................................................... 46 Table 5.7. MLP experiment results with PCA ......................................................................... 46 Table 5.8. Decision tree experiment results ............................................................................. 47 Table 5.9. Comparison of the performance results .................................................................. 48 Table A.1. The features in the metric data repository .............................................................. 52

x

LIST OF ABBREVIATIONS

ANN

Artificial Neural Networks

BBN

Bayesian Belief Network

EW

Error Derivative of Weights

GQM

Goal/Quality/Metric

kNN

k-Nearest Neighborhood

LOC

Lines of Code

MAE

Mean Absolute Error

MDP

Metrics Data Program

MLP

Multi Layer Perceptron

MSE

Mean Squared Error

PCA

Principal Component Analysis

1

1. INTRODUCTION

A recent survey carried out by the Standish Group in mid 90s which included data from about 8-000 commercial projects indicated that 53 percent of the projects were either over budget or behind schedule, or had fewer functions or features than originally specified, and 31 percent of these projects were cancelled. A cost increase of 189 percent and schedule slippage of 222 percent were calculated as average values at the time they were completed or cancelled [1]. These statistics show the importance of measuring the software early in its life-cycle and taking the necessary precautions before these results come out. For the software projects carried out in the industry, an extensive metrics program is usually seen unnecessary and the practitioners start to stress on a metrics program when things are bad or when there is a need to satisfy some external assessment body. On the academic side, less concentration is devoted on the decision support power of software measurement. The results of these measurements are usually evaluated with naive methods like regression and correlation between values. However, models for assessing software risk in terms of predicting defects in a specific module or function have also been proposed in the previous research [2, 3, 4]. Some recent models also utilize machine-learning techniques for defect predicting [5, 6, 7]. But the main drawback of using machine learning in software defect prediction is the scarcity of data. Most of the companies do not share their software metric data with other organizations thus a useful database with large amount of data cannot be formed. However, there are publicly available well established tools for extracting metrics such as size, McCabe’s cyclomatic complexity and Halstead’s program vocabulary. These tools help automating the data collection process in software projects. A well established metrics program yields to better estimations of cost and schedule. The analysis of measured metrics is also good indicators of possible defects in the software being developed. Testing is the most popular method for defect detection in most of the software projects. However, as the sizes of the projects grow in terms of both lines of code and effort spent, the task of testing gets more difficult and computationally expensive with the use of sophisticated testing and evaluation procedures. Nevertheless, defects that are identified in previous segments of programs can be clustered according to their various properties and

2 most importantly according to their severity. If the relationship between the software metrics measured at a certain state and the properties of the defects can be formulated together, it becomes possible to predict similar defects in other parts of the code developed. The software metric data gives us the values for specific variables to measure a specific module/function or the entire software. When combined with the weighted error/defect data, this data set becomes the input for a machine learning system. A learning system is defined as a system that is said to learn from experience with respect to some class of tasks and performance measure, such that its performance at these tasks improve with experience [8]. To design a learning system, the data set in this work is divided into two parts: the training data set and the testing data set. Some predictor functions are defined and trained with respect to Multi Layer Perceptron and Decision Tree algorithms and the results are evaluated with the testing data set. 1.1. Motivation The motivation behind this thesis is to explore the possibility of implementing a generic technique for predicting software defects early in project lifecycle by looking at the metric data which is available on that specific context. The practitioners collect code metrics during software development processes but the decision support power of the collected data is wasted in most of the organizations. When an extensive metrics program is being utilized, defect data are collected as well. These defect data combined with the data of other features become a well-suited repository for using machine learning techniques to predict future defects. According to our observations, most of the values among defect repository are zero, meaning that the corresponding code segments are non-defected when data collection is done at module/function level. Since the defect related data are sparse in these kind of data sets, the resulting estimations by the machine learning techniques are underestimating the defect probabilities. Our research proposes a novel framework for doing more accurate estimations over defectedness.

3 1.2. Outline In the next chapter, background information is given about the topics that are related to what we have done in this research. First, software measurement and software risk issues are discussed consecutively. Next, brief information about the software metric theory is given. Principal component analysis (PCA) and some machine learning methods are presented in this chapter. Previous research about software risk assessment and usage of machine learning methods in defect prediction are also given. The details about the software metric data used in this research conclude this chapter. In Chapter 3, the main goal of this research is presented and software defect prediction paradigm is explained as well. Also, the methods for evaluating the results of experiments are explained in detail. Our methodology is explained in Chapter 4. In this chapter, the main idea behind our model and a system overview is given. Implementation of the stages in our model is detailed. At the end of this chapter, the organization of the data set and the features of the learning system are explained. Chapter 5 depicts the results of the experiments done in our research. Also an evaluation of the proposed model is done. Finally, in Chapter 6, a general evaluation is done about the proposed model and the future works to improve the performance of the model are also given. The implementation of the framework with respect to some different perspectives is also discussed.

4

2. BACKGROUND

Identifying and mitigating the risks early in software development helps decreasing long-term costs, especially unexpected ones, and prevents later disasters. Risk is defined as the possibility of loss given that a software project involves various parties such as customer, developer and user, the concept of risk varies depending on the subject audience [12]. Risks like budget overrun and schedule miss are mostly concerned by customer and developer whereas the ones like wrong functionality and performance shortfalls are more important for users. In this research our aim is to analyse the risks originating from the source code. It is clear that when these type of risks are realized, they have effects both on the customer-developer side and the user side. 2.1. Software Measurement A measure provides a quantitative indication of amount or size of some attribute of a product or process. Effective management of any process requires quantification, measurement and modeling. As the tools and methodologies in software development advances, the demand for more robust and more reliable products increases. Technical staff and managers in the software industry are confronted with evolving technologies and higher levels of competition both in the market and in the recruitment process. Meanwhile, they should be concerned about endless requirements, too many modifications, unsatisfactory testing and training efforts, and tight budgets and schedules. Measurement by itself cannot solve these problems, but it brings better understanding of the situation. Moreover, a comprehensive measurement of quality attributes can provide the foundation for product and process improvement activities [9]. The majority of software projects do not include effective measurement processes. If we do not measure, there is no way of determining whether we are improving. Managers in the software industry are concerned with more common issues like developing meaningful project estimates, producing higher-quality systems and completing the product on time. By using measurement to establish a baseline, each of these issues becomes more manageable.

5

The most important aim of measurement is understanding the software and the software engineering processes in order to derive models for these processes and examine the relationships between specific parameters. Increased understanding leads to better management of software projects and improvements in the software engineering process [10]. According to their ability to help addressing specific business goals, software management efforts can be fit into three main groups: project management, process management and product engineering [9]. Project management is mostly concerned with setting and meeting achievable commitments regarding cost, schedule and quality. The key project management issues are creating plans, tracking the status and progress of the product. Process management is simply related with ensuring that the defined processes are being followed and making necessary improvements on these processes. Finally, the main objective of product engineering is ensuring the acceptable values for the product related attributes like reliability, performance and stability. Software measurement is intrinsic to all of these management perspectives. Previous research presents models for software measurement methods from the process management viewpoint [9, 11]. Our research is mostly related with the other two types of groups, hence, more detailed information regarding those perspectives will be presented throughout the text. 2.2. Risk Issues Risk can be defined as the probability of failing to achieve objectives like functionality, performance, cost or schedule, and the consequence of failing. There are numerous risk contributors in software projects. These contributors can be aggregated into six major groups regarding their characteristics [12]. Project level risk group includes issues like excessive, unrealistic or unstable requirements and underestimation of project complexity. Another risk group based on project attributes includes issues like performance shortfalls and unrealistic cost or schedule. Management risk group consists of ineffective project management issues. Engineering risk group includes issues like ineffective integration, quality control, or systems engineering. Yet another risk group arising from work environment includes issues like immature or untried design, process or technologies, inadequate work plans, inappropriate methods or tool selection and poor training. The last group consists of mainly the issues other than the ones that are defined in one of the above groups. Some examples of issues related to

6 this risk group are inadequate or excessive documentation, legal or contractual issues or unanticipated maintenance and support costs. Many software projects fail to meet acceptable outcomes within the desired schedule and budget. Most of these failures can be avoided by the use of proper risks assessment and mitigation methods. Several risk assessment and management methods are proposed in the previous research [13, 14, 15]. Devoting serious effort to risk management can help software managers assess problem situations and formulate proactive solutions [15]. The most extensive attempts in software risk management are accomplished with a focus on cost concept and the risk exposure which is formulated as potential loss times probability of loss. There are examples of cost based models in previous research as well [16, 17]. The importance of risk management has also been recognized by some software engineering and quality standards. While these standards may not be the driving force in making risk management more common in industry, they represent an improving level of more systematic risk management. Among several standards issued by international authorities, the most important ones are the IEEE Standard 1074 for Developing Software Life Cycle Processes [18], the IEEE standard 1058.1-1987 for project management plans [19], the ISO 9000-3 guideline (ISO 1991b) for applying ISO 9001 standard [20] and the Capability Maturity Model (CMM) of the Software Engineering Institute [21]. All of the mentioned standards except ISO 9000-3 includes software risk management process as a mandatory practice and define the methodologies as well. Although ISO 9000-3 does not address risk explicitly, risk management would clearly contribute to the overall objectives of ISO 9000-3. 2.3. Software Metrics A metric is a quantitative measure of the degree to which a system, component, or process possesses for a given attribute. The most important work on software metrics was carried out by Maurice Halstead. Halstead put forward his famous code metrics such as program vocabulary, program length, program volume and effort [22]. Halstead’s metrics lost significant amount of reputation since they can only be measured late in the software project, after coding has been carried out and some of the assumptions that Halstead made were based on theories in cognitive psychology which are now regarded as somewhat suspicious.

7 Nevertheless, Halstead’s work was very influencial since most of the researchers started to direct their attentions onto software measurement issue after him. The next important development in this area was observed when Thomas McCabe proved that the control structure of a program, or module, could be metricated, and the number that emerged could be used to characterise the complexity of the control flow [23]. Some of the resulting metrics that have emerged from this work are cyclometic complexity, essential complexity and design complexity. The latter two are indeed somehow extensions to the first metric discovered. Cyclometic complexity (v) can be calculated with the Equation 2.1 given a flow graph G where each node (n) is a block of code and each edge (e) is a decision point. v(g) = e – n + 2

(2.1)

Today’s knowledge of software metrics started to accumulate with a growing pace by the beginning of eighties, specifically by a research carried out by Kafura et al. which was interested in system design metrics. System design metrics could be extracted from a system design and could be used to predict factors maintainability. Other than describing the product or process parameters, ideal metrics should facilitate the development of models that are capable of predicting them [24]. Good metrics should be simple and easily definable so that evaluating the metric becomes clear. Also a reliable metric must be objective. Another important property for a useful metric is the ease of obtaining the metric value, i.e., at a reasonable cost. Valid metrics are more useful so that such metrics measure what they are intended to measure. Robustness is also an important indicator for being a valuable metric. The metric must be relatively insensitive to insignificant changes in the process or product. In addition, for maximum utility in analytic studies and statistical analyses, metrics should have data values that belong to appropriate measurement scales [25]. There are many different models for software quality, but in almost all models, reliability is one of the criteria, attribute or characteristic that is incorporated. IEEE 610.12-1990 defines reliability as “The ability of a system or component to perform its required functions under stated conditions for a specified period of time”. Software reliability consists of three activities [26]: 1. Error prevention

8 2. Fault detection and removal 3. Measurements to maximize reliability, specifically measures that support the first two activities There has been extensive work in measuring reliability using mean times between failures and mean time to failure [27]. Also successful models have been proposed for predicting error rates and reliability [27, 28]. In this research we also dealt with fault detection issues, so in the big picture our research aims improving software quality. 2.3.1. Classification of Software Metrics Software projects have several phases in order to progress in a disciplined manner. These phases have different characteristics in terms of the tasks being done and the outcomes. Hence, the measurement viewpoints and the software metrics used in these phases also vary. The most common phases in software development and corresponding metric types are [24]: 1. Management : Cost, schedule, progress, computer resource utilization 2. Requirements : Completeness, traceability, consistency, stability 3. Design : Size, complexity, modularity, coupling, cohesiveness 4. Code : Fault density, problem report analysis, standards compliance 5. Test : Coverage, sufficiency, failure rate, mean time to failure These types of metrics related to the phases are utilized with respect to the measurement aspects at a point the assessment is being done. Another metric classification is by the means of how the measurement process is being held. Software metrics are classified into two groups according to their measurement types. Direct measurement of an attribute or an entity involves no other attribute or entity. Examples of direct measures that are commonly used in software engineering are length of source code which is measured by lines of code, duration of testing process which is measured by elapsed time in hours and number of defects discovered which is measured by counting defects. In indirect measurement, the metric value of an attribute or an entity is calculated indirectly in terms of other attributes or entities. Examples of indirect measures commonly used in software engineering are module defect density which is acquired by dividing the number of defect by the module size, requirement stability which can be calculated by dividing the number of initial requirements by the total number of requirements.

9

Software metrics are also classified into groups according to their measurement scales. There are five major types of metrics in terms of the scales used in measuring them [29]. Nominal scale is a primitive form of measurement. Entities are grouped into different categories based on the value of the attribute being measured according to this scale group. There is no ordering within the attribute values but simply attachment of labels. Ordinal scale is an empirical relation system that consists of classes that are ordered with respect to the attribute. In this scale group, the metric values have ranking relation among each other without any quantitative comparison. Interval scale carries even more information than the nominal and ordinal scales. It preserves order, as in an ordinal scale and it also preserves differences, but not ratios. Difference between two classes can be computed, but computing ratios are not sensible in this scale group. Ratio scale preserves ordering, the size of intervals between entities and ratios between entities. There is a zero element representing the absence of the attribute in the entity. The measurement mapping must start at zero and increase at equal intervals, known as units. All arithmetic can be meaningfully applied to the classes in the range of the mapping. Yet another scale is absolute scale where the measurement is made simply by counting the number of elements in an entity set. Apart from the scale types depicted above there are some rarely used scale types like logarithmic scale. These classification operations are important in software measurement because they help understanding and visualizing the structure of the metric program established and forms a framework in terms of applying the right methods effectively depending on the assessed situation. Also there are other classifications with respect to different viewpoints in the previous research [24]. Although the described classifications figure out a meaningful picture about how to use the metrics in measurement processes, software metrics may also be broadly classified as either product metrics or process metrics. Product metrics consist of all attributes of the product from requirements to installed system. Process metrics, on the other hand, are measures of the software development process such as development time or experience of the staff. In this research we are focusing on some of the product metrics. Some of the widely used product metrics which are also the basis of our experiments in this research are:

10 1. Size metrics which attempt to quantify software size. Lines of code (LOC) is the most widely used metric for program size. The LOC metric is easily definable, however, there are number of different definitions for LOC. These differences are originating from the treatment of blank lines, comment lines, non-executable statements, as well as how to count reused lines of code. The most common definition is LOC is to count any line that is not blank or comment [30]. Another important problem of LOC metric is that its value cannot be measured until the coding process is completed. There are some other size metrics like function points and system bang overcoming this problem so that they can be measured earlier in the development process. 2. Complexity metrics are to measure program complexity. McCabe’s measure is the commonly accepted complexity measure [23]. Cycomatic complexity and extensions to it are widely used in the industry where the information flow metric [31] and Halstead’s metrics [22] are often studied as possible measures of software complexity. 3. Halstead’s product metrics proposes a unified set of metrics that apply to several aspects of programs, as well as to the overall software production effort [22]. Halstead defines program vocabulary as the sum of the number of unique operators in the program and the number of unique operands in the programs. Likewise, program length is calculated by summing the total number of operators in the program and the total number of operands in the program. Halstead’s another measure is program volume which is calculated as in the Equation 2.2 where V is program volume, N is program length and n is program vocabulary. V = N * log 2 n

(2.2)

4. Quality metrics are the collection of metrics which measure properties like correctness, efficiency, portability, maintainability, reliability, etc. Software quality is a characteristic that, it can be measured at every phase of the software development cycle. Defect metrics, reliability metrics and maintainability metrics can be discussed as metrics at some of these phases [32]. Our research is highly concerned with defect metrics. There is no effective procedure for counting the defects in the program so counting the number of design changes, the number of errors detected by code inspections, the number of errors detected in program tests or the number of code changes required may be alternative measures. By establishing relationships between

11 these metrics and other metrics that might be available earlier in the development cycle, the outcomes may be useful for predictors of program quality. 2.3.2. Metrics Programs Due to the complexity of software products and processes, no single metric can lead to accurate predictions. Predicting is a multi-variable problem. Software quality models use artificial intelligence and statistical techniques to predict a reliability measure that developers and managers are familiar with. The application of comprehensive data collection and metric analysis throughout the life cycle of a software project will allow assessing risks better during the development phase. A metrics program can be developed that will help all levels of management to evaluate program progress [33]. A metrics program is simply a set of predefined activities for identifying, measuring and evaluating the specific attributes of a software process or product [33]. Once it is decided to implement a metrics program, the next step is how to do it. How a metrics program is developed determines its success or failure. Successful metrics programs generally begin with focusing on the problem. Goal/Quality/Metric (GQM) approach provides one of the most popular frameworks for developing a healthy metrics program [34]. The data which is utilized in our research is also accumulated within a metric program that is based on the GQM framework. The framework was developed as a mechanism for formalizing the tasks of characterization, planning, construction, analysis, learning and feedback. The paradigm does not provide specific goals but rather a framework for stating goals and refining them into questions to provide a specification for the data needed to help achieve the goals [34]. The main issues that this approach is concerned with are depicted in Figure 2.1.

12

Figure 2.1. GQM Model for Software Metrics Program [34] The GQM paradigm consists of three steps. At the first step the model tries to generate a set of goals based upon the needs of the organization. After generating goals, the next step is deriving a set of questions whose answers will lead to accomplishing the goals. The last step is developing a set of metrics which provide the information needed to answer the questions. 2.4. Principal Component Analysis Principal Components Analysis (PCA) is a multivariate procedure which rotates the data such that maximum variability is projected onto the axes as depicted in the Figure 2.2. In Figure 2.2, PCA is applied to a simple set of 2-D data to determine the principal axes. Although the technique is used with many dimensions, two dimensional data will make it simpler to visualize. The red line in part (a) represents the direction of the first principal component and the green is the second. As seen from the figure, the first principal component lies along the line of greatest variation, and the second lies perpendicular to it. Where there are more than two dimensions, the third component will be both perpendicular to the first two, and along the line of next greatest variation. By multiplying the original dataset by the principal components, the data is rotated so that the principal components lie along the axes as seen in part (b).

13

Figure 2.2. PCA, (a) Original sample dataset, (b) Transformed dataset Essentially, a set of correlated variables are transformed into a set of uncorrelated variables which are ordered by reducing variability. The uncorrelated variables are linear combinations of the original variables, and the last of these variables can be removed with minimum loss of real data. The main use of PCA is to reduce the dimensionality of a data set while retaining as much information as possible [35]. As a result, by selecting a number of the principal components, the data set is interpreted with lesser number of dimensions. 2.5. Machine Learning A learning system is a computer program that is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E [8]. For example, one of the experiment types in this research aims to establish a learning system for metric data so that, the system should predict whether a function/module is defected or not. In such a setting, a system that learns to predict defectedness might improve its performance as measured by its ability to guess defected functions/modules correctly at the class of tasks involving doing predictions, through experience obtained by analyzing the characteristics of the past metric data. Actually this is not the only prediction type we are trying to succeed, our research involves some other likely

14 classification experiments, besides regression experiments that try to predict a specific defect density value in the tested data set. A typical learning system has four critical decision points in its design phase [8]. Since the definition implies that learning is improving with experience at some task, identifying the experience and the exact feature to be learned, finding out a representation for the experience and deciding the specific algorithm for learning underlie the decisions to be taken in the design process of the learning system. The answers given to these questions bring out possible considerations about the system. The learning system may rely on direct (supervised), indirect (reinforcement) or unsupervised training experience, and the system may be teacher controlled or learner controlled. The definition of the target function can be done, i.e. predicting whether a module is defected or predicting the defect density value. After defining the target function, a representation of the target function can be derived, i.e. polynomial, linear function of features, artificial neural network etc. Finally the learning algorithm can be determined, i.e. gradient descent, linear programming etc. Among many of the machine learning methods available in the literature, our research utilizes some of the mostly used methods that also best suit to our input data set and intended outcomes. These machine learning methods are summarized in the following sections. 2.5.1. K-Nearest Neighborhood Method The k-Nearest Neighborhood (kNN) is a non-parametric classification method [8]. For a data record t to be classified, its k nearest neighbors are retrieved, and this forms a neighborhood of t. Majority voting among the data records in the neighborhood is usually used to decide the classification for t with or without consideration of distance-based weighting [8]. An example nearest neighborhood problem decision is shown in Figure 2.3.

15

Figure 2.3. An 8-nearest neighborhood decision To apply kNN we need to choose an appropriate value for k, and the success of classification is very much dependent on this value. In a sense, the kNN method is biased by k. There are many ways of choosing the k value, but a simple one is to run the algorithm many times with different k values and choose the one with the best performance [8]. Using relatively large k values decreases the chance that the decision will be influenced by noisy training data records. But large values of k also reduce the sensitivity of the method. The distance metric used in nearest neighborhood method can be simple Euclidean distance. This distance measure is often modified by scaling the features so that the spread of attribute values along each dimension is approximately the same. 2.5.2. K-Means Method K-means clustering is a commonly used partition clustering method. In k-means clustering, the target function is the average squared distance of the data items from their nearest cluster center point [36]. The assumption underlying this method about the data set is that, the records might be grouped in several regions within the n-dimensional space and by finding the center points of these possible clusters it is possible to do predictions about the records with respect to their closeness to a particular cluster center. The k-means algorithm takes the

16 number of items of the data set equal to the final required number of clusters. In this step itself, the final required number of clusters is chosen such that the points are mutually farthest apart. Next, it examines each item in the data set and assigns it to one of the clusters depending on the minimum distance. The center position is recalculated every time an item is added to the cluster and this continues until all the items are grouped into the final required number of clusters. Figure 2.4 depicts how the means move into the centers of the clusters by iterative nature of the algorithm.

Figure 2.4. K-means algorithm run example Although it can be proved that the procedure will always terminate, the k-means algorithm does not necessarily find the most optimal configuration, corresponding to the global objective function minimum [8]. The algorithm is also significantly sensitive to the initial randomly selected cluster centers. The k-means algorithm can be run multiple times to reduce this effect.

17 2.5.3. Artificial Neural Networks, Multilayer Perceptron Method Networks of non-linear elements, interconnected through adjustable weights, play a significant role in machine learning. These networks are called artificial neural networks (ANN) because the non-linear elements have as their inputs a weighted sum of the outputs of other elements, much like biological networks of neurons do [8]. An artificial neuron is a unit with many inputs and one output. The neuron has two main operations modes; training mode and using mode. In the training mode, the neuron can be trained to fire (or not), for particular input patterns. In the using mode, when a taught input pattern is detected at the input, its associated output becomes the current output. If the input pattern does not belong to the taught list of input patterns, the firing rule is used to determine whether to fire or not. A more complicated neuron has 'weighted' values for the inputs so that the effect each input has at decision making is dependent on the weight of the particular input. The weight of an input is a number which when multiplied with the input gives the weighted input. These weighted inputs are then added together and if they exceed a pre-set threshold value, the neuron fires. In any other case the neuron does not fire. Perceptrons are also neurons with weighted inputs that have some additional, fixed pre-processing abilities so that there are association units within the neuron whose task is to extract specific, localised features from the inputs [35]. Artificial neurons have the ability to adapt to a particular situation by changing their weights and/or thresholds. Various algorithms exist to let the neuron to 'adapt'; the most used ones are the Delta rule which is used in feed-forward networks and the back error propagation which is used in feedback networks [35]. A feed-forward ANN allows signals to travel one way only; from input to output. There is no feedback (loops) i.e. the output of any layer does not affect that same layer. Feed-forward ANNs tend to be straight forward networks that associate inputs with outputs. They are extensively used in pattern recognition. This type of organisation is also referred to as bottom-up or top-down. Feedback networks can have signals travelling in both directions by introducing loops in the network. The algorithm we use in this research creates feed-forward networks. Feed-forward neural networks provide a general framework for representing non-linear functional mappings between a set of input variables and a set of output variables. This is

18 achieved by representing the nonlinear function of many variables in terms of compositions of nonlinear functions of a single variable, which are called unit activation functions. Neural networks have a similar structure of topology. Some of the neurons interface the real world to receive its inputs and other neurons provide the real world with the network’s outputs. All the rest of the neurons are hidden from view. There are usually a number of hidden layers between input and output layers. To determine the best number of hidden layers the most convenient method is trial and error. If the hidden number of neurons is increased too much the network will get over fit that is it will have problem to generalize [35]. The training set of data will be memorized, making the network useless on new data sets. In order to train a neural network to perform some task, the weights of each unit must be adjusted in such a way that the error between the desired output and the actual output is reduced. This process requires that the neural network compute the error derivative of the weights (EW). In other words, it must calculate how the error changes as each weight is increased or decreased slightly. The back propagation algorithm is the most widely used method for determining the EW [35]. Throughout the experiments in this research, multi layer perceptron (MLP) method is used in ANN experiments. Multilayer perceptrons are feed-forward neural networks trained with the standard back propagation algorithm. We used linear output unit activation function so that the output activity is proportional to the total weighted output. The back propagation networks were trained using scaled conjugate gradient optimization in our experiments. 2.5.4. Decision Trees and Regression Trees Decision trees are one of the most popular approaches for both classification and regression type predictions [37]. They are generated based on specific rules. Decision tree is a classifier in a tree structure. Internal nodes are tests (on input patterns) and the leaf nodes are categories (of patterns). The tree is computed with respect to the existing attributes. Decisions at each node are based on an attribute which branches for each possible outcome for that attribute. Decision trees can be regarded as a sequence of questions, which leads to a final outcome. Each question depends on the previous question hence this case leads to a branching in the

19 decision tree. While generating the decision tree, the main goal is to minimize the average number of questions in each case. This task provides increase in the performance of prediction [8]. The central focus of the decision tree growing algorithm is selecting which attribute to test at each node in the tree. The goal is to select the attribute that is most useful for classifying examples. A good quantitative measure of the worth of an attribute is a statistical property called information gain that measures how well a given attribute separates the training examples according to their target classification [37]. This measure is used to select among the candidate attributes at each step while growing the tree. In order to define information gain precisely, a measure commonly used in information theory, called entropy is defined. Entropy value determines the level of uncertainty. The degree of uncertainty is related to the success rate of predicting the result. A decision tree can grow in such a way that each branch of the tree is just deeply enough to perfectly classify the training examples [37]. While this is sometimes a reasonable outcome, in fact it can lead to difficulties when there is noise in the data, or when the number of training examples is too small to produce a representative sample of the true target function. In either of these cases, the algorithm can produce trees that over-fit the training examples. Over-fitting is a significant practical difficulty for decision tree learning and many other learning methods. There are several approaches to avoiding over-fitting in decision tree learning. These can be grouped into two classes [37], a group of approaches try to stop growing the tree earlier, before it reaches the point where it perfectly classifies the training data, another group of approaches allow the tree to over-fit the data, and then post prune the tree. Although the first of these approaches might seem more direct, the second approach of post-pruning over-fit trees has been found to be more successful in practice [37]. This is due to the difficulty in the first approach of estimating precisely when to stop growing the tree. By pruning the output variable variance in the validation data is minimized by selecting a simpler tree than the one obtained when the tree building algorithm stopped, but one that is equally as accurate for predicting or classifying "new" observations. Regression trees are used for regression type predictions. Regression trees can be considered as a variant of decision trees, designed to approximate real-valued functions instead of being used for classification tasks.

20

2.6. Metrics and Software Risk Assessment Software metrics are mostly used for the purposes of product quality and process efficiency analysis and risk assessment for software projects. Software metrics have many benefits and one of the most significant benefits is that they provide information for defect prediction. Metric analysis allows project managers to assess software risks. Currently there are numerous metrics for assessing software risks. The early researches on software metrics have focused their attention mostly on McCabe, Halstead and LOC metrics. Among many software metrics, these three categories contain the most widely used metrics. Also in this work, we decided to use a model mainly based on these metrics. Metrics usually have definitions in terms of polynomial equations when they are not directly measured but derived from other metrics. Researchers have used neural network approach to generate new metrics instead of using metrics that are based on certain polynomial equations [38]. This is actually introduced as an alternative method to overcome the challenge of derivation of a polynomial which provides the desired characteristics. Modeling metrics with methods different than the traditional ones, especially by applying neural approaches is much more significant in terms of developing new reusable metrics. Other causal methods are also introduced on software measurement stressing on the power of these types of methods’ causeeffect relationships and incorporating this power into predictive models [39]. Most software metrics activities are carried out for the purposes of risk analysis in software projects. Bayesian belief network is also used to make risk assessment in previous research [2]. The traditional classification and regression based approaches are criticized in terms of their decision support power and Bayesian networks are suggested as a causal model without requiring new metrics other than basic metrics such as LOC, Halstead and McCabe in the learning process. It is argued that some metrics do not give right prediction about software’s operational stage. For instance, there is not a similar relation between the number of fault for the pre-and post-release versions of the software and the cyclomatic complexity. To overcome this problem, Bayesian Belief Network (BBN) is used for defect modeling. In another research, the approach used is to categorize metrics with respect to the models developed. The model is based on the fact that “software metrics alone are difficult to

21 evaluate”. Metrics are applied on three models namely “Complexity”, “Risk” and “Test Targeting” model. Different results obtained with respect to these models and each is evaluated distinctly [3]. It is shown that some metrics depict common features on software risk. Instead of using all the metrics adopted, a basic combination that will represent a cluster can be used. PCA can be utilized to provide a means of normalizing and orthogonalizing the input data and the ANN can be used for risk determination/classification [4]. 2.7. Applications of Machine Learning in Software Defect Prediction Defect prediction models can be classified according to the metrics used and process step in the software life cycle. Most of the defect models use the basic metrics such as complexity and size of the software [40]. Testing metrics that are produced in test phase are also used to estimate the sequence of defects [41]. Another approach is to investigate the quality of design and implementation processes, that quality of design process is the best predictor for the product quality [42, 43]. The main idea behind the prediction models is to estimate the reliability of the system, and investigate the effect of design and testing process over number of defects. Previous studies show that the metrics in all steps of the life cycle of a software project as design, implementation, testing, etc. should be utilized and connected with specific dependencies [5]. Concentrating only on a specific metric or process level is not enough for a satisfied prediction model. Machine learning algorithms have been proven to be practical for poorly understood problem domains that have changing conditions with respect to many values and regularities. Since software problems can be formulated as learning processes and classified according to the characteristics of defect, regular machine learning algorithms are applicable to prepare a probability distribution and analyze errors [5, 6]. Decision trees, ANN, BBN and clustering techniques such as kNN are examples of the most commonly used techniques for software defect prediction problems [6, 7, 8].

22 Machine learning algorithms can be used over program execution to detect the number of the faulty runs, which will lead to find underlying defects. Executions are clustered according to the procedural and functional properties of this approach [44]. Machine learning is also used to generate models of program properties that are known to cause errors. Support vector and decision tree learning tools are implemented to classify and investigate the most relevant subsets of program properties [45]. Underlying intuition is that most of the properties leading to faulty conditions can be classified within a few groups. Technique consists of two steps; training and classification. Fault relevant properties are utilized to generate a model, and this precomputed function selects the properties that are most likely to cause errors and defects in the software. Clustering over function call profiles are also used to determine which features enable a model to distinguish failures and non-failures [46]. Dynamic invariant detection is used to detect likely invariants from a test suite and investigate violations that usually indicate erroneous state. This method is also used to determine counterexamples and find properties which lead to correct results for all conditions [47]. 2.8. Software Metric Data The data set used in this research is provided by the NASA IV&V Metrics Data Program – Metric Data Repository [48]. The data repository contains software metrics and associated error data at the function/method level. The data repository stores and organizes the data which has been collected and validated by the Metrics Data Program (MDP). The association between the error data and the metrics data in the repository provides the opportunity to investigate the relationship of metrics or combinations of metrics to the software. The data that is made available to general users has been sanitized and authorized for publication through the MDP website by officials representing the projects from which the data has originated. The database uses unique numeric identifiers to describe the individual error records and product entries. The level of abstraction allows data associations to be made without having to reveal specific information about the originating data. The repository contains detailed metric data in terms of product metrics, object oriented class metrics, requirement metrics and defect/product association metrics. We specifically

23 concentrate on product metrics and related defect metrics. The data portion that feeds the experiments in this research contains the mentioned metric data for JM1 and CM1 projects. Some of the product metrics that are included in the data set are, McCabe Metrics; Cyclomatic Complexity and Design Complexity, Halstead Metrics; Halstead Content, Halstead Difficulty, Halstead Effort, Halstead Error Estimate, Halstead Length, Halstead Level, Halstead Programming Time and Halstead Volume, LOC Metrics; Lines of Total Code, LOC Blank, Branch Count, LOC Comments, Number of Operands, Number of Unique Operands and Number of Unique Operators, and lastly Defect Metrics; Error Count, Error Density, Number of Defects (with severity and priority information). After constructing our data repository, we cleaned the data set against marginal values, which may have led our experiments to faulty results. For each type of feature in the database, the data containing feature values out of a range of ten standard deviations from the mean values are deleted from the database. The details about this operation are given in the methodology section. Our analysis depends on machine learning techniques so for this purpose we divided the data set in two groups; the training set and the testing set. These two groups used for training and testing experiments are extracted randomly from the overall data set for each experiment by using a simple shuffle algorithm. This method provided us with randomly generated data sets, which are believed to contain evenly distributed numbers of defect data.

24

3. PROBLEM STATEMENT

In this research our aim is to design and implement the learning system shown in Figure 3.1 to achieve correct predictions on test data using a data set containing defected and non-defected software functions/modules.

Figure 3.1. The predictor system The prediction system consists of four main components: 1. Input data: The input data contains metric data about the software. In our research this data consists of code metrics data and defect data. The defect data values are also the outputs that the system aims to predict. This system is not specifically designed for defect prediction such that, if any other quantitative data is supplied as target value in the input data, then the system should be capable of predicting corresponding values too. 2. Re-organization of the input data: The functions in this component process the input data to make it available for an effective learning system. In our experiments, this component involves the cleaning, normalization and principal component analysis

25 functions. The cleaning function aims to remove marginal values in the data set. For this purpose, the function eliminates the data records containing values out of a range of ten standard deviations from the mean values in any of the features. PCA function is used for normalizing and orthogonalizing the data set, thus eliminating the negative effects of correlations within the data set. Normalization function aims to bring all feature values into similar scales and applied only when PCA is not used. The important point about this component is the functions in it are applied based on the decisions for other component settings, i.e., PCA function is used in experiments that are utilizing only kNN and k-means algorithms as training methods, or neither normalization nor PCA was used in decision tree experiments because these have undesired effects in terms of performance. 3. Training: The training component may involve any learning algorithm such that the desired learning method can be plugged into the system by matching the input interface. In our research we have utilized kNN, k-means, MLP and decision tree methods. 4. Output: The outputs of the system are predicted values with respect to the target features. In this research the output for the learning experiments is the defect density value in a specific function/module. Since the output depends on the supplied input as mentioned before, its type can be altered with respect to desired outcome. Therefore, in our research we defined boolean outputs (defected or non-defected) for some of the experiments which are aiming classification. 3.1. Software Defect Prediction The learning system in our research is concerned with predicting the defectedness of a given function/module. Two types of research can be conducted on the code based metrics in terms of defect prediction. The first one is predicting whether a given code segment is defected or not. The second one is predicting the magnitude of the possible defect, if any, with respect to various viewpoints such as density, severity or priority. Estimating the defect causing potential of a given software project has a very critical value for the reliability of the project. Our work in this research is primarily focused on the second type of predictions. But it also includes some major experiments about the first type of predictions to increase the performance of the ultimate goal.

26 3.2. Evaluation of Experiment Results Given a training data set, a learning system can be set up. This system would come out either with a boolean value that indicates whether a test data is defected or non-defected or a score point that indicates how much a test data is defected. For both type of outcomes, we should have performance measures to evaluate the success of the system. Our evaluation mechanisms for both types of outcomes are described in the following sections. 3.2.1. Evaluation of Classification Experiments’ Results In this type of experiments the system predicts the defectedness of the test data. From such a prediction, it can be deduced that the system may be mistaken in two ways. Table 3.1. Performance evaluation of classification experiments

Table 3.1 depicts the possible outcomes of a classification type prediction model. The model is correct for the outcomes residing in the first and the fourth cells because the prediction is realized, but when the predicted value does not match the actual value, that is counted as an error for the prediction model. For the sake of better evaluation we classified these errors into two groups. Type 1 errors occur when the predictor claims that there exists a defect but in fact there is not, and type 2 errors occur when the predictor misses a real defect. The percentage of correct predictions and the values of both type 1 and type 2 errors indicate the performance of the experiment. 3.2.2. Evaluation of Regression Experiments’ Results After predicting the score point, the results can be evaluated with respect to popular performance functions. The two most common options here are the Mean Absolute Error

27 (mae) and the Mean Squared Error (mse). The mse is most commonly used in function approximation so in this research we used mse for evaluating the performance of the regression experiments. Mse equals the mean of the squares of the deviations from target, i.e.,

(3.1) where xi = ith value of a group of m values, T = target or intended value for the product variable of interest.

28

4. METHODOLOGY

Designing a learning system for defect prediction in software projects requires experimenting different learning approaches as well as developing different solution settings with respect to organization of the data supplied, target functions and parameters. Our goal in this research is to predict the defect density value subject to other metric values for a given data record. We first designed an initial solution for defect prediction problem considering the problem as prediction of the desired values using regression based techniques. And then to improve the performance of the system we have divided the problem and developed separate solutions for the resulting two sub-problems. We also incorporated some other statistical methods to improve the success of the designed solution. 4.1. Initial System Design Our aim in this research is predicting the defect density value of the functions/modules with respect to training data. Defect density is the total number of defects per 1-000 lines of code. For this purpose we have designed a learning system utilizing ANN and decision tree machine learning methods for regression purposes. The architecture of this system is shown in Figure 4.1.

Figure 4.1. The initial learning system

29 The experiments that are done for the evaluation of this system are carried out by using the entire data set and the results show that the performance of both algorithms are not in acceptable ranges as these outcomes are detailed in the results section. The data set includes mostly non-defected modules so there happens to be a bias towards underestimating the defect possibility in the prediction process. Also it is obvious that any other input data set will have the same characteristic since it is practically likely to have much more non-defected modules than defected ones in real life software projects. 4.2. Improved Learning System Since the prediction power of the regression algorithms are not successful enough when they are used among the entire data set we trained the system with the data set containing only defected items. Our purpose was to see the performance when the bias mentioned in the previous section is abolished. The results show that both the ANN and the decision tree methods perform well with this new data set.

Figure 4.2. The improved learning system

30 Therefore we divided the problem into two sub-problems and updated our system with respect to these problems as shown in Figure 4.2. The improved system has two steps aiming to solve the two problems: 1. Which items in the dataset are defected? 2. What are the defect density values for the items that are said to be defected? In the first step, in order to generate a learning system that predicts the existence of possible errors, we used four widely used classification algorithms. kNN, k-means, ANN and decision tree approaches are utilized to classify the data set according to defectedness criteria. The classification process has two clusters so that the testing data set is fit into. In these experiments the classification is done with respect to a threshold value, which is close to zero but is calculated internally by the experiments. This threshold point is the value where the performance of the classification algorithm is maximized. The two clusters are the values less than this threshold value, which indicates that there is no defect. The values greater than the threshold value which is indicating there is a defect. This threshold value may vary with respect to the input data set used but it can be calculated throughout the experiments for any data set. The performance of this classification process is measured by the count of the correct predictions it has done compared to the incorrect ones. The first step uses the entire data set as input and we observed that the ranges of the values in the data set are very incompatible. Some of the learning algorithms we intend to use are based on calculations of the Euclidean distances in an n-dimensional space so when calculating these distances, the size of the scales of each dimension should affect the resulting value with the same ratio. For this purpose we first normalized the data set in kNN and k-means experiments. It is also depicted in Figure 4.2 so that the normalization task is optional in the learning system. Again in the first step we observed that the data set contains highly correlated data. Additional experiments are done in our research so that PCA method is incorporated for decreasing the dimensionality of the data set while eliminating the correlations among the features. The results regarding the effects of using PCA before training are detailed in the results chapter. The second step in this research is very similar to the initially proposed system. ANN and decision trees are utilized for regression purposes in this step. The only difference in this

31 improved model is that, this step uses the data values that are labeled as defected in the first step experiments. In both steps, the system puts forward predictions using the necessary learning methods and our model implies that, the performance of the methods are calculated in each step and the predictions of the best performing method is selected as the outcome of the learning system. Our model predicts the possibly defected modules in a given data set, besides it gives an estimation of the defect density in the module that is predicted as defected. So the model helps concentrating the efforts on specific suspected parts of the code so that significant amount of time and resources can be saved in software quality process.

32

5. EXPERIMENT RESULTS

The experiments and calculations in this research are done by Matlab 6.5 Release 13 tool. Some standard functions embedded within the tool are utilized as well as we have developed some functions taking widely accepted algorithms into consideration. The source codes of the functions developed by us and necessary references to either Matlab’s or third party functions are detailed in Appendix B. The data set used in this research contains defect density data which corresponds to the total number of defects per 1-000 lines of code. In this research we have used the software metric data set with this defect density data to predict the defect density value for a given project or a module. As mentioned earlier, our analysis depends on machine learning techniques therefore, we divided the data set in two groups; the training set and the testing set. These two groups used for training and testing experiments are extracted randomly from the overall data set for each experiment by using a simple shuffle algorithm. This method provided us with randomly generated data sets, which are believed to contain evenly distributed numbers of defect data. In this research all experiments are done with 6-000 training data and 2-000 testing data. The experiments for each algorithm are repeated 30 times and the mean values of the results are taken into account in the evaluation process. To assure the randomness of the data, the 6-000 training data and the 2-000 testing data are collected from a repository which is regenerated from the original data set by using the previously mentioned shuffling algorithm which is based on random displacement of the rows in the data set. 5.1. Regression Results Using Complete Dataset In designing the experiment set of the MLP algorithm, a neural network is generated by using linear function as the output unit activation function. Different number of hidden units are used in network generation to see their performances and to select the best, and the alpha value is set to 0.01 while the experiments are done with 200 training cycles. Also in the

33 experiment set of decision tree algorithms, Treefit and Treeprune functions are used consecutively. The method of the Treefit function is altered for classification and regression purposes respectively. Different number of pruning levels are tested to see their corresponding performances. 5.1.1. Multi Layer Perceptron In the first type of experiments ANN method does bring out successful results. The average variance of the data sets which are generated randomly by the use of the shuffling algorithm is 1-402.21 and the mean mse value for the ANN experiments is 1-295.96 for 32 hidden units where the performance is maximized. Figure 5.1 shows the mean values and standard deviances of experiments that are done with respect to different number of hidden units. The figure shows that selecting hidden unit number as 32 gives the best results for regression experiments using MLP.

Figure 5.1. Experiment results for MLP with respect to different number of hidden units This value is far from being acceptable since the method fails to approximate the defect density values. Figure 5.2 depicts the scatter graph of the predicted values and the real values. According to this graph, it is clear that the method potentially does faulty predictions over the non defected values. The points laying on the y-axis show that there are unacceptable amount

34 of faulty predictions for non defected values. Also apart from failing to predict the non defected ones, it is obvious that the method is biased towards smaller approximations on the predictions for defected items because vast amount of predictions lay under the line which depicts the correct predictions.

Figure 5.2. Experiment results for MLP using entire data set

5.1.2. Decision Tree Decision tree method similarly brings out unsuccessful results when the input data set is the complete data set which contains both defected and non defected items where non defected ones are much denser. The average variance of the data sets is 1-353.27 and the mean mse value for decision tree experiments is 1-316.42 when the resulting tree is pruned with 10 levels. Figure 5.3 depicts the experiment results with respect to different levels of pruning. Increasing the pruning level until a value of 10 results in an increase in performance of regression ecperiments but after that value, the performance drops drastically. Also the standard deviations of the experiments are high which means that the results are not so stable for these type of experiments.

35

Figure 5.3. Experiment results for decision tree with respect to different levels of pruning This result is slightly worse than that of ANN results. Figure 5.4 shows the predictions done by the decision tree method and the real values.

Figure 5.4. Experiment results for decision tree using entire data set

36 Like ANN method, decision tree method also misses predicting non defected values. Moreover, the decision tree method does much more non defected predictions where the real values show that the corresponding items are defected. Also the effect of the input data set which is explained as a bias towards zero value is not as high as in the ANN case. 5.2. Regression Results Using Dataset Containing Defected Items The second type of experiments is done with input data sets which contain only defected items. The results for both ANN and decision tree methods are more successful than in the first type of experiments. These experiments are done to see and verify the success of the two step model that is proposed. These experiments constitute to the second step of the proposed model and the problem is reduced to improving the performance of the first step since these experiments prove that the second step is successful. 5.2.1. Multi Layer Perceptron The average variance of the data sets used in the ANN experiments are 1-637.41 and the mean mse value is 262.61 when the number of hidden units is 32. Figure 5.5 shows the results of the experiments that are done using different number of hidden units.

Figure 5.5. Experiment results for MLP with respect to different number of hidden results

37

The performance of the training experiments are maximized when the number of hidden units is 32. The performance climbs slowly until this value and drops more quickly after 32. This result show that the network starts to memorize the data set rather than learning it. According to these results the MLP algorithm approximates the error density values well when only defected items reside in the input data set. It also shows that the dense non defected data affects the prediction capability of the algorithm in a negative manner. Figure 5.6 shows the predicted values and the real values after ANN experiments run. The algorithm estimates the defect density value better for smaller values as seen from the graph, where the scatter deviates more from the line that depicts the correct predictions for higher values of defect density.

Figure 5.6. Experiment results for MLP using data set containing only defected items 5.2.2. Decision Tree The average variance of the data sets in the decision tree experiments are 1-656.23 and the mean mse value is 237.68 when the pruning level is 10. Figure 5.7 shows the means and standard deviations of the mse values with respect to different levels of pruning.

38

Figure 5.7. Experiment results for decision tree with respect to different levels of pruning

Figure 5.8. Experiment results for decision tree using dataset with only defected items The performance of the decision tree experiments is maximized when the pruning level is 10 for the experiments done with the dataset containing only the defected items.

39

Like ANN experiments, decision tree method is also successful in predicting the defect density values when only defected items are included in the input data set. Figure 5.8 depicts the experiment results; according to this figure decision tree algorithm gives more accurate results for almost half of the samples than the ANN method. Despite the spread of the erroneous predictions shows that their deviations are more than that of ANN’s. Like ANN method, decision tree method also results in increasing deviations from the real values as the defect density values increase. 5.3. Classification Results In the third type of experiments the problem is reduced to only predicting whether a module is defected or not. For this purpose kNN, k-means, ANN and decision tree algorithms are used to classify the testing data set into two clusters. The value that divides the output data set into two clusters is calculated dynamically so that this value is selected among various values according to their performance in clustering the data set correctly. After several experiment runs, the performance of the clustering algorithm is measured with respect to these values and the best one is selected as the point which generates the two clusters; less values are non defected and the others are defected. In addition to these experiments kNN, k-means and ANN experiments are also carried out with the transformed dataset resulting from the PCA algorithm. The effects of normalizing and orthogonalizing the dataset using PCA are investigated in terms of performance. The results of the experiments with and without PCA are depicted together for each of the methods in the following sections. 5.3.1. Principal Component Analysis The PCA results of a sample run show that, when the multivariate data is projected into two dimensional coordinate system with respect to the first two principal components, it is observed that the data cannot be easily separated according to the defect density values.

40

Figure 5.9. Two dimensional projection of the first two components resulting from PCA

The non-clustering nature of the input data proves that predicting defected ones is a fairly difficult task. Figure 5.9 depicts the characteristics of the transformed data set projected into 2 dimensional system. The zero data points and the one data points which represent the nondefected and the defected items respectively do not cluster seperately according to the figure. The PCA results also show that a small number of the transformed components represent more than 98 % of the total variation in the original data set as listed in Table 5.1. This is a sufficient value so that the data set does not lose significant information besides; it can be represented with only first ten principal components whereas the total number of features in the original data set was 20. The classification experiments are done using the features that result from the PCA as well as using all features without doing PCA. The results of these two types of experiments are compared to decide whether utilizing PCA increases the overall classification performance or not.

41 Table 5.1. Cumulative Eigenvalue percentages of the principal components

5.3.2. K-Nearest Neighborhood The results for 30 consecutive runs of kNN algorithm on a data set consisting are summarized in Table 5.2 and the performance of these experiments is depicted in Figure 5.10. Although the performance of the kNN algorithm increases as k grows, the algorithm is inefficient in defect prediction especially the kNN algorithm performs poor in terms of type 2 errors. The algorithm did not result in successful results in predicting the really defected items.

42 Table 5.2. kNN experiment results

kNN algorithm is respectively successful with greater k values at doing correct predictions. It lacks performance in terms of predicting the really defected items.

Figure 5.10. Performance graph of kNN algorithm in classification The overall performance of the algorithm is low because it succeeds in predicting only 123 out of 425 real defects at its the best. When k is greater, type 1 errors are low. This is also consistent with the fact that the algorithm puts less defect predictions forward as k increases. For a k value of 10, the algorithm makes 145 defect predictions where its overall success is

43 80.4 percent. This value is even less than the percentage of the non-defected items in the dataset, so an algorithm simply saying non-defected to all tested values shows better performance than the kNN algorithm. So given the fact that the algorithm makes much less defect predictions than it should do, it is straightforward to understand why it does not have high type 1 errors but have more type 2 errors. It is also evident that the success rate values of smaller k’s on defected items are better where the number of defect predictions done is higher. These type of experiments with smaller k values have worse overall performance values since their type-1 errors are too high. Applying PCA before kNN algorithm changes the performance results but these changes are not significant, hence the performance of the kNN experiments with PCA is still not good enough in terms of defect prediction. The results are shown in Table 5.3. Table 5.3. kNN experiment results with PCA

kNN experiments with PCA show a performance value of 81.70% at best when k value is 10. The same characteristics exist as in the without PCA case. The number of total predictions done decreases continuously when k increases so does the successful prediction rate on defected items. 5.3.3. K-Means The results for 30 consecutive runs of k-means algorithm on a data set consisting of 6,000 training and 2,000 testing data are summarized in Table 5.4 and the corresponding performance values are depicted in Figure 5.11. K-means algorithm performs slightly better than kNN algorithm in predicting errors especially it does not make as much type 1 errors as kNN does. The results show similar characteristics with the ones of kNN algorithm since type

44 1 error counts are again very low with respect to type 2 error counts. The performance of kmeans does not vary much with respect to the value of k especially when the value of k 5 or greater. Table 5.4. k-means experiment results

The predictions of the k-means algorithm are also not in acceptable ranges but once again, the algorithm does not label non-defected items as defected by mistake.

Figure 5.11. Performance graph of k-means algorithm in classification

45 The overall performance of this algorithm is also low because it has succeeded in predicting only 115 out of 413 real defects at its best case. Type 1 errors are low, so this shows that the k-means algorithm also puts fewer predictions forward than expected. For a k value of 8 where the overall performance is at maximum, the algorithm makes 149 defect predictions which is also the maximum number of predictions among the experiments done using different k values. The success rate of the algorithm at k value 8 is 83.65%. This value can be considered low when compared with the total percentage of the non-defected items. Despite the fact that the algorithm performs well in predicting non-defected items with a percentage of 97.91, it lacks performance in predicting the defected items where the success rate is 25.36%. This is due to the fact that the algorithm makes only 149 defect predictions where it is supposed to 418 predictions. Applying PCA before k-means algorithm has a positive effect on the performance results but these changes are again not significant. The performance of the k-means experiments with PCA is not in acceptable range in terms of defect prediction. The results are shown in Table 5.5. K-means experiments with PCA show a performance value of 84.70% at best when k value is 8. Like the without PCA case, the number of total predictions done are lower than expected. Table 5.5. k-means experiment results with PCA

5.3.4. Multi Layer Perceptron In the ANN experiments the clustering algorithm is partly successful in predicting the defected items. The mean percentage of the correct predictions is 88.35% for ANN experiments. The mean percentage of correct defected predictions is 54.44% whereas the mean percentage of correct non defected predictions is 97.28%. These results show that the

46 method is very successful in finding out the non-defected items. The performance results of the MLP experiments are shown in Table 5.6. Table 5.6. MLP experiment results

The performance of the MLP algorithm is maximized when the threshold point which separates the two clusters is 5. This value is found by trying different values between 1 and 10. The results in Table 5.6 are generated by classifying the dataset with respect to having a defect density value greater than 5 and less than 5. The MLP algorithm is better than both of the algorithms used before. Nevertheless, the results show similar characteristics with the ones of kNN and k-means algorithms since type 2 error counts are again high with respect to type 1 error counts. This is due to the fact that this algorithm also makes less predictions than expected. The algorithm succeeds in predicting more than half of the defected items. Table 5.7. MLP experiment results with PCA

47 The overall success rate of the MLP algorithm makes it a better method for prediction but still it has to be improved in terms of doing more predictions correctly. When PCA is applied to the dataset before MLP method is used, the results are better as shown in Table 5.7. The average success rate of the MLP algorithm in predicting defected and non-defected items in the dataset is 90.95%. The algorithm manages to find out every two of three really defected items. Also the algorithm rarely fails to predict non-defected items where its success rate is 98.10%. Despite these successful results, MLP algorithm still does less number of predictions than the real defected item count. So total number of type 2 errors is still high with respect to type 1 errors. 5.3.5. Decision Tree The decision tree method is more successful than all of the methods applied before in these type of experiments. The mean percentage of the correct predictions is 91.75% for decision tree experiments. The mean percentage of correct defected predictions is 79.38% and lastly the mean percentage of correct non defected predictions is 95.06%. According to these results it can be concluded that the experiments using decision tree method for classification are more successful than the experiments that are using the other methods used previously. The performance results of the decision tree method are shown in Table 5.8. Table 5.8. Decision tree experiment results

Decision tree method mostly differs from the other methods in the number of defect predictions it does. The number of defect predictions is almost equals to the number of total defects in the dataset. Type 1 error counts are higher than the counts of the other three

48 algorithms but this is a result of doing more predictions. So this does not mean a worse performance so that the success rate on defected items also describes a better way of quantifying these results. The decision tree algorithms find almost four of each five defects in the dataset. It is clear from the results that the decision tree algorithm is not as successful as the MLP algorithms in terms of finding the non-defected items. This is also a result of high type 1 errors. The other three algorithms all had very low type 1 errors and very high type 2 errors, but decision tree algorithm has similar values for type 1 and type 2 errors. The type 2 error counts are lower than other methods’ whereas it does more type 1 errors than the others. 5.3.6. Comparison of the Classification Methods The decision tree algorithm is selected as the best classifier among the four methods tested in classification step. It is also worth noting that the experiments carried out are dependent on the dataset used so one of the other algorithms may be selected for classification for a different problem domain. Table 5.9 shows the classification performance results of the all four algorithms by taking their best performances into account. Table 5.9. Comparison of the performance results

According to Table 5.9 the decision tree algorithm is best of the four in terms of overall performance as stated earlier. The decision tree algorithm is also the best way of identifying the defected items in the dataset but k-means and MLP algorithms seem to be a better way of predicting the non-defected items. This results arises from the fact that those algorithms make less defect predictions, so doing less predictions brings doing less mistakes. 5.4. Testing of the Proposed Model The last group of experiments are done by using the decision tree method which is decided as the best method for both of the classification and the regression tasks.

49

In the first step of these experiments, the entire dataset is classified with respect to being defected by using decision tree algorithm with a pruning level of 10. The items which are classified as defected are collected in a new dataset as a result of this first step. Then in the second step of these experiments this new dataset is used for regression purposes to estimate the defect density values of the items. These experiments are also done by using decision tree algorithm. The average variance of the data sets in the decision tree experiments are 1-521.06 and the mean mse value is 341.35. These results show that our model achieves a success in terms of estimating the defect density values. The mean mse value is much less than the mean mse value of the first type of experiments even if the average variances of the datasets are not so different. The mean mse of the experiment is higher than the mean mse values of the second type of regression experiments which is an expected result also. But again, the model is comparatively successful because the mean mse value does not increase too much despite the fact that the classification task generates some error prior to the regression runs. This shows that if the performance of the classification step increases by enhancing the used techniques, the overall performance of the model will increase towards the values which are acquired from the regression experiments done with datasets which contains only defected items.

50

6. CONCLUSION

In this research, we designed and implemented a new defect prediction system based on machine learning methods. According to the evaluation experiments MLP and decision tree results for regression (function approximation to defect density values) have much more wrong defect predictions when applied to the entire data set. Since most modules in the input data have zero defects (80% of the whole data), applied machine learning methods fail to predict values within expected performance. The data set is already 80% non-defected. Even if an algorithm claims that a test data is non-defected though it did not try to learn at all, the 80% success is guaranteed. Therefore logic behind the learning methodology fails. Different methodology which can manage such data set for software metrics is required. Instead of predicting the defect density value of a given module, first, trying to find if a module is defected, and then estimating the magnitude of the defect seems to be an enhanced technique for such data sets. Metrics values for modules that have zero defects or vice versa are very similar so it is much easier to learn the defectedness probability. Moreover, it is also much easier to learn the magnitude of the defects while training within the modules that are known to be defected. Training set of software metrics has most modules with zero or very small defect densities. So, defect density values can be classified into two clusters as defected and non-defected sets. This partitioning enhance the performance of learning process and enables regression methods to work only on training data consisting of modules that are predicted as defected in the first processing. Clustering as defected and non-defected based on a threshold value enhances the learning and estimation in the classification process. This threshold value is self set within the learning process so that it is an equilibrium point where the learning performance is at maximum. In our specific experiment dataset we observed that decision tree algorithm performs better than kNN, k-means and MLP algorithms in terms of both classifying the items in the dataset with respect to being defected, and estimating the defect density of the items that are thought

51 to be defected. Also the decision tree algorithm generates rules in the classification process. These rules are used for deciding which branches to select towards the leaf nodes in the tree. The effects of all features in the dataset can be observed by looking at these rules. As a result, the model can be summarized in two steps, first one will predict whether a module is defected or not using clustering techniques and second one will predict the magnitude of the defect. By this approach, along with predicting which modules are defected, the model generates estimations on the defect magnitudes. The software practitioners may use these estimation values in making decisions about the resources and effort in software quality processes such as testing. Our model constitutes to a well risk assessment technique in software projects regarding the code metrics data about the project. As a future work, different machine learning algorithms or improved versions of the used machine learning algorithms may be included in the experiments. The algorithms used in our evaluation experiments are the simplest forms of some widely used methods. Also this model can be applied to other risk assessment procedures which can be supplied as input to the system. Certainly these risk issues should have quantitative representations to be considered as an input for our system.

52

APPENDIX A: THE METRIC DATA REPOSITORY SPECIFICATIONS

The metric data repository contains 2-700 plus error reports spanning 8 years of measurement related to approximately 14-000 modules/functions. The error reports in the repository are associated with 2-400 different modules/functions so that in the metric database, their outcomes are accumulated as metrics like defect density, defect severity etc. with respective product/module metrics. The descriptions of the 22 metric features in the database are shown in Table A.1. Table A.1. The features in the metric data repository

53

APPENDIX B: SOURCE CODES

The experiments in this research are done using Matlab tool. The source codes that are developed for running the experiments are listed in the following sections. Some of Matlab’s intrinsic functions and other third party functions are utilized in the codes. The references to these third party codes are mentioned as comment lines in the source code. B.1. Shuffling Function

function outarg = myShuffle(dataAll) %shuffle for s=1:size(dataAll,1) p=fix(size(dataAll,1)*rand)+1; a=dataAll(p,:); dataAll(p,:)=dataAll(s,:); dataAll(s,:)=a; end; outarg = dataAll; return; B.2. kNN Function

function outarg = myKNN(dataAll, trainSize, testSize, _ classLabelColumn, featureRange, PCA, PCADim)

k_of_knn = 10; dataAll = myShuffle(dataAll); % Get training class information classTr = dataAll(1:trainSize,classLabelColumn); % Get training part of the data dataTr = dataAll(1:trainSize, featureRange); % Get test class information

54 classTe = dataAll(trainSize+1:trainSize+testSize, _ classLabelColumn); % Get test part of the data dataTe = dataAll(trainSize+1:trainSize+testSize, _ featureRange);

if (pca == 1) % princomp function :

J. Edward Jackson, A User's

% Guide to Principal Components % John Wiley & Sons, Inc. 1991 pp. 1-25. % [pc, score, latent, tsquare] = princomp(x);

% Perform PCA over training data [pc, zscores, pcvars]

= princomp(dataTr);

% Get first PCADim from Principal Components dataTr = (zscores(:,1:PCADim)); % Perform PCA over test data [pc, zscores, pcvars]

= princomp(dataTe);

% Get first PCADim from Principal Components dataTe = (zscores(:,1:PCADim)); end;

% knn : Christopher M Bishop, Ian T Nabney (1996, 1997) % knn(xtrain, xtest, t, kmax)

y = knn(dataTr, dataTe, horzcat(~classTr,classTr), k_of_knn); y = y - 1; comp = (classTe)*ones(1,k_of_knn);

fprintf(1,'\n\nTotal error for knn \n'); fprintf(1,'%4d',TrainSize-sum(y==comp)); fprintf(1,'\n\nTotal type 1 error for knn \n'); fprintf(1,'%4d',sum(y>comp)); fprintf(1,'\n\nTotal type 2 error for knn \n');

55 fprintf(1,'%4d',sum(yclassTe)); fprintf(1,'\n\nTotal type 2 error for mlp \n'); fprintf(1,sum(yt

Suggest Documents