Software-Effort Estimation: An Exploratory Study of Expert Performance

Software-Effort Estimation: An Exploratory Study of Expert Performance Steven S. Vicinanza Energy Management Associates, Inc. 100 Northcreek Atlanta,...

Author: Myra Terry

0 downloads 2 Views 7MB Size

Report

Download PDF

Recommend Documents

AN EXPLORATORY STUDY SYNOPSIS

Expert Judgment in Software Estimation during the Bid Phase of a Project An Exploratory Survey

Exploratory Study. Antisocial Burnout: An

Breadcrumb Navigation: An Exploratory Study of Usage

An exploratory study of Google Scholar

Children of Famous Parents: An Exploratory Study

Defining Spiritual Care: An Exploratory Study

An exploratory study of first exposure to Urdu AN EXPLORATORY STUDY OF FIRST EXPOSURE TO URDU DISCOVERING WHAT MATTERS

Cognition in Jazz Improvisation: An Exploratory Study

FINANCIAL DISTRESS AMONG EMPLOYEES: AN EXPLORATORY STUDY

Is "Smart Census" Possible? An Exploratory Study

An Exploratory Study of the Emerging Role of Electronic Intermediaries

Spiritual Leadership and its Relationship with Quality of Work Life and Organizational Performance An Exploratory Study

Performance: an Empirical Study

An Exploratory Study of Restaurant Leadership Approaches: Some Preliminary Findings

CHOICE SCHEMA DESIGN OF CROWDFUNDING CAMPAIGNS: AN EXPLORATORY STUDY

Family Volunteering: An Exploratory Study of the Impact on Families

CYBER BRANDING: AN EXPLORATORY STUDY OF VIRTUAL ORGANIZATIONS

Archival Orientation for Undergraduate Students: An Exploratory Study of Impact

Adoption of Telemedicine in India An Exploratory Study

An Exploratory Study of RFID Implementation in the Supply Chain

An Exploratory Study of Component Reliability Using Unit Testing

Ethical perceptions of newly staffed accountants: an exploratory study

Virtual worlds: An exploratory study of undergraduate behavior

Software-Effort Estimation: An Exploratory Study of Expert Performance Steven S. Vicinanza

Energy Management Associates, Inc. 100 Northcreek Atlanta, Georgia S0327

Tridas Mukhopadhyay

Center for Management of Technology Graduate School of Industrial Administration Carnegie Mellon University Pittsburgh, Pennsylvania 15213

Michael J. Prietula

Center for Management of Technology Graduate School of Industrial Administration Carnegie Mellon University Pittsburgh. Pennsylvania 15213

An exploratory study was conducted (a) to examine whether experieticed software managers could generate accurate estimates of effort required for proposed software projects and (b) to document the strategies they bring to bear in their estimations. Five experienced software project managers served as expert subjects for the study. Each manager wasfirstasked to sort a set of 37 commonly-used estimation parameters according to Ibe importance of their effect on effort estimation. Once this task was completed, the manager was then presented with data from ten actual software projects, one at a time, and asked to estimate the effort (in worker-months) required to complete the projects. The project sizes ranged from 39,000 to 450,000 lines of code and varied from 23 to 1,107 worker-months to complete. All managers were tested individually. The results were compared to those of two popular analytical models—Function Points and COCOMO. Results show that the managers made more accurate estimates than the uncalibrated analytical models. Additionally, a process-tracing analysis revealed that the managers used two dissimilar types of strategies to solve the estimation problems—algorithmic and analogical. Four managers invoked algorithmic strategies, which relied on the selection of a base productivity rate as an anchor that was further adjusted to compensate for productivity factors impacting the project. The fifth manager invoked analogical strategies, which did not rely on a base productivity rate as an anchor, but centered around the analysis ofthe Function Point data to assist in retrieving information regarding a similar, previously-managed project. The manager using the latter, analogical reasoning approach produced the most accurate estimates. Software-efforl estinulJOD—Expert reaKoning—Analogkal reasonlnii

lO47-7O47/9l/O2O4/0243/i0l.2i Copyright & 1991, The Institute of Management Sciences

Information Systems Research 2 : 4

243

Vicinanza • Mukhopadhyay • Prietula

1. Introduction

A

ttempts to develop models for estimating software-development effort (and hence cost) and assessing the impact of productivity factors have been the focus of much research (Kemerer 1991). In a recent paper. Cot, Bourque, Oligny and Rivard (1988) identified over 20 software-effort models in the literature. The most common approach taken by researchers in this arena involves using historical project data to develop algorithmic models that estimate development effort as a function of a number of project attributes. The existing algorithmic models, however, fail to produce accurate estimates ofthe development effort (e.g., Kemerer 1987). There are at least four reasons why the algorithmic models do not perform well. First, there is the problem of software sizing. Obtaining consistent size estimates (in lines of code or Function Points) and interpreting such estimates require high skill levels and good judgment. Second, there is the issue of historical data. Most researchers have used (and require) a large volume of site-specific historical data to build or calibrate (i.e., tune) their models. When such models are transported to other sites, used at other times., or applied to other sets of projects, these models lose their validity and the resultant estimates become inaccurate. Third is the problem of data analysis. The dominant method of data analysis has been linear regression. A regression analysis may expose a statistical relationship between development factors and effort, but it provides little understanding of why or under what conditions the relationship exists. Furthermore, the assumptions underlying this approach (e.g., factor independence) are often questionable. Finally, there is the problem of task complexity. Many factors can affect the effort required. Different models capture different sets of factors. For example, in an analysis of five models, Wrigley and Dexter (1987) identified 74 factors indicating "the diversity and ambiguity surrounding these models." Despite the large number of models developed in recent years, the software-estimation task remains a difficult problem. Consequently, the existing software-effort estimation models are not widely used (Zelkowitz, Yeh, Hamlet, Gannon and Basili 1984). Human judgment remains the dominant method of estimation (Wrigley and Dexter 1987). However, virtually no research has attempted to document the performance or explicate the knowledge held by software development managers. Do these managers, with many years of experience in estimating, scheduling, and running software projects, have any knowledge to offer researchers that might lead to improved models ofthe software-estimation task? The information processing theory of cognitive psychology suggests two reasons to search for and document human expertise in the software-effort estimation task. First, expertise is seen as consistent, skilled reasoning to solve task-related problems in the domain (Newell and Simon 1972). Consider what comprises the psychological construct of human expertise. Within a cognitive architecture restricted in specific capacities (e.g., working memory, attentional capacity, access to long-term memory, subcomponent seriality, speed), human beings must selectively (and continually) assimilate, reorganize, integrate, and generate knowledge in ways that facilitate its availability in solving task-relevant problems (Newell 1990). To the extent that a domain can support the development of expertise, we would expect that managers

244

Information Systems Research 2 ; 4

Software-Effort Estimation who have sufficient experience and attendant responsibility in estimating software project costs would possess skill in this task. Thus the documentation of skilled reasoning would serve as an "existence proof of a task environment which affords the development of task-specific knowledge.' A second reason to study expertise in software-effort estimation is based on the theory of how expert adaptation occurs. The source of expert skill resides in the knowledge brought to bear on a problem, and this knowledge reflects aspects ofthe task environment significant in solving the types of problems recurring in that environment. The flexibility admitted by the cognitive mechanism permits different adaptations to different task environments (Newell 1990). By studying how experts have adapted to the task, we can gain insight into the nature of reasoning strategies emergent (and useful) in response to task demands. Furthermore, as the nature ofthe skill arises in response to task demands, it serves as a barometer to the task environment, thus permitting insight into the software-estimation task environment itself. Although the benefits of skilled human judgment in the software-estimation task have been discussed (Boehm 1981), little empirical evidence exists on how well, or even how, such experts perform this task. If skilled software managers are able to estimate development effort relatively accurately, a promising research strategy would be to understand the knowledge such managers bring to this problem solution. Indeed, a long-term research goal would be to develop a know ledge-based model of the estimation problem and test its empirical validity. The objective of this study is to examine the feasibility of this research strategy. In particular, OUTfirstgoal is to determine the level of performance of expert estimators given a set of software-effort estimation tasks. Our second goal is to examine the estimation processes of these managers to explicate the strategies they invoke to achieve their performance. In summary, our research seeks to address the following two questions: • How accurate are human effort estimators? • How are these estimators accurate (or inaccurate)? We report on a study that examined in detail how five software-development managers independently estimated the effort (in worker-months) required to complete ten software projects. Our results show that experienced managers can make more accurate estimates than existing (uncalibrated) algorithmic models on the same data set. Additionally, examination of verbal protocols indicates that these managers relied on two dominant strategies in approaching the estimation task and that one strategy was consistently associated with more accurate estimates. Though the generality of our results is necessarily limited by the constraints and form ofthe specific study, it appears that experienced software managers can develop specific and useful knowledge resulting in accurate effort estimations.

' It is important to note that not all task domains support the development of human expertise as we have defined it. The necessary and sufficient components required for human adaptations may not hold because of problem size, a dearth of fundamental knowledge, or a sparse or a random event space. For example, it is unlikely that we can develop "lottery-playing" expertise as a true game is based on random events, or "weather-predicting" expertise, as dominant causal factors are (as of yet) not well understood.

December 1991

,

245

Vicinanza • Mukhopadhyay • Prietula

The paper is organized as follows. First, we describe the research method for the study. Next, we present and discuss the results in terms of the two goals of the study—exphcating the performance levels and estimation processes ofthe managers. Finally, we conclude by discussing the implications ofthe results to the software-development estimation problem. 2. Method Examining human expertise in software-effort estimation poses a considerable research challenge due to at least two reasons. First, there is very little prior work in this area. Second, software-effort estimation is a relatively difficult problem. To overcome these problems, we needed firsthand knowledge ofthe estimation process as it actually is performed. Accordingly, we made contacts with five organizations whose primary business involved software development. Discussions with managers in these organizations facilitated the design of this study. Within the organizations we contacted, a committee usually developed an initial estimate of software effort. Later, af^er a detailed design was completed, a specific budget was established using module-by-module estimates (developed by the manager) and involving some negotiation between the project team and management. Replicating this process for our study, however, was not attempted for two reasons. First, the end result of this process is not necessarily an accurate estimate, but a workable estimate acceptable to all parties involved. Second, this process does not allow us to draw any conclusions regarding the ability of individual managers to estimate development effort. As we have noted, the primary goals for this study were to determine the extent to which (presumably) highly-skilled effort estimators could generate accurate estimates and to document (at an abstract level) the strategies they use in generating those estimates. Thus what we sought was a procedure that would be sufficient to begin to explicate the knowledge underlying the estimation task. The approach taken was pragmatic and simple. Software-development managers reviewed data from actual software-development projects and generated estimates of effort. AH data items were explicitly defined, and all were explained to the managers before the experiment began. The managers reported no difficulty in understanding the presented data items. By incorporating a common stimulus data set, similarities and differences could be assessed across subjects. The stimuli themselves reflected a modification in the task as generally performed by the managers. Specifically, there was a difference in available data, a compression in time, and imposition of structure (an actual estimation task is usually unstructured). However, pilot studies, self-reporting by the managers, and their eventual performance indicated that, though not identical the task was apparently sufficient to engage the knowledge that managers bring to bear for effort estimation. Thus the sacrifice of some external validity for control seemed warranted for the goals ofthe study. Furthermore, structuring the task in this manner permitted a first step in developing a knowledge-based model of effort estimation. Should one or more managers achieve high levels of performance, the protocols obtained from the study could serve as a basis for a knowledge-based model of effort estimation.

2.1. Subjects Three main criteria were used for subject selection. First, all managers had to have a minimum often years of experience. This would help ensure that the subjects were

246

Information Systems Research 2 : 4

Software-Effort Estimation TABLE 1 Subject Experience Profile Subject

Si

S2

S3

S4

S5

Total Yrs. Experience Systems Data Processing Scientific/Engineering Other Project Scheduling Experience (years) Projects Worked On Projects Managed

19 0 0 19 0

10 1 9 0 0

31 19 0 12 0

It 9 2 0 0

27 22 1 1 3

10 11

10 12 12

12 25 20

6 12 7

18 35 25

5

thoroughly familiar with the software-development process. This criterion is consistent with the general time frame for expert skill development (Bloom 1985). A second criterion used to screen subjects was the type of experience: project management and cost-estimation responsibility. The final criterion was that the subjects have a reputation among their peers as individuals who had consistently developed accurate estimates. Subjects were obtained from five firms whose primary products were software and software-related services. As software development is the most critical operation within these firms, it is reasonable to assume that therein would be found skilled effort estimators. Senior development managers at these firms were asked to identify individuals whose job responsibilities included cost estimation and who were consistently accurate at the task. As a result, five software-development professionals were invited to participate in the study with all five agreeing and completing the tasks. Table 1 presents the background data for each subject. Their experience in software development ranged from 10 to 31 years with an average of about 20 years experience. Each subject's experience reflected a different development environment. Subject 1 was a consultant to the defense software industry, who worked primarily with embedded software systems, such as real-time, jet-fighter control software. Subject 2 had a commercial data processing background. Subject 3 worked primarily with systems software (compilers in particular). Subject 4 had worked with mainframe systems software and some commercial data processing. Subject 5 was a consultant who had worked in many different environments but who had the most experience with systems software. Except for Subject 1, all subjects had some experience with the environment from which the stimulus projects were developed (commercial data processing), with Subject 2 having the most experience related to the stimulus projects.

2.2. Materials A subset (thefirstten projects) ofthe actual project data Kemerer (1987) used in his validation of four popular cost-estimation models was adapted for use in this study. This data, collected from project managei^ at a management consulting firm specializing in data processing applications, comprised a set of project attributes and the actual development effort associated with each of the ten projects. Further details

December 1991

247

Vicinanza • Mukhopadhyay • Prietula

about the project data can be found in Kemerer (1987). The projects ranged in size from 39,000 to 450.000 lines of code (100 to 2,300 Function Points) and took from 23 to 1,107 worker-months of effort to complete. The COCOMO (Boehm 1981) and Function Point (Albrecht and Gaffney 1983) inputs for these projects were obtained and transcribed on index cards. The project factors about which information was available to subjects comprised information on program attributes (e.g., size of database, amount of intersystem communication, number and complexity of files generated and used, reliability requirements), environmental attribtites (e.g., hardware system, main-memory constraints), personnel attributes (e.g., average experience of the development team, capability of the project analysts, capability of the project programmers), and project attributes (e.g,, use of modem programming techniques, level of software-development tool usage). The entire set of project factors is presented in the Appendix. The project data provided did not include a functional specification of the project. These data were omitted for three reasons. First, as neither COCOMO nor Function Points admit this data, a more equitable basis for human to analytical model comparison could be made. Second, function descriptions are inherently ambiguous, and it would be quite difficult to control exactly what such semantically-loaded terms meant to each manager.' It was clear that the subjects would not have equivalent familiarity with all applications and their functions in our sample. Finally, as the managers had to interpret each project "by attributes," all decisions, inferences, and judgments had to be made solely on the parameters they deemed relevant (from the set available). Thus our goal was to restrict the type of estimation problem to a close variant of the actual problem performed, but imposing controls to facihtate data acquisition on the processes of deliberation.

2.3. Procedure All subjects were run individually in the subject's own firm. A quiet room was selected, and interruptions were minimized. All subjects received the same tasks in the same order. Each subject completed two types of tasks: a sorting task followed by ten problem-solving (i.e., software-effort estimation) tasks. Subjects were first asked to sort a set of index cards (where each card listed a cost factor for which data were available) by order of the importance of the factor's effect on the development efiibrt. The objectives of this task were to familiarize the subjects with the type of project data that were available for use in estimation and to provide insight into how they might perceive the relative importance of and the clustering relations among the various factors (Chi, Feltovich and Glaser 1981). Upon completing the sorting task, each subject was asked to estimate the actual number of worker-months consumed by each of the ten projects. The worker-month estimate included the total number of actual worker-months "expended by exempt staff members (i.e., not including secretarial labor) on the project through implementation" (Kemerer 1987). The experimenter resolved any ambiguities or questions the ^ For example, an inventory control system can be very complex to very simple depending on the organization (General Motors versus a convenience store), users (number, management versus clerical), items (number, percentage out-sourced. production/supply), ordering policies (number, reorder point versus MRP-based), vendors (number, electronic or manual communications), storage locations (number, internal or external), cycie times, and accounting procedures.

248

Information Systems Research 2 : 4

Software-Effort Estimation

subject had regarding this directive and the protocol revealed no evidence of estimations different from those presented in the instructions.' Subjects were provided with a list of developmental factors for which information was available (the same factors used in the sorting task). They were instructed to ask for each project data item, which was provided on index cards by the experimenter, as needed. Subjects were also instructed to verbalize their thoughts as they developed the estimate and take as much time as needed to develop the estimate. While time to complete the estimate was not a factor of analysis in this study, on average, each project required about 30 minutes to estimate. Estimates were obtained for each project in this fashion. All subjects were given a break after the fifth project-estimation task. All subjects received the same ten estimation problems in the same order. To avoid biasing subjects' responses and inhibit between-trial learning effects, feedback regarding the accuracy of the estimates was not provided to the subjects until all project estimates were obtained. The final numerical estimates for each project were noted by the experimenter. Verbal protocols were audio-tape recorded for subsequent analysis. Neither fatigue nor learning effects over the ten problems were detected across trials."* 2.4. Data Analysis To compare and evaluate the accuracy of the subjects' effort estimates, the absolute value of the percentage error was calculated. This metric, the magnitude of relative error (MRE), was suggested by both Thebaut (1983) and Conte et al. (1986) and used by Kemerer (1987) for cost-model validation. The MRE is calculated as the absolute percentage error of the estimate with respect to the actual amount ofdeveiopment effort required to complete the project:

WM

where WM^, is the estimate of worker-months and WM^^, is the actual effort consumed by the project. Although subjects may be sensitive to the influence of various productivity factors, it is anticipated that they may consistently overestimate or underestimate development ifthe assumed base-productivity rate is very different from that of the environment in which the projects were developed. To measure subjects' sensitivity to development factors, we examined the correlation between the estimates and the actuals. If there is a strong correlation, a strong (linear) relationship between the estimate and actual effort is assumed. In the event that there is a large error in the estimates (as determined by the MRE), a high correlation between actual results and estimated results indicates that the subject is sensitive to the productivity factors and is accurately accounting for their effects on development effort (as detected by the ^ Note that the project data and the subjects (managers) both came from oi^ntzations whose primary business involved contractual (external) software development. * These effects were tested by regressing the error In task estimates (the M^?^ described in the next section) against the project sequence solved by the managers. The resulting correlations ranged from r^ = 0.07 (S3) to /^ = 0.23 (S2) and none were significant.

December 1991

249

Vicinaitza • Mukhopadhyay • Prietula 3000

-^

2S00

-

2000

-

.,

•-%

lSOO —

1000 —

500

—

Cocomo

FntPts

Sl

52

S3

S4

SS

Estimator FIGURE 1. MRE Box and Whisker Diagram by Estimator.

correlation). However, they are "miscalibrated." estimating from a base-productivity rate different from that in which the projects were developed (as detected by

the MRE). Finally, to determine the individual estimation strategies ofthe subjects, an analysis of the estimators' thinking-aloud verbal protocols was performed (see Ericsson and Simon 1984). The objective of this analysis was not to build a detailed model of the process, but to focus on the general forms of reasoning that the subjects invoked to work from the initial problem statement (the project data) to the solution (the estimate). Each audio-taped session was transcribed into a computer file with each line ofthe file corresponding to a single "task assertion" as a phrase or phrase fragment (Ericsson and Simon 1984).

3. Results The results are presented in two parts. First, we discuss how well the managers did both in absolute and relative terms with respect to each other and to two analytical approaches (COCOMO and Function Points). Second, we focus on explicating the general strategies the managers seem to bring to bear in generating their estimates.

3.1. Accuracy and Consistency of Estimates Figure 1 summarizes the MRE results for COCOMO, Function Points, and each manager as a version ofa box-and-whisker diagram (Tukey 1977) indicating the extreme values (whiskers), the standard deviation (box), and the mean (line). COCOMO and Function Points estimates (Kemerer 1987) are included for comparison. A Friedman analysis of variance by ranks indicated significant diiferences existed in estimator accuracy {F^^^ = 45.94, p < 0.001). To examine between-estimator differences in estimation, a post-hoc analysis ofthe

250

Information Systems Research 2 :4

Software-Effort Estimation auuu2700-

2400-

——+—

Cocomo

w—

FntPts

.—

SI

....^...

S2

•—

S3

*—

S5

K

M

2100-

18001500-

1200/ 900^ 600-

300-

/A

A \ A W 1//\ \ •

/

0-

10

Estimation Problem FIGURE 2. Magnitude of Relative Error for All Estimators.

Friedman test results was performed, controlling for the familywise error rate (Siegel and Castellan 1988). Two sets of differences were uncovered. The first difference indicates that SI performed significantly worse than the rest ofthe managers and Function Points; however, the difference between SI and COCOMO was not significant. The second difference found COCOMO performing significantly worse than S2 and S5. No other significant performance differences were found. S2's estimates had a mean MRE of 32% with a standard deviation of 22%. The Function Points estimates had a mean MRE over three times as high and a standard deviation over five times as high. The (relatively) large, within-subject MRE variance and the reduced power of post-hoc analysis make it impossible to discern further statistical differences within these clusters. Thus the apparent outstanding performance evidenced by the MRE scores of S2 in Figure I could not be verified statistically over all other estimators. Figure 2 presents the MRE scores for COCOMO, Function Points, and the managers by project (in the order they were solved). From this figure, it can be seen that COCOMO and SI form a profile of MRE values that are much higher than the remaining group. In this analysis, SI has the highest MRE scores for 90% ofthe

December 1991

251

Vicinanza • Mukhopadhyay • Prietula TABLE 2 Correlation Between Actual and Estimated Effort Estimator

r»

Mean MRE

S2 S4 S3 S5 COCOMO SI Ftn Points

0.96 0.90 0.87 0.77 0.71 0.67 0.63

32.0 140.6 146.1 65.5 758.2 1106.7 107.4

projects with COCOMO having the overall highest MRE score. The remaining estimators show less error and variance. Note that S2's profile has the lowest MRE for the greatest number of projects. As was noted, we examined the correlation (r) between an estimator's predicted effort and the actual amount of effort taken. Table 2 presents the correlations as well as the mean MRE scores for comparative purposes. Estimators are listed in decreasing m^nitude of r^. Recall that the MRE value is an estimate of error in prediction, and the correlation reflects possible sensitivity to productivity factors. For example, the function-point model appears to be well calibrated to this environment (as suggested by the low MRE); however, the comparatively low r^ of this model reveals that it does not account well for the impact of various productivity factors (see Figure 1). On the other hand, the estimates of most of the managers correlated well with the actuals, indicating that they are sensitive to cost drivers in this development environment. The highest r^. that for subject S2, was 0.96. This result compares with 0.71 for the COCOMO model and 0.63 for Function Points. Using Fisher's Z transformation to test these correlations, wefindthat S2's estimates are significantly better correlated with the actual data than either Function Points or COCOMO {p < 0.05). The lowest r^ for the managers was for SI at 0.67, which was statistically lower than S2 but not different from either Function Points or COCOMO.

3.2. Analysis of Manager Strategies The analysis ofthe reasoning processes involved two tasks: the initial sorting task and the estimation (problem solving) tasks. The results ofthe sorting task were examined to determine the extent ofthe agreement or disagreement among managers. The Spearman Rank Correlation Coefficient was calculated for managers' rankings. One manager (S3) was unwilling to commit to a ranking and was not included in this analysis. There was very little agreement among managers. Only two ofthe six correlations were statistically significant, and one of these was a negative correlation.^ An initial analysis of the tape-recorded verbal protocols of the estimation tasks ' The significant correlations were: SI with S2 (r, = - 0.38. p < 0.05) and S4 with S5 (r, = 0.67. p < 0.001). Each manager expressed reservations about ordering the productivity factors. The common explanation was that a factor would become important only when its value was out ofthe ordinary. In addition, the impact of one factor could be dependent on the value of another factor. Such joint effects seemed to preclude the assignment of an a priori strict linear ordering to the factors without specific values.

252

Information Systems Research 2 : 4

Software-Effort Estimation Determine Productivity Rate Selea Base-Produciivity Rate

Select Factor to Evaluate

Yes

Is the Factor wiihin "normal" range? No Detemiine % change in Produciivity Rate [hat results from [his Factor value.

Adjust Produaiviiy Rale accordingly

Addiiional Faaors to Consider?

Yes

No Apply Productivity Rate

No

Divide LoC by final Producliviiy Rale 10 yield Effon Estimaie

/ ^

Is ihe Estimaie "reasonable"? Yes

"Adjust Estimate"

timaie

FIGURE 3. Abstraction of Algorithmic-Estimation Strategy.

focused on explicating a way to simply, but unambiguously, differentiate between the most-successful and the least-successful managers as determined by the average MRE. The results indicated that these two managers seemed to ditfer in terms ofthe fundamental way they arrived at estimates. The least accurate manager, SI, appeared to employ an algorithmic strategy that was similar in many respects to the methods by which models, such as COCOMO, estimate development effort.^ Figure 3 shows aflowchartabstraction of this strategy. This subject started with a specific, base-productivity rate as an anchor. That rate was then adjusted throughout the session (up or down) to compensate for productivity * The notion of similarity does not imply equivalence to the analytical models. Interviews and protocol evidence indicate that none of the subjects attempted to apply such approaches to the problem. The similarity merely reflects the dominant role ofthe productivity rate in their estimation process.

December 1991

253

Vicinanza • Mukhopadhyay • Prietula Determine Referent Project Examine Problem Features

No

Do They Suggest a Pasi Project? Yes

Adjust Referent Project Values Activate Pasi Project as Refcreni Projeci

Examine Ctincni Projcci Fcaiures

Ignore

is Feaiurc Value at Variance with Rcfereni Projeci Value' Yes Adjusi Estimaie based on Results of "Running" the Refereni Projeci with Fcaitire Value

Selcci Feature lo Review

FIGURE 4. Abstraction of Analogical-Estimation Strategy.

factors that impact the project effort being estimated. For example, a base rate of 100 lines of code (LoC) per week might be modified to 200 LoC per week if the programmers on the project were rated as highly skilled. The number of LoC in the application was then divided by this rate to come up with an initial overall estimate. This overall estimate was then further adjusted depending on how well it appeared to represent the entire project. If SI felt that an estimate seemed too low for the entire project, that estimate was adjusted upward. This final adjustment served as a generic "sanity check." The most accurate manager, S2, employed an analogical strategy that was quite different from the algorithmic (see Figure 4). Rather than relying on some base-productivity rate to generate the anchor, S2 carefully analyzed the Function Points data in an attempt to understand the type of application to be developed. LoC data were used primarily to verify that the guess about application type is reasonable. Once the application type was determined, S2 formed an analogy between the application to be estimated and a similar application previously managed. The amount of effort

254

Information Systems Research 2 : 4

Software-Effort Estimation TABLE 3 Coding Scheme for Estimator Strategies Coding Category Algorithmic AI. Mention productivity rate. A2. Effort estimation based on productivity rate.

Analogical BI. Mention past case (project).

B2. Effort estimation based on past case.

Example Protocol Fragment ". , . I'm saying the group may be. . . be in the 300 lines of code (per day] range . . ." "I would say that the maximum for this kind of thing would be 100 [lines of code per day]. . ." "I'm gonna go down to 10 . . . 10 lines of code a day . . . I'm very pessimistic about this, which means that you've got 450,000 lines . . .10 each. . . 45.000 code days . . . so I'm saying likely is 2250 [months]. . . and I'd say worst is 3.000 and best is 2.000." "253,000 lines of code divided by 20 [lines of code per day] is 12,650 code days . . . So If you divide that by 20 [work days per month]. . . that's 633 code months." "Now with this information I'm going to try and use (project mentioned) as a metaphor and try to see how this compared with (project mentioned)." ". . . try to think of a development project in the past of equivalent size . . . our purchasing system . . . then start adjusting from that." ". . . if I use (project mentioned) as a base. I start with 90 manmonths . . . roughly the same input and output types . . . but more inquiries . . . So. I'll reduce that by 20 man-months." "OK, so let's compare the quantitative information [with a recalled case]. . . about the same inquiry. . .about twice the number of files. . . I'll double the figure [effort estimation]"

that the recalled project required became the reference for subsequent adjustments. If the recalled project was not in sufficient correspondence, the estimate was adjusted to compensate for the difference. By adjusting the estimate for only those productivity factors that differed from one project to the other, S2 was able to produce very accurate estimates. Further examination of the two managers' protocols revealed that the two approaches seemed to be distinguished by four specific (deliberation) events appearing in the two transcripts which reflected the fundamentally different strategies used for generating estimates. These events were used to define specific coding categories and formed a coding scheme that captured these events. Algorithmic strategies were characterized by the mention of {1) a productivity rate, and (2) an effort estimation based on a productivity rate as project properties were revkwed.'^ Analogical strategies were characterized by the mention of (1) a past project, and (2) an effort estimation based on the past project. Table 3 presents examples of protocol fragments and corresponding coding categories. To test the coding scheme's ability to discriminate between categories and subsequent categorization ofthe subjects' strategies, all references identifying the subjects 'The units of this rate vary by manner: SI (LoC/month). S3 (LoC/hour), S4 (LoC/day). and S5 (LoC/week).

December 1991

255

Vicinanza • Mukhopadhyay • Prietula were removed from the entire set of problem protocol transcripts (50). A program was written that would randomly select one ofthe problem protocols at a time and display the lines of the protocol with each line uniquely numbered. The program permitted each line ofthe displayed protocol to be selected and, if relevant, the rater could link the line to one or more of the coding categories (AI, A2, BI and B2) identified in Table 3. Two raters independently coded all protocol files to identify the occurrence ofthe four coding categories listed in Table 3. Each problem episode was classified according to the particular coding categories identified. If the coding categories were A1 and A2 (or BI and B2), then the problem was classified as algorithmic (or analogical). If there were a mixture of categories within a problem, then it was scored as mixed. A kappa statistic calculation (Siegel and Castellan 1988) yielded a high rate of agreement between the two raters (z = 9.05. a < 0.001). The discrepancies were resolved by a third rater.^ The results ofthe final coding for each ofthe ten protocols for each manager were the following: (1) managers SI, S3, S4 and S5 were scored as using algorithmic strategies on ail of the problems, (2) manager S2 was scored as using analogical strategies on nine of the problems and a mixed analogical-algorithmic strategy on one ofthe problems.' 4. Discussion The results of this study show that human estimators, when provided with the same information used by two algorithmic models, estimate development effort with at least as much accuracy and in some cases with more accuracy than the models, as measured by the MRE. Furthermore, the correlation analysis provides evidence that human estimators are more sensitive than the analytical models to the environmental and project factors that affect development productivity. There were, however, differences in the accuracy of the subjects as measured by both MRE and correlation coefficients. What were the basic methods that these estimators employ and what might account for the difference in estimation accuracy? An examination ofthe subjects* backgrounds and ofthe tape-recorded verbal protocols provides insights about the strategies employed by the human estimators. One clear distinguishing factor among the subjects is the amount of experience each had with the class of applications used as the project data. All applications were in business data processing, and most were written in COBOL. The subject who was least successful on this task, S1, worked exclusively on applications far removed from * A total of seven discrepancies occurred. In each case, one coder identified the mentioning of a productivity rate (Code Al in Table 3) while the other coder did not. For example, one such protocol fragment was: "I will use the same number 100 as Ihe starting point." The third coder examined the ten protocols in the order in which they were solved (the other two coders saw all protocols in a random order) and determined if the remote reference (the number "100" in this example) referred to a productivity rate or not. ' It is unclear why the algorithmic strategy occurred. However, from the protocol we can speculate that S2 could not recall (or did not experience) a project that "sufficiently matched" the parameters of interest in the presented project—which, in fact, was the only project based on a fourth generation language. As a consequence, S2 had to rely on a default, and a presumably weaker form of reasoning based on less knowledge. Regardless, S2 had an MRE or42.9% on that problem, which was better than three ofthe four other managers.

256

Information Systems Research 2 : 4

Software-Effort Estimation the typical data processing application. This subject's experience was in real-time, embedded military systems. Such systems differ in almost every respect from typical data processing programs as they are usually very complex, with memory and realtime processing constraints, developed under a myriad of military specifications and reporting standards, and often dependent on volatile hardware configurations. According to SI, 100 lines of delivered source code per worker-month is considered standard in that development environment. However, the average productivity rate for the applications in this study is over ten times greater. It is not surprising that S1, with no experience in traditional data processing work or COBOL, consistently overestimated the effort required to complete such projects. The algorithmic strategy seems quite sensitive to anchoring effects (Tversky and Kahneman 1974), and an initial, inappropriate productivity anchor can lead to substantial error because subsequent adjustments are inadequate. It is possible that SI could have used an analogical strategy if a presented case was sufficiently similar to one previously encountered; however, the data at hand demonstrated that SI did not engage such a method. The other subjects who used an algorithmic strategy (S3, S4, S5) performed better than SI. The reason appears to be in the formulation of the initial productivity rate—the anchor driving the strategy. From the sorting task, it was evident that the values ofthe factors were necessary to produce a context for evaluation. Experience has provided the subjects with a set of expectations for the parameter values. After producing an initial productivity estimate, they review the remaining parameter values and adjust accordingly. As their experience is based on task environments similar to the ones in the study, they generated more accurate initial (anchoring) estimates than SI. On tbe other hand, S2's analogical strategy was both different and more effective than algorithmic strategies followed by the other subjects. S2's method is not to anchor and adjust around a productivity factor, but rather to anchor and adjust around a specific, previously encountered project. This strategy may more effectively incorporate important nuances and interactions as the context ofthe prior project guides the selection and valuation of parameters. Such a strategy, however, makes large demands on experience. S2 had the most, and the most similar, experience spending nine years with estimating data processing applications. To determine the most relevant prior project as an anchor, it is necessary to interpret and review the relevant project parameters which determine similarity; that is, analogy to past projects must be made on the basis of a similarity metric which accounts for the important parameter values. Recall that the problems given to the subjects did not include the project type; rather, they were described only by parameter values. Therefore, S2's strategy would be most accommodated by the experimental task. Whether the other subjects could have invoked a similar analogical strategy based on their experience is not clear. What is certain is that S2's strategy did make explicit use of analogy based on prior project experience and the adjustments made were within the context of that experience. The reliance on prior projects places important and different demands on memory. The exact nature ofthe role of memory in complex problem solving, however, is still a focus of research (Baddeley 1990). S2's strategy was heavily dependent on specific, episodic memories as cases. When the experiences that underlie the

December 1991

257

Vicinanza • Mukhopadhyay • Prietula

development of skilled reasoning are derived from varying, but similar, situations, recalling relevant prior experiences can greatly reduce effort in problem solving. Rather than rely on more general knowledge or inferential procedures, the problem solver can exploit the similarities (or differences) between the current and past situations. Individual experiences, and not general inferential mechanisms, drive the reasoning process. The machinery is most probably a combination of existing mechanisms such as analogy, induction, and pattern matching that have been adapted in the context of using a rich set of specific experiences. Finally, the question of why these two approaches exist may be addressed. As noted, skilled development in a domain is based on adaptation to the task over long periods of time. General mechanisms of problem solving can be made more specific (or more appropriate) with knowledge directly related to what the task environment will afford. With this view, the analogical approach of S2 may simply be farther along the experiential path being trod by the algorithmic estimators. Lacking a sufficient set of task-specific experiences, the estimators incorporating the algorithmic strategy rely on a method that is plausible and consistent with what they know. When cases cannot drive the strategy, an algorithm is used which incrementally adjusts the estimate in a progressive-deepening strategy common to problem solving in many domains (Newell and Simon 1972). 5. Conclusions In this paper we have examined (I) whether a selected group of experienced software managers were able to generate accurate effort estimates of projects described by a collection of attributes, and (2) how they were able to do this. The specific results of this study are dependent on the choice of subjects as well as the task. If none ofthe subjects proved to be accurate in their estimates ofthe experimental tasks, it would not be possible to determine whether it was due to the nonexistence of expertise in general, a lack of expertise among the subjects in the experiment, or a lack of suitable experimental task. If, on the other hand, even a single subject is consistently accurate, the existence of expertise will have been strongly suggested—to the extent that our task was ecologically valid."* How valid was the experimental task? Overall, our explicit goal was to match the parameters ofthe data set used by Kemerer (1987). One could argue that our subjects were presented with less information than usual, as we only provided a strict set of parameter, permitted no dialogue for acquisition of additional data, and eliminated some key information (in particular, the type of application). Thus the resulting behaviors ofthe subjects were an underestimate of their true levels of accuracy. On the other hand, one could also argue that we presented the subjects with the kev information—an estimate of size data (i.e., LoC). This latter piece of information is obviously not available for incompleted projects, and the resulting behaviors '" Note that there is a difference between the ecological validityoflhe task employed in the experimental setting (i.e., its "similarity" to the real-world task), and the psychological validity of Ihe data obtained from the experimental task (i.e., its accuracy in reflecting deliberation events). The latter issue is addressed by incorporating the appropriate methods of cognitive science which reveal accurate psychological data. The former issue is addressed in a number of ways which link psychological data to parameters of interest in the target task environment.

258

Information Systems Research 2 : 4

Software-Effort Estimation reflected overestimates of their true levels of accuracy." This study does not resolve that issue. Is there expertise in software-effort estimation? In our study, and to the extent that our task captures the fundamental characteristics ofthe actual estimation task, the managers estimated effort as good as, and in most cases, better than two well-established algorithmic models. The correlations of the subjects' estimates with actual worker-months as well as the MRE results were good. In particular, one subject was remarkably accurate and highly sensitive to productivity factors. Furthermore, the analysis ofthe reasoning strategies yielded two somewhat different, but nonetheless reasonably logical, methods of solving the estimation problems. Taken collectively, we can presume there is evidence suggesting valuable task adaptations that can effectively bring to bear knowledge to formulate accurate estimates. In sum, the results of this exploratory study strongly suggest the existence of expertise in software-effort estimation. We expect some highly-rated (by peers and senior managers) and experienced (in project management and cost estimation) software managers to possess expertise in this task. Our expectation is also based on the increasing maturity ofthe software-development process over the last two decades. There have been increasing levels of logic and discipline in the development process due to the rapid growth ofthe software engineering discipline. When highly-skilled individuals with proven records (such as our subjects) are exposed to and required to manage a reasonably coherent process such as software development, we would expect some of these individuals to acquire a deep understanding about the effort required for different projects. In other words, such experience allows task-specific adaptations based on reorganizing knowledge and cognitive processes crucial in the development of expertise (Prietula and Feltovich 1991) in software estimation. Thus, for example, we would expect that such individuals would not only be able to distinguish a large project (in terms of effort) from a small project (and estimate the effort required), but also be able to grasp the effects of various productivity factors on the effort required. Can we use the expertise to improve software-effort estimation? It has been suggested that a knowledge-based approach to the estimation problem may help improve the accuracy of existing models (see Ramsey and Basili 1989). In support of this, our study indicates the particular form of reasoning that might be pursued is an analogical-reasoning approach. The performance ofthe subject using the analogical approach was superior both in terms ofMREand r^. However, it should be noted that even though some ofthe experts (algorithmic) may not have been as well calibrated (evidenced by the MRE), they did display sensitivity to productivity factors (evidenced by the high r^). The ability ofthe subjects to fathom the effects ofthe productivity factors addresses a commonly-voiced, but unsubstantiated, claim that cost factors vary across organizations and that experts at one organization could not estimate costs at another. Our study shows that experts with experience in different organizations are sensitive to " In fact, early LoC estimates generally are made (especially for several analytical models), but have explicit (or implicit) associated distributions of certainty. This begs an interesting question concerning the effect of confidence of the LoC estimate on the subsequent estimation behaviors of the humans (and models).

December 1991

259

Vicinanza • Mukhopadhyay • Prietula

cost factors across organizational boundaries—this is evidenced by the high correlation between managers' estimates and actual effort. This result implies that there might be some cost drivers that have a standard effect, independent ofthe organization. Even the worst estimator, S1, added information beyond what was available in LoC alone.' ^ Size is a major cost driver, but other factors have a significant influence. Kemerer's data illustrate this point: the productivity rate ranged from 400 LoC per worker-month to 2500 LoC per worker-month. It seems that size alone cannot account for overall software effort. The experts were able to examine the data and account for the factors that influenced these productivity changes—in spite of their different backgrounds. We could speculate that the managers' estimates may have been miscaiibrated because ofthe influences of their different environments and not because ofthe lack of ability to account for cost-factor effects. This possibility leads us to conjecture that productivity factors could ^^general and transcend the boundaries of organizations—unlike the base productivity rate, which seems to be more site specific. How common and complex is the reasoning approach used by S2? Our study was exploratory and involved a small sample, but sufficient to detect and document possible expertise in the task. The task itself was somewhat artificial, but the data were real and the estimates were accurate. A major finding of the study revealed a limitation: estimation using analogical reasoning yielded the most accurate estimates —but we had only one subject incorporating this type of strategy. As a consequence, we are pursuing the answers to several questions generated by thefindings.For example, what role does specific experiential knowledge play in retrieval of prior project infonnation and the subsequent adjustments of effort? Can good adjustment rules compensate for an inadequate set of project experiences? How general are adjustment rules across projects? Are there multiple forms of similarity functions in accessing prior projects? Is the knowledge acquired from this study sufficient to generate a computer model of effort estimation? Can analytical and algorithmic methods be woven together? What role does learning play in estimation? No doubt, many others can be generated. Finally, as long as large software development projects are difficult to understand, predict and control, software-effort estimation models will be a major focus for research. In addition, the most prevalent (and costly) form of effort estimation may reside in maintenance projects and not new software projects (Lientz and Swanson 1980). Our research is now addressing this complex and important estimation problem. However, rather than take a "macro" approach to modeling software maintenance as a production process (e.g., Banker, Datar and Kemerer 1991), our approach provides a micro perspective which may offer a new source of complementary information—human expertise.* Acknowledgements. We would like to thank Dr. Herbert Simon for his input to this research and Dr. Chris F. Kemerer of MIT's Sloan School of Management for his comments on this paper and permitting us access to the project data used in this study. The comments ofthe anonymous reviewers were also helpful. This research '^ The estimates generated by S1 have a higher correlation (r' = 0.70) with actual effort than the correlation produced by regressing LoC with actual effort (r^ = 0.42). * Ron Weber, Associate Editor. This paper was received on June 2,1989, and has been with the authors 18 months for 2 revisions.

260

Information Systems Research 2 : 4

Software-Effort Estimation

was supported in part by NSF Computer and Computation Research grant CCR8921287 and the Center for the Management of Technology, GSIA. Appendix: Project Factors Available for Estimation Program Attributes • CPLX—The amount of complex processing (e.g., extensive mathematical or logical processing) performed by the program. • COMM—Amount of intersystem communication required ofthe program. • DATA—Data Base Size, the size ofthe application's data base. • DIST—Amount of distdbuled data or processing. • EFIL—Number and complexity of files shared or passed between programs. • EFFI—Eiiiciency, the importance ofthe end user efficiency ofthe program's on-line functions. • FLEX—Degree to which the program was specifically designed and developed to facilitate change. • INQT—External Inquiry Type—Number and complexity of input/output combinations where an input causes an immediate output. • lNST—The degree to which a program conversion or installation plan was provided and tested during program development. • LOC—Size ofthe completed program in lines of source code. • LANG—Programming language used for the project. • MULT—The degree to which the program was specifically designed to be installed at multiple sites in multiple organizations. • NFIL—Number and complexity of files generated, used, and maintained. • OENT—Amount of on-line data entry required ofthe program. • OPER—Amount of startup, backup, and recovery procedures provided. • OUPD—Amount of on-tine data update required ofthe program. • PERF—The importance of program performance measured by either response time or throughput. • RELY—Reliability requirements, the effect of a software failure. • TRNS—The importance of the program's transaction rate. • UINP—User Input, the number and complexity of user inputs to the program. • UOUT—User Output, the number and complexity ofthe reports generated. • REUS—The degree to which the application and the code has been specifically designed for reusability in other applications or at other sites. Program Environment Attributes " HARD^—Target hardware system. • HEAV—The loading ofthe equipment on which the program will run. • STOR—Main storage constraints, the percentage of available memory used. • TIME—Execution time constraints, the percentage of available execution time the application will utilize on the production computer. • V!RT—Amount of volatility in the underlying hardware/software system upon whieh the application relies. Personnel .Attributes • ACAP—Capability ofthe analysts on the project rated as a team. • AEXP—Average experience ofthe development team. • LEXP^—Amount of experience the team had with the programming language. • PCAP—Capability of the programmers as a team. • VEXP—Amount of experience with the hardware/OS architecture for which the application was developed. • SSIZ—Stafi' size, average number of people working on the project. Project Attributes • MODP—Use of modem programming practices (e.g.. tofwiown design, structured programming techniques, design and code walk-throu^s). • SCED—Development schedule constraints, the ratio ofthe required development schedule to a reasonable or normal schedule. • TURN—Development environment turnaround time. • TOOL—Level of software development tool usage on the project.

December 1991

261

Vicinanza • Mukhopadhyay * Prietula

References Albrecht. A. J. and J. Gaffney, "Software Function, Source Lines of Code, and Development Effort Prediction," IEEE Transactions on Software Engineering. 9, 6 (1983), 639-648. Baddeley, A., Human Memory: Theory and Practice, Allyn and Bacon, Boston, MA, 1990. Banker, R., S. Datar and C. Kemerer, "A Model to Evaluate Variables Impacting the Productivity of Software Maintenance Projects," Management Science. 37 (1991), 1-18. Bloom, B. (Ed.). Developing Talent in Young People. Ballantine, New York. 1985. Boehm. B. W., Software Engineering Economics. Prentice-Hall, Englewood Cliffs, NJ. 1981. Chi, M., P. Feltovich and R. Glaser, "Categorization and Representation of Physics Problems by Experts and Novices," Cognitive Science. 5 (1981), 121-152. Conte, S., H. Dunsmore and V. Shen, Software Engineering Metrics and Models. Benjamin/Cum mi ngs, Menlo Park. CA, 1986, Cot, V., P. Bourque. S. OHgny and N. Rivard, "Software Metrics: An Overview of Recent Results," The Journal ofSy.stems and Software. 8(1988), 121-131. Ericsson. K. A. andH.fii.Simon, Protocol Analysis: Verbal Reports as Data. MIT Press, Cambridge, MA, 1984. Kemerer, C. F.. "Software Cost Estimation Models," Chapter 25 in Software Engineers Reference Handbook. Butterworth, Surrey, U.K., 1991. , "An Empirical Validation of Software Cost Estimation Models," Communications ofthe ACM, 30 (1987), 416-429. Lientz. B. and E. Swanson, Software Maintenance Management, Addison-Wesley, Reading, MA, 1980. Newell, A., Unified Theories of Cognition, Harvard University Press. Boston, MA, 1990. and H. Simon. Human Problem Solving. Prentice-Hall, Englewood Cliffs, NJ. 1972. Prietula, M. and P. Feltovich, "Expertise as Task Adaptation." Working Paper. Center for the Management of Technology, Graduate School of Industrial Administration, Carnegie Mellon University, 1991. Ramsey, C. and V. Basili. "An Evaluation of Expert Systems for Software Engineering Management," IEEE Transactions on Software Engineering. 15 (1989), 747-759. Siegel. S. and N. Castellan, Nonparametric Statistics for the Behavioral Sciences. (2nd Ed.). McGraw-Hill, New York, 1988. Thebaut, S. M.. "Model Evaluation in Software Metrics Research." J. Gentle (Ed.), Computer Science and Statistics: Proceedings ofthe Eifteenth Symposium ofthe Interface. North-Holland. New York, 1983. Tukey, J., Exploratory Data Analysis. Addison-Wesley. Reading, MA. 1977. Tversky, A. and D. Kahneman, "Judgment under Uncertainty: Heuristics and Biases," Science. 185 (1974). 1124-1131. Wrigley, C. D. and A. S. Dexter, "Software Development Estimation Models: A Review and Critique," PrtKeedings of the ASAC Conference. University of Toronto, 1987. Zelkowitz. M., R. Yeh, R. Hamlet, J. Gannon and V. Basili. "Software Engineering Practices in the US and Japan," IEEE Computer, (June 1984), 57-66.

262

Informatioti Systems Research 2 : 4