Patterns of Evolution in Open Source Projects: A Categorization Schema and Implications 1

1 Patterns of Evolution in Open Source Projects: A Categorization Schema and Implications1 Sherae Daniel Katz School of Business University of Pittsb...

Author: Augustus Montgomery

1 downloads 2 Views 201KB Size

Report

Download PDF

Recommend Documents

LEADERSHIP AND MOTIVATION IN OPEN SOURCE PROJECTS

Schema Repository for Database Schema Evolution 1

Quality assurance in open source projects

KNOWLEDGE REUSE IN OPEN SOURCE SOFTWARE: AN EXPLORATORY STUDY OF 15 OPEN SOURCE PROJECTS *

Recent advances in schema and ontology evolution

Disseminating Architectural Knowledge on Open-Source Projects

Open-source Software Implications in the Competitive Mobile Platforms Market

THE ROLE OF OWNERSHIP AND SOCIAL IDENTITY IN PREDICTING DEVELOPER TURNOVER IN OPEN SOURCE SOFTWARE PROJECTS

The evolution of anthropological research, from a Free and Open Source point of view

Provenance Management in Databases Under Schema Evolution

OpenPPM: Open Source a partir de Open Source

Patterns of Conflict in Pakistan: Implications for

Open Source Projects for a suite of Web of Things Servers

Sampling open source projects from portals: some preliminary investigations

FIRM-SPONSORED DEVELOPERS IN OPEN SOURCE SOFTWARE PROJECTS: A SOCIAL CAPITAL PERSPECTIVE

The Evolution and Implications of Bilateral Trade Agreements in the

GIS AND OPEN SOURCE. H.S.Rai

Free and open source software

Wikipedia and Open Source Design

HOW OPEN IS OPEN SOURCE?

List of Open-Source Software

Status of Open Source Software in BCA and MCA Courses of IKG PTU, Jalandhar, Implications Thereof and Remedial Measures

COLLABORATION THROUGH OPEN SUPERPOSITION: A THEORY OF THE OPEN SOURCE WAY 1

Design Architecture, Developer Networks and Performance of Open Source Software Projects

1

Patterns of Evolution in Open Source Projects: A Categorization Schema and Implications1 Sherae Daniel Katz School of Business University of Pittsburgh

Katherine Stewart Robert H. Smith School of Business University of Maryland

David Darcy Irish Management Institute

March, 2009

WORKING PAPER: PLEASE DO NOT CITE OR DISTRIBUTE WITHOUT AUTHORS’ CONSENT

1

The authors would like to acknowledge several sources of support for this work. Constructive feedback was received during presentations at the First Interdisciplinary Symposium on Statistical Challenges in e-Commerce, Michigan State and Lero, University of Limerick. Several individuals provided useful feedback including Brian Butler, Pratyush Sharma, Wolfgang Jank, Galit Shmueli and Pankaj Setia. The authors are grateful for the research assistance provided by Chang-Han Jong, Vincent Kan and Julie Inlow. This research was partially supported by the National Science Foundation award IIS-0347376. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily represent the views of the National Science Foundation.

2

Patterns of Evolution in Open Source Projects: A Categorization Schema and Implications

ABSTRACT

Open Source Software (OSS) is growing both in terms of its adoption and as a method of development. OSS projects represent new means of collaborating to build, distribute, and support software. Prior research on OSS has tended to focus on either the “big” projects such as Linux, Apache, etc, or on “the rest,” such as the population (or a sample) of SourceForge. Very few have made distinctions among “the rest” beyond dead and not dead, or better performing and worse performing. This study provides a finer-grained view of the landscape of OSS development focusing on different patterns of evolution and success. Statistical analysis of longitudinal data is combined with qualitative analysis of data from projects’ public websites to elaborate a categorization schema that differentiates among 6 types of OSS projects: User-Centered, Controlled, Counter-cultural, Personal, Abandoned, and Intractable. A major contribution of this work is in describing how each of these categories is associated with a unique pattern of software evolution and varying levels and kinds of success in attracting development activity and user interest. Observation of these categories leads to several interesting implications for research and practice, including the proposition that, counter to much prior work on software complexity, controlling complexity is neither necessary nor sufficient for success in OSS projects.

3

Patterns of Evolution in Open Source Projects: A Categorization Schema and Implications

1 Introduction Open source software (OSS) has emerged as a viable alternative to software developed in proprietary settings, generating great interest among managers and researchers (Hahn et al. 2008; Jackson 2008). Two factors appear to have fueled the intense interest in OSS development. One is a belief that OSS projects may attain very high quality, for example as indicated by low levels of complexity in the source code (O'Reilly 1999; Raymond 2001). A second is fascination with unique aspects of the development practices which allow for wide-ranging voluntary contributions (MacCormack et al. 2006; Mockus et al. 2002; von Hipple and von Krogh 2003). Many OSS enthusiasts believe OSS development practices lead to high quality software because of the many developers who can observe and participate, finding and correcting bugs, and adding new features. This belief is based on the notion that a multitude of geographically dispersed developers work on OSS projects (DiBona et al. 1999). However, given that many OSS projects have only a single developer (van Antwerp and Madey 2008), this basis for expectations of high quality may not apply uniformly across all OSS projects. Another common notion is that OSS is free so it has a lower total cost of ownership (Wheeler 2007). However, when including maintenance costs and a potential lack of support due to diverted interest from developers, the savings are less clear (Wheeler 2007). Part of the discrepancy between common beliefs and the reality for many OSS projects lies in the fact that although the hundreds of thousands of OSS projects have some common characteristics because they are governed by similar licenses, they can also be distinct in many ways. Hence the application of broad generalizations may not be appropriate, and discriminating among different types of OSS projects could be useful. The starting point for this research is the assertion that OSS projects can no longer be thought of

4 as a single monolithic phenomenon; the landscape is more nuanced and differentiation is necessary. OSS projects are sometimes divided into two classes: the ultra-successful, including projects such as Apache and Linux, and the rest. Some research has further distinguished among “the rest” those that are “active” versus those that are “inactive” (Crowston et al. 2006). When making these distinctions many studies focus on the source code, i.e. its size and or how much it changes over time (Lee and Cole 2003). For example, Linux is deemed successful partially based on a huge code base and the fact that it continues to receive code contributions. The ability to attract code contributions is important because with rare exception, software must evolve in order to meet dynamic needs of stakeholders including the developers themselves, the larger development community and, ultimately, users (Belady and Lehman 1976; Lehman and Ramil 2001). However, while Linux has a large and growing code base, there are other unique aspects to its code evolution and also its project characteristics. Linux has increased in complexity yet continues to be successful in terms of activity and other measures (Godfrey and Tu 2002). This seems to contradict prior work suggesting that software complexity makes it difficult to modify software and limits many forms of success (Lehman 1980). Thus the developers of at least one OSS project have been able to leverage the unique characteristics of OSS project development to overcome common limitations of software complexity. The seeming contradiction of continually increasing software complexity coupled with ongoing success in Linux raises the question of whether this is an anomaly or a frequently occurring pattern in OSS. The complex and potentially unique interplay between software complexity and success for OSS projects has captured the attention of researchers. For example, Haefliger et al. (2008) study the relationship between code structure and reuse, MacCormack et al. (2006) focus on the ability of OSS management and organizational structure to affect code complexity, and Baldwin and Clark (2006) discuss the impact of code complexity on attracting developer interest. These studies, along with case studies focused on the pattern of

5 software complexity evolution in Linux (Godfrey and Tu 2002), together distinguish software complexity as a notable characteristic of OSS projects. Code complexity is especially significant in OSS projects because code is at once the product, a reflection of the development process, and a guide for potential reuse where there may not be documentation or other helpful organizational structures. Based on the prior work that suggests the importance of software complexity evolution, we focus on the evolution of software complexity to develop a more fine-grained schema of OSS projects. The schema identified here can help address many issues of concern to researchers. By using software complexity evolution as a lens to categorize OSS projects, we can start to understand the paradox of Linux because we identify and describe a category of projects that increase in complexity over time yet are associated with successful outcomes. We also identify and describe other categories of projects that decrease in complexity and have varied outcomes. For example, we discover that 3 categories, representing 47% of the sample, do not follow the commonly observed pattern of increasing complexity noted in most prior research on closed development projects (Belady and Lehman 1976). This is evidence that OSS projects may be distinct from traditional projects in terms of how complexity evolves, and could be subject to alternative laws. Further, we are surprised to find that there does not seem to be a strong association between managing (or increasing) complexity and success (or failure) on other dimensions. This paper also makes contributions beyond the realm of research focusing on software evolution. Through focusing on internet enabled communities that strive to innovate we seek to provide some guidance for theoretical inquires such as those aimed at understanding community based innovation (von Hipple and von Krogh 2003) or dynamics on globally distributed teams (Maznevski and Chudoba 2000). In particular, this study will offer guidance to researchers who attempt to draw generalizable theoretical conclusions from the study of OSS projects by enabling them to better estimate the boundaries to which they can project their findings. In summary, the main contribution of this research derives from developing a classification schema that illuminates some prior findings (i.e. the Linux

6 paradox), raises some interesting theoretical questions (i.e. the role of complexity for communities seeking to innovate), and may help guide sample selection and assessments of generalizability in future OSS research. In addition, the schema provides a tool that can be helpful for managers faced with OSS adoption decisions or the challenge of utilizing OSS practices described in prior research (Watson et al. 2008). Such managers may get quickly mired in a variety of OSS projects. Using the schema developed in this paper, a manager interested in adopting OSS practices can identify practices that have worked for projects that are similar to his project instead of attempting to borrow practices from projects that are substantially different from his own. Specifically, our schema suggests the kinds of projects where success is associated with complexity and those where complexity is not associated with success, so that a manager can better estimate the impact of complexity for his project. To situate this study in the prior work on OSS and software evolution, the next section provides a brief description of OSS, a short discussion of software complexity, and a summary of research concerning complexity evolution in OSS projects. Next, the research design is presented, including details about the use of functional data analysis (FDA) and the qualitative analysis of information on project websites. The results section combines analysis of software complexity evolution, quantitative project characteristics, and qualitative project information to develop and describe the 6 project categories. In the discussion we consider several implications of our findings. Finally, we highlight the limitations of the study before discussing implications for research and practice.

2 Background

2.1 OSS Project Characteristics and the Importance of Complexity OSS projects generally engage a set of practices that distinguish them from other software development projects, and many of these practices relate to the treatment of source code. The treatment of source code in OSS projects is guided by licenses approved by the

7 Open Source Initiative (see www.opensource.org). The most prominent characteristic of approved licenses is that the source code be freely available for use, modification, and redistribution. These licensing rules can yield opportunities and challenges for OSS projects. By maintaining code availability, OSS licenses create an opportunity for projects to engage geographically distributed volunteer developers who may join and exit projects as they wish (Shah 2006; von Krogh et al. 2003). However, engaging geographically distributed volunteers can increase coordination challenges, and emphasize the importance of factors that ease contributions. For example, Kuk (2006) highlights the important role of knowledge sharing to overcome these challenges. When geographically distributed developers may come and go without formal integration of new contributors or turnover processes when a contributor leaves, factors that facilitate the ease with which a developer can contribute is especially important. The ease of contribution is also important because the developers are often volunteers and may have limited time to work on the OSS project. If they cannot understand and contribute quickly, they may not contribute. Hence the complexity of the source code may be an important determinant of a new members’ ability to understand the code and contribute effectively (Baldwin and Clark 2006; Lehman and Ramil 2001). Thus, the unique characteristics of OSS projects make it an especially interesting context to observe patterns of interplay between complexity evolution and other project characteristics such as ongoing development activity.

2.2 Software Complexity: Size and Structure There have been quite a few studies on the evolution of source code in OSS. In contrast to the current work, these have generally been case studies of the largest, most successful projects. We follow studies of Linux and Mozilla in their focus on size and structure as critical measures of software complexity (Godfrey and Tu 2002; MacCormack et al. 2006; Paulson et al. 2004). While these measures are often correlated, they are theoretically and empirically distinct (Darcy et al. 2005; Eick et al. 2001).

8 Size is typically measured as source lines of code, and it reflects the quantity of code. As such it may reflect the amount of functionality or feature richness of an application (Lehman and Ramil 2001) and in that regard, larger size may be seen as desirable. Size can reflect how much developer activity the project has been able to attract, and prior OSS research has considered this a signal of project success (Stewart and Gosain 2006). OSS projects that are rich in functionality may be able to attract a user community. This is important as prior literature discusses the positive impacts that users can have in OSS projects (Lakhani and vonHippel 2003). Hence overall, increasing size may be seen as positive. However, software with larger size tends to have higher complexity, a higher number of developers needed to complete the tasks related to the software and a higher number of errors (Kitchenham et al. 1990; Triantafyllos et al. 1996; Wang and Shao 2003). As discussed above, higher complexity can be detrimental to OSS projects as developers are volunteers and may not take the time to overcome the complexity to make contributions. Size is a measure with high accessibility to project managers, but other measures that capture complexity, irrespective of size, may be more helpful for focusing maintenance efforts within a project (Briand et al. 2000). Structural complexity captures the way that code is organized rather than how much code exists. As such, rather than reflecting the amount of functionality in an application, it may more likely reflect management efforts and decisions associated with design (MacCormack et al. 2006). Structural complexity is viewed as “the organization of program elements within a program” (Gorla and Ramakrishnan 1997; MacCormack et al. 2006). The term element can refer to a procedure, function, method, module, class, etc. When solving a problem through software, several of these elements are typically created. As design, implementation and maintenance decisions are made, the particular content of these elements and the relationships between these elements leads to the structural complexity of the software. That is, the structural complexity of a program depends on the complexity of individual elements and the complexity of the associations among these elements. The structure of elements created by developers can have several implications including effecting

9 the ease with which the code can be altered as bugs are fixed and features are added, and the reliability of the application (Ulrich 1995). In this regard high structural complexity is often viewed as an undesirable trait. Structural complexity is reflected in both cohesion and coupling. Cohesion is conceptualized as the ‘togetherness’ of the sub-elements within an element (Bieman and Ott 1994). Higher cohesion indicates lower complexity. Coupling focuses on the associations among elements, and is conceptualized as the ‘relatedness’ of elements to other elements (Hutchens and Basili 1985). As coupling increases, so does complexity. An experimental study found coupling and cohesion to be interdependent (Darcy et al. 2005). Based on such findings, it can be argued that a single measure of structural complexity may be calculated using coupling and cohesion, and such a measure adequately captures overall structural complexity because coupling and cohesion are the fundamental, important underlying dimensions of structural complexity (Darcy et al. 2005). The single coupling and cohesion measure of structural complexity is particularly relevant with regard to outcomes of managerial interest such as developer effort (Darcy et al. 2005). While size and structural complexity have sometimes been considered interchangeable as indicators of complexity, they have important differences as overviewed above. Specifically, increasing size can indicate added features, generally a positive trait, whereas increasing structural complexity may be a negative trait, leading to difficulty in maintenance. Increases in size (and features) do not have to increase structural complexity. We focus on both size and structural complexity to understand if and how projects can increase their functionality while minimizing the difficulty developers face when trying to contribute.

2.3 OSS and Complexity Evolution Exploration of the evolution of complexity in OSS projects has yielded some consistency and some contradictions. Godfrey and Tu (2002) observed the size of 96 Linux kernel versions to grow at a super-linear rate. Yu et al. (2004) examined the evolution of

10 larger set of 400 versions of the Linux kernel and found that both size and common coupling with non-kernel modules increased, but at different rates. Specifically they found that size increased in a linear fashion, while common coupling increased at an exponential rate. In contrast, Paulson (2004) found that size and other measures of complexity increased at the same rate in the Linux kernel. So, while prior research suggests that measures of complexity in Linux have increased over time, it is not clear exactly what pattern those increases have followed. The increasing complexity in Linux is only partially consistent with the seminal work known as Lehman’s Laws of Software Evolution, which were based on observations made during the development of the IBM 360 operating system (Belady and Lehman 1976; Lehman 1980). Lehman’s Laws suggest that complexity will increase over time unless it is actively controlled, and a common hypothesis is that increasing complexity will obstruct project performance while decreasing complexity facilitates it (Baldwin and Clark 2006). Evidence supporting this hypothesis has been found in the Mozilla project, where developers were able to purposefully decrease complexity and then experienced greater success (MacCormack et al. 2006). In contrast, Linux’s success, even in the face of increasing complexity, seems to suggest that increases in complexity may not be detrimental to OSS projects, or that there are other factors that ameliorate the impact on a project’s performance (e.g. Godfrey and Tu 2002). In summary, complexity is expected to influence the outcomes of OSS projects, and there is a growing body of research that details how complexity, in terms of size and structure, evolve in some of the largest OSS projects. While this research is insightful, it provokes questions about the impact of complexity for OSS projects and what patterns of evolution may occur in more typical projects. This study adds to this body of research by exploring evolution of a broader sample of OSS projects to uncover and interpret patterns.

3 Research Design

11 Research was conducted in four stages. In the first stage, sampling criteria were defined, OSS projects fitting the criteria were identified, the source code from all available releases of those projects was downloaded, the data was cleaned, and the measures of interest were calculated for every release of every project in the sample. In the second stage functional data analysis (FDA) was applied to uncover patterns of evolution in size and in structural complexity and to categorize projects accordingly. Further statistical testing was conducted to increase our confidence that the resulting 6 group categorization schema was valid. In the third stage of the research, we collected and analyzed additional archival data to enhance our understanding of differences across the projects categories. This included both quantitative data (e.g., numbers of downloads for the projects), and qualitative data (e.g., types of information provided on the project websites). Finally, in the fourth stage we combined all of the analyses to write comprehensive descriptions of the 6 types of projects uncovered in this research.

3.1 Sample The sample of projects was drawn from SourceForge (www.SourceForge.net). SourceForge provides open source developers a centralized place to manage their development and includes communication tools, version control processes, and repositories for source code. It is one of the largest open source repositories, estimated to host over 168,000 projects (Madey and Christley 2008). Drawing a sample from this site allowed this study to build on prior open source research by focusing on a larger and more diverse set of projects compared to previous case study work focused on the largest projects such as Linux, Mozilla or Apache. SourceForge hosts many different kinds of software projects, utilizing many different programming languages. Since our focus is on the evolution of code structure, we attempted to limit variability due to the nature of the underlying problem being addressed across projects by selecting from the two largest problem domains listed on SourceForge, “Internet” and “Networking.” To avoid variability due to the project programming language, projects

12 were selected that were built with C++. These constraints were instituted to enhance the likelihood that differences in evolutionary patterns observed may be due to factors of greater managerial interest, such as the development practices used on a project. Some projects represented multiple parallel development streams that might not be able to be disentangled, and because some variables were measured at the level of the SourceForge project, those measures could not be assigned to a single subproject. For this reason those projects that appeared to be a sub-project of a larger project or that appeared to be an umbrella project for multiple smaller efforts were eliminated. In order to focus on projects that represent the open source movement, only projects that use OSI approved licenses were examined. In order to make sure we could observe some evolution these criteria were applied to projects that had a minimum of 2 releases. Projects were tracked for 1 year and those that had at least 2 software releases by the end of that year were retained in the sample. In order to observe comparable periods, we included the first one year of history for each project in our sample. A one year history was chosen as being sufficiently long for projects to display considerable evolution, and likely long enough to capture the active life of most projects (Stewart et al. 2006b). Applying the selection criteria generated a total of 76 projects (listed in the appendix) and 492 software releases for analysis. Two screening procedures were used to ensure that the releases for a given project represented a single development stream. The first procedure identified projects for which the size appeared to vacillate drastically over time. For every project that showed such vacillation, the source files, public forums, web pages, and/or discussion groups associated with the project were reviewed in order to determine whether the set of source files obtained represented a single development stream. Source files that were determined to have been erroneously included were eliminated. The second screening procedure focused on separating multiple development streams in projects (Godfrey and Tu 2002). This was done by examination of the naming conventions used in each project to distinguish when parallel development efforts were represented. For example, some projects included versions of the code related to different human languages

13 (e.g., a French release and a Spanish release), and some had versions for different operating systems (e.g., Mac and Windows). Where a project was identified as having multiple parallel development streams, the releases associated with the development stream that was most active (i.e., the one that contained the largest number of releases) were retained.

3.2 Measurement Size, coupling and cohesion for each release were calculated using Scientific Toolwork’s Understand (version 1.4) analysis tool. Size was calculated by summing the total number of lines of code in the release. Coupling and cohesion were individually assessed for each class in a particular release of the project. To calculate a release level coupling measure, the class level coupling measures were averaged across all classes for the given release of the project. The same procedure is used to assess the release level measure of cohesion. As increasing cohesion represents decreasing complexity, a reversed measure was used. Following Darcy et al. (2005), the release level measures of coupling and cohesion were multiplied to represent an overall measure of structural complexity. In addition to the measures of complexity, other descriptive statistics were also calculated to enhance our understanding of the projects and their relationship to practices that have been suggested to be common for OSS projects. As one common notion is that OSS projects release “early and often” (Raymond 2001) we explored measures that would help us evaluate the degree to which our projects release early and often. These were the total number of releases for a project, the average number of days between releases (release frequency), and the active life. The active life for a project represents time in number of days between the first and the last release for a project (within the one year observation period). To explore potential relationships between evolution during the first year and subsequent project success, we revisited each project in the sample approximately three years after our initial sampling and observed the date of the most recent software release. This allowed us to determine whether the project had survived past the first year (as indicated by further development activity).

14

3.3 Uncovering Complexity Evolution Patterns and Categories FDA is a relatively new technique that enables an efficient representation of many types of data series, such as time series data, in a compact and analytically powerful functional form (more information can be found at http://functionaldata.org/ or in (Ramsey and Silverman 2002)). A functional form is a mathematical description of a curve that can be analyzed using mathematical techniques such as cluster analysis. We used FDA to enable the creation of comparable functional forms representing the complexity evolution pattern for a single OSS project. Projects, represented by their functional forms, are the unit of analysis for this study. FDA is used in this study because when OSS data representing the evolution of complexity is viewed across multiple projects, they are very ill behaved (Kemerer and Slaughter 1999) (Stewart et al. 2006b). Otherwise stated, the time series for a given project may look very different from that of another project, and comparisons across projects are difficult to make using traditional time series analysis. The time series were different across projects based on: the variance in the timing of initial, intermediate, and final software releases across projects; variance in the number of releases across projects; and variance in the absolute levels of complexity, which we found to obscure patterns of change over time. In employing FDA we resolved several challenges to analysis. Specifically, we followed Stewart et al. (2006) in addressing these issues, and also in our selection of parameters for creating the functional forms. Briefly, to eliminate difficulties associated with the staggered starting points of projects, we aligned them according to their first software release. To address the challenge posed by the varied number and timing of software releases, we assigned each project a level of complexity for every day in the one year following the initial release. This value was equal to the complexity calculated for the most recent software release for that project. Following (Jank and Shmueli 2008), K-medoids clustering was used to identify different patterns of evolution for size and for structural complexity. Because there is no theoretical reason to expect a particular number of clusters, we began with a single cluster and then clusters were added until no new insight was gained by adding an additional cluster (Kluyver

15 and Whitlark 1986). After creating clusters, the General Linear Model procedure (GLM) in SPSS 16.0 was used to confirm that the clustering resulted in clusters that were significantly different from one another. We then examined the assignment of projects to categories representing a size cluster and a structural complexity cluster. For categories with few projects, we explored collapsing them and performed statistical analyses to investigate the result of combining some categories.

3.4 Archival Data Analyses After generating categories based on complexity evolution patterns, we sought greater understanding of the differences among categories by conducting an analysis of data available from the SourceForge archives. This included both quantitative and qualitative data. Quantitative data were drawn from the SourceForge Research Data Archive, an archival database of OSS projects provided by SourceForge to the University of Notre Dame (van Antwerp and Madey 2008). SourceForge provided resources to manage the development process such as tracking bug reports and downloads, managing CVS activity, and space to create a website. We analyzed the quantitative data for each project as of 2006. This represented a minimum of 2 years after the end of the one year observation period used to assess the evolutionary patterns, allowing us to observe some indictors of the later success of projects that could be associated with their early evolution. At this point one project had been deleted from the SourceForge archive and so the archival data analysis is for 75 projects. Following prior work that has identified different types of success in OSS (Crowston et al. 2006; Stewart et al. 2006a), we focused on multiple indicators. These included survival (releases after the first year), development success (CVS commits, number of developers), and user interest (downloads). To build on the quantitative data analyses, qualitative data was collected during May 2008 by examining web space SourceForge provided for each project (e.g., http://pocketwarrior.SourceForge.net/). We used a grounded approach, conducting iterative rounds of coding to inductively determine relevant categories of information presented on the

16 websites (King et al. 1994; Straus 1987). In the first round of coding, one author reviewed all websites, determined whether the website was in use or not, and created a summary of the information found on each site. In the second round, this information was reviewed by the other two authors, who then also explored several project websites in order to arrive at an initial set of information categories found on the websites. In the final round of coding, each website was reviewed again to apply the inductively generated coding schema and code which types of information were present on each site. At this point all three authors independently reviewed the data for all categories of projects in order to assess commonalities and differences across them. Finally, the three authors discussed these independent assessments to converge on a common understanding of the categories leveraging both the qualitative and quantitative data.

4 Results Descriptive statistics are presented first. Then a description of clusters created based on the evolutionary patterns of size is described, followed by a summary of the clusters based on the evolutionary patterns of structural complexity. Next, we unite these two sets of clusters to create six categories encompassing patterns of both size and structural complexity evolution. Finally, we present results of the analyses of the additional archival data to describe each of the 6 categories in more detail. Descriptive statistics based on the software releases for the sample of projects can be found in the second column of Table 1. The average active life across the 76 projects is 161 days. Projects with multiple releases all on a single day had an active life of 0. On average, the projects released 6.47 versions and released them every 36.86 days. The size of the first release was approximately 4,427 lines of code and the final release contained approximately 6,520 lines of code.

4.1 Patterns of Evolution in Size

17 The functional forms representing the evolution of size for all projects are shown in Figure 1. The functional form in bold represents the average of all the projects and the 95% confidence interval around that average is indicated by bold dashed lines. Though the individual projects are difficult to distinguish, the mean curve shows an upward trend. After clustering the functional forms and analyzing solutions for one to four clusters, we found the three cluster solution yielded the most explanatory power and provided the most interpretable results for the evolution of size. The thin dotted curve in Figure 2 represents the overall mean and each of the other lines represents one of the three cluster means. The largest cluster of projects (n = 29) is represented by the solid bold line showing the largest increase in size over time. The curve shows these projects begin with a relatively slow increase for approximately the first 50 days, then have a period of rapid increase until approximately day 200, after which the rate of increase again slows down. We refer to the projects in this cluster as “high growth” projects. The next largest cluster (n=28) is represented by the dashed curve showing an increase during approximately the first 100 days, followed by a lack of change during the remainder of the observation period. We refer to this as the “low growth” cluster. The smallest cluster (n=19) is represented by the relatively flat dotted line. Initially the flat line was surprising, given that we sampled only projects with multiple releases. Examining the curves in this cluster revealed that most of these were projects that included both increases and decreases during the observed year, resulting in an average for the cluster that appears relatively flat. We thus refer to this as the “fluctuating size” cluster. The clusters are visually distinguished by changes in size and the active life. To confirm that the clustering does represent actual differences on these dimensions, we used the general linear model (GLM) procedure. The GLM showed that the three clusters are distinguished mainly by the percentage change in size (log transformed to minimize deviations from normality) and the active life across clusters. Both effects were significant at p