How to Merge Program Texts

How to Merge Program Texts WUU YANG National Chiao-Tung University 3333333333333333333333333333333333333333333333333333333333333333333333333333333333...
Author: Garry Paul
1 downloads 0 Views 37KB Size
How to Merge Program Texts WUU YANG National Chiao-Tung University

333333333333333333333333333333333333333333333333333333333333333333333333333333333333333

Software usually exists in multiple versions. All these versions must evolve in parallel. We propose a programmerging system that helps the programmer to manage the evolution of all the versions, which is much needed in software development and maintenance environments. The system can either combine two programs together or it can combine changes to a base program together. The system consists of three stages: a syntax-based comparator, a synchronous printer, and a merging editor. Based on the differences between the two programs that are identified by the syntax-based comparator, the synchronous printer combines the texts of two programs. Since there may be conflicts between the two programs, a merging editor provides the user with commands to resolve the conflicts. Another distinct feature of the merging system is the generator approach to producing syntactic program comparators for new programming languages. Additional Key Words and Phrases: merging editor, program evolution, synchronous printer, syntax-based comparator

333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333

1. INTRODUCTION Software systems usually evolve along divergent lines of development and maintenance. We need tools that help the programmer to manage the evolution of the software versions. Configuration management and version control systems solve part of the problem. However, there is no effective and convenient tool that helps the programmer to directly merge the source code of a system. A tool that helps the programmers merging program texts is, thus, very useful. 33333333333333333333333333333 This work was supported in part by National Science Council, Taiwan, R.O.C. under grant NSC 82-0113-E-009-265-T. Author’s address: Computer and Information Science Department, National Chiao-Tung University, HsinChu, Taiwan, R.O.C. Copyright  1993 by Wuu Yang. All rights reserved.

−2−

The UNIX utility sdiff is such a merging tool. It takes two files as arguments, compares them line by line, and lists the two files side by side. Special characters such as , and | are used to indicate the differences. Users can issue simple commands to reconcile the differences manually. In this paper, we introduce a more advanced merging system that is characterized by three aspects: (1) it compares programs in a way that corresponds more closely to the syntactic structure of the programs; (2) it utilizes advanced display hardware to highlight the differences in colors; (3) it is built on top of an existing popular editor so that the merging system can benefit the most people while requiring the least amount of learning effort. Figure 1 depicts a sample editing session. Suppose that a user wants to merge two versions of a program A and B. A comparison utility compares the two versions and produces an intermediate merged file. The merging editor displays the intermediate merged file as shown in Figure 1 (c). Since it is not possible to display colors on the printed document, we use single and double boxes to indicate that the differences are highlighted in colors. Texts in the single-boxes are highlighted in one color (e.g. green) on the display, which indicates that the texts appear only in the first version. Similarly, texts in the double-boxes are highlighted in a different color (e.g. yellow) to indicate that the texts appear only in the second version. Texts not enclosed in any boxes, which are displayed in the normal color (e.g. black), are the common part of both versions. (We will discuss Figure 1 (d) in a later section.) One of the fundamental components of a merging system is a utility that can identify the differences between two programs. The differences may be captured by comparing the programs line by line, resulting in what we call textual difference. Textual difference is a crude comparison; a user needs to spend more time in understanding the differences and performing the merging task. On the other hand, since there is a rigid syntactic structure in the program text, it is possible to compare the programs based on the syntactic structure to produce a more accurate comparison. We call the second approach syntactic difference. Syntactic difference is more meaningful because the syntax of the programming language is used during comparison.

A syntactic comparison tool has been implemented and reported in

[Yang1991a]. The differences between two versions of a program sometimes conflict each other. To resolve these conflicts, we need a friendly and versatile merging editor. The user interface in sdiff is quite simple. It is possible to utilize the advanced hardware, such as color displays, to build a more intuitive user interface. Advanced hardware also allows a new set of merging commands to be incorporated in the editor.

−3−

333333333333333333333333333333333333333333333333333333333333333333333333333333333333333 (a) Program A

(b) Program B

int find(kind, key)

char find(kind, key)

int kind; char *key;

int kind; char **key;

{

{ float utime, stime;

float utime;

float total;

float difference;

FILE *fpi;

FILE *fpi, *fpo;

total = utime * 3.0 + stime;

difference = utime - stime;

} /* end of find */

} /* end of find */

(c) Intermediate merged file

(d) After the fifth line is split

int char find(kind, key)

int char find(kind, key)

int kind; char * * key;

int kind; char * * key;

{

{ float utime , stime ;

float utime , stime ;

float total difference ;

float total; float difference;

FILE *fpi , *fpo ;

FILE *fpi , *fpo ;

total difference = utime * 3.0 + - stime;

total difference = utime * 3.0 + - stime;

} /* end of find */

} /* end of find */

333333333333333333333333333333333333333333333333333333333333333333333333333333333333333

Figure 1. An example illustrating the merging editor.

In this paper we present the design of a syntax-based merging system. The system consists of three stages: a syntax-based comparator, a synchronous printer, and a merging editor. The syntax-based comparator and the synchronous printer together form a syntactic comparison utility, which identifies the differences and produces an intermediate merged file. The merging editor provides a set of commands to manipulate the intermediate merged file. The merging system proposed in this paper still works on a file-by-file basis. Furthermore, the system understands only the syntax of the programming languages. Though an ideal merging system should try to merge entire software systems at the same time, rather than individual modules, and it should have as

−4−

much semantic knowledge of the programming languages as possible [Horwitz1989a], we think that our merging system is a practical first step toward that goal. The remainder of this paper consists of four sections. The next section discusses the syntactic program-comparison utility. Section 3 presents the merging editor. Several design alternatives of the merging editor are also discussed. Three-way merging is an interesting extension to two-way merging. Section 4 discusses three-way merging. The final section describes related work. 2. A SYNTACTIC PROGRAM-COMPARISON UTILITY The program-comparison utility consists of three steps: firstly, a parser transforms the program texts into parse trees; secondly, a tree-matching algorithm matches the nodes of the trees; finally, a synchronous printer traverses the parse trees and produces an intermediate merged file that is used for editing. The parse tree is produced by a parser, which may be generated by yacc [Johnson1979a]. The parse trees are slightly different from the traditional parse trees [Aho1986a] in that a homogeneous sequence of a structure, such as a sequence of statements, is represented as a flattened tree rather than a skewed tree. The flattened-tree representation for a homogeneous sequence of a structure is designed to fit the treematching algorithm, which is able to detect the insertion and deletion of one or more elements in a homogeneous sequence. The text of a program can be classified into three categories: (1) grammatical elements, such as a variable name or a semicolon; (2) extra-grammatical elements, such as comments, preprocessor commands in a C program, and pragmas in an Ada program; and (3) glues, which are the white spaces between the grammatical and extra-grammatical elements. The grammatical elements constitute the program. Comments are an indispensable part of the program text; C’s preprocessor commands and Ada’s pragmas affect the meaning of the program. The glues make the syntactic structure of the program explicit. When we compare two programs, only the grammatical and extra-grammatical elements are compared. Glues are not compared because the syntactic structure of the programs is represented by the nonterminal nodes in the parse trees. In the intermediate merged file, glues may be generated as in a pretty printer or they may be copied from the input programs. The nodes of the parse tree are divided into two disjoint sets: the terminal nodes and the nonterminal nodes. A terminal node contains a grammatical element or an extra-grammatical element. A nonterminal node contains a nonterminal symbol that denotes a syntactic structure in the program, such as an expression. Two nodes can match only if they contain identical or similar grammatical elements, extra-

−5−

grammatical elements, or nonterminal symbols. The tree-matching algorithm is based on a dynamic programming scheme. The algorithm attempts to match the maximum number of nodes in the two trees under the restriction that the parent-child relationship and the ordering among siblings are respected. That is, the following two conditions are satisfied: (1) Two nodes can match only if their parents match. (2) Suppose v 1 matches v 2 , w 1 matches w 2 , v 1 and w 1 are siblings, and v 2 and w 2 are siblings. Then v 1 comes before w 1 if and only if v 2 comes before w 2 . Note that nodes on different levels of the trees cannot match and that only insertions and deletions of nodes or subtrees can be detected by the matching algorithm. No movement of nodes or subtrees can be detected. The two conditions significantly reduce the running time of the tree-matching algorithm. A node of a tree that does not have a corresponding node in the second tree is considered to be absent from the second tree and hence is considered to be a difference between the trees. A node of a tree that has a corresponding node in the second tree but that contains a different symbol than the corresponding node is considered to have been changed; this node is also considered to be a difference. After the matching algorithm establishes a correspondence between the two trees, the synchronous printer traverses the trees and produces an intermediate merged file. The synchronous printer, like a traditional pretty printer [Oppen1980a, Rubin1983a], walks through the trees, prints out the grammatical and extra-grammatical symbols as the nodes containing the symbols are visited, and inserts appropriate glues. By "synchronous", we mean that two trees are traversed at the same time. The traversals are arranged in such a way that corresponding nodes are always visited at the same time. Special markers in the merged file are used to indicate differences. The merging editor can recognize the markers and highlight the differences with colors. In addition to the grammatical and extra-grammatical symbols, the synchronous printer also needs to produce glues between the symbols. There are three ways to generate glues. Firstly, the glues may be produced according to a fixed format. This approach is used in simple pretty printers. Secondly, the glues may be produced in a customizable format. The user may customize the output format by inserting formatting commands in the grammar. The formatting commands are translated into appropriate actions for the synchronous printer. The third approach to producing glues is to copy glues from the input programs. The merged file produced by the third approach looks most similar to the input files. After experimenting with all three methods, we decided to adapt the third approach. The decision is based on the

−6−

belief that users may prefer to work on a merged file that is as similar to the original programs as possible. The syntactic comparison utility is characterized by two aspects. It produces a fine-grained comparison. Diff produces a line-by-line comparison of two files. In contrast, the unit of syntactic comparison is individual tokens. The fine-granularity comparison pinpoints the differences more accurately. The second characteristic of the syntactic comparison utility is that the differences correspond to the syntactic structure of the programs more closely. For example, suppose that a user wants to compare the following two program fragments: while (p) { x = y + z; a = b + c; }

while (p) { x = y + z; } while (q) { a = b + c; }

The while statement on the left can match either the first while statement or the second while statement on the right but not both. Although the shortest editing distance to change the left fragment to the right fragment would be the insertion of two lines—the third and the fourth lines of the right fragment—our goal is not to find the minimum editing distance. Rather, we attempt to find the minimum syntactic distance (or equivalently, the maximum syntactic similarity). In the above example, we will match the while statement on the left to the first while statement on the right and also match the assignments to x. The assignments to a will not be matched. The differences between the two program fragments is that the assignment statement to a is deleted from the left fragment and the second while loop is added. One weakness of the syntactic comparison utility is that it can only be applied to a specific programming language. To remedy this weakness, we have re-designed the syntactic comparison utility so that the language-dependent modules—the lexical analyzer and the parser—and the language-independent modules—the matcher and the synchronous printer—are carefully separated. To produce a new syntactic comparison utility for a new programming language, a user simply supplies the lexical analyzer and the parser that build the parse trees in a form that is consistent with the one used in the matcher and the synchronous printer. To provide further help, we have built an auxiliary utility that transforms a plain yacc [Johnson1979a] specification of a parser into one with the appropriate statements to build the parse trees. A new syntactic comparison utility can, thus, be generated automatically from a lexical analyzer and a

−7−

plain yacc specification. Using this generator approach, we have successfully built a syntactic comparison utility for yacc specification files (consider that the yacc specification is also a programming language). 3. THE MERGING EDITOR The merging editor works on the intermediate merged file produced by the synchronous printer. The intermediate merged file contains special markers to indicate the differences between the two programs. There are two kinds of markers: those tokens that appear in the first program but not the second are enclosed in one kind of marker; those tokens that appear in the second program but not the first are enclosed in the second kind of marker. The markers come in pairs, just like pairs of parentheses; the character string enclosed in a pair of markers is called a "zone." In Figure 1 (c), the text enclosed in a single or double box forms a zone. Zones do not overlap with one another. Depending on the kinds of enclosing markers, zones are highlighted in different colors. In the intermediate merged file, a zone usually consists of a grammatical symbol or an extra-grammatical symbol. After the intermediate merged file is processed by the merging editor, zones may be deleted and adjacent zones of the same color may be combined into a larger zone. The merging editor recognizes the zones and displays the text in the appropriate colors. The text in the intermediate merged file that is not in any zone is displayed in the normal color, say, black. The text enclosed in the first kind of marker is displayed in a distinct color, say, green. The text enclosed in the second kind of marker is displayed in a different color, say, yellow. It is also possible to use different background colors, rather than foreground colors, to highlight the differences. We use four colors in the display: three foreground colors and one background color. Based on our experiments, we believe that it will cause confusion when too many colors are used. This is one reason why we did not attempt to detect the movement of program elements as is done in [Tichy1984a] since that will require the use of more colors. We have also experimented with different fonts rather than colors. The advantage of using fonts to highlight the differences is that it may be used with monochrome displays, which are still more popular than color displays. The disadvantages of using fonts include (1) fonts are not as easy to recognize as colors and (2) different fonts usually have different sizes. In a text editor such as emacs, displaying fonts of different sizes on the same window seems confusing to the user. In contrast, it is easier to see the differences highlighted with foreground or background colors.

−8−

The merging editor can be either a stand-alone tool with only the merging capabilities or an extension to an existing popular editor. We adapt the later approach for three reasons: (1) during merging, a user needs not only the commands for merging but also the commands for editing; most, if not all, editing commands are necessary; (2) users are reluctant to switch to a new editor unless the new editor is much better than the old one; and (3) the added merging capabilities do not justify the cost of building a new powerful editor from scratch. Therefore, to benefit the most people at the least cost, we believe that the best approach is to add the merging capabilities to an existing popular editor. Because the comparison utility is based on the syntax of the programming languages, the first candidates as the basis of the merging editor are naturally the language-based editors. However, due to the lack of a language-based editor that is extensible, popular, and available to the public, we choose the emacs text editor [Stallman1981a] instead of a language-based editor. (Actually, the merging editor is built on top of the epoch editor, which is "emacs on steroid" [Kaplan1992a].) Besides its extensibility, popularity, and availability, the emacs editor is independent of programming languages. The same editor can be used to merge programs written in any programming languages. The merging commands are classified into the following seven categories: (1) read or write files; (2) eliminate highlighted zones; (3) turn off highlighting of some zones; i.e., these zones will be displayed in the normal color; (4) highlight a sequence of characters; (5) move the cursor around; (6) split a sequence of characters; and (7) add new zones. We may add text to the new zones. The added text is highlighted in the color of the zone. All other normal editing commands, such as searching, work as if there is no highlighting. The intermediate merged file reflects the matching determined by the matcher. When the matching becomes too fragmented, a user may want to revert the matching. For instance, it is preferable not to match two assignment statements when all that matches is just the assignment sign. The split command is used to revert the undesirable matching. An example of splitting a line is shown in Figure 1. In the A program, a variable total of type float is declared. In the B program, another variable difference of the same type is declared. The matching algorithm (erroneously) concludes that the user changed the name of the variable. But in reality, they are two distinct variables that should not match. In this case, we use the split command to revert the unwanted matching. After splitting the fifth line, we have the situation shown in Figure 1 (d).

−9−

The merging editor can effectively edit two versions of a file at the same time. When texts are entered in a green (or yellow, respectively) zone, the texts will appear in the first (or second, respectively) input file. When a green (or yellow, respectively) zone is eliminated, the texts are removed from the first (or second, respectively) file. The commands for adding new zones, turing on or off highlighting, and saving files also help a user to edit individual files. After editing a program for an hour, a user might want to know what has been changed. He can invoke the syntactic comparison utility to compare the latest version in the buffer and the original version in the disk file. The merging editor can display the differences. 4. THREE-WAY MERGING Three-way merging is an extension to two-way merging. The intermediate merged file produced by twoway merging is a combination of the two input programs. Notice that in two-way merging, the programs that are to be merged are usually derived from the same base program. Three-way merging is useful when two users change separate copies of the same source file independently and later their changes, not their programs, need to be merged together. The intermediate merged file produced by three-way merging is the base program plus the changes made by the two users. The difference between two-way merging and three-way merging, which is illustrated in Figure 2, is that two-way merging simply merges the two programs while three-way merging merges the two programs with respect to a common base pro-

333333333333333333333333333333333333333333333333333333333333333333333333333333333333333 (a) Two-way merging

(b) Three-way merging Base

A

B

A+B

A

B

Base + ∆A + ∆B

333333333333333333333333333333333333333333333333333333333333333333333333333333333333333

Figure 2. Two-way merging and three-way merging.

− 10 −

gram. Let us call the original file Base and the changed copies A and B. A and B are derived independently from Base. A comparison between Base and A shows the changes made by the first user; a comparison between Base and B shows the changes made by the second user. The objective of three-way merging is to merge the changes together. When we perform three-way merging, correspondences are made between Base and A and between Base and B. The composition of the two correspondences is a correspondence between A and B. After a correspondence between A and B is established indirectly though comparisons with Base, the synchronous printer generates an intermediate merged file from A and B. 5. RELATED WORK The PEDIT [Kruskal1984a] is similar to the merging editor discussed in this paper. PEDIT stores multiple versions of data (not necessarily programs) in the same file in which each line is tagged with a boolean expression to indicate which version(s) the line belongs to. Since PEDIT does not include a comparison capability, it cannot merge two separate files. In contrast, the merging editor assumes that different versions of a program are stored in different files and performs a comparison to establish the correspondence between the files before any editing operations are applied. The second difference is that PEDIT displays one file at a time but it can swap the displays of different files. By contrast, the merging editor, in principle, can also manipulate many versions of a program during the same editing session (an inexpensive color display can easily display 256 colors). All the files are displayed in the same window; there is no need to swap displays. However, the user tends to be confused when there are too many colors on the display. Horwitz [Horwitz1990a] proposes a set of methods for identifying the semantic and textual differences between two versions of a program. Her method requires a preprocessing step that can (conservatively) determine the equivalence classes of program components that have equivalent execution behaviors. Based on the equivalence classes, Horwitz’s method proceeds to pair components under various optimization criteria, such as maximizing the number of pairs that have equivalent behaviors and texts, maximizing the number of pairs as well as the number of dependence edges between components, etc. In contrast, the comparison and merging system in this paper is based solely on programs’ texts and syntactic structures.

− 11 −

The UNIX utility diff3 also performs three-way differentiation, whose function is similar to the discussion in Section 4. The difference is that diff3 performs line-by-line comparison. The revision control system RCS [Tichy1982a] contains a facility for merging files. This facility is based on diff. Fraser and Myers [Fraser1987a] propose an editor for revision control. The editor automatically builds revision trees and uses the same editor command language to edit both text and revision trees. This editor differs from the merging editor discussed in this paper in that it is aimed at helping users to manage the revisions. The main function of the merging editor is to merge several revisions.

REFERENCES Aho1986a. Aho, A.V., R. Sethi, and J.D. Ullman, Compilers: Principles, Techniques, and Tools, Addison-Wesley, Reading, MA (1986). Fraser1987a. Fraser, C.W. and E.W. Myers, An editor for revision control, ACM Trans. Programming Languages and Systems 9(2) pp. 277-295 (April 1987). Horwitz1989a. Horwitz, S., J. Prins, and T. Reps, Integrating non-interfering versions of programs, ACM Trans. Programming Languages and Systems 11(3) pp. 345-387 (July 1989). Horwitz1990a. Horwitz, S., Identifying the semantic and textual differences between two versions of a program, Proceedings of the SIGPLAN 90 Conference on Programming Language Design and Implementation, (White Plain, New York, June 20-22, 1990), ACM SIGPLAN Notices 25(6) pp. 234-245 (June 1990). Johnson1979a. Johnson, S.C., YACC-Yet another compiler compiler, CSTR 35, Bell Labs., Murray Hill, N.J. (1979). Kaplan1992a. Kaplan, S., C. Love, A.M. Carroll, and D.M. LaLiberte, Epoch User Manual, Computer Sciences Dept., Univ. of Illinois, Urbana, IL (March 1992). Kruskal1984a. Kruskal, V., Managing multi-version programs with an editor, IBM J. Res. Develop. 28(1) pp. 74-81 (January 1984).

− 12 −

Oppen1980a. Oppen, D.C., Prettyprinting, ACM Trans. Programming Languages and Systems 2(4) pp. 465-483 (October 1980). Rubin1983a. Rubin, L.F., Syntax-directed pretty printing − A first step towards a syntax-directed editor, IEEE Trans. Software Engineering SE-9(2) pp. 119-127 (March 1983). Stallman1981a. Stallman, R.M., EMACS: The extensible, customizable self-documenting display editor, Proceedings of the 6th SIGPLAN/SIGOA Symposium on Text Manipulation, (Portland, Oregon, June 8-10, 1981), ACM SIGPLAN Notices 16(6) pp. 147-156 (June 1981). Tichy1982a. Tichy, W.F., Design, implementation, and evaluation of a revision control system, pp. 58-67 in Proceedings of the 6th International Conference on Software Engineering, (Tokyo, Japan, September 13-16, 1982), IEEE, New York (1982). Tichy1984a. Tichy, W.F., The string-to-string correction problem with block moves, ACM Trans. Computer Systems 2(4) pp. 309-321 (November 1984). Yang1991a. Yang, W., Identifying syntactic differences between two programs, Software—Practice and Experience 21(7) pp. 739-755 (July 1991).