Recovering Intent of Code from RPG Legacy Source

International Journal of Software Engineering and Its Applications Vol.8, No.3 (2014), pp.291-304 http://dx.doi.org/10.14257/ijseia.2014.8.3.27 Recov...
Author: Michael Spencer
0 downloads 2 Views 791KB Size
International Journal of Software Engineering and Its Applications Vol.8, No.3 (2014), pp.291-304 http://dx.doi.org/10.14257/ijseia.2014.8.3.27

Recovering Intent of Code from RPG Legacy Source Kochaporn Suntiparakoo and Yachai Limpiyakorn Department of Computer Engineering, Chulalongkorn University, Bangkok10330, Thailand [email protected], [email protected] Abstract Legacy software can be characterized as old software that continues to provide core services to an organization. Applications written in RPG can be considered as legacy software.RPG was originated as a report-building program developed by IBM. Many business applications are written in RPG, and they are often critical in the operations of enterprises. Through decades of use, these RPG legacy systems can be hard to maintain, improve, and expand, since there is a general lack of understanding of the systems. The supporting documentation may not be current as well due to many changes implemented into the software. This paper thus presents a method of reverse engineering for recovering the intent of code from RPG legacy source. The metadata is gathered from the input RPG source by detecting and handling the program controls and operations. These metadata stored in the directed graph will then be mapped to DOT markup language format for flowchart rendering using visualization tool, Graphviz. The prototype implemented in this work would facilitate the understanding of RPG legacy code during software maintenance process. Keywords: reverse engineering, legacy system, RPG language, software maintenance, metadata.

1. Introduction Legacy software can be characterized as old software that is still performing a useful job. Through years of use, users are familiar with the look and feel of the system, and are reluctant to change. These archaic codes have been developed and maintained for many years, and they have become part of the integral business environment. The new replacement system may not fulfill the business requirements, and the investment may be prohibitive. RPG (Report Program Generator) [1] is the programming language developed by IBM in 1959. Thereafter, the language has evolved from RPG to RPG II, RPG III, RPG/400, and RPG IV, which is the latest version, providing a modern programming environment. RPG was originated as a report-building program used in DEC and IBM minicomputer operating systems, and evolved into a fully procedural programming la nguage. Software developed with RPG language (except RPG IV) can be considered as legacy software. Nevertheless, many business applications are written in RPG, and they are often critical in the operations of enterprises, namely software used in commercial bank and production line control. For decades of use, these RPG legacy systems can be hard to maintain, improve, and expand, since there is a general lack of understanding of the system. The developers, who were experts on it, have retired or forgotten what they knew about it. This can be worsened by loss or lack of updated documentation. The study reported that organizations have spent 20% to 70% of computing effort on

ISSN: 1738-9984 IJSEIA Copyright ⓒ 2014 SERSC

International Journal of Software Engineering and Its Applications Vol.8, No.3 (2014)

maintenance tasks [2]. In literature, Vasudevan et al., [3] presented an approach of image processing analysis to extract flowchart information from digital imagery. The e xtracted components are output to metadata (XML format) which is machine readable. This paper thus presents a method of recovering the intent of code from RPG legacy source, and presenting as a flowchart, which is a schematic representation that illustrates the program controls and operations. Flowcharts can then be used as the program specification document to support software maintenance activity. The remainder of this paper is organized as follows. Section 2 briefly introduces the RPG/400 programming language. Section 3 describes the proposed method of recovering the intent of code from RPG/400 source. Section 4 demonstrates a case of creating the flowcharts from the example input source using the prototype implemented in this work. Section 5 presents the conclusion and future research work.

2. RPG (Report Program Generator) [1] 2.1. RPG Specifications RPG/400 Programming Language is the focus of this work. RPG is a structured programming language. Programmers must be concerned about the position of code when writing RPG statements. RPG/400 is composed of seven specifications, each of which must be outlined in the following sequence:       

Control Specification (H) provides information about the program. File Description Specification (F) defines all files in the program. Extension Specification (E) describes arrays, tables. Line Counter Specification (L) indicates the length of overflow lines. Input specification (I) describes data structures, named constants, records, and fields in the input files; and indicates how the records and fields are used by the program. Calculation Specification (C) describes the program computations and indicates the order in which they are done. Calculation Specifications can control certain input and output operations. Output Specification (O) describes the records and fields, and indicates when they are to be written by the program.

RPG programs typically start with File Specification, listing all files being written to, read from, or updated; followed by Extension Specification containing program elements such as data structures and dimensional arrays; then followed by Calculation Specification, which is the computation part, including record matching to generate reports from data files. Finally, Output Specification may follow to determine the layout of other files or reports. Calculation Specification is the major part that contains the intent of code of RPG applications. It indicates the operations to be carried out on the data as shown in Figure 1. The purpose of program is to iteratively access status (F0RSTS) from file BH002P for updating status (ATCSTS) in file BN001P, when found identical model (F0RTMD), lot (F0RLOT), and unit (F0RUNT) that comprise the key of file. In case of not found, the values of status, model, lot, and unit will be inserted as a new record into file BN001P. There are two general rules for writing of calculation entries: 1) Each operation is specified on one line, and 2) Calculation must be grouped in the following order: Detailed calculation, Total calculation, and Subroutine. The grammar of Calculation Specification statements is described in Table 1.

292

Copyright ⓒ 2014 SERSC

International Journal of Software Engineering and Its Applications Vol.8, No.3 (2014)

Figure 1. Example RPG Source 2.2. Control Structure The controls in RPG/400 consist of Sequential, Conditional Branching, and Repeating based on a certain condition. Various operations and structures are allowed within a control as following:  



Sequential operation Conditional Branching If else Structure Select Structure GOTO operation Execute subroutine CASXX (Compare and Branch) operation CABXX (Conditionally Invoke Subroutine) operation Repeating based on a certain condition Do operation Do while operation Do until operation

Copyright ⓒ 2014 SERSC

293

International Journal of Software Engineering and Its Applications Vol.8, No.3 (2014)

Table 1. Grammar of Calculation Specification Statements Position Argument Type

Description

6

Form Type

A ‘C’ must appear in position 6 to identify this line as a calculation specification statement.

7

Comment

An ‘*’ must appear in position 7 to identify this line as a header comment.

9-17

Indicator

Positions 10 and 11, 13 and 14, and 16 and 17 contain indicators that are tested to determine if a particular calculation is to be processed. A blank in positions 9, 12, and 15 designates that the indicator must be on for a calculation to be done. A N in positions 9, 12, and 15 designates that the associated indicator must be off for a calculation to be done.

18-27

Factor 1

The entries that are valid for factor 1 depend on the operation code specified in positions 28 through 32. For the specific entries for factor 1 for a particular operation code.

28-32

Operation Code

33-42

Factor 2

The entries that are valid for factor 2 depend on the operation code specified in positions 28 through 32. For the file operation codes, factor 2 names a file or record format to be used.

43-48

Result Field

The result field names the field that contains the result of the calculation operation specified in positions 28 through 32.

49-51

Field Length

Specify the length of the result field. This entry is optional, but can be used to define a field not defined elsewhere in the program.

52

Decimal Position

53

Operation Extender

The operation extenders are single-character entries that provide additional attributes to the operations that they accompany.

54-59

Resulting Indicators

These positions can be used, for example, to test the value of a result field after the completion of an operation, or to indicate an end-of-file, error, or record-notfound condition. The resulting indicator positions designate different uses, depending on the operation code specified.

60-74

Comment

Comment in line with display when print out.

75-80

Comment

Comment in line without display when print out.

Operation to be executed using factor 1, factor 2, and the result field entries. Note: Example operation codes are listed in Table 2.

Position 52 indicates the number of positions to the right of the decimal in a numeric result field. If the numeric result field contains no decimal positions, enter a '0' (zero). This position must be blank if the result field is character data.

2.3. Indicator The major strength of RPG is known as the program cycle, that is, the execution within an implied conditional branching using an “indicator”, a set of logical variables numbered 01–99 for user-defined purposes. An “indicator” or “switch” owns two states, i.e. the value of ‘0’ denotes ‘off’, and the value of ‘1’ denotes ‘on’. It is used to dete rmine the operation based on the resulting state of “indicator”. Figure 2illustrates an example of using an “indicator” for checking the operation “chain” to find the record in a file. If the search value of field in PRODB file is found, the indicator 91 will be ‘0’, otherwise, it will be ‘1’.The operation “Z-ADD” will then be executed if the state of indicator 91 is ‘1’.

294

Copyright ⓒ 2014 SERSC

International Journal of Software Engineering and Its Applications Vol.8, No.3 (2014)

Figure 2. Example the Use of “Indicator” in RPG/400

3. Research Methodology Figure 3 illustrates the process of recovering the intent of code from RPG/400 source, and presenting with flowcharts. The methodology consists of six main steps as describedin the following subsections.

RPG Source (.TXT)

mation

2. Code 1. Chunking

analysis

5. Transform to DOT language

High-level flowchartProgram Infor-

Detailed flowchart

6. Render with visualization program

3. Generate program information

4. Generate intent of code

Figure 3. Method of Recovering Intent of Code from RPG Source and Presenting with Flowcharts 3.1. Chunking Initially, the input file will be partitioned into seven parts based on types of specifications. The size of Calculation Specification can be large. The subroutines can be separated into other chunks. File Description Specification, and Calculation Specification are the focus in this work. 3.2. Code Analysis In this step, the details of program controls and operations will be detected and transformed from the RPG source to the directed graph containing the metadata for constructing the flowcharts. The metadata is data about data to facilitate the discovery of relevant information [4]. It is in the format of machine understandable referring to in-

Copyright ⓒ 2014 SERSC

295

International Journal of Software Engineering and Its Applications Vol.8, No.3 (2014)

formation contained in the source code that will be analyzed and mapped to visua lize the intent of code with flowcharts. The method consists of two main tasks as follows. 3.2.1. Detect and Manage Operation Codes: The operation code on position 28-32 will be detected for the creation of a node of the directed graph. Each node contains ID (unique number), node label, and type. The conversion of an operation code to node label can be categorized into two cases: 1) operation code without resulting indicator(s), and 2) operation code with resulting indicator(s). In case of operation code without resulting indicator(s), the mapping of program operation codes to human understandable text labeled in a node is described in Table 2. If the operation code contains resulting indicator(s) on position 54-59, the mapping of operation code to node label will be varied by the status of the resulting indicator(s) as described in Table 3. Example lines containing operation code with resulting indicators include line 15, 17, and 30 in Figure 1. The 2-dimensional array of indicator 0-99 is used for handling the mapping of operation code to node label that is varied by the status of resulting indicator(s). Examples are shown as line 16, and 18 in Figure 1 that associated with node id 5, and node id 7 in Figure 4, respectively. Figure 4 illustrates the output directed graph associated with the RPG source shown in Figure 1. 3.2.2. Detect and Manage Controls: The conversion of operation codes of which the type is control structure as listed in section 2.2 is processed as follows. In case of Sequential operations: create new node with an edge directed from last node to new node, and set new node to last node. In case of Conditional Branching: the start of operation is pushed into stack, and when the end of operation is found, the elements contained in stack are repeatedly poped until the associated start operation is out. For example, when found operation IFEQ, new node is created with an edge directed from last node to new node, and set new node to last node. Next, push operation IFEQ into stack. When the matching operation ENDIF is found, the elements residing stack is repeatedly poped until IFEQ is out. An edge is then created directed from IFEQ node to ENDIF node with edge label is ‘NO‘, and set ENDIF node to last node. In case of Repeat operations: The use of stack to manage the construction of directed graph is similar to the case of Conditional Branching. For example, when found operation DOWEQ, new node is created with an edge directed from last node to new node, and set new node to last node. Next, push operation DOWEQ into stack. When the matching operation ENDDO is found, the elements residing stack is repeatedly poped until DOWEQ is out. An edge is then created directed from last node to DOWEQ node (for loop), and set DOWEQ node to last node. 3.3. Generate Program Information This step is to generate program information from File Description Specification and Calculation Specification. Program information contains program name, input/ output parameters, and working files. The input/ output parameters are from Calculation Spec ification. Output parameters are those updated fields, otherwise, they are considered input parameters. The working files include input files, output files, updated files, workstation files, and report files contained in File Description Specifica tion.

296

Copyright ⓒ 2014 SERSC

International Journal of Software Engineering and Its Applications Vol.8, No.3 (2014)

Table 2. Excerpt of Mapping between Operation Code and Node Label Operation Code ADD (Add)

Node label Case Factor 1 = Blank [Result Field] = [Result Field] + [Factor 2]

Steps of creating directed graph 1. Create new node 2. Create edge from last node to new node 3. Set node to last node

Case Factor 1 Blank [Result Field] = [Factor 1] + [Factor 2] CHAIN (Random Retrieval from a File) DIV (Divide) DO (Do)

Search File [Factor 2] by key [Factor 1]

[Result Field] = [Factor 1] / [Factor 2] Create 2 Nodes of directed graph for supporting operation DO Node 1 Case Factor 1 = Blank [Factor 3] = 1 Case Factor 1 Blank [Factor 3]= [Factor 1]

1. Create new node 2. Create edge from last node to new node 3. Set node to last node 4. Check resulting indicators 5. Update indicator array 1. Create new node 2. Create edge from last node to new node 3. Set node to last node 1. Create new node1 2. Create edge from last node to new node1 3. Set node1 to last node 4. Create new node2 5. Create edge from last node to new node2 6. Set node2 to last node 7. Push new node2 to stack

Node 2 (condition node) [Factor 3] [Factor 2] )

Copyright ⓒ 2014 SERSC

1. Create new node 2. Push new node to stack 3. Create edge from last node to new node 4. Set node to last node Note: edge label from DOWEQ node to next node is “YES” 1. Create new node 2. Push new node to stack 3. Create edge from last node to new node 4. Set node to last node Note: edge label from DOUEQ node to next node is “NO” 1. Set IF node from stack to last node 1. Pop DO node from stack 2. Create edge from last node to DO node 1. Create new node 2. Pop IF node from stack 3. Create edge direction from IF node to new node. Edge label is ‘No’ 1. Create new node 2. Pop node from stack 3. Create edge node from stack to new node 4. Do 2,3 until found SELEC node 1. Create new node 2. Push new node to stack 3. Create edge from last node to new node 4. Set node to last node 1. Create new node 2. Push new node to stack 3. Create edge from last node to new node

297

International Journal of Software Engineering and Its Applications Vol.8, No.3 (2014)

4. Set node to last node IFLT (If less than) KLIST (Define a Composite Key) PLIST (Identify a Parameter List) READ (Read a Record)

Define a composite key [Factor 1]

Identify a parameter list [Factor 1]

Read file [Factor 2]

SELEC (Begin a select group) Z-ADD (Zero and Add)

1. Create new node 2. Push new node to stack 3. Create edge from last node to new node 4. Set node to last node 1. Create new node 2. Create edge from last node to new node 3. Set node to last node 1. Create new node 2. Create edge from last node to new node 3. Set node to last node 1. Create new node 2. Create edge from last node to new node 3. Set node to last node 4. Check resulting indicators 5. Update indicator array 1. Create new node 2. Push new node to stack 3. Set node to last node 1. Create new node 2. Create edge from last node to new node 3. Set node to last node

IF ( [Factor 1]Node2.ID [label = “Edge.label”]; Node1.ID [label = “Node.label” ,shape = Node.type.dot]

Figure 5. Format of DOT Markup Language used to Map with Metadata 3.6. Render with Visualization Program The output markup language file (.dot) will then be fetched into the visualization tool, Graphviz [5] for flowchart rendering.

4. Prototype To support the automation of the processes previously described, a prototype has been developed using Eclipse Kepler 4.3 [6]. A case is demonstrated with the input RPG source, UPDBN01, illustrated in Figure 1. The metadata contained in the directed graph as the result of code analysis is shown in Figure 4. Next, the DOT markup language file is created by traversing the directed graph. In particular, 1) detect all edges of the directed graph for generating the relations as shown in Figure 6, line 25-42. 2) detect all nodes of the directed graph for generating attributes of each node as shown in Figure 6, line 43-59. The DOT markup language file (Figure 6) will then be rendered using the visualization program, Graphviz, to obtain the detailed flowcharts (Figure 7), the high-level flowcharts and program information (Figure 8).

300

Copyright ⓒ 2014 SERSC

International Journal of Software Engineering and Its Applications Vol.8, No.3 (2014)

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62

digraph G{ subgraphcluster0{ label = "Program information" a[ label = "Program name : UPDBN01\lInput parameter : None\l Output parameter : None\lDeclaration of 2 files\l Input file : BH002P\lUpdatedfile : BN001P\l",shape = rect ]; } subgraphcluster1{ label = "Concept"; b ->c [label = " " ]; d ->e [label = " " ]; f ->g [label = " " ]; h ->i [label = " " ]; j ->k [label = " " ]; b[ label = "File definition",shape = rect ]; c[ label = "Key definition",shape = rect ]; d[ label = "Read file BH002P until end of file",shape = rect ]; e[ label ="found KEY01 in file BN001P move value Update file BN001P",shape = rect]; f[ label ="not found KEY01 in file BN001P move value Write file BN001P",shape= rect]; g[ label = "End program",shape = rect ]; } subgraphcluster3{ label = "Detail Flowchart"; 1 -> 2[label = " " ]; 2 -> 3[label = " " ]; 3 -> 4[label = " " ]; 4 -> 5[label = " " ]; 5 -> 6[label = " Yes " ]; 5 -> 17[label = " No " ]; 6 -> 7[label = " " ]; 7 -> 8[label = " Yes " ]; 7 -> 10[label = " No " ]; 8 -> 9[label = " " ]; 9 -> 15[label = " " ]; 10 -> 11[label = " " ]; 11 -> 12[label = " " ]; 12 -> 13[label = " " ]; 13 -> 14[label = " " ]; 14 -> 15[label = " " ]; 15 -> 16[label = " " ]; 16 -> 5[label = " " ]; 1[ label = "START" ]; 2[ label = "Define Key list of KEY01\n F0RTMD\n F0RLOT\n F0RUNT" ,shape = rect ]; 3[ label = "Set lower of file BH002R" ,shape = rect ]; 4[ label = "Read file BH002R" ,shape = rect ]; 5[ label = "If read file BH002R not end of file" ,shape = diamond ]; 6[ label = "Search file BN001R By key KEY01" ,shape = rect ]; 7[ label = "If search file BN001R by key KEY01 found" ,shape = diamond ]; 8[ label = "ATCSTS = F0RSTS" ,shape = rect ]; 9[ label = "Update file BN001R", shape = rect ]; 10[ label = "ATCTMD = F0RTMD" ,shape = rect ]; 11[ label = "ATCLOT = F0RLOT" ,shape = rect ]; 12[ label = "ATCUNT = F0RUNT" ,shape = rect ]; 13[ label = "ATCSTS = F0RSTS" ,shape = rect ]; 14[ label = "Write file BN001R" ,shape = rect ]; 15[ label = "End if" ,shape = rect ]; 16[ label = "Read fileBH002R" ,shape = rect ]; 17[ label = "End Program" ]; { rank = sink; "17";} } }

Figure 6. DOT Markup File for Rendering Flowcharts and Program Information

Copyright ⓒ 2014 SERSC

301

International Journal of Software Engineering and Its Applications Vol.8, No.3 (2014)

Figure 7. Detailed Flowchart Rendered from Graphviz

302

Copyright ⓒ 2014 SERSC

International Journal of Software Engineering and Its Applications Vol.8, No.3 (2014)

Figure 8. Visualization of High-level Flowchart and Program Information

5. Conclusion Legacy software is any application based on older technologies and hardware that continues to provide core services to an organization. Legacy systems are frequently large and difficult to modify. The costs of redesigning or replacing the systems may be prohibitive. Influenced by economic reasons, organizations thus usually opt to keep their outdated systems rather than to modernize them. Users may also prefer an evolutionary rather than a revolutionary approach to modernizing their software. While many changes have been made to the software through years of use, the supporting docume ntation may not be current. This paper thus presents an approach to automating the construction of flowcharts as design blueprints from legacy source written in RPG/400. The recovery of the intent of code starts with chunking the source code b ased on types of specifications. The subroutines contained in the Calculation Specification can be partitioned into separated chunks of code. Next, code analysis is carried out to create the metadata of program controls and operations, and store in the directed graph. Graph traversal will be conducted for mapping the metadata with the format of DOT markup language that supports the flowchart rendering with the visualization program, Graphviz. The implemented prototype system provides two types of flowcharts: detailed flowchart, and high-level flowchart. Currently, the recovery of intent of code only focuses on the File Specification and the Calculation Specification. The enhancement for more comprehension and compl etion would be conducted to cover other Specifications, such as the Extension Specification and the Input Specification. The Extension Specification will provide the info r-

Copyright ⓒ 2014 SERSC

303

International Journal of Software Engineering and Its Applications Vol.8, No.3 (2014)

mation of arrays and tables declared in the program. The Input Specification will promote the understanding of data structures and name constants defined in the program.

References [1] [2] [3] [4] [5] [6]

International Business Machines Corporation, http://www.ibm.com/us/en/, (2014). B. P. Lientz and E. B. Swanson, Editor, Addison-Wesley, Boston, (1980). B. G. Vasudevan, S. Dhanapanichkul and R. Balakrishnan, “Flowchart knowledge extraction on image processing”, IEEE World Congress on Computational Intelligence, (2008), pp. 4075-4082. National Information Standards Organization, “Understanding Metadata”, NISO Press, Bethesda, (2001). Graph Visualization Software Document, http://www.graphviz.org/Documentation.php, (2014). Eclipse Kepler 4.3,http://www.eclipse.org/kepler/, (2014).

Author Kochaporn Suntiparakoo, she received her bachelor degree in Computer Engineering from Mahidol University in 2010. After graduation, she has been working as a RPG programmer at ISUZU Motors Company (Thailand) Ltd. Currently, she is pursuing Master degree in Software Engineering at Department of Computer Engineering, Chulalongkorn University, Bangkok 10330, Thailand.

304

Copyright ⓒ 2014 SERSC