An Approach to Data Description and Conversion

University of Pennsylvania ScholarlyCommons Technical Reports (CIS) Department of Computer & Information Science December 1971 An Approach to Data...
Author: Morgan Stewart
3 downloads 2 Views 14MB Size
University of Pennsylvania

ScholarlyCommons Technical Reports (CIS)

Department of Computer & Information Science

December 1971

An Approach to Data Description and Conversion Diane P. Smith University of Pennsylvania

Follow this and additional works at: http://repository.upenn.edu/cis_reports Recommended Citation Diane P. Smith, "An Approach to Data Description and Conversion", . December 1971.

University of Pennsylvania Department of Computer and Information Science Technical Report No. MS-CIS-72-20. This paper is posted at ScholarlyCommons. http://repository.upenn.edu/cis_reports/831 For more information, please contact [email protected].

An Approach to Data Description and Conversion Abstract

Currently, the structure of stored data is determined implicitly by the software which accesses and processes it. This data structuring technology has given rise to two outstanding problems in data processing. First, there is the communication of the exact structure of data to users and machines, and secondly, the interchange of the data itself. This work contributed to overcoming these problems by developing a technique for describing the structure of data explicitly and independently of machines and software. This aim is reflected in the following objectives: 1) To understand data structures by developing a model which not only characterizes current data organizational techniques, but also provides a framework within which new data structures can be defined. 2) To use this model to develop a language which can explicitly describe the organization of data. 3) To use this model to study how data can be converted from one structure to another, with a view towards developing a method for describing data conversions. This model unifies the diverse area of data structures by including the record, file and storage organizations of data. Furthermore, the model clearly separates at each level the conceptual part, which is the logical structure imposed by a user, from the implementation part, which is the method by which the logical structure is I encoded as a binary representation. This separation leads to n straightforward mapping of a file onto storage. From an analysis of the state-of-the-art in data organization, it is shown that the model can express not only the data structures of current systems, but also certain useful generalizations which might well be produced by future systems. The model treats records as hierarchies of data items. These hierarchies are expressed by production systems based on a generalized notion of attribute-value pairs. Files are treated as graphs whose nodes are records. The connections between the nodes are expressed using a powerful production system which generates criteria for determining when any two records are to be linked. The structure of storage is generalized as a hierarchy since this structure is common to all storage media. The mapping of files onto storage is expressed in terms of rules for distributing the records of the file within the slots provided by the storage structure. The language, called Generalized Data Description Language (GDDL) is a realization of the model, and thus possesses all its capabilities . In particular, the language can describe the implementation of any aspect of a file as being dependent on any other aspect. The language is presented in an appendix in the form of a user's manual. Data conversion is studied in terms of transforming data in one structure to another, where both structures are expressed in the model. This study shows that to fully specify a conversion the relationship between the components of the two structures must be specified. In certain cases, such as the reorganization of a file, this relationship can be very elaborate. A method is developed for specifying such relationships, and a corresponding capability is built into GDDL. Thus, WDL has the ability not only to fully describe data structures, but also to specify data conversion.

This technical report is available at ScholarlyCommons: http://repository.upenn.edu/cis_reports/831

Comments

University of Pennsylvania Department of Computer and Information Science Technical Report No. MSCIS-72-20.

This technical report is available at ScholarlyCommons: http://repository.upenn.edu/cis_reports/831

University of Pennsylvania THE MOORE SCHOOL OF EI;ECTRICAL ENGINEZRING

ECHNICAL REPORT

AN APPROACH TO DATA DESCRIPTION

AND CONVERSION

by

Diane Pirog Smith Project Supervisor Noah S. Prywes

December

1971

Prepared f o r t h e Off i c e of Naval Hesearch Information Systems Arlington, Va. 22217 under Contract ~00014-67-A-0216-0007 P r o j e c t No. 049-272

Reproduction i n whole o r i n part i s permitted for any purpose of t h e United S t a t e s Government.

Moore School Heport No. 72-20

Spctirity

I

Classification

1

DOCUMENTCONTROLDATA- R L D (Security c l a a a i l i r a l i o n o f t i t l o , body o f abstract and indexing ennolalion must be enlered when the overall report i s c l a s s i f i e d )

r.

O R I G I N A T I N G A C T I V I T Y (C0rp01010 author)

2 ..

University of Pennsylvania The Moore School of E l e c t r i c a l Engineering Philadelphia, Pa. 19104

REPORT SECURITY C L A ~ S I F I C A T I O N

UNCLASSIFIED zb. GROUP

-

-.

3

Ht POHT

I

1 1 11

I.

AN APPRWCH TO DAT4 DESCRIPTION OESCHlPTlVP NOTES ( 5 p O

4

I

AND CONVERSION

of repor1 and.inclueive d a l e a )

Technical Report 5 . A U T H O R I S ) ( F i r s 1 name. middle I n i t i a l , l a a t n a m e )

Diane Pirog Smith 6. REPORT DA T E

7a. T O T A L N O . O F P A C E S

December 1971 Ba. C O N T R A C T

OR G R A N T

17b. N O . O F R E F S

328 NO.

20

Pa. O R I G I N A T O R ' S R E P O R T N U M B E R I S )

~00014-67-A-0216-0007 Moore School Report No. 72-20

b. P R O J E C T N O .

NR 049-272 9b. O T H E R R E P O R T N O ( S ) (Any other numbera that may be assigned lhim m p o r l )

C.

d. 10. D I S T R I B U T I O N S T A T E M E N T

Reproduction i n whole o r i n p a r t i s permitted f o r any purpose of t h e United S t a t e s Government. Office of Naval Research Information Systems Currently, t h e s t r u c t u r e of stored data i s determined i m p l i c i t l y by t h e software which accesses and processes it. This d a t a s t r u c t u r i n g technology has given r i s e t o two outstanding problems i n d a t a processing. F i r s t , t h e r e i s t h e c o m n i c a t i o n of t h e exact s t r u c t u r e of d a t a t o u s e r s and machines, and secondly, t h e interchange of t h e d a t a i t s e l f . This work contribute@t o overcoming t h e s e problems by developing 8 t e c h n a u e f o r describing t h e s t r u c t u r e of data e x p l i c i t l y and independently of machines and software. This aim i s r e f l e c t e d i n t h e following objectives:

1) To understand data s t r u c t u r e s by developing a model which not ordy c h a r a c t e r i z e s current data organizational techniques, but a l s o provides a framework w i t h i n which new d a t a s t r u c t u r e s can be defineti. 2 ) To use t h i s model t o develop a language which can e x p l i c i t l y describe t h e organization of data. 3) To use t h i s model t o study how d a t a can be converted from one s t r u c t u r e t o another, with a view towards developing a method f o r describirle; ciata conversions. n ~ model e u n i f i e s t h e diverse a r e a of d a t a s t r u c t u r e s by inc:ludirlg t h e record, f i l e and storage organizations of d a t a . Furthermore, t h e model c l e a r l y separates a t each l e v e l t h e conceptual p a r t , which is t h e l o g i c a l s t r u c t u r e imposed by a user., from t h e implementation p a r t , which i s t h e method by wl.llch t h e l o g i c a l ctructur-c i s errcoded a s a binary reprecentat ion. Thio oeparation l e a d s t o n s t r a i g t i t f orp&&iIIUeLi

DD

NOV

es

1473

(PAGE 1 ) Security

Classification

A-31408

I

Securltv Clarrlflcrllon

S/N

0101-007-6821

Security Classification

A-31409

DD FORM 1473 A b s t r a c t (continued) mapping of a f i l e onto s t o r a g e . From an a n a l y s i s of t h e s t a t e - o f - t h e - a r t i n d a t a o r g a n i z a t i o n , it i s shown that t h e model can express not only t h e d a t a s t r u c t u r e s of c u r r e n t systems, b u t a l s o c e r t a i n u s e f u l g e n e r a l i z a t i o n s which might w e l l be produced by f u t u r e systems. The model t r e a t s r e c o r d s a s h i e r a r c h i e s of d a t a items. These h i e r a r c h i e s a r e expressed by production systems based on a g e n e r a l i z e d n o t i o n of a t h r i t u t e - v a l u e p a i r s . F i l e s a r e t r e a t e d a s graphs whose nodes a r e r e c o r d s . The r:onnections between t h e nodes a r e expressed u s i n g a powerful productiorl system which g e n e r a t e s c r i t e r i a f o r determining when any two r e c o r d s a r e t o be l i n k e d . The s t r u c t u r e of s t o r a g e i s g e n e r a l i z e d a s a h i e r a r c h y s i n c e t h i s s t r u c t u r e i s common t o a l l s t o r a g e media. The mapping of f i l e s onto s t o r a g e i s expressed i n terms of r u l e s f o r d i s t r i b u t i n g t h e records of t h e f i l e w i t h i n t h e s l o t s provided by t h e s t o r a g e s t r u c t u r e . The language, c a l l e d Generalized Data D e s c r i p t i o n Language (GDDL) i s a r e a l i z a t i o n of t h e model, and t h u s possesses a l l i t s c a p a b i l i t i e s . I n p a r t i c u l a r , t h e language can d e s c r i b e t h e implementation of any a s p e c t of a f i l e a s being dependent on any o t h e r a s p e c t . The language i s presented i n an appendix i n t h e form o f a u s e r ' s manual.

-

Data conversion i s s t u d i e d i n terms of transforming d a t a i n one s t ~ u c t u r e t o a n o t h e r , where both s t r u c t u r e s a r e expressed i n t h e model. This study shows t h a t t o f u l l y s p e c i f y a conversion t h e r e l a t i o n s h i p between t h e components of t h e two s t r u c t u r e s rrmst be s p e c i f i e d . I n c e r t a i n cases, such a s t h e r e o r g a n i z a t i o n of a f i l e , t h i s r e l a t i o n s h i p can be very e l a b o r a t e . A method i s developed f o r s p e c i f y i n g such r e l a t i o n s h i p s , and a corresponding c a p a b i l i t y i s b u i l t i n t o GDDL. Thus, WDL has t h e a b i l i t y not only t o f u l l y d e s c r i b e d a t a s t r u c t u r e s , b u t a l s o t o s p e c i f y d a t a conversion.

I would l i k e t o express q y g r a t i t u d e t o my two supervisors:

D r . David K. Hsiao who f i r s t introduced me t o t h i s a r e a of research and

who provided invaluable help and c a r e f u l c r i t i c i s m , and D r . Grace Murray Hopper whose conviction of t h e importance of t h e t o p i c provided t h e encouragement I needed and whose v a s t experience i n t h e a r e a helped me t o recognize many of t h e c r u c i a l a s p e c t s of t h e problem.

I would a l s o

l i k e t o thank D r . Noah S. Prywes and D r . James Emery f o r t h e i r support and guidance. The Ford Foundation and t h e U.S. Army E l e c t r o n i c s Command, Avionics P r o j e c t , supported me a t various times during my graduate s t u d i e s .

I

a m p a r t i c u l a r l y g r a t e f u l t o t h e Information Systems Branch of t h e Office

of Naval Research f o r supporting t h i s research under contract ~ ~ 0 0 0 1 4 67-A-0216-0007.

INDM

4, 32, 158-159

a c c e s s method

107

i ~ c c e c sp a t h s

85

direct

83

implementat i o n

86, 87

length of

109, 120, 125

a d d r e s s i n g scheme alignment s e t

120, 122, 125

assembly languages

21

association l i s t

12, 21,

128

139

definition

141, 144

examples

5'7, 58, 111

attribute

64

data item a t t r i b u t e

62, 67

encoding

64

group a t t r i b u t e

62, 67, 73

a t t r i b u t e marker b a s i c block

108, 110,116, 124

108, 124

block b l o c k riame

110, 116, 124

c torage Itern block name

ch:~ r a v t c r code COI3OL

59

9, 2'7-31, 36, 71, 296

111

INDEX (continued)

COD AS^

6, 44-48

compound value

63, 64

conceptual part

7, 8

84

file structure

65

record structure storage structure

110

96, 105

connection set number

conversion (see data conversion) criteria conversion selection file

133-140

87

value

32

88-90,102, 105, 133, 134, 140

criterion production system

39

data base management systems

3, 5, 128-158, 160

data conversion definition

129

148-155

process

data description language applications

3,

2,

4

58, 111

data items data structure

4, 7-12,157

data type

61, 77

delimiter

60

direct access path

85

156-157

INDM ( continued)

82, 83

embedded p o i n t e r s

95

encoding

98

example encoding

of a t t r i b u t e s

62,

67

95

of f i l e s t r u c t u r e s

67

of record s t r u c t u r e s

of storage items and storage s t r u c t u r e s

59

of values

96

encoding method e x p l i c i t description

2

71

field

71

f i e l d type

79

file

85, 95

f i l e relation

85

definition

7, 84

f i l e structure

88

definition

35

encoding

23-27

FOliTRAN

group

63

group type head record

70

85 vii

116

INDEX ( continued) higher-level programming languages implicit specifications

36

2

108, 116, 121, 124

labels length

116, 121, 124

basic block

86, 87 96, 105

path value

60, 72, 77

length uniformity

116, 121, 124

basic block value l i n k number

60, 72, 77 96, 99, 100, 105

96, 99, 100, 105

linkage uniformity

85

l i s t structure machine languages

15

occurrence group

68, 73, 78

117, 121, 124

sssi

operating system

18

order group S S S ~

path

68, 73, 78

117, 121, 124

(see access path)

path length

96, 105 viil

INDEX ( continued) pointer form

120, 125

n8, 123

pointer i n t e r p r e t a t i o n r u l e s pointer mode

120, 125

pointer t a b l e

82, 83

encoding

95, 96

example

100

pointer type record

120, 125

63

record d i s t r i b u t i o n r a t i o

118~121, 124

record positioning r u l e s

l l O ? 118, 124-128

record s p l i t s e t

119, 121, 124

record s t r u c t u r e

7, 56 65

definition

67

encoding record type

70

r e p e t i t i o n number

68, 78

group ~ s s i

117,

r e p e t i t i o n order

124

69, 73, 78

repet i t i o n uniformity

69, 78

group sssi

117, 121, 124

ring structure

86 ix

INDEX (continued)

sequencing position

82

sequential encoding

95

98

example source f i l e

129

SSDL

6

csci

(see structured s e t of storage items)

119) 121, 125

s t a r t record oet storage c e l l

110

storage item

111

7, 108,

storage s t r u c t u r e

112,

116

structured s e t of storage i t e n s ( s s s i )

64

subordinate group

t a i l record

85

target f i l e

129

85

t r e e structure value

58, 59,

111

compound value value alignment value c r i t e r i a

63, 64, 111

61, 77 62, 69, 77

111

Page CHAPTER 1

INTRODUCTION

1

1.1 Background and Objectives

1

1.2

6

The Development of t h e Models, t h e Design of t h e Language, and t h e Study of Conversion

1 . 3 Organization of t h e lieport CHAPTER 2

M I S T I N G RAW STRUCTURES AND M'W DESCRIPTION LANGUAGES

2.1

Introduction

2.2

Data S t r u c t u r e s i n Machine Languages

2.3

Data S t r u c t u r e s i n Early Operating Systems

2.4

Data S t r u c t u r e s i n Assembly Languages

2.5

Data S t r u c t u r e s i n Early Higher-Level Programming Languages

2.6

Data S t r u c t u r e s i n Third-Generation Operating Systems

2.7

Data S t r u c t u r e s i n Current Versions of Higher-Level Programming Languages

2.8

Data S t r u c t u r e s i n Data Base Management Systems

2.9

Die Data D e ~ c r i p t i o r iLanguage of t h e COIASYL Data Base Task Croup

2.10 Summary

3.1

Introduction

!MILE OF CON!JENTS (continued)

Page 3.2 A Model of Record Structures 3.2.1

The Model of Data Items

58

3.2.1.1

The Concept of Data Items

58

3.2.1.2

Encoding Values

59

3.2.1.3

Encoding Attributes

62

3.2.2 The Model of Records 3.2.2.1 The Conceptual Record Structure 3.2.2.2 3.2.3

Encoding the Record Structure

The Specification of the Encoding Characteristics

3.3 Interpretation of Common Data Processing Concepts in Terms of the Model of Record Structures 3.4 An Application of the Model of Record Structures 3.5

The Completeness and Generality of the Model

3.6 Tne Relationship Between the Model and GDDL

3.7 Demonstrations of GDDL ' s Completeneos CHAPTER 4

56

FILE DESCRIPTTON

4.1 Introduction 4.2 A Model of File Structures 4.2.1 The Conceptual File Structure

4.2.2 Encoding the File Structure xii

63

BIBLE OF CONTENE (continued)

Page

CHAPTER

4.3 Applications of the Model of File Structures

98

4.4 The Completeness and Generality of the Model

101

4.5

The Relationship Between the Model and GDDL

104

4.6

Demonstrations of GDDL ' s Completeness

106

5

STORAGE DESCRIPTION

5.1 Introduction 5.2

A Model of Storage Structures 5.2.1

The Conceptual Structure of Storage

5.2.2

Encoding Storage Items and Storage Structure

5.2.3

Record Positioning and Pointer Interpretation Rules

5.3 An Application of the Model of Storage Structures

5.4

The Completeness and Generality of the Mode1

5.5

The Relation~hipBetween the Model and

GDDL

5 -6 Medium Dependent Encoding Characteristics

5.7 Demonstrations of GDDL's Completeness CHAPTER 6

mm CONVERSION

6.1 Introduction 6.2 The Concept of the Association List List 6 . 3 A Model of the fl~~ociation xlii

'2UI;E OF CONTENTS (continued) Page

6.4 Applications of t h e Model of the Association List

6.5

The Relationship between the Model and GDDL

6.6 m e Conversion Process CHAPTER 7

CONCLUDING REMARKS

APPENDIX A

REFERENCE MANUAL FOR GDDL

APPENDIX B

EXAMPLES OF GDDL DESCRIPTIONS

APPENDIX C

RELATIONSHIP OF GDDL TO COBOL

xiv

LlST OF FIGURES Page Figure 1-1.

The Components of a Data S t r u c t u r e and t h e i r Interrelationships

Figure 2-1.

IBM 7040 Data Description Statements

2-1, a . 2-1, b .

The IBM 7040 $PILE Statement The IBM 7040 $LABEL Statement

Figure 2-2.

The ANSI COBOL Statement f o r Describing a Data Item o r a Group i n a COBOL Record

Figure 2-3.

The ANSI COBOL Statement f o r Describing a COBOL F i l e

Figure 2-4.

The ANSI COBOI; Statement f o r Deccribing t h e Storage Convention of a COBOL F i l e

Figure 2-5.

Enhanced COBOL Description Statements

a. 2-5' 'b 2-5,

.

Figure 4-1.

4-1, a . 4-1, b. 4-1, c.

The COBOL Statement f o r Declaring Data Types The COBOL Statement f o r Specifying R e p e t i t i o n Implementation of Access Paths By Sequencing By Embedding P o i n t e r s By Using D b l e s of P o i n t e r s

Figure 4-2.

B i t S t r i n g Representation of F i l e Sequent i a l l y Encoded

Figure 4- 3.

B i t S t r i n g Representation of F i l e Encoded by Embedded P o i n t e r s

Figure 4-4.

F i l e Linked by Embedded P o i n t e r s

Figure 4-5.

B i t S t r i n g Representation of F i l e Encoded by a P o i n t e r Tuble

Figure

4-6.

F i l e Linked by a P o i n t e r % b l e

Fipre

5-1.

Formatted Tape

Figure 5-2.

SSSI f o r Disk F i l e

LIST OF FIGURES (continued)

Figure 5- 3.

Bit S t r i n g Representation of B p e F i l e X

122

Figure 6-1.

Simplified Conversion Process

132

Figure 6-2.

An Bample of Source Record Selection f o r the Formation of Target Records

135

Figure 6- 3.

The Use of Descriptions and the Association L i s t i n Data Conversion

150

6-3? a . 6-3, b

.

6-3, c. Figure 7-1.

m e Extraction of Data Items from Source F i l e s The Formation of Target Data Items from Source Data Items Creation of Target F i l e s from Trlrget Data Items The Trichotomy of Information Processing

xvi

158

LIST OF 'IYIBLES Page

Table 2-1

Summary of Data Representation Characteristics

50

Table 3-1

The Relationship Between the Model and GDDL

77

Table 4-1

Characteristics for each Encoding Method

95

mble

4-2

The Relationship Between the Model and GDDL

105

Table 5-1

Characteristics Required for Encoding

116

Table 5-2

The Relationship Between the Model and GDDL

124

xvi i

BIBLI OGRAPHY Birkhof f , G., L a t t i c e Theory, Society, 1948.

American Mathematical

( ~ 1968) h

Chapin, N., "A Deeper Look a t Data," Proceedings 1968, ACM National Conference, 1968, pp. 631-638.

(CO 1971)

CODASYL Data Base Tbsk Group, Data Base %sk Group Report t o t h e CODASYL Programming Language Committee, A p r i l 1971.

(CO 1969)

CODASYL Systems Committee Technical Report, p n e r a l i z e d Data Base Manaaement Systems,

(CO 1970)

Codd, E.F., "A Relational Model of Data f o r Large Shared Data Banks," Comrunications of the ACM, Volume 13, Number 6 June, 1970, PP. 377-387.

( Ga 1970)

Galler, B .A. and P e r l i s , A. J A View of Programming Languages, Addison-Wesley, 1970.

.,

(HS 1970) Hsiao, D. and Harary, F., "A Formal System f o r Information Retrieval from F i l e s , " Communications of t h e ACM, Vol. 13, No. 2, February 1970, pp. 67-73. (HS 1971)

,

Hsiao, D. "A Generalized Record Organization, Transactions on Computers, December 1971.

IEEE

system/360 Operating System, PL/I Language Specificat i o n s , F i l e NO. ~360-29, ~ o r m~28-6571-4, 1965.

( IBM 1965) IBM

( ~ 1968) a Lancaster, F.W., Evaluation of t h e MEDLARS Demand Search Service, U.S. Department of Health, Education and Welfare, Public Health Service, National Library of Medicine, Bethesda, Maryland, January 1968. (Ma

1971) Manola, Frank, "An Extended Data Management F a c i l i t y f o r a General Purpose Time Sharing System," M.Sc. Thesis, The Moore School of E l e c t r i c a l Engineering, University of Pennsylvania, 1971.

(Ma 1$9)

Marden, E., "Statement of Need f o r a Data Descriptive I,anguage," Statement prepared f o r USA Slandard.~X 3 A d Hoc Committee, 1969.

( ~ 1967) e

Mealy, C.,

"Anotkier Look a t I)ata,If F;JCC,

xviii

1$)'1,

pp. >?>-1,311.

BIBLIOGRAPHY ( continued)

( ~ 1971) a

Ramirez, J., and Solow, H . , "The Design and Implementation of t h e DDL Processor," The Moore School of E l e c t r i c a l Engineering, University of Pennsylvania, work i n progress.

(RCA, 1969)

RCA Information Systems, COBOL Reference Manual, 70-00-607, May 1969.

(HCA, 1970)

RCA Time Sharing Operating System, Data Management System Reference Manual, DJ-001-2-00, June 1970.

( ~ 1969) a

Sammet, Jean E . , Programming Languages: Fundamentals, Prenkice-Hall, 1969.

(st 1967)

Standish, T.A , "A Data Definition F a c i l i t y f o r Programming Languages," Carnegie I n s t i t u t e of Technology, 1967.

(SSDL 1970)

Storage S t r u c t u r e Definition Language Task Group, "Storage S t r u c t u r e D e f i n i t i o n Language, SSDL," Record of t h e 1970 ACM SICFIDET Workshop on Data Description and Access, Rice University, Houston, 1971.

(US 1968)

U .S. Navy Programming Languages Group,

IIistory and

.

COBOL,

NAVSO P-3063, 1968.

m d s m e n t a l s of

CHAPTER 1 INTRODUC!EON 1.1 Background and Objectives

Computer technology i s a f i e l d which has experienced a rapid and uneven evolution.

This evolution has seen computer users develop

techniques and conventions appropriate only t o t h e i r own needs and data processing environments.

This has l e d t o the i n a b i l i t y of d i f f e r e n t

user groups t o communicate information about, and t o exchange algorithms and data e f f e c t i v e l y .

The problem of user and machine dependent

algorithms has received considerable a t t e n t i o n , r e s u l t i n g i n t h e development of widely accepted and l a r g e l y machine independent programming languages such a s ALGOL.

However, t h e s e v e r i t y of t h e problems of

user and machine dependent data organization has only been r e a l i z e d comparatively recently

* , and

a s yet l i t t l e has been done t o a l l e v i a t e

t h i s situation. Traditionally data i s organized e i t h e r by developing s p e c i a l s o f t ware o r by specifying i t s s t r u c t u r e i n e x i s t i n g programming languages, operating systems o r data management systems.

I n e i t h e r case, t h e

exact data organization can only be understood by analyzing and i n t e r preting several complex and i n t e r a c t i n g programs w r i t t e n i n a v a r i e t y of languages.

*

For example, t o understand the data s t r u c t u r e s produced

"It has been estimated that t h e lack of an adequate data descrjpt i o n language i s costing t h e Department of Defense alone millions of d o l l a r s annually because of t h e i n a b i l i t y t o exchange data e f f e c t i v e l y . " ( ~ 1969, a pg. 1)

by a p a r t i c u l a r COBOL program, it i s necessary t o analyze and i n t e r p r e t t h e following programs: (i) t h e COBOL program i t s e l f ,

(ii) (iii)

t h e COBOL compiler, and the data management system of the machine being used.

This e f f o r t i s necessary because the f a c t o r s which determine t h e organization of data a r e i m p l i c i t i n the programs and software used t o process and s t r u c t u r e t h e data.

Consequently, such p r a c t i c e s i n

data organization have hampered not only t h e communication of data s t r u c t u r e s but a l s o t h e interchange of the data i t s e l f .

When data i s t o

be interchanged, it is necessary t o know f i r s t whether the e x i s t i n g organization i s compatible with t h e new software which i s t o use it, and secondly, how the organization can be converted t o make it comp a t i b l e when t h i s i s not t h e case.

m e i m p l i c i t nature of data organi-

zation can make t h i s an onerous task. A solution t o these problems of c m i c a t i o n and data i n t e r -

change i s t o make the organization of data e x p l i c i t and i t s understanding independent of machines and software systems.

!Phis can be achieved by

developing a language f o r e x p l i c i t l y specifying data s t r u c t u r e s which i s separate from t h e languages used t o process that data.

'Ilo under-

stand a data structure, it i s then only necessary t o i n t e r p r e t a ~ p e c i f i c a t i o nwhich i s expressly intended t o communicate data s t r u c t u r e information, r a t h e r than t o i n t e r p r e t a program one of whose side e f f e c t s i s t h e s t r u c t u r i n g of data.

- 3Such a d a t a d e s c r i p t i o n language (ddl) would have many applications.

One important a p p l i c a t i o n i s t o provide a means of c o m n i c a -

t i n g d a t a s t r u c t u r e s among u s e r s .

For example, using a d d l a c r e a t o r

of a d a t a base can describe p r e c i s e l y t o an a p p l i c a t i o n s programmer t h e exact s t r u c t u r e of t h e d a t a t h a t t h e programer wants t o use.

Just as

ALGOL i s now used t o communicate algorithms so can a d d l be used t o communicate d a t a s t r u c t u r e s . Not only can a d d l be used t o communicate with users, but by constructing a d d l i n t e r p r e t e r , t h e d d l can be used t o communicate with machines.

Using such an i n t e r p r e t e r , a computer could use t h e informa-

t i o n contained i n any f i l e when it i s provided with a d d l d e s c r i p t i o n f o r that file.

Users would then be f r e e t o s t r u c t u r e t h e i r d a t a i n

whatever manner they deem appropriate, without being constrained by t h e d a t a s t r u c t u r e s p e c i f i c a t i o n f a c i l i t i e s a v a i l a b l e i n operating systems and programming languages.

Thus, a d d l could be used i n e s t a b l i s h i n g

automatically t h e s t r u c t u r e of d a t a bases.

A d a t a base c r e a t o r would

provide a d d l d e s c r i p t i o n and h i s data t o t h e i n t e r p r e t e r which would s t r u c t u r e t h e d a t a according t o t h e d e s c r i p t i o n . Furthermore, we could apply a ddl t o t h e problem of mechanizing t h e conversion of d a t a from a c u r r e n t s t m c t u r e t o a new s t r u c t u r e . It would only be necessary t o input t o a converter t h e data, a d d l

d e s c r i p t i o n of i t s current s t r u c t u r e , a d d l d e s c r i p t i o n ol' i t s new s t r u c t u r e and a d d l d e s c r i p t i o n of t h e r e l a t i o n s h i p between elements i n one s t r u c t u r e and t h e o t h e r .

By InLerpreting t h e s e d e s c r i p t l o n c

t h e converter could output t h e d a t a i n i t s new s t r u c t u r e .

Thus, t h e

u s e r i s released from writing special conversion programs.

In t h i s

way f i l e s could be interfaced across programming language, operating system, data management system and hardware b a r r i e r s . A f u r t h e r application i s i n the design and operation of data and

data base management systems.

For example, a ddl can be used t o create

new data structures which can then be t e s t e d f o r e f f e c t i v e storage u t i l i z a t i o n and other efficiency considerations. A t t h i s point we should make clear what we mean by t h e term "data

structure".

We use t h e term t o r e f e r t o the structure of data a s it i s

t o appear on a storage medium, including both t h e conceptual organizat i o n imposed by the user and the implementation of t h i s conceptual organization. uages

Some research groups, p a r t i c u l a r l y those i n programming lang-

(st

1967, ~a 1970), often use data structure t o r e f e r t o not only

the structure of data (as we use t h e term) but a l s o t h e access method by which t h i s data i s used.

To these groups a pushdown, f o r example,

i s a data structure, whereas we would say t h a t a pushdown i s a data

s t r u c t u r e together with an access method which controls storage and r e t r i e v a l on a l a s t i n

- f i r s t out basis.

An access method i s a pro-

gram which i s designed t o store and r e t r i e v e data from a data structure. It follows from our diccussion above t h a t we need t o separate out data

structures from the programs which uce them, so we can describe the data structures independently and e x p l i c i t l y .

Furthermore, any appro-

p r i a t e access method can be designed once the data structure has been specified.

With t h i s background i n mind, we s t a t e t h r e e objectives f o r t h i s dissertation: 1) To understand d a t a s t r u c t u r e s by developing a model which

not only characterizes current data organizational techniques, but a l s o provides a framework within which new data s t r u c t u r e s can be defined. 2)

To use t h i s model t o develop a language which can e x p l i c i t l y

describe t h e organization of data.

3)

To use t h i s model t o study how data can be converted from one

s t r u c t u r e t o another, with a view towards developing a method f o r describing such conversions. It i s a n t i c i p a t e d t h a t data description languages w i l l contribute

a s much a s programming languages towards the evolution of information processing.

J u s t a s t h e current s t a t e of programming languages i s t h e

a c c u m l a t i o n of many e f f o r t s , it i s expected t h a t much research and development w i l l be needed t o f u l l y understand tlie power and applicab i l i t y of data description languages.

The development of the d d l i n

t h i s d i s s e r t a t i o n i s perhaps analogous t o the development of tlie f i r s t programming language.

Different programming languages usually have

d i f f e r e n t models of algorithms on which they a r e based.

For example,

ALGOL i s based on recursive procedures with arithmetic operations, whereas LISP i s based on t h e lambda-calculus and s t r i n g manipulations. Similarly, we provide our own model of data organization on which our data d e s c r i p t i o n language i c based.

There a r e other studies i n progress which r e l a t e t o t h e design of a ddl, s p e c i f i c a l l y the studies being made by the COREXI, Storage Struct u r e Description Language Task Group (SSDL 1970)

.

However, t h i s group

so f a r has mainly addressed i t s e l f t o techniques f o r mapping records onto storage, which i s j u s t a subset of the problem we have tackled here. The language given here i s the f i r s t one t o be completely developed and specified.

I n addition, we a r e the f i r s t t o study and propose a general

solution f o r the problem of using data descriptions f o r converting data from one structure t o another. 1.2

The Development of the Model, the Design of t h e Language, and the Study of Conversion We w i l l now discuss the development of the model and i t s use i n

the design of the ddl (called GDDL f o r Generalized Data Description ~anguage)which i s presented i n t h i s report. The development of data description from i t s f i r s t primitive forms i n machine languages t o i t s current forms i n data management systems has been based on ad hoc changes triggered by user needs and new technology. This has l e d t o a (ride variety of methods f o r describing data, without any general concept o r comprehensive model.

For example, COBOL (US

1968) i s based on highly developed record concepts, whereas ~6 ( ~ a 1969) i s based on c e r t a i n aspects of l i s t structures, and i n operating system design, systems programmers have b u i l t up a body of expertise on storage structures and f i l e implementation techniques.

However, the

common concepts underlying these and other aspects of data structures have not been extracted and formulated i n t o a comprehensive model.

Therefore, a thorough study of t h e d a t a d e s c r i p t i o n elements i n software systems and p r o g r a m i n g languages was undertaken, with a view towards e x t r a c t i n g t h o s e common elements t o include i n a comprehensive model of d a t a s t r u c t u r e s . Tbls model of d a t a s t r u c t u r e s i s divided i n t o t h r e e l a r g e l y inde-

pendent l e v e l s , namely, t h e record, f i l e and s t o r a g e l e v e l s , and each l e v e l i s f u r t h e r subdivided i n t o a conceptual p a r t and implementation part.

The conceptual p a r t i s t h e l o g i c a l s t r u c t u r e which i s imposed on The implementation p a r t i s t h e way i n which t h i s s t r u c t u r e

the data.

i s t o be represented o r encoded.

The components of t h i s s u b d i v i s i o n

of d a t a s t r u c t u r e s a r e i l l u s t r a t e d i n Figure 1-1. CONCEPTUAL PART

I M P m N m T I ON PART

data

I Structure I Logical Record Structure

a, L 4

Structure

a,

~ogical

.rl

0 a,

-

Storage Structure

FESULTING B I T STRING REPRESENWTION (B.s.R.)

MAPPING

B.S.R. of

\structure]

L Data

1teml

Encoding Record Structure Encoding File Structure

'

Encoding Storage Structure

OUT

t B.S.li. of 1 File

File in Storage Format

B.S.13. of Storage

Figure 1-1. The Components of a Data S t r u c t u r e and t h e i r Interrelationships

A

These subdivisions provide a valuable vantage point for understanding data structures. Let us look first at the implications of the division into conceptual and implementation parts. The nature of the conceptual part is quite distinct from the implementation part, even though most systems do not make this distinction. The conceptual part is the machine-independent structure which is imposed on the data by the user. He conceives of the data as being organized in this fashion, and this is the form in which his programs expect to find the data. The implementation part, which is machine-dependent, is the way in which the logical structure is encoded as a bit string representation which can be stored on a storage medium. In our model we will see that specifications which relate to the conceptual part have the nature of production systems, whereas, specifications which relate to the implementation part have the nature of certain characteristics of character strings like length or character code. In addition, this subdivision yields a valuable insight which has not been noted in other work.

This insight is based on the observation

that if a person intends to organize certain entities into a structure, he may want that organization to depend on any property of those entities which are available to him. In particular, if a person wanto to organize record6 into a file, he may apecify thio organization in terms of any available properties of thoce records. 'Phcce properties can include the valueo of data items in recordc, the logical structure of the records and the implementation of the record structure. Thus we can

see t h a t t o describe f i l e organization we have t o provide more than t h e c a p a b i l i t y of j u s t specifying a b s t r a c t graphical s t r u c t u r e s . Now we look a t the implications of dividing t h e model i n t o record, f i l e and storage l e v e l s .

The concept of a record i s common t o a l l data storage and r e t r i e v a l systems, y e t it i s usually overlooked i n t h e o r e t i c a l s t u d i e s of data structures.

The s t r u c t u r e of records i s an important consideration i n

t h a t it i s t h e basic organization of data items which i s t r e a t e d a s an e n t i t y f o r storage and r e t r i e v a l .

Thus f a r a hierarchic organization f o r

records has proven adequate, a s it provides a s t r u c t u r e which i s r e l a t i v e l y easy t o encode and decode without the need f o r extended scanning operations.

I n t h i s work, therefore, we only allow h i e r a r c h i c s t r u c t u r e s

a t t h e record l e v e l .

I n our model t h i s hierarchic organization i s

generalized i n t h a t it allows f o r l e v e l s of t h e hierarchy t o occur optionally o r t o repeat a number of times.

This conceptual s t r u c t u r e of

records has not been modelled e x p l i c i t l y before, although it i s e s s e n t i a l l y t h e l o g i c a l organization of records which i s i m p l i c i t i n COBOL.

COBOL, however, i s q u i t e r e s t r i c t i v e on t h e ways i n which the

implementation of records may be specified.

I n t h i s work we allow each

implementation c h a r a c t e r i s t i c t o be specified e i t h e r d i r e c t l y o r dependent on other c h a r a c t e r i s t i c s . Records a r e the elements which a r e organized i n t o f i l e s .

There

i s great f l e x i b i l i t y i n d i s t r i b u t i n g the o v e r a l l organization of a s e t

of data items between t h e record and f i l e l e v e l s .

On one hand, wc can

specify a record t o consist of a tingle d a t a item, and, i n e f f e c t ,

specify the overall organization of the data a t the f i l e l e v e l .

In

f a c t we can specify hierarchies a t the . f i l e l e v e l and thus a l l the conceptual structure f o r records can i n principle be moved t o the f i l e level.

However, while the conceptual structure of t h e data might remain

use of t h e data f o r storage and r e t r i e v a l has been changed. t h e same, the On the other hand, we can specify a record t o be a complex hierarchic

structure and possibly make the f i l e structure simple.

The d i s t r i b u t i o n

of structure between the f i l e and record l e v e l s depends on t h e intended use of the data.

Therefore, by distinguishing record structure from

f i l e structure we a r e able t o include these aspects of data structures i n our model. Our concept of a f i l e structure i s more general than others be-

cause, a s previously mentioned, we allow the specification of graphical structures which depend on data and record properties.

Btis requires

a more elaborate specification method than the usual methods based on

pure graph-theory. The specification of the structure and encoding of records, and t h e specification of how these records a r e structured and implemented a s a f i l e determine a b i t s t r i n g representation of t h e f i l e .

7his i s

t h e b i t s t r i n g which i o actually mapped onto a storage s t r u c t u r e . Our division of storage structure i n t o conceptual and implementat i o n p a r t s i s the key t o both simplifying the mapping of the b i t string representation of a f i l e onto a storage structure, and a l s o simplifying t h e specification of storage structures by extracting the structure

common t o storage media independent of physical considerations.

me

conceptual s t r u c t u r e of storage i s based on generalized h i e r a r c h i e s which a r e common t o a l l storage media.

The implementation of t h e s e

h i e r a r c h i e s i s based on encoding c h a r a c t e r i s t i c s which a r e a l s o independent of t h e storage media.

To bind a storage s t r u c t u r e t o a particu-

l a r medium, we have only t o r e l a t e t h e l e v e l s o f t h e hierarchy t o t h e a c t u a l physical l e v e l s of a storage medium.

I t i s over such a storage s t r u c t u r e t h a t t h e b i t s t r i n g representat i o n of a f i l e i s d i s t r i b u t e d .

A r e s u l t of our subdivision of d a t a

s t r u c t u r e s has been t o make t h e a c t u a l mapping of data onto a storage medium comparatively straightforward,

It i s only necessary t o decon-

catenate t h e b i t s t r i n g representation of t h e f i l e a t appropriate p o i n t s , and i n s e r t these component s t r i n g s without d i s t u r b i n g t h e i r order i n t o t h e s l o t s already provided by t h e storage s t r u c t u r e . These a r e t h e i n s i g h t s and advantages which a r e obtained by subdividing our model i n t h e above way.

From t h e study of d a t a descrip-

t i o n elements i n software systems and programming languages we can ensure t h a t we a t l e a s t included t h e d a t a d e s c r i p t i o n c a p a b i l i t i e s of

every current system t h a t was considered.

A G each of t h e c l a c s c s ol'

software i n t h e study include0 t h e most s o p h i s t i c a t e d r e p r e s e n t a t i v e oL' that c l a s s , it i s l i k e l y that we have i n f a c t included t h e capabil-

i t i e s of a l l current systems.

From t h i s model t h e requirements f o r

a d a t a d e s c r i p t i o n language a r e immediately apparent.

GDDL i t s e l f t o be very c l o s e l y r e l a t e d t o t h e model.

This allows

When the data description capability of t h e language had been designed, the problem of using descriptions t o convert data from one s t r u c t u r e t o another was studied.

Using d d l t s f o r data conversion

i s one application t h a t has been widely suggested, but never actually

investigated.

With our model of data structure, we could study the

conversion process itseli?.

I n t h i s study it w i l l be shown t h a t addition-

a l information i s required t o completely describe a conversion.

This

additional information specifies a relationship, which can be quite elaborate, between names i n one description and names i n the other. To model t h i s relationship the concept of an association l i s t was

developed.

GDDL c a p a b i l i t i e s f o r describing data conversion relation-

ships a r e incorporated d i r e c t l y from t h e association l i s t concept. 1.3

Organization of the Report The GDDL language i t s e l f i s presented i n Appendix A i n the form

of a self-contained reference manual.

The body of t h i s report

therefore i s concerned with presenting the model and i t s relationship t o t h e language.

It a l s o shows t h a t GDDL can describe any data organi-

zation t h a t can be obtained with current systems.

Further, because the

model allows generalizations of current data description c a p a b i l i t i e s , GDDL can describe data organizations t h a t a r e beyond these present

c a p a b i l i t i e s but might well be incorporated i n t o future systems.

The

generality of GDDL r e l a t i v e t o current systems i~diccucsed i n terms of the model.

Chapter 2 presents the study of the development of data description in programming languages and software systems.

The table at the

end of this study ('Table 2-1) provides the basis for showing that the models and thence GDDL include all current data structure capabilities. This study is quite long and the details are not essential for understanding the remaining chapters. &e

reader is therefore advised to

skip to Chapter 3 should the detail become too oppressive. Chapters 3,

4 and 5 develop the record, file and storage levels

of the model respectively. Each chapter shows the relationship between the model and the GDDL language at that level. The material in these chapters provides an excellent way of visualizing the structure of GDDL and its description capabilities. Chapter 6 discusses the ways of using data descriptions to convert data from one structure to another.

m e concept of an association list

is introduced and it is shown how an association list can be used to complete the specification of data conversion. Chapter

7 summarizes the contributions of this report and

suggests directions for future research. Appendix B contains examples of GDDL descriptions of' some realworld files and of data conversl.on from one structure to another. 'Ihcuc examples are chosen to further demonstrate the ability of GDDL to describe current data organizations. Appendix C contains a proof that GDDL can indeed describe all the COBOI2 record features. COBOL is the prototype for the most advanced record level data representations. It is shown that each COBOL record

- 14 description clauee can be expressed in GDDL.

CHAPTEX 2

2.1

MISTING DA'IYI STRUCTURF,S AND DA'B DESCRIPTION LANGUAGES

Introduction The o b j e c t of t h i s chapter i s t o provide an a n a l y s i s of d a t a

s t r u c t u r e s i n contemporary computer software with a view towards obtainin/-:a romprek~en~ive summary of d a t a s t r u c t u r e c l h a r a c t e r i c t i c s .

This

cummary provides t h e b a s i s f o r demonstrating i n l a t e r c h a p t e r s t h a t t h e CDDL i s complete.

The software systems covered by t h i s a n a l y s i s a r e : ( i ) machine languages,

(ii)

e a r l y o p e r a t i n g systems,

(iii) assembly languages,

( iv) (v) (vi) (vii) (viii)

e a r l y h i g h e r - l e v e l programming languages, c u r r e n t o p e r a t i n g systems, c u r r e n t h i g h e r - l e v e l programming languages, d a t a base management systems, and t h e CODASYL Data Description Language.

The c h a r a c t e r i s t i c s of each of t h e s e systems a r e analyzed i n a s e p a r a t e s e c t i o n of t h i s chapter.

The f i n a l s e c t i o n combines t h e

r e s u l t s of t h e s e a n a l y s e s i n t o a t a b l e . 2.2

Data S t r u c t u r e s i n Machine Languages I n machine languages, t h e r e a r e f o u r ways t h a t d a t a s t r u c t u r e

ckiaracteristics a r e specilied:

1) hardware specifications f o r conventions such a s the code f o r representing characters, t h e base f o r representing numbers, and the length

of t h e eglelleet addressable u n i t of storage.

These conventions a r e

Fixed For a given computer but may vary from machine t o machine.

To use

a p a r t i c u l a r machine, a system programmer has t o know these conventions. Thus, deecriptions i n t h e form of specifications i n manuals a r e usually provided, 2)

machine language i n s t r u c t i o n s t h a t specify the data type

( e . g., character or number), the s c a l e of numbers (e. g., f i x e d point or

. ., single o r double) .

f l o a t i n g point), and t h e precision of numbers ( e g

These descriptive elements a r e implicit i n data manipulation i n s t r u c t i o n s r a t h e r than e x p l i c i t a s declarations.

They a r e i l l u s t r a t e d by t h e

following examples. a)

To specify t h a t a character s t r i n g i s t o be placed i n t h e

accumulator of t h e computer, the machine language i n s t r u c t i o n CAL ( c l e a r and Add Logical word) would be used instead of t h e i n s t r u c t i o n CLA f o r placing a number i n the accumulator. b)

To specify t h a t a f l o a t i n g point number i s t o be added

t o t h e accumulator, t h e i n s t r u c t i o n FAD ( ~ l o a t i n g~ d d )would be used instead of t h e fixed point i n s t r u c t i o n ADD. c)

lb specify double precision f o r addition, t h e instructi.011

DFAD ( ~ o u b l ePrecision Floating A ~ O )would be used in1;tead of the sinflc p r e c i ~ i o ni n ~ t r u cion t ADD.

3) items.

machine language i n s t r u c t ions tha,t specify locations of data These descriptive elements a r e a l s o implicit i n data manipula-

t i o n inetructione r a t h e r than e x p l i c i t a s declarations.

For example,

t h e ST0 ( s t o r e ) i n s t r u c t i o n both d e c l a r e s that a p a r t i c u l a r l o c a t i o n i s t o be used f o r s t o r a g e and s p e c i f i e s that a d a t a item is t o be

stored i n t h a t location.

4)

machine language i n s t r u c t i o n s t h a t s p e c i f y which devices a r e

t o be used f o r input and output, and how d a t a would be organized on t h e device medium.

These d e s c r i p t i v e elements a r e a l s o i m p l i c i t i n d a t a

manipulation i n s t r u c t i o n s r a t h e r t h a n e x p l i c i t a s d e c l a r a t i o n s .

They

a r e i l l u s t r a t e d by t h e following examples. a)

To s p e c i f y that a p a r t i c u l a r I/O

device i s t o be used f o r

output, t h e machine language i n s t r u c t i o n WRS (write s e l e c t ) i s used t o prepare t h e a p p r o p r i a t e channel. b)

To specify t h a t a p a r t i c u l a r block of d a t a items i s t o be

t Load channel) copied onto an o u t p u t medium, t h e i n s t r u c t i o n RCH ( ~ e s e and i s used t o send t o t h e channel a channel command word which g i v e s t h e

s i z e of t h e block of d a t a t o be copied and i t s l o c a t i o n . c)

To s p e c i f y t h a t t h e l a s t block of d a t a has been reached

on a magnetic t a p e , t h e i n s t r u c t i o n WEF ( w r i t e ~ n d - o f - ~ i l e i) s used t o w r i t e a n e n d - o f - f i l e gap followed by a t a p e mark on t h e t a p e .

*

The c h a r a c t e r i s t i c s of d a t a s t r u c t u r e s

languages can be grouped i n t o two c a t e g o r i e s .

provided by machine One i n c l u d e s t h e charac-

t e r i s t i c s of i n d i v i d u a l d a t a items, and t h e o t h e r t h e c h a r a c t e r i s t i c s

*

A t t h e end of each s e c t i o n of t h i s chapter a l i s t of t h e c h a r a c t e r i s t i c s of t h e system under d i s c u s s i o n w i l l be p r e s e n t e d . Whenever a

new c h a r a c t e r . i s t i c ( n o t appearing i n previous s e c t i o n s ) i s i n t r o duced, it w i l l be underlined.

of storage media. 1. The c h a r a c t e r i s t i c s of individual data items c o n s i s t of: (i

t h e hardware ~ r o v i d e dcharacter code.

(ii) length,

(iii)

data type: a)

character s t r i n g ,

b)

numbers: 1)

binary base, Sign

-

radix o r diminished radix complement

(depending on t h e hardware),

3) 2.

fixed or floating-point scale.

The c h a r a c t e r i s t i c s of storage media c o n s i s t of: (i) (ii)

(i i i )

block s i z e , end-of -f i l e l a b e l s , and device assignment.

We note t h a t machine i n s t r u c t i o n s a r e seldom used o r made a v a i l a b l e t o describe e x p l i c i t l y t h e s t r u c t u r i n g of s e t s of data items.

Such

s t r u c t u r e s a r e created and maintained by machine language programs. 2.3

Data S t r u c t u r e s i n Early Operating Systems With t h e development of Operating Systems (os's), more complex

d a t a s t r u c t u r e s on storage devices were provided d i r e c t l y t o the programer.

They a r e described by statements of t h e OS job control lan-

guage (JCL).

Previously, these f i l e and storage s t r u c t u r e s had t o be

implemented as p a r t of user-written machine language programs.

Examples of such statements a r e t h e $FILE and $LABEL statements provided by t h e IBM 7040 JCL.

These a r e i l l u s t r a t e d i n Figure 2-1.

The $FILE statement i s used t o describe t h e c h a r a c t e r i s t i c s of t h e f i l e s t r u c t u r e and t h e p o s i t i o n i n g of t h e records on magnetic tape, t h e s t r u c t u r e of t h e t a p e ' s physical blocks and t h e t a p e u n i t . 1. The f i l e s t r u c t u r e and implementation c h a r a c t e r i s t i c s c o n s i s t

of:

2.

(i)

ordering t h e records i n t h e i r input sequence, and

(ii)

implementing t h i s s t r u c t u r e by scqueniiial storage.

The record positionirlg c h a r a c t e r i s t i c .is .the rccorri Lo tape

block r a t i o ; t h a t i s , t h e number of records per t a p e block.

3.

The storage s t r u c t u r e and implementation c h a r a c t e r i s t i c s a r e : ( i ) tape naming,

(ii) (iii)

labels: a)

header and t r a i l e r l a b e l s f o r tape r e e l s and f i l e s ,

b)

count f i e l d s f o r tape blocks,

(iv)

f i x e d ordering of tape blocks and l a b e l s on t h e tape,

(v)

f i x e d occurrence of a l l blocks and l a b e l s s p e c U i e d ,

(vi)

4.

tape block s i z e ,

r e p e t i t i o n of r e e l s

-

given a s number of r e e l s .

The device c h a r a c t e r i s t i c i s read/write density.

The remaining parameters of t h e statement a r e used t o describe b u f f e r s and a c t u a l processing. The $LABEL statement i s used t o describe t h e information i n a label.

1,abels a r e used t o implement storage s t r u c t u r e s .

$FILE

deck name

'f i l e

,

name ' [primary unit], [secondary u n i t ]

PRINT SCRTCH

The IBM 7040 $FIU Statement

Figure 2-1 a )

16 e

$LAB-

m

Figure 2-1 b) Figure 2-1.

,

[

number a

] ,[ ]

number

,

[ {z::}] ,

identification The IBM 7040 $LABF;L Statement

IBM 7040 ?hta Description Statements

.

I

Data S t r u c t u r e s i n At3~emblyLanguages Assembly languages were primarily designed t o enhance d a t a handling

and t o a l e s s e r degree, t o provide mnemonic machine i n s t r u c t i o n s .

The

data-oriented pseudo-instructions provided by assembly languages s i g n i f i c a n t l y increase t h e v a r i e t y o f d a t a s t r u c t u r e s made d i r e c t l y a v a i l a b l e t o the user.

Thus, many complex data s t r u c t u r e s t h a t had previously

been created and maintained by u s e r programs, can now be declared explicitly. I n Assembly Languages, elements and statements which d e a l with

data s t r u c t u r e s a r e t y p i f i e d a s follows: 1)

Symbolic rlames assigned t o d a t a items.

These names may be

used t o access t h e data items d i r e c t l y without r e f e r r i n g t o t h e address of t h e d a t a items.

For example, i n t h e IBM 7040 Macro-Assembly Language

MAP, t h e statement DlXT%DEC 1 3 r e s u l t s i n t h e name DM% being assigned

t o t h e l o c a t i o n i n which a decimal number 13 i s s t o r e d . 2)

Pseudo-instructions t h a t declare d a t a types.

For example,

i n IBM 7040 MAP, d a t a items may be declared t o be o c t a l , OCT; decimal, DEC; binary coded information, BCI; and v a r i a b l e f i e l d data, VFD.

This

i s i l l u s t r a t e d by t h e following examples:

a)

To specify that a data item named DMM i s t o be i n t e r -

preted a s t h e decimal i n t e g e r 13, t h e following MAP statement i s used: DINT%DEC 1 3 b)

To specify t h a t a d a t a item named ENTHY i n t o contain t h e

character C i n t h e f i r s t statement i s used:

6 b i t s of t h e data item, t h e following MAP

ENTRY VFD H ~ / C

3)

Pseudo-instructions that describe the structure of data items.

For example, in IBM

7040 MAP, to specify that a block of 6 consecutive

storage locations are to be reserved for storing data items, the following statement is used: BSS

4)

6

Pseudo-instructions that describe input/output characteristics

of particular media.

For example, in IBM

7040 MAP such statements are

of the form:

..., option LABEL option, . .. , option

name FILE option,

where the options for the FILE statement and LABEL statement are the same as the options for the IBM 7040 Job Control Language $FILE and $LABEL described in the previous section.

Thus, the following characteristics of individual data items, sets of data items and storage media are made accessible to programmers in Assembly Language. 1. The characteristics of individual data items consist of: ( i)

symbolic namine,

(il) the hardware provided character code, (iii) length, (Iv)

data type: a)

character &ring,

b)

numbers: 1)

binary, decimal o r o c t a l base,

2)

character sign f o r decimal numbers and radix and diminished r a d i x complement f o r binary numbers,

3) (v) 2.

data items i d e n t i f i e d by p o s i t i o n .

The c h a r a c t e r i s t i c s of s e t s of d a t a items c o n s i s t of: (i) (ii) (iii)

3.

f i x e d o r f l o a t i n g point s c a l e ,

f i x e d order, f i x e d occurrence, and s e t s of data items i d e n t i f i e d by t h e i r p o s i t i o n

Assembly languages depend on t h e i r underlying operating system f o r storage s t r u c t u r e .

2.5

Data S t r u c t u r e s i n Early Higher-Level Programming Languages

I n developing higher-level languages such a s FOREUN and COBOL, appropriate d a t a s t r u c t u r e s were provided.

For example, FORTRAN, which

was designed f o r s c i e n t i f i c computing, provides a r r a y accessing f o r handling homogeneous d a t a ( i . e . , d a t a of t h e same type). The data d e s c r i p t i o n statements of ANSI FOl?Tl&CIN have f o u r for.ms: 1) Declaration statements t h a t describe t h e s t r u c t u r e of i n d i -

v i d u a l d a t a items.

I n FORTRAN, c h a r a c t e r i s t i c s such a s s c a l e and

p r e c i s i o n a r e t r e a t e d a s a d d i t i o n a l d a t a types.

For example, i n FOHTliRN

I V , t h e following "type" d e c l a r a t i o n s a r e provided:

INTEGER

DOUBLFt PFaCISION

REAL

LOGICAL

COMPLEX

EXTERNAL

where LOGICAL d a t a items a r e t h e values T ( o r TRUE) and F ( o r FALSE), and EXTERNAL data items a r e data items which a r e defined e x t e r n a l l y t o t h e FORTRAN program.

To specify t h e ty-pe of a d a t a item, t h e name of

t h e item i s l i s t e d a f t e r t h e ty-pe i n a d e c l a r a t i o n statement, e . g . , INTEGER CVAL, A, B 2)

The d e c l a r a t i o n statement which describes t h e s t r u c t u r e of

s e t s of data items (groups).

I n ANSI FOR!IBAN, individual data items can

be grouped together i n h i e r a r c h i c s t r u c t u r e s which a r e i n t e r p r e t e d by t h e processor a s a r r a y s .

For example, t h e t r e e i l l u s t r a t e d below can be

i n t e r p r e t e d as a 2 x 3 array:

That i s , t h e p a i r s of d a t a items < a11,a21 a r e i n t e r p r e t e d a s rows. dimensions.

>, < a21,a22 > and < a31,a32 >

Arrays a r e l i m i t e d t o a maximum of t h r e e

The DIMENSION statement i s used t o describe such groupings.

The statement has t h e following format: DlMENSION a r r a y name (nl,n2),

.. ., a r r a y name

(n1,n2,n3)

where:

a r r a y name i s t h e name used t o r e f e r t o t h e a r r a y , and n n , n a r e t h e number of elements i n each of t h e 1' 2 3 dimensions of t h e a r r a y , allowed i n ANSI FORTRAN.

For example, t h e statement: ) a 2 x 3 a r r a y c a l l e d A. DIMENSION ~ ( 2 . ~ 3describes Data items i n t h e vectors a r e accessed by a r r a y indexing. 3)

t h e FORMAT statement which describes input and output data

structures.

The statement i s used t o describe data type and lerlgth

f o r each d a t a item i n a record t o be input o r output.

For example,

i n ANSI FORTRAN, t h e statement has t h e f o l l o w i r g format: FORMAT ( d a t a item s p e c i f i c a t i o n ,

..., data

item s p e c i f i c a t i o n )

where a data i t e m s p e c i f i c a t i o n c o n s i s t s of two p a r t s : and a d a t a l e n g t h p a r t .

a d a t a type

These types a r e :

F r e a l with no exponent E r e a l with exponent

D r e a l with double p r e c i s i o n exponent I integer

L l o g i c a l ( c h a r a c t e r s t r i n g T o r 11') A character s t r i n g

H h o l l e r i t h ( c h a r a c t e r s t r i n g used f o r output only)

Length i s given a s number of characters p e r d a t a item. example,

~6 describes a d a t a item which i s a

strjrq of

For

6 characters.

For r e a l data items, i n a d d i t i o n t o length, t h e number of d i e i t s Lo t h e r i g h t of t h e decimal point i s s p e c i f i e d .

For example, ~ 8 . 2describes a

d a t a item which i s a r e a l number with a maximum l e n g t h of i j characters

and which has 2 digits following the decimal point.

4) 1nput/0utput statements that describe the order of the data items to be input or output, and the device to be used. The statements have the following format: (device number, format statement number) data name,

.. ., data name

where: device number refers to a specific device, format statement number refers to the format statement describing the data items being input or output, data name refers to the data item or group (array values) being input or output. Thus, the following characteristics of data structures are made accessible to programmers by the data description statements of

FORTRAN: 1. The characteristics of individual data items consist of: (i)

symbolic naming,

( ii) the hardware provided character code,

(iii) fixed lengths as specified by the user, (iv) data type : a)

character string,

b)

number: 1) binary or decimal base, 2)

radix or diminiohed radix complement depending on hardware for binary numbers, character sign or no sign for decimal numbers,

f i x e d o r f l o a t i n g point scale,

3) (v) 2.

data items i d e n t i f i e d by t h e i r p o s i t i o n .

The c h a r a c t e r i s t i c s of records c o n s i s t of: (i) (ii)

array

access in^

(balanced t r e e s ) ,

f i x e d ordering,

(iii) f Fxed occurrences,

(iv)

3.

groups of data items i d e n t i f i e d by t h e i r p o s i t i o n .

FORTRAN depends on i t s underlying Operating System f o r i t s storage s t r u c t u r e .

Because t h e COBOL language was designed f o r handling l a r g e quantit i e s of data, more importance was given t o t h e data d e s c r i p t i o n s t a t e ments of t h e language than i n FOR'IRlN.

These statements a r e w r i t t e n i n

separate s e c t i o n s of a COBOL program.

The Data Division i s t h e s e c t i o n

f o r describing t h e d a t a items, records, f i l e s , working storage and program constants.

Another section, c a l l e d t h e Environment Division, i s

f o r describing t h e storage media.

I n it, information concerning f i l e

s e l e c t i o n i s given, and t h e equipment configuration ( t a p e s t a t i o n , print e r , e t c . ) i s described. 1)

I n COBOL's Data 1)ivision t h e r e i s one statemerlt l o r describint:

tjle organization of d s t a items i n records and one statement lor. descriL)ing t h e organization of records i n t o f i l e s ; a)

Each d a t a item o r group of d a t a items t h a t i s t o appear

i n a record i s described by a statement of t h e form i l l u s t r a t e d i n

Figure 2-2.

This statement i s used t o describe:

i) the l e v e l a t which the data item or group of data

items i s t o occur i n t h e hierarchic record, ii)

.

the data type ( e g., character s t r i n g = DISPLAY, numeric s t r i n g = CW),

iii) iv)

the length of the data item, the number of times t h e data item o r group of data items is t o occur in each record,

v)

the alignment of the data item i n respect t o

word boundaries and t o fixed length s t r i n g s of character positions.

levelnumbe { ~ ~ ~ ~ ~ REDEF c - l }INES [ ;date-name-21

COMPUTATIONAL

DISPLAY SYNCHRONlZED){,W,

I

RIGHT

) ] [; jm [PICTURE

/ [ ;(JUST J J U ~ T I F I E D } R I G H T ] [; VALUE IS literal]

Figure 2-2

I;BLANK WHEN ZERO].

The ANSI COBOL Statement For Describing a Data Item or a Group in a COBOL Record

(us 1968)

1

character-string

The organization of COBOL records i n a COBOL f i l e i s

b)

described by a statement of the form i l l u s t r a t e d i n Figure 2-3.

This

statement i s used t o describe i) ii) iii)

iv)

file-name

t h e s i z e of storage blocks, the s i z e of t h e records stored i n t h e blocks, any l a b e l s t o appear on t h e storage tape, t h e names of records appearing i n t h e f i l e .

; BLOCK CONT AlNS [integer-1 TO] intcgcr-2.(RECORDS

\CHARACTERS

[; R E C O R CONTAINS ~ [integer-3

RECORDS ARE

integer-4 CHARACTERS]

STANDARD date-name-1 [, data-name-21.

t-

Figuree :'-3

..

i

DIGAN:;] COJiOL Statement Tor?)csr.ribirly rr. C0130L Vilc (11:; 1()6(:)

J

2)

In COBOL's Environment Division, there is one section that

is used to describe input and output conventions. In it, equipment assignments and certain physical characteristics of each file to be used by the program are described by a statement of the form illustrated in

Figure

2-4. This statement is used to describe the device on which

the file is stored.

FILE-CONTROL FILE-CONTROL.

SELECT [OPTIONAL] lila-name

t ASSIGN TO [integer-11 implementor-name-1 [,implementor-name-21 integer-2

[FOR MULTIPLE

...

ALTERNATE

[{

AREA AREAS

}]I

.). . .

Figure 2-4. 'I'he ANSI COBOL Statement for Describing the Storage Convention of a COBOL File

(us 1968)

Thus, the data structures that are made accessible to programmers by COBOL can be characterized in the following way.

1. The characteristics of individual data items consist of:

(i)

symbolic naming,

(ii) the hardware provided character code, (iii) fixed lengths aE specified by the user,

(lv)

data types:

a)

character etring,

b)

number:

1)

binary o r decimal base,

2)

sign

-

radix o r diminished radix complement

(depending on t h e hardware) f o r binary numbers, and character sign o r no s i g n f o r decimal numbers,

3) (v)

f i x e d o r f l o a t i n g point s c a l e ,

value alignment ( j u s t i f i c a t i o n ) with blank o r zero padding,

(vi)

value s t r i n g alignment (synchronization) with respect t o computer words with blank o r zero padding,

(vii) 2.

The c h a r a c t e r i s t i c s of records c o n s i s t of:

(i ) (ii) (iii) (iv) (v)

3.

data items i d e n t i f i e d by t h e i r p o s i t i o n .

hierarchic structure, f i x e d order, f i x e d occurrences, f i x e d r e p e t i t i o n ordered a s input, groups of d a t a items i d e n t i f i e d by t h e i r p o s i t i o n .

COBOL depends on i t s underlying Operating System f o r i t s storage s t r u c t u r e s .

2.6 Data Structurer; i n Third- Generation Ope~atinr: Syctcms I n t h e i r current stage of development, Opcratirlg :;ystcm:: ((1;' ::) art. provldirlg more f i l e and ctorage stmc-L;urc u p t l o ~ ~ :tklurl ; curly

01;'::.

Wle c r e a t i on and maintenance of t h e s e s t r u c t u r c c a r e t r c a t c t i a:: a s e t of s e r v i c e s separate from those involved i n sct-leduling programs. The p a r t of an OS which supports t h e s e s e r v i c e s i s r e f e r r e d t o as t h e

data management system (DMS) of t h e operating system.

Among these

services a r e the moving of data between storage devices and main memory, and t h e accessing of data i n DMS maintained s t r u c t u r e s .

Additional JCZ

statements, known a s DMS statements, a r e provided t o evoke DMS services. I n general, DMS1s provide t h e i r users with a number of f i l e and storage s t r u c t u r e s .

To s t o r e data i n such structures, the u s e r proceeds

a s follows: (i) (ii)

he names the p a r t i c u l a r s t r u c t u r e i n a DMS statement, he l i s t s the parameters which s e l e c t those options provided by t h e DMS ( i f any), and

(iii)

he e n t e r s h i s data.

The data management service so evoked moves t h e data from the input device t o the appropriate storage devices and s t o r e s it i n the described structures. For example, the DMS I1 of the RCA SPECTRA 70/46 B O S (RCA 1971) provides i t s u s e r s with f i v e s t r u c t u r e s and r e l a t e d input/output conventions.

Collectively, these s t r u c t u r e s and conventions a r e called

access methods. 1)

They are:

PAM (primitive Access ~ e t h o d.) !This method provides only a

p a r t i c u l a r record format (fixed i n length) and storage on e i t h e r d i r e c t access devices o r on single r e e l , standard blocked tape. and accesses f i l e s only i n random order. t h e blocking and deblocking of records.

PAM c r e a t e s

The user must himself handle

2)

SAM (sequential Access ~ e t h o d ) . This method provides e i t h e r

f i x e d length, v a r i a b l e length o r undefined record formats (where records with undefined formats a r e stored one t o a block). accecsec f i l e s i n sequentlal order only.

SAM creates and

S t performs a l l blockirg,

dei~lockingand buffering f o r the user.

3)

ISAM (1ndex Sequential Access ~ e t h o d.) It provides e i t h e r

fixed o r variable length record formats and storage on direct-access devices only.

Records a r e maintained by means of a d i r e c t o r y whose

e n t r i e s point t o t h e records t o r e f l e c t t h e correct sequence. words, records may not be i n sequential order physically. key. whose values determine the sequence i s c a l l e d t h e access f i l e s i n a sequential or non-sequential order.

I n other

The f i e l d

Thus, ISAM can I n terms of

storage s t r u c t u r e , an ISAM f i l e i s made up of data blocks (2048 bytes) and d i r e c t o r y blocks.

Data blocks contain t h e u s e r ' s records which a r e

ordered i n i t i a l l y according t o t h e values of the key f i e l d . blocks contain p o i n t e r s t o data blocks.

Directory

ISAM performs a l l blocking,

deblocking and buffering f o r t h e u s e r .

4)

RTAM ( ~ a s i cmpe Access ~ e t h o d ) . This method provides e i t h e r

fixed length or undefined record formats (where records a r e stored one per block) and storage on tape only.

BmM i s used t o provide e f f i c i e n t

accessing of tape blocks.

5) EAM

vanesc scent Access

~ e t h o d ) . 1:t provides f i x e d length

record formats and storage on direct-access devices only.

and accesses temporary f i l e s only i n a random ortic r

.

creates

Because they a r e

temporary, EfU4 f l l e s have no label^ and require no c:ataloe;uing or

s e c u r i t y checks. Data s t r u c t u r e s i n t h e s e f i v e access methods a r e s i m i l a r i n s e v e r a l respects.

I n f a c t , only t h r e e s t r u c t u r e s a r e provided f o r records:

1) Fixed length

-

i n which each record contains e x a c t l y t h e same

number of b y t e s .

Standard format i s known t o a l l DMS access

methods. 2)

Variable length number of bytes.

-

i n which each record may contain a d i f f e r e n t I n each v a r i a b l e l e n g t h record, t h e f i r s t

,

two bytes of t h e record contain t h e characters "11" and t h e second two bytes contain t h e length of t h e record.

3) Undefined

-

i n which records a r e i d e n t i c a l i n length t o t h e

input/output b u f f e r s defined f o r t h e access method. There a r e t h r e e ways of organizing records i n t o f i l e s : 1) random organization, 2)

sequential, and

3)

indexed sequential.

For storage, records may be blocked and unblocked automatically, devices may be tape o r direct-access, and blocks may be standard (2064 bytes) o r nonstandard (< 4096 b y t e s ) .

Control codes such as tapemarks,

count f i e l d s , e t c . a r e handled automatically and may not be s p e c i f i e d by t h e user.

Thus, t h e following c h a r a c t e r i s t i c s of f i l e and storage s t r u c t u r e s a r e made a c c e s s i b l e t o programmers by t h e DMS d a t a d e s c r i p t i o n statements

1. The characteristics for organizing records into files and implementing the structure consist of: (i) (ii) (iii)

structuring records by input sequence, structuring records by value (key), implementing structures by a)

sequential positioning, and

b)

by pointers : 1) stored in tables or embedded in records, 2)

given as absolute address or relative to some origin.

2.

The characteristics for positioning records in device blocks consist of: (i) (ii)

the record-to-block ratio, and the distribution of records such that records either are maintained whole or are split between blocks.

3.

The characteristics for organizing storage blocks and implementing this structure consist of: (i)

block naming,

(ii) formatting for the following supported devices: magnetic tape, mgnetic disk, cards, and printer,

( ili)

block length specifica1;ion Tor cupportcd d.cviccs,

(lv) labels for cupported dcvicec, (v) fixed order of device formats, (vi) fixed occurrences of device formats, (vii)

repetition of formats for tape reels, disk levels,

cards and p r i n t e r pages. 2.7

Data Structures i n Current Versions of Higher-Level Programming Languages Current higher-level programming languages have been developed t o

take advantage of the data management services provided by operating systems and t o s a t i s f y user requirements f o r more complex working struct u r es

. For example, RCA SPECTRA 70/46 ANSI COBOL (RCA 1969) has statements

t o evoke SAM and ISAM and t h e i r related data structures. The COBOL Data Division has been enhanced:

new i n t e r n a l formats

have been added, repeating groups can be ordered, and r e p e t i t i o n numbers can vary f o r different record occurrences.

The clauses used t o

specify these options a r e i l l u s t r a t e d i n Figure 2-5.

[

USAGE IS

Figure 2-5, a.

The COBOL Statement f o r Declaring Data 'Sypes [integer-1 TO]

integer-2 TIMES

[ D E ~ I N G ON hta-name-l]

[

{~DE,"~) KEY IS data-name-2 C , data-name- 31

.. . ]

[INDEXED BY index-name-1 [, index-name-21 Figure 2-5, b. Figure 2-5.

...

!The COBOL Statement f o r Specifying Repetition Enhanced COBOL Description Statements

]

PL/I i s an example of a higher-level programing language t h a t

was designed t o incorporate a l a r g e r number of record s t r u c t u r e s than o t h e r languages a v a i l a b l e a t t h e time of i t s conception. a r r a y accessing, h i e r a r c h i c s t r u c t u r i n g and

strirltl,

It provided

processirtg f o r d a t a

items and group6 of d a t a itemc. PL/I

provides a r i c h s e t of c h a r a c t e r i s t i c s l o r s t r u c t u r i n g nrlci

implementing data items ( IBM 1965):

(i ) (ii) (iii) (iv)

symbolic naming, t h e hardware provided character code, f i x e d and varying l e n g t h s a s s p e c i f i e d by t h e u s e r , data types: a)

character s t r i n g ,

b)

number: 1) binary o r decimal base, 2)

sign

-

radix o r diminished radix complement

(depending on t h e hardware) f o r binary numbers, and character s i g n o r no s i g n f o r decimal numbers,

3) (v) (vi)

f i x e d o r f l o a t i n g point s c a l e ,

value alignment with zero o r blank pad c h a r a c t e r s , d a t a items i d e n t i f i e d by p o s i t i o n .

These d a t a d e s c r i p t i v e elements a r e combined i n cleclarat i o r ~statemelks of t h e form: i)

DECLAIiE data item name

PI C'IUliE

(n) [VAllY picture string

ii) DECZARE data item name

FIXED

To group data items into hierarchic structures and structures accessible by array indexing, PL/I provides the following elements: 1) a clause which is used to specify the dimensions for array

accessing. It has the form:

(5, ... , mn)

for an n dimensional array, where the

ith dimension has mi elements. This clause is used in a

DECLARE statement: DECLARE data name 2)

(y .. . , mn) . . .

a clause which is used to describe hierarchic relationships between data items. It has the same form as the level number clause in COBOL.

It is used in a DECLARE statement:

DECL4RE level number data item

level number data item

... ...

Such hierarchic structures may also be accessed by array

.

indexing

For file and storage structures, PZ/I provides statements which are used to invoke the DMS access methods of its. underlying operating system.

The characteriotics of data structures that are made acce~sible to the programmer by the data description elementc of many cvrrcnt t~ighcrlevel languages are summarized in Section 2.10.

2.8

Data

Btructures

in

Data Base Management Systems

Data Base Management Systems a r e an outgrowth of Information Storage and R e t r i e v a l (ISR)

systems.

ISR systems a r e designed t o manage

l a r g e q u a n t i t i e s of a p a r t i c u l a r ty-pe of data.

For example, one e a r l y

system, MEDURS, was created t o manage documents f o r t h e National Library of Medicine ( ~ 1968). a I n t h e s e systems, since only one type of information was t o be used, only one type of f i l e s t r u c t u r e was required.

Also, input and

output r o u t i n e s were s p e c i a l i z e d t o handle t h e f i l e s t r u c t u r e most effectively.

A s a whole, ISR systems were individually t a i l o r e d f o r

a p p l i c a t i o n s such a s text-handling and record-keeping. The development of more generalized text-handling and record-keep-

i n g systems l e d t o t o d a y ' s generalized Data Base Management Systems (DBMS' S) (CO 1969). Every DBMS has a language.

The data d e s c r i p t i o n statements of

t h e language specify t h e c t r u c t u r e of data maintained by t h e DBMS. general, t h e d a t a d e s c r i p t i o n statements form t h e l a r g e s t p a r t of a DBMS's language.

For example, i n t h e MARK I V DBMS developed by Informatic Inc. (CO 1969)~ raw d a t a must be input i n t h e format i n which it i s t o be

stored.

MARK I V formats can be characterized i n t h e following way.

1. The c h a r a c t e r i s t i c s of individual d a t a items corlsist of: ( i ) cymbolic r l s m i n g ,

(11)

t h e hardwa tne provided charac: ter. ceodc,

( i i ~ ) f i x e d l e n g t h s a s s p e c i f i e d by t h e u s e r ,

In

(iv)

data types: a)

character s t r i n g ,

b)

number: 1)

binary o r decimal base,

2)

Sign-radix o r diminished radix complement (depending on t h e hardware), and character signs o r no signs f o r decimal numbers,

3) f i x e d o r f l o a t i n g point scale, (v) 2.

The c h a r a c t e r i s t i c s of records c o n s i s t of: (i) (ii) (iii) (iv) (v)

3.

d a t a items i d e n t i f i e d by t h e i r p o s i t i o n .

hierarchic structure, f i x e d order, f i x e d occurrences, f i x e d o r varying r e p e t i t i o n s ordered as input, groups of data items i d e n t i f i e d by t h e i r p o s i t i o n .

MARK I V depends on i t s underlying Operating System f o r i t s

storage s t r u c t u r e s . MARK IV's language i s a t a b u l a r language.

Forms a r e provided i n

which a u s e r s e l e c t s options provided by t h e system. MARK I V i s a self-contained DBMS.

higher-level programming languages.

It i s not embedded i n any

DBMS's which a r e embedded i n some

higher-level languages a r e c a l l e d host-language DBMS's. designed t o enhance t h e i r host language.

They a r e

This development combines t h e

record s t r u c t u r e s provided by t h e host languages with t h e f i l e and storage s t r u c t u r e s provided by t h e DBMS.

COBOL and t h e ~ o n e y w e l l -

General E l e c t r i c Co.'s I n t e g r a t e d Data Store (IDS) together form an example of t h i s type of system (CO 1969).

I n COBOL-IDS, COBOL s t r u c t u r e s a r e used a t t h e d a t a item and record l e v e l . statements.

These s t r u c t u r e s a r e described by t h e standard COBOL The enhancement comes a t t h e f i l e l e v e l .

I D S adds t h e

c a p a b i l i t y t o describe network r e l a t i o n s h i p s among records. can be viewed a s interconnecting r i n g s t r u c t u r e s . a r e maintained by embedded p o i n t e r s . i n more than one r i n g . marly o t h e r records.

IDS networks

The .interconnections

Each record i n I D S may p a r t i c i p a t e

Thus, a s i n g l e record may be a s s o c i a t e d w i t h

I n each IDS r i n g t h e r e i s one record which i s

t r e a t e d a s a master record.

It contains c o n t r o l information.

remaining records i n t h e r i n g a r e c a l l e d d e t a i l records. may be master i n one r i n g and d e t a i l i n another.

The

Any record

The d a t a d e s c r i p t i o n

statements used t o describe t h e c h a r a c t e r i s t i c s of these r i n g s a r e i n t h e form of a d d i t i o n a l clauses i n t h e COBOL record statement.

Each r i n g

r e l a t i o n s h i p i s defined a t l e v e l 98 i n a record d e s c r i p t i o n . terminology a r i n g i s c a l l e d a CHAIN.

I n IDS

The clause f o r declaring a record

t o be a chain master has t h e form:

98 chain-name CHAIN MRSTZlI. The clause f o r d e c l a r i n g a record t o be a chain d e t a i l has t h e form:

98

chain-name CHAIN DEMIL ;

SELECT UNIQUE MASTEX]

[

MATCH-KEY IS data-rfime]

[; CHAIN-OIIDER IS SOE~'ZED]

';

(,SCENDIIUG)

SORT-KEY 1S dats-name]

[; RANDOMIZE ON data-name] ;

DUPLICA!ES NOT ALLOWED]

.

This clause specifies the chain i n which the record i s t o be a d e t a i l , the order i n which d e t a i l records a r e t o occur ( i f they a r e t o be ordered), and t h e f i e l d from which a hashed address of the record i s t o be derived ( i f t h i s i s desired). MARK I V and COBOL-IDS represent two d i f f e r e n t classes of DBMS.

However, they a r e both implemented a s application programs and a r e not p a r t s of the operating systems. a b l e t o the user.

Many system resources a r e thus unavail-

Furthermore, privacy protection and access control

which a r e v i t a l t o DBMS users a r e d i f f i c u l t t o enforce.

Therefore, a

d i f f e r e n t approach t o building a DBMS was taken by the designers of t h e Extended Data Management F a c i l i t y (EDMF)implemented a t the Moore School of E l e c t r i c a l Engineering a t the University of Pennsylvania (Ma 1971).

The EDMF was implemented a s a p a r t of the RCA SPECTRA 70/46

Time Sharing Operating System ('23%).

Statements of t h e EDMF a r e i n

t h e form of e i t h e r 150s Commands, macro-calls which may be used by the regular applications programmer i n assembly language programs, or b u i l t i n functions f o r t h e FORTRAN and COBOL languages. m e s e t of record and f i l e structures provided by the one of the most extensive that has been Implemented.

EI)MF

EDW provides

record structures which a r e beyond the COBOL structurec ( H S 19'71) provides the following characterictics.

are

.

It

1. The c h a r a c t e r i s t i c s of individual data items c o n s i s t o f : ( i ) symbolic naming,

(ii) (iil) (iv)

t h e hardware provided character code, f i x e d o r v a r i a b l e lengths a s s p e c i f i e d by t h e u s e r , data types: a)

character s t r i n g ,

b)

number: 1)

decimal o r binary base,

2)

sign

-

radix o r diminished radix complement

(depending on t h e hardware) f o r binary numbers, and character s i g n o r no s i g n f o r decimal numbers, (v)

value alignment

-

l e f t f o r character s t r i n g s and

r i g h t f o r numbers, with zero o r blank pad characters, (vi) i

2.

d a t a items i d e n t i f i e d by p o s i t i o n and by a t t r i b u t e names used a s d e l i m i t e r s .

The c h a r a c t e r i s t i c s of records c o n s i s t o f :

(i3

hierarchic structure,

(ii) f i x e d order,

(iii)

f i x e d o r optional occurrences of data items and groups,

(iv)

f i x e d and v a r i a b l e r e p e t i t i o n of data items and groups ordered a s input,

(v)

groupe i d e n t i f i e d by p o ~ i t i o nand by using u t t r . i b u i x names as markers.

A t the f i l e l e v e l , t h e EDMF allows records t o be linked together i n t o l i s t s , when the records contain the same data items ( c a l l e d keywords).

A record may be linked i n t o any number of l i s t s .

Pointers t o

the heads of t h e l i s t s a r e stored i n d i r e c t o r i e s ( t a b l e s ) i n ascending lexicographical order.

By s e t t i n g limits on l i s t lengths, f i l e s may

be implemented completely with pointers embedded i n records o r with t a b l e s of pointers o r some combination of t h e two.

This i s under the

u s e r ' s control, and allows him t o organize h i s data i n a wide range of s t r u c t u r e s , including inverted, multilist, and indexed random organizat i o n (HS 1970).

EDMF seems t o be t h e only e x i s t i n g DBMS t o allow t h e

user t h i s kind of control over the implementation of h i s f i l e . Each one of t h e above DBMS's was designed t o enhance various c h a r a c t e r i s t i c s a t e i t h e r t h e data item, record o r f i l e l e v e l , or a t a l l three.

The l e v e l and degree of enhancement vary from DBMS t o DBMS.

A summary w i l l be provided i n Section 2.10 of the most advanced DBMS

features. 2.9

The Data Description Language of t h e COaClSYL Data Base Task Group -%

The CODASYL Data Base B s k Group (DBTG) w a s organized t o unify work done on current DBMS data description languages.

The goal of the

DBTG i s t o produce a s i n g l e data description language (DDL) i n which a l l current data can be described.

-*

structures

at

the'

data item, record and f i l e l e v e l s

This DDL (CO 1971) includes:

CODASYL (conference on Data Systems ~anguages)i s a group o r i g i n a l l y formed t o c r e a t e a business-oriented language. I t produced COBOL and has now extended i t s i n t e r e s t s t o DBMS's.

1) t h e COBOL Data Diviulor~which allows tihe u s e r t o specify record f o m t s .

Unlike t h e EDMF, t h e COIlASYL UDL does not allow varying

l e n g t h data i t e m s , varying r e p e t i t i o n s , o r o p t i o n a l occurrences of d a t a items. 2)

statements describing network s t r u c t u r e s .

SEThas been developed t o describe f i l e s t r u c t u r e s . t i a l l y ordered s e t of records. s e v e r a l "member1' records.

The concept of a A SET i s a sequen-

Each SET has one "owner" record and

The concept of "owner" record i s s i m i l a r t o

t h a t of "master" record in IDS.

Member records of SET'S a r e ordered

i n e i t h e r of two ways: (a)

Records may be ordered by ascending o r descending sequences based on s p e c i f i c keys.

(b)

Records msy be ordered i n r e l a t i o n t o e x i s t i n g members of t h e SET a s they a r e i n p u t .

m a t i s , when a new record

i s input, it can be automatically placed a s t h e l a s t o r

f i r s t record of t h e SET. The SET concept i s s i m i l a r t o t h e IDS chain.

3)

statements describing f i l e implementation.

The COaASYL DDL,

a t t h e f i l e l e v e l , allows t h e u s e r t o specify whether a SET of records i n t o be implemented e i t h e r with embedded p o i n t e r s o r with t a b l e s of

pointers.

However, t h e s e cannot be combined a s i n t h e EDMF, and t h e u s e r

kfic no c o n t r o l over t h e p o i n t e r s o r t a b l e s t r u c t u r e .

I n summary, t h e following c h a r a c t e r i s t i c s of data s$ructures a r e made available t o t h e user by the C O a A S n DDL. 1. m e c h a r a c t e r i s t i c s of individual data items consist of:

(i) (ii) (iii)

symbolic naming, f i x e d lengths a s specified by t h e user, data types:

a)

character s t r i n g ,

b)

number: 1)

binary o r decimal base,

2)

sign

-

radix or diminished radix complement

(depending on hardware) f o r binary numbers, and character sign o r no sign f o r decimal numbers,

3) (iv) (v) 2.

fixed o r f l o a t i n g point scale,

value alignment with blank or zero padding, data items i d e n t i f i e d by t h e i r position.

The c h a r a c t e r i s t i c s of records consist of:

( i) hierarchic structure, (ii) (iii) (iv) (v)

f i x e d order, fixed occurrences, fixed and dependent r e p e t i t i o n s ordered a s input, groups i d e n t i f i e d by t h e i r position.

3.

The s t r u c t u r e and implementation c h a r a c t e r i s t i c s of f i l e s c o n s i s t of: (i) (ii)

s t r u c t u r i n g by input sequence, s t r u c t u r i n g by c r i t e r i a on keys (values): a)

c r i t e r i a comparisons:

b)

con junctions of c r i t e r i a ,

5, 2, =,

( iii) implementation:

4.

a)

by embedaed p o i n t e r s ,

b)

by t a b l e % of p o i n t e r s .

The COIlASYL DDL w i l l depend on i t s implementation f o r storage structures.

The COIlASYL DDL i s an attempt t o c r e a t e a common front-end language f o r describing d a t a s t r u c t u r e s t o DBMS's.

There i s t h e r e f o r e a

degree of overlap between t h e CODASYL DDL and GDDL developed herein. Before t h i s overlap i s discussed, it should be pointed out again t h a t GDDL i s designed t o be a language f o r completely describing d a t a

s t r u c t u r e s and f o r d a t a conversion. t o specify data conversion.

The CODASYL DDL i s not intended

Furthermore, GDDL provides t h e c a p a b i l i t y

of describing storage s t r u c t u r e s , whereas CODASYL DDL does not.

At

t h e record l e v e l , CODASYL DDL i s based on COBOL and we show i n Appendix C t h a t GDDL has more d e s c r i p t i v e power than COBOL a t t h e record level.

This a d d i t i o n a l power i s obtained by providing more general

c a p a b i l i t i e s f o r specifying record implementation.

A t the f i l e

l e v e l , UODASYL DDL i s d e s i ~ m e d , t odescribe j u s t tlr~oucI i l e structures

e x i s t i n g i n current systems.

GDDL i s designed t o provide much greater

descriptive power a t t h e f i l e l e v e l .

The power i s provided by general-

i z i n g current f i l e s t r u c t u r i n g technology e s s e n t i a l l y by allowing t h e dependency of f i l e s t r u c t u r e on data values, record structure, and record implementation t o be described. 2.10

summary Wo trends have appeared i n t h e handling of data by software

systems.

F i r s t , the data s t r u c t u r e s provided have become increasingly

elaborate, and secondly, the user has been given more arid more explic i t control over s e t t i n g up the data structures required. The e a r l i e s t systems provided the user with c e r t a i n s t r u c t u r a l options a t t h e data item l e v e l .

These options were, however, pro-

vided i m p l i c i t l y through a s e l e c t i o n of machine i n s t r u c t i o n s .

Suc-

cessive systems provided more c a p a b i l i t i e s a t the record l e v e l , and allowed these t o be declared e x p l i c i t l y .

I t was f i r s t i n operating

systems t h a t s t r u c t u r i n g f a c i l i t i e s were offered a t t h e f i l e l e v e l . Typically, the s t r u c t u r e s provided were limited t o a few options which frequently

included sequential and indexed sequential struc-

tures. With t h e development of DBMS's, users were given more control over the implementation and s t r u c t u r e of both records and f i l e s . However, they s t i l l have no control o r even knowledge of the storage s t r u c t u r e s used.

The d d l presented h e r e i n t a k e s t h e s e two t r e n d s towards t h e i r l o g i c a l conclusion.

F i r s t , t h e d d l can describe a more general c l a s s

of d a t a s t r u c t u r e s than t h a t provided by current d a t a processing technology.

Secondly, t h e ddl allows every aspect of a d a t a s t r u c t u r e

a t each l e v e l t o be described e x p l i c i t l y . Those a s p e c t s of data s t r u c t u r e s which have been i d e n t i f i e d i n t h e preceecling s e c t i o n s have been summarized i n Table 2-1.

This

t a b l e i s organized t o provide a convenient means of evaluating t h e d d l and i t s underlying model i n l a t e r chapters.

e

l a

3

Record Characteristics

Structure Characteristic

Symbolic Naming

X

X

X

Fixed by hard-

Diminished

Table 2-1. Summary of Data Structure Characteristics

X

X

3.ecord Characteristics Implementation Characteristics

Structure Char-

r

I

d

File

FI

Characteristics

X

S t r u c t u r i n g by input Sequence

I

rn

.rl

X

X

X

X

%

X X X

X X X

2f

x*

>

r4

Criteria on Value s (~eys)

Q)

.IJ 0

2

$,

P

*

a l

. S-P

c,

E

.

3.8

k.9 I c, k

2

c,

s 2

-

6' t:

* 3 . t i

,

C r i t e r i a on Paths

v2

c,

L

Implementing by Sequential Storage

c,

f!

Conjunction of C r i t e r i a

v2

::

B

8

*+a

.rl

c,

Z

B

X

X

X

X

X

x

x

X

X

X

Q)

Embedded in Method Record

oj

k

oj

8 !3 .rl

P

m

U

c, ad aJ c, m

c

Q)

El a,

d

B

H

4

8 ad

m k

Stored in mble

3.2 c 1 " Path 8 Length Upper Bound .rl

d

Limit

* EDMF

only

3 ". k d

$

5 k

4

8

a 0

f3

5

cn

2

111

%

X*

10

Storage Characteristics

Block Naming 0

*

d

Reel Formatting of

X

X X

Storage Characteristics

( continued)

'' %'

Tbpe Disk

Bytes/Block Bytes/Block

X

X

X

m

a , m

m 0

m C

Storage

cl R

Characteristics

( continued) a3

2

Record Split Set

Whole Split

cd

I 3

$ 4 3 X X

CHAPTER 3 RECORD DESCRIPTION

3.1 I n t r o d u c t i o n I n t h i s chapter we begin our t a s k of showing how t h e o r g a n i z a t i o n of d a t a can be e x p l i c i t l y described.

We present t h e model f o r record

s t r u c t u r e t h a t i s t h e foundation f o r t h e design of GDDL's record description features.

We show t h a t t h e model i s complete f o r record

d e s c r i p t i o n i n t h e sense t h a t record s t r u c t u r e s of Table 2-1 can be described i n t h e model.

We a l s o d i s c u s s how t h e model can d e s c r i b e

c e r t a i n g e n e r a l i z a t i o n s of present record s t r u c t u r e s .

Then we show

t h a t t h e record d e s c r i p t i o n statements of GDDL a r e based on t h i s model. I n t h i s way we show t h a t GDDL i s a l s o complete and generalized i n t h e above senses.

We f u r t h e r demonstrate t h e completeness of GDDL by noting

t h a t t h e COBOL record d e s c r i p t i o n f e a t u r e s a r e properly contained i n GDDL and by providing a s e t of examples which i l l u s t r a t e t h e a b i l i t y of GDDL t o describe e x i s t i n g record organizations. 3.2 A Model of Record S t r u c t u r e s We begin t h i s s e c t i o n by providing a n i n t u i t i v e i n t r o d u c t i o n t o t h e model. The s m a l l e s t meaningful piece of information we w i l l c u l l a " d a t a item". r e cordc

Data items a r e t h e components which are organizctl irito

.

Conceptually, a d a t a item is a ~ t r i n gof charrrctcrc, which provide a value f o r t h e d a t a item, t o g e t h e r with a n i d e n t i f i c a t i o n of t h e

-

56

-

type o r c l a s s of information t o which t h e value belongs.

This type

o r c l a s s of information we c a l l t h e a t t r i b u t e of t h e data item. When a d a t a item i s represented on a storage medium, t h e r e must be r u l e s which determine how t h i s data item i s implemented a s a b i t string. When a u s e r i s organizing data items f o r storage and r e t r i e v a l from a computer medium, he i d e n t i f i e s a p a r t i c u l a r l e v e l of organization which i s t o be stored and r e t r i e v e d as a s i n g l e urnit when t h e data i s being used. level.

This l e v e l of data item organization we c a l l t h e record

A convenient way t o conceptualize t h e organization of d a t a items

a t t h e record l e v e l i s a s a hierarchy.

It i s c e r t a i n l y t h e case t h a t

e x i s t i n g software systems ( e - g . , COBOL, MARK I V , I D S , EDMF, and t h e CODASYL DDL) provided h i e r a r c h i e s f o r organizing d a t a items i n t o records.

The records a r e themselves f i n a l l y represented on a storage medium a s a b i t string.

So again t h e r e must be r u l e s f o r specifying how a p a r t i c u l a r

organization conceived by a u s e r i s t o be represented a s a b i t s t r i n g . There a r e then t h e following components t o t h i s process of data organiz a t ion: f o r data items:

(1) t h e conceptual s t r u c t u r e of data items, (2) t h e encoding of t h i s c t m c t u r e in%o a b i t str i n g , UIKJ

(3) t h e r e s u l t i n g b i t s t r i n g representation; f o r records: (1)

t h e conceptual s t q c t u r e of t h e recortis,

(2)

the encoding of the record s t r u c t u r e i n t o a b i t s t r i n g , and

(3) the r e s u l t i n g b i t s t r i n g representation. We therefore have t o model each of these components.

The conceptual

s t r u c t u r e of data items and records i s modelled i n terms of t h e ideas of a t t r i b u t e and value by generalizing the work of ( ~ 1967), e ( ~ 1968), h (lin 1970) , and (HS 1971) 1's.

.

The b i t s t r i n g is simply a sequence of 0 ' s and

The encoding of the conceptual structure i s modelled d i r e c t l y i n

terms of c h a r a c t e r i s t i c s f o r encoding a t t r i b u t e s and values a s b i t strings.

The complete model w i l l be presented i n two steps.

F i r s t the

model of data items w i l l be described and then the model of records.

3.2.1

The Model of Data Items

3.2.1.1

The Concept of Data Items

The concept of a data item can be described i n terms of two primitives

-

a t t r i b u t e and value, and a d e f i n i t i o n of data item based on

these primitives. I n t u i t i v e l y , an a t t r i b u t e i s a quality, such a6 s i z e , or weight t h a t i s ascribed t o an object.

For each a t t r i b u t e , there i s a s e t of

measures o r q u a n t i t i e s , known a s values.

A single value t o be

associated with the a t t r i b u t e i s selected from t h i s s e t .

For example,

a measure f o r the a t t r i b u t e weight i s selected from the s e t of r e a l

numbers. Definition 3-1.

A data item i s an ordered p a i r of the form < a, v >

where a i s an a t t r i b u t e and v i s a value.

For example, t h e p a i r s < name, JONES >, < age, 32 >, < sex, M >, 4 school, NEWTOWN HIGH

SCHOOL >, < school, UNIVERSITY OF PENNSYLVANIA >

a r e d a t a items. I n r e p r e s e n t i n g a d a t a item on a computer medium (such a s c a r d s , t a p e , e t c . ) both t h e a t t r i b u t e and t h e value must be encoded.

We s h a l l

consider t h e r u l e s f o r each kind of encoding s e p a r a t e l y . 3.2.1.2

Encoding Values

A value i s encoded i f it i s transformed i n t o a b i t s t r i n g accordint:

t o t h e following encoding rule'. string.

Such a s t r i n g w i l l be c a l l e d a value

The r u l e f o r encoding a value i s simply a d e t a i l e d s p e c i f i c a t i o n

of t h e s i x c h a r a c t e r i s t i c s l i s t e d below: 1.

Character codes.

S t r i n g s of binary d i g i t s a r e used t o encode

c h a r a c t e r s such a s l e t t e r s , numbers and punctuation s i g n s .

Character

codes have been standardized t o t h e e x t e n t t h a t a l l new computers use e i t h e r of two codes:

USASCII ( o r ASCII) and EBCDIC.

However, it i s

not s u f f i c i e n t t o be a b l e t o s p e c i f y e i t h e r ASCII o r EBCDIC a s t h e r e a r e o t h e r codes which a r e i n use on e a r l i e r computers.

Also, u s e r s of

l a r g e d a t a bases employ what a r e , i n e f f e c t , new c h a r a c t e r codes t o compress d a t a .

Thus, t o be completely general, it must be poosiblc t o

d e s c r i b e any c h a r a c t e r code.

One way t o d e s c r i b e a c h a r a c t e r code i s

t o l i s t f o r each c h a r a c t e r t h e code i n terms of i t s b i t s t r i n g representation. A:;sociated w i t h a c h a r a c t e r code i s a s o r t o r d e r .

To describe

t h e s o r t o r d e r , t h e c h a r a c t e r s of t h e code can be l i s t e d i n L k l c caret order.

When values a r e t o be t r a n s l a t e d from one character code t o a second character code, it i s necessary t o indicate f o r every character i n t h e f i r s t code i t s image i n the second code.

This can be specified

by l i s t i n g t h e characters of the second code i n the same s o r t order a s t h e f i r s t code. An example of encoding the characters of a value i n EBCDIC i s presented below.

For t h e data item < name, JONES >, we have J

-,

11010001

0

4

11010110

N

-,

11010101

E

-,

11000101

S

2.

Length.

+

11100010

The length of a value s t r i n g i s t h e number of b i t s

i n the s t r i n g . For example, the value s t r i n g of the a t t r i b u t e name i n t h e previous example may be specified t o be of length 64 b i t s , where unused b i t s may be f i l l e d a r b i t r a r i l y .

3. Length Uniformity.

If t h e value s t r i n g s f o r an a t t r i b u t e a r e

always of uniform length, then t h e lengths of t h e value s t r i n g s car1 be described simply by giving the length.

However, if t h e length of value

s t r i n g s f o r an a t t r i b u t e a r e not uniform, then e i t h e r the length of each value s t r i n g muct be given and ~ t o r e c lac a data item, o r t h e vuluc s t r i n g must be dellmlted by s p e c i a l characters.

Thus, value s t r i n g s

may be specified ao being e i t h e r uniform o r varying.

4.

Value alignment.

When t h e lengths of the value s t r i n g s f o r

a n a t t r i b u t e a r e t o be uniform, t h e number of characters needed t o represent t h e value may be l e s s than the a l l o t t e d length.

I n such cases,

it i s necessary t o specify whether t h e value i s aligned t o t h e r i g h t o r t o t h e l e f t and t o specify t h e characters t o be used t o pad out the unused positions.

For example, consider t h e data item

4

name, JONES >.

length of t h e a t t r i b u t e may have been specified a s character code a s EBCDIC.

64 b i t s

The value and t h e

To specify t h a t the value i s t o be

aligned t o the l e f t with blank'characters used f o r padding, r e s u l t s i n t h e following encoding of t h e tialue JONXS: J

-,

11010001

0

-,

11011001

N

5.

Data type.

t e r s o r as numbers. digits.

-+

11010101

E

-,

11000101

S

-,

11100010

)d

-,

01000000

)d

-t

01000000

j4

-,

01000000

Value s t r i n g s may be interprctcti u s e i t h e r c:li:~r*ac-Numbers a r e e i t h e r o-lgned or urlsi(y~cdc t r i n g c u f

Signs nay be denoted by the plus

0.r.

mir~uc, by radix rornplc.ruenl;,

or by diminiskled radix complement. Numbers may be orgar~izede i t h e r a:; f i x e d point, o r a s f l o a t i n g point numbers with the number of signil'icant d i g i t s and the length of the mantissa s p e c i f i e d .

6.

Value c r i t e r i a .

Numeric and s e t - t h e o r e t i c c r i t e r i a may be

used t o define t h e s e t of acceptable values f o r a given a t t r i b u t e .

For

example, values of t h e a t t r i b u t e age may be r e s t r i c t e d t o numbers between

65 f o r a given s e t of data items.

21 and

O r values of the a t t r i b u t e c i t y

may be r e s t r i c t e d t o a p a r t i c u l a r s e t of c i t y names. 3.2.1.3

Encoding A t t r i b u t e s

We have seen how t h e value of a data item i s encoded.

To encode

t h e e n t i r e data iten1 we must now provide a way of identifying t h e a t t r i bute t o which t h a t value belongs. This can be achieved i n two ways.

The f i r s t way i s t o d i r e c t l y

encode t h e a t t r i b u t e a s a b i t o r character s t r i n g , and then p o s i t i o n t h i s s t r i n g r e l a t i v e t o t h e value.

This way of encoding an a t t r i b u t e

can be made t o f u l f i l l a second r o l e .

We saw i n t h e discussion of

l e n g t h uniformity i n t h e s e c t i o n on encoding values, t h a t i f a value i s s p e c i f i e d a s having varying length, then it must be delimited by charact e r s which s i g n i f y t h e end of t h e value s t r i n g .

The a t t r i b u t e encoding

can serve a s such a d e l i m i t e r f o r t h e value s t r i n g .

We w i l l c a l l t h e

s t r i n g which d i r e c t l y encodes an a t t r i b u t e , an a t t r i b u t e marker. The following c h a r a c t e r i s t i c i s used t o specify an a t t r i b u t e marker:

7. A t t r i b u t e marker. A t t r i b u t e markers can be c i t h c r

chax*acter

o r b i t s t r i n g s which a r e poaitioned d:Lrectly i n f rorlt of or d.lrectly behind a value s t r i n g .

The second way i n which t h e a t t r i b u t e of a p a r t i c u l a r v a l u e can be i d e n t i f i e d i s by knowing t h a t it always occurs i n a c e r t a i n p o s i t i o n r e l a t i v e t o o t h e r values.

That i s , i f a s e t of d a t a items a r e organized

i n such a way t h a t t h e p o s i t i o n of t h e value corresponding t o a given a t t r i b u t e can be i d e n t i f i e d , t h e n t h e a t t r i b u t e has been i n d i r e c t l y encoded by p o s i t i o n i n g .

A s t h s encoding of a t t r i b u t e s by p o s i t i o n i n g

depends on t h e o r g a n i z a t i o n of s e t s of d a t a items, t h i s way oS encoding a t t r i b u t e s w i l l be dlscusoed i n t h e next s e c t i o n . The Model of Records

3.2.2

3.2.2.1

The Conceptual Record S t r u c t u r e

I n t h i s s e c t i o n we want t o model t h e conceptual s t r u c t u r e of records.

F i r s t , however, we must p i n down e x a c t l y what we mean by a

record i t s e l f .

Then, we can go on t o o b t a i n t h e s t r u c t u r e of such

records.

I n t h e d a t a p r o c e s s i n g f i e l d , a u s e r of COBOL conceives of a record d i f f e r e n t l y t h a n say, a , u s e r of MAW I V .

I n t h e d e f i n i t i o n of

r e c o r d s below, w e attempt t o give an e x a c t f o r m a l i z a t i o n of t h e n o t i o n of record which i s independent of any p a r t i c u l a r software system. D e f i n i t i o n 3-2.

A record i s a s e t of d a t a items which a r e structut:ed

according t o t h e following r u l e s : r e c o r d -. group group

-,

< a t t r i b u t e , {compound. value)>

compound value

-,

compound value, (:ompound value

compound value -. group compound value -. d a t a item

We use t h e symbols < > t o denote an ordered s e t and the symbols { ] t o denote an unordered s e t . For example, the data items < name, JONES >, < age, 32 >, and

< sex, M > can be organized i n t o t h e following record:

< person, {< name, JONES >, < age, 32 >, < sex, M >]> A s another example, the data items < name, JONES >, < name, MARY >,

< age, 6 >, < name, JOHN >, < age, 10 > can be organized i n t o the record: < family, {< name, JONES >,

< child, {< name, MARY >, < age, 6 >]>, < child, {< name, JOHN >, < age, 10

>]>I>

I n t h i s case < child, [< name, MARY >, < age, 6 >)> and

< child, {< name, JOHN >, < age, 10 >]> a r e groups. It should be noted t h a t a data item i s simply an a t t r i b u t e - v a l u e p a i r , whereas a group i s an attribute-compound value p a i r .

When i t i s

necessary t o d i s t i n g u i s h t h e a t t r i b u t e s associated with compound values from the a t t r i b u t e s associated with values, we w i l l r e f e r t o them a s group a t t r i b u t e s and data item a t t r i b u t e s respectively.

I n the example

above, "name" and "age" a r e data item a t t r i b u t e s whereas "family" and "child" a r e group a t t r i b u t e s . d a t a items.

Compound values a r e a c t u a l l y groups o r

The groups forming a group a r e c a l l e d subordinate groups.

We note t h a t a s a consequence of t h e above d e f i n i t i o n t h e s t r u c t u r e of a record i s a hierarchy which has a n a t t r i b u t e assoc i a t e d with each p a r t of t h e hierarchy.

We can t h u s a b s t r a c t a

notion of record s t r u c t u r e based on t h e s e a t t r i b u t e s which i s independent of t h e values. D e f i n i t i o n 3-3.

This i s done i n D e f i n i t i o n 3-3. A record s t r u c t u r e i s a r e l a t i o n s h i p over d a t a item

a t t r i b u t e s produced according t o t h e following s t r u c t u r e productions: 1. record s t r u c t u r e

-,

structure

2.

structure

-,

< group a t t r i b u t e , {substructure]>

3.

sub s t r u c t u r e

-,

hubstructure, s u b s t r u c t u r e

4.

substructure

-,

structure

5.

substructure

-,

d a t a item a t t r i b u t e

6.

substructure

-,

null

For example, t h e d a t a item a t t r i b u t e s "name" and "age" may be r e l a t e d by s t r u c t u r e s obtained from t h e following s t r u c t u r e productions: family record s t r u c t u r e s t r u c t u r e F1 substructure FlFl s u b s t r u c t u r e F1 cuhstructure 2'2 :;ubutructure F'2

+

-,

< family, { s u b s t r u c t u r e ~ 1 ~ 1 ) >

4

s u b s t r u c t u r e F1, s u b s t r u c t u r e F2

-)

name

+

4

cubstructure F2

s u b s t r u c t u r e F212

[lull c u b s t ~ u c t u r cla':', ::ubstr.uc.Lur.e

11':'

s t r u c t u r e P2

< c h i l d , [ s u b s t r u c t u r e l?l'%l~l]>

s t r u c t u r e F2 s u b s t r u c t u r e F21F1

s t r u c t u r e F1

-,

s u b s t r u c t u r e F1, s u b s t r u c t u r e F2'12 age

Two p a r t i c u l a r s t r u c t u r e s of these a t t r i b u t e s are: (i) (ii)

< family, {name]>

< family, {name, < child, {name, age] >, < child, {name, age] > ] >

Note:

i)

Production 3 i n d e f i n i t i o n 3-3 allows a p a r t i c u l a r subs t r u c t u r e t o repeat an a r b i t r a r y number of times, (e.g., i n the above example substructure F2

ii) Production

-4

substructure F2, substructure ~ 2 ) .

6 allows t h e occurrence of a p a r t i c u l a r substruc-

t u r e t o be optional, (e.g., i n the above example substructure F2

-)

-

null).

If we a r e given a s t r u c t u r e , then we can obtain records from it

simply by s u b s t i t u t i n g a data item f o r each data item a t t r i b u t e i n the structure. For example, i f w e make the following s u b s t i t u t i o n s i n t h e struct u r e s above:

< name, JONES >

f o r name



f o r name



f o r age

< name, JOHN >

f o r name

< age, 10 >

f o r age

we obtain the following records: i)

< family, {< name, JONES >]>

ii)

< family, {< name; JONES >,

>I>, 1 0 >I>]>

< chiild, {< name, MARY >, < age, 6 < c h i l d , {< name, JOHN >, c age,

I n a previous s e c t i o n w e s a w how d a t a items were encoded.

Now

w e must consider how t h e s t r u c t u r e of a record i s encoded.

3.2

2.2

Encoding t h e Record S t r u c t u r e

The s t r u c t u r e of a record i s a r e l a t i o n s h i p over Lhe d a t a item a t t r i b u t e s i n t h e record s p e c i f i e d by s t r u c t u r e productions.

'Blese

productions a c t u a l l y produce a h i e r a r c h i c s t r u c t u r e which has t h e d a t a item a t t r i b u t e s on t h e lowest l e v e l s and each h i g h e r l e v e l i d e n t i f i e d by a group a t t r i b u t e .

Therefore, t o encode t h e s t r u c t u r e of a record

it i s only necessary t o ensure t h a t t h e a t t r i b u t e which i s a s s o c i a t e d w i t h each compound value can be i d e n t i f i e d . W e have seen that t h e a t t r i b u t e of a d a t a item can be i d e n t i f i e d by p u t t i n g a marker a d j a c e n t t o i t s value, o r , when t h e d a t a item

appears i n a group, t h e a t t r i b u t e can be i d e n t i f i e d by t h e positiori of i t s value r e l a t i v e t o t h e values of o t h e r a t t r i b u t e s . The a t t r i b u t e a s s o c i a t e d w i t i - i a compouncl value c-a11 be iderjtified

i n s i m i l a r wayc:.

Markers can ,be placed ad,jacent Lo Lhc c-ornpourrcl v : i l u c

ur:ing t h e same " a t t r i b u t e marker'" r-haractcrj cL i c . as t~el'orc.

AlLel.~i:lL i

vc-

t k ~ ea t t r i b u t e f o r a compourltl value can bc irlcr~lil'iud b y L l ~ cpo:: i L i c ) l ~ IY, in which t h e compour~dvalue occ-wr:: r e l a t i v e t o Lhc votnpou~al v:llucc ol' other a t t r i b u t e s .

We will now discuss what characteristics must be specified to identify an attribute from the position of the compound value or value. For convenience in this discussion, we will just use the term compound value to refer to both compound values and values. The attribute associated wlth a compound value can be identified if the compound value occurs in a particular order with respect to the compound values of other attributes in the same substructure. In this case, the order can be specified by listing the attributes of the compound values in the appropriate order. Further, if one of the attributes in this list corresponds to a substructure which is optional, then it must be specified that this attribute may not appear. Also, if one of the attributes in the list corresponds to a substructure which repeats, then the number of repetitions must be given. The characteristics required to identify the attribute of a compound value (or value) from the position of the compound value (or value) are given below:'

8.

Order.

The order of compuund values can be specified by

listing their attributes in the appropriate order. If the attributes are allowed to appear in any order, then the encoding must be done by markers. 9.

Occurrence.

The occurrence of an attribute may be either

mandatory or optional within a substructure. 10. Repetition number.

The repetition number is the number of

times an attribute may occur consecutively in a substructure.

11. R e p e t i t i o n uniformity.

If t h e number of times a n a t t r i b u t e

r e p e a t s i s always t h e same ( i . e . , t h e r e p e t i t i o n of t h e a t t r i b u t e i s uniform), t h e n t h e r e p e t i t i o n number can be s p e c i f i e d simply by g i v i n g t h e number d i r e c t l y

However, i f t h e r e p e t i t i o n of t h e a t t r i b u t e i s ]lot

uniform, then e i t h e r t h e r e p e t i t i o n number must be encoded and s t o r e d a s a d a t a item, o r t h e encoding of t h e values o r compound values f o r

t h e a t t r i b u t e must be d e l i m i t e d . 12.

Repetition order.

When t h e same a t t r i b u t e r e p e a t s , t h e n t h e

encoding of t h e val-ues o r compound v a l u e s f o r it may e i t h e r be s t o r e d d i r e c t l y i n any o r d e r o r i n some order described by c r i t e r i a on t h e values .

13.

Criteria.

Numeric and s e t - t h e o r e t i c c r i t e r i a may be used t o

dePine t h e s e t of a c c e p t a b l e v a l u e s o r compound v a l u e s f o r each attribute. 3.2.3

The S p e c i f i c a t i o n of t h e Encoding C h a r a c t e r i s t i c s I n t h e previous s e c t i o n s we have seen t h a t r e c o r d s a r e encoded by

specifying c e r t a i n c h a r a c t e r i s t i c s .

We w i l l a l l o w each c h a r a c t e r i s t i c

t o be s p e c i f i e d e i t h e r :

-

1)

directly

2)

indirectly

by specifying e x p l i c i t l y t h e c h a r a c t e r i s t i c , o r

-

by s p e c i f y i n g a f u n c t i o n which must b e computed

t o determine t h e c h a r a c t e r i s t i c .

The furlctiorl rmy be defineti

over t h e v a l u e s of d a t a items o r over o t h e r char.acter.istic.s using the usual arithmetic operators. For example, t h e l e n g t h q h a r a c t e r i s t i c can be s p e c i f i e d d i r c r t l y a s a number of b i t s , o r it can be ~ p e c i f i e di n d i r e c t l y a s perimps

(i) (ii)

being equal t o the value of some p a r t i c u l a r data item, or being equal t o t h e number of r e p e t i t i o n s of some p a r t i c u l a r attribute.

3 . 3 I n t e r p r e t a t i o n of Common Data Processing Concepts i n Terms of the Model of Record Structures A s e t of s t r u c t u r e productions together with a s p e c i f i c a t i o n of

the r u l e s f o r encoding t h e s t r u c t u r e s determines a p a r t i c u l a r type of record, or record type.

Two records a r e of the same record type if

and only i f they can both be obtained from t h e same s t r u c t u r e product i o n s and they both have t h e same encoding c h a r a c t e r i s t i c s . Note t h a t the term record i s sometimes used i n data processing l i t e r a t u r e t o r e f e r t o what we c a l l a record type. Note t h a t the production r u l e s of Definition 3-2 make it possible t o distinguish e a s i l y between a data item and a record consisting of a single data item, even though t h e both contain a s i n g l e value.

For

example, < name, JONES > i s a data item, whereas < person, {< name, JONES

>I> i s

a record.

This d i s t i n c t i o n r e f l e c t s the f a c t t h a t a data

item i n i t s e l f i s only a basic u n i t of information i n some data organization, whereas a data item structured a s a record i s i n addition the basic u n i t which is stored o r retrieved when t h a t data organization i s used. Two groupo a r e of t h e same Broup type if and only if Lhey (!an both be obtained from t h e came s t r u c t u r e prod.uctlons and. they both have t h e oame encoding c h a r a c t e r i o t i c s .

.

A d a t a item corresponds t o t h e i n t u i t i v e idea of a f i e l d .

Two f i e l d s a r e of t h e same f i e l d type if and only i f t h e y both have t h e same a t t r i b u t e and a r e both encoded i n t h e same way. In e a r l y v e r s i o n s of COBOL and i n some record i s allowed per f i l e .

Dm's

only one type of

I n t h e s e systems t h e r e was t h e r e f o r e no

need t o r e f e r t o p a r t i c u l a r types of records.

However, t h e model allows

f o r t h e appearance of more than one type of record i n a. f i l e .

Therefore,

some means of r e f e r r i n g t o p a r t i c u l a r types of records must be provided. S i m i l a r l y , it w i l l be u s e f u l t o be a b l e t o r e f e r t o p a r t i c u l a r types We w i l l u s e t h e a t t r i b u t e A of a record (group,

of groups and f i e l d s .

... > t o

f i e l d ) < A,

name t h e type of' t h a t record (group, f i e l d ) .

Thus,

a record < person, { . . .]> i s of type person, and a f i e l d < age, 10 > i s of type age.

To ensure t h a t t h i s way of r e f e r r i n g t o types of records

(groups, f i e l d s ) i s unambiguous, we must make t h e followirlg convention: Within a f i l e , a given a t t r i b u t e i s a s s o c i a t e d with only one s t r u c t u r e and only one s e t of encoding c h a r a c t e r i s t i c s . I n p a r t i c u l a r t h i s requires: (1) A given a t t r i b u t e can occur i n only one production of t h e f om:

c1;ructure

(:!)

-,

< a t t r i b u t e , { ~ u b c t r u tcu r c ]>

If A occurs i n a protiuc8t.i.on of the l'orm:

c t r u c t u r e -. < A, {substructure]> t h e n R cannot occur i n t h e s u b s t r u c t u r e . We w i l l see i n Section 3.5 t h a t t h i s convention erlsures t h a t t h e s t r u c t u r e productions produce only h i e r a r c h i c o r g a n i z a t i o n s .

3.4 An Application of t h e Model of Record Structures An example of using t h e model t o completely encode a s e t of data items i n a given otructure a s a b i t s t r i n g i s given below: Consider the data items

-

< name, JONES >, < age,

32

>, and

< sex, M > and t h e s t r u c t u r e specified by t h e s t r u c t u r e productions: person record s t r u c t u r e

-,

s t r u c t u r e P1

s t r u c t u r e P1 .-. c person, {substructure ~ 1 ~ 1 ] > substructure PI21

-,

substructure PlP2

substructure P11, substructure PIP2 substructure P12, substructure P13

substructure P11

-,

name

substructure P12

-,

age

substructure P13

-,

sex

The following record i s obtained from these s t r u c t u r e productions:

< person, {< name, JONES >,

< age, 32 >, < sex, M >]> The b i t s t r i n g representation of t h i s record i s produced using t h e following encoding c h a r a c t e r i s t i c s : (1)

The character code f o r t h e values of name, age and sex i s EBCDIC

(2)

.

The length of values of name i s 64 b i t s , of age i s 16 b i t s , and of sex i s 8 b i t s .

( 3)

The lengths of values of name, age and sex a r e uniform.

(It)

!The values of name a r e l e f t aligned and padded with blanks.

(5)

m e values o f name, age and sex a r e t o be i n t e r p r e t e d a s

character strings.

(6) There

a r e no r e s t r i c t i o n s defined by c r i t e r i a on t h e values

of name, age and sex.

( 7 ) No a t t r i b u t e markers a r e used with value s t r i n g s of name, age and sex.

The otructure i c encoded according t o t h e following c h a r a c t e r i c t i c c :

( 6 ) The attribute^ name, age and sex appcar i n

l;hc order i n which

they a r e named by the s t r u c t u r e produc.tior~s.

(9) An occurrence of each a t t r i b u t e i s mandatory. (10) Each a t t r i b u t e occurs once i n a s t r u c t u r e .

(11) The r e p e t i t i o n f o r each a t t r i b u t e i s uniform. (12) Since t h e r e may be only one occurrence of t h e a t t r i b u t e s name, age and sex, t h e r e p e t i t i o n order c r i t e r i o n does not apply.

(13) There a r e no r e s t r i c t i o n s defined by c r i t e r i a on t h e compound

values of person. Applying these encoding c k a r a c t e r i s t f c s , the followirq rccaor.cj raepr.csentation results: 110100011101011011010101110001011110001001000~0001000100ooo0

11110011111100101101Ol00

Igor every d i f f e r e n t s e t of d a t a i t e m s which are substituted i n the s t r u c t u r e obtairlecl. from t h e above s e t 01 ztr-uc.1;ur.c PI-oclu~'tior~:;, ir rj

if ferc.rit b i t r;tr i.rg is produc.erl by these c.nc:odir~ ca1itr r.ncsLct ' i L:L i (.:;.

3.5

l%e Completeness and Generality of the Model

To be complete, the model must incorporate i n i t s e l f a l l of the c b r a c t e r i s t i c e of record structures

derived i n Table 2-1.

This i s

done f o r the data i t e m c h a r a c t e r i s t i c s a s follows: By'rnbolic naming appears i n the model a s the concept of an attribute. 'Phe implementation characteristics f o r data items appear i n the

model d i r e c t l y a s encoding characteristics. The c h a r a c t e r i s t i c s r e l a t i n g t o the structure of records are incorporated i n the model a s follows: The structuring characteristics of records appear i n the model a s the concept of record structure. The implementation c h a r a c t e r i s t i c s a r e incorporated d i r e c t l y as encoding characteristics. Thus, the model includes each of the record l e v e l c h a r a c t e r i s t i c s appearing i n B b l e 2-1.

I n t h i s sense, t h e model i s complete.

We f u r t h e r note t h a t the structure productions and the convention of Section 3 . 3 impose a p a r t i a l ordering on t h e a t t r i b u t e s of a structure.

This i s proved a s follawa:

Theorem:

The structure productiono and the convention of Scctior~3 . 3

lmpone a p a r t i a l orderira over the a t t r l b u t e ~ ;of a teecord structulwc. IJroof: A p a r t i a l ordering i s a r e l a t i o n which i s 1)

reflexive ,

2)

antisymmetric

3) t r a n s i t i v e .

1,c.t uc clef ine

- t o be a r e l a t l o n over a t t r i b u t e s ac follows: 2

for attributes a and b, a

-3 b

... b .. .

If arid only i f u = b, o r < a , {

i~ s structure, where b may appear i n any depth of {

,3

or


, > brackets

-

now show 3 i s a p a r t i a l ordering.

We will

3 b and b 3 a and t h a t a # b. 2) Assume t h a t a T h i ~means < a, { . .. b . . . 3 2 and < b, [ . . . a . . . 3 1) By d e c i n i t i o n 3 i s reflexive.

> a r e structures.

But by (I) of t h e convention, t h e a t t r i b u t e b can only be associated

. . . a . . . 3. . . . < b , { . .. a .. .

with one substructure which must therefore be { i s a c t u a l l y

/,

a, {

I e not allowed by ( 2 ) of the conventi on. a

= b,

'Riuc, a

-3 b

Thus,

) >.

This

-

arid b 3 a implies

-

Hence 2 i s antisymnzetric. 2 b and 3) Assume a -

-

b 3 c.

I f a = b and/or b = c, then a

-3 c.

. . b ... ) > and < b, ( . .. c . .. ] > a r e s t r u c t u r e s , the convention < a , { .. . b ,.. 3 > i s a c t u a l l y

If < a, { , then by (1)of . . . I > .

-

T h u s , a3c,ands,is

transitive. Therefore,

-2 i s a p a r t i a l

ordering.

Mathemtically, any hierarchy can be r e a l i z e d by a p a r t i a l ordering (131 1948)

.

From t h e above proof, it Sollows t h a t t h e struvtu r.c

productions and conventions can r e a l i z e any hierarchic record s t r u c t u r e . The c h a r a c t e r i s t i c s of m b l e 2-1 a r e incorporated i n more generalized forms i n t h e model t o allow f o r the description of variat i o n s of e x i s t i n g data

structurec.

This g e n e r a l i t y i s provided i n

t h e following ways: 1)

The model provides a more generalized way t o describe the

order of data i t e m and groups.

A s we have seen i n Pable 2-1, current

systems only provide f o r the specification of fixed ordering.

However,

t h e ordering c h a r a c t e r i s t i c of the model allows order t o be specified a s ffxed o r

86

a r b i t r a r y r e l a t i v e t o the groups.

the following group , < t , b > ] > ,

< U) c >, c v , c < ~ d, > , e s , @ > I > ] > with the following order characteristic: m e ordering f o r the compound values of a t t r i b u t e x i8 fixed, and t h e ordering f o r the compound values of a t t r i b u t e s y and v i s arbitrary.

mis

r e s u l t s i n the following valid orderings of the values a , b, c,

d, e :

abcde, bacde, abced, and baced. Such variable orderings a r e not permitted i n current systems. 2)

The model provides a more generalized way t o specify the

encoding c h a r a c t e r i s t i c s than i s required t o describe the char.acteristics of m b l e 2-1.

I n Table 2-1, we saw t h a t t h e c h a r a c t e r i s t i c s length and repetition

could be specified a s depending on some single other data i t e m .

I n the

model, a l l c h a r a c t e r i s t i c s can be specified a s depending on other data

items, other characteristics and f'unctions of these. This greatly increases the variety of' encodings which can be specified.

In theee ways, the model allows generalizations of current data representations at the record level to be specified.

3.6 The Relationship Between the Model and

GDDL

GDDL has been explicitly decigned in terms of the model. A CDDL statement consists of an identifying name and a string of parameters. m e FIELD and GRmP statements are used to describe the conceptual organization of data items and groups. Each encoding characteristic of data items and the structure of records can be specified by one or more

parameters in GDDL statements. The parameters and statements for these characteristics are listed in Table 3-1 given below: 4

Value Characteristics

Statements and Parameters

Character Code

FIELD statement

Length

Remarks

parameter (ii) CHAR statement SET statement

FIELD statement parameter ( iii) parameter ( iv)

Specified in Section in Appendix A

1.1 1.ic.l

2.1.2.1

..U)

U

1.1

C V,

L

w

C

Length UniPormity

FIELD statement parameter (v)

U

1.1

2 o

r

Value Alignment

FIELD statement parameter (lx.)

Data m e

F I E L D statemerit

Value Criteria

GRCUP statement parameter (iii)f Criterion statements

U

E Q, .o

1.1

C

parameter (vi)

1.1

C

o

n 1.2

2.1

. Attribute Characterieties

Statements and Parameters

Attribute Marksre

CONCODE state-

Order

GRWP statement

ment

Remarke

>

GROUP statement parameter (ifi)b

Repetition Number

GROUP statement

Repetition Unifortnity

GROUP statement

Repetition Order

GROUP statement

Criteria

GROUP statement

2.1 m @rl o

.t: .rl $4

t

4

parameter (iii)c

?I o o

parameter (iii)d

2.1 2.1

1.2

8

1.2

parameter (iii)e 1.2

parameter (iii)f Criterion statements Specification of Characteristics

1.4.3

d

parameters (ii) and (11l)a Occurrence

Section in Appendix A

J

Section in Appendix A

Statements and Parameters

Direct

By listed parameters

Indirect

Parameter statements

2.1

1.4.4

Wble 3-1. The Relationehip Between the Model and. GDDL Inaigl~tinto the relationship between the model and. GDDL can beat be obtained by comparing the format of the GDDL FIELD and GROUP statements with the definitions of field ty-pe and group ty-pe (see Section 3.3 and Definitions 3-1 and 3-3)

.

The FIELD statement has the following format:

FIELD ( field name, encoding characteristics ) This corresponde to the specification of a Iield type in the followi~g way.

attribute corresponds to the field riame, arid cncodirlg

'Phe

characteristicc appear directly. Thus, we Gee t h a t the FIELD statement specifies data items. The G R W P statement has the following format: G R O ( group name,

. ..

; (list),

.. . , (list) . ..) .

This corresponds to the specification of a group type in the following way.

Compare the structure productions of Definition 3-3 with this

format.

The production of the type: structure

< attribute, { substructure ] >

corresponds to the format of the GROLTP statement, with the attribute corresponding to the group name, and with all the substructures that can be obtained using the remaining types of productions corresponding to (list),

... , (list).

The encoding characteristics for each sub-

structure are included in each list. Thus, we see that the G R W P statement specifies the structure for groups.

To specify that a particular

group is to be treated as a record, the IiECOliD statemerlt is used (cce Section 1.3 in Appendix A)

.

From the above table, we note that every chatmac*tcr.ist,ic.ol' Lhc moclel is included in GDDL. Since the complete sct of characteristic:: can encode the structure and values of data items, CalDL therefore has the same capability. This, in effect completes the ar&-u~neritthat GDL)L can specify any record level

structures which

can be described in

the model.

3.7 Demonetrations of GDDLtsCompleteness In the greviour~~ectionwe ohowed that GDDL is complete for record de~crlptionby ~howirqthat the model on which it was based i~ complete. We now provide eeveral practical examples of its completeness. me first of these examples is a demonstration that GDDL contains

the COBOL record description features as a proper subset. COBOL was chosen because it Is the prototype for almost every DBMS DDL and for the CODASYL DDL effort. It has the most highly developed record description capabilities currently available. The demonstration is given in Appendix C, part 1. In Appendix C, part 2 three examples are given of record

characteristics describable in GDDL but not in COBOL. The remaining examples demonstrate the use of GDDL in describing

real-world records. These record descriptions are part of larger examples of complete conversions of data f r m one structure to another. They are given in Appendix B.

CHAP'ER 4

4.1

FUE DESCRIPTION

Introduction This chapter i s devoted t o t h e study and description of organiza-

t i o n s of records called f i l e s .

W e develop a model of f i l e structures

which i a a very general extension of current concepts of f i l e s a s analyzed i n Chapter 2.

Thin model leads t o t h e technique f o r describing

f i l e structureo that i s incorporated i n GDDL.

This technique i s i l l u s -

t r a t e d i n a s e r i e s of examples which show t h a t GDDL can describe several well-known f i l e structures.

4.2 A Model of F i l e Structures I n Chapter 3, we developed a model of records.

I n t h i s chapter,

we a r e concerned with t h e record a s a basic u n i t of storage and r e t r i e v a l . When l a r g e numbers of records a r e t o be stored and retrieved, a problem of e f f i c i e n t u t i l i z a t i o n a r i s e s .

For example, s t o r e time i s consergved

If data need not be rearranged each t i m e a new record i s stored. And search time i s concerved if records can be

GO

arranged t h a t each record

i s stored physically next t o the record t h a t i s needed next.

Then, when

the f i r s t record t o be used i s found, succeeding records can be d i r e c t l y accessed i n the order of usage.

However, when access t o two o r more

record8 from a single record l a required, a sequential ordering of records doeu not i n l t a e l f provide the m o ~ te f f i c i e n t u'tilization. A uaer, then, should conceive of the record^ as bcing conx1ecetcd

together i n fiome way by acceat: paths.

-

81

-

These pal;hc make a record at

one p o i n t on a path a c c e s s i b l e t o records which occur a t p o i n t s previous t o it on t h e path.

n e y represent connections among t h e records i n

question t h a t t h e u s e r wants t o e x p l o i t f o r storage and r e t r i e v a l .

We

c a l l such an organization of records t h e conceptual f i l e s t r u c t u r e .

When

t h i u s t r u c t u r e l o implemented on a ctorage medium, it must be represented

i n gome way by a o t r i n g of b i t s . A s seen i n Chapter 2 , there a r e currently t h r e e ways i n which t h e

access paths of a f i l e s t r u c t u r e a r e implemented.

If t h e r e i s t o be an

access path from a record (say, A) t o another record (say, B ) , it may be implemented by:

-

(1) sequencing p o s i t i o n

t h e b i t s t r i n g representation f o r B

i s concatenated a f t e r t h e b i t s t r i n g representation f o r A ( s e e Figure 4-1, a ) ; (2)

embedding p o i n t e r s i n t h e records

-

a pointer t o B ( i . e . ,

an encoding of t h e p o s i t i o n t h a t t h e b i t s t r i n g representation of I3 occupies i n t h e record sequence) i s included as a f i e l d i n A ( s e e Figure 4-1, b);

(3)

arranging p o i n t e r s i n t a b l e s

-

a p o i n t e r to B i s concatenated

a f t e r t h e p o i n t e r t o A i n a sequence of pointerSs ( c a l l e d a t a b l e ) which i o maintained separately from t h e records themelves

.

Ultimately, a p o i n t e r t o B w i l l give t h e phycical address of the b l t & r i n g repreccntation of B when it Is stored on n cl;oragc medium.

Ilow the a c t u a l h l t a t r i n g f o r

ci

pointer can be obtuined i s discussed in

Chapter 5 , a f t e r we have considered t h e organization of storage media.

bsr

bsr A

where b s r means: b i t string reprccenta-

A

t i o n 01'

Figure 4-1, a.

By Sequencing

b s r B and

bsr A and pointer

pointer

Figure 4-2, b.

-

By Embedding P o i n t e r s

b s r pointer

...

to R

to R

...

'bsr

Figure 4-1, c .

Figure I t - 1 .

By Using Tables of Pointers

ImplemeriLatjon oi' Access I'IiLhs

bcr

We saw i n Chapter 3 how t h e records themselves a r e encoded as b i t strings.

Now we must consider the r u l e s f o r encoding t h e f i l e structure

into a b i t string.

If the f i l e structure i s t o be implemented by

sequencing, the rules muet determine the sequence i n which t h e b i t s t r i n g s roproosntlng t h e rucordo occur.

I f the f i l e structure ic t o be imple-

mented by pointere, the r u l e 6 must determine how the p o i n t e r ~a r e encoded i n t o b i t atrings, where these b i t s t r i n g s must be positioned i n r e l a t i o n t o the b i t s t r i n g s of the records, and the sequence i n which t h e b i t s t r i n g s of the records must occur.

These rules w i l l then determine a

b i t s t r i n g which represents the f i l e structure. There a r e thus three components of t h i s process: (1)

the conceptual f i l e structure,

(2)

the f i n a l b i t string, and

(3)

r u l e s f o r encoding t h e conceptual f i l e structure of records as a b i t string.

We therefore have t o model each of these components.

m e modelling of

t h e conceptual s t r u c t u r e i s influenced by (Co 1970).

'Phe r u l e s f o r

encoding a r e modelled a f t e r the work of (HS 1970).

The b i t s t r i n g '

is simply a sequence of 0 ' s and 1 ' s .

F i r s t , the conceptual f i l e structure w i l l be deccribed.

And

secondly, the r u l e s f o r encoding t h e f i l e structure w i l l be specified. 4.2.1

Ihe Conceptual F i l e Structure

L

W e noted In the previous ~ e c t i o nt h a t thc f i l e s t m c t u r c cletcrm l r l c c whlch rccords a r e connected by acceao pathr;.

In other words, J t

determines a r e l a t i o n ( c a l l e d a f i l e r e l a t i o n ) among records on t h e b a s i s of access paths.

Consider two records which we w i l l c a l l A and

B, such that e i t h e r

(I)

t h e b i t s t r i n g representation of B i s concatenated a f t e r t h e b i t s t r i n g r e p r e s e n t a t i o n of A, o r

(ii)

t h e r e i s a p o i n t e r from A t o B.

Then we say t h a t t h e r e i s a d i r e c t access path fromA t o B .

lielative

-

t o t h i s path we c a l l record A t h e head of t h e path and record B t h e tail of t h e path.

This terminology allows u s t o r e f e r t o records connected by

access paths without naming t h e s p e c i f i c records. D e f i n i t i o n 4-1.

The f i l e r e l a t i o n determined by access paths through a

s e t of records c o n s i s t s of t h e s e t of ordered p a i r s < head record, t a i l record > f o r each d i r e c t access path.

A s examples, consider t h a t we a r e given a s e t of records, S = {rl,

(1)

...

r ) where r is a record f o r 1 n i

i g n.

The access paths of t h e l i s t s t r u c t u r e :

r2 -3 ..* I-n-1 give t h e r e l a t i o n I 1-,

< rn- 1 9 rn

-)

'n < rlr rg 2, < r.,,r3 >,

-,

1' .

(2) Vie access path:: of t h e t r e e s t r u c t u r e :

1 .

. ... ,

give the r e l a t i o n

< r ,r >, 2 5 (3)

I2 = {< r 1'r 2 >, < r1 ,r 3 >, < r2,r4 >,

* * *

?

< rn-prn >

1)

The access paths of the r i n g structure:

give the r e l a t i o n Ig = [< r


,

r 1' 2

>,

..., < rn-l,rn>)

...?

< rnJr1 >]

It w i l l be convenient t o introduce t h e following terminology:

a)

If the p a i r of records < r

1'

r > is i n a f i l e relation

J

R, then we say t h a t there i s a path of length 1 from ri t o

f o r r e l a t i o n R. I' one. b)

Therefore, a d i r e c t access path has length

J : i t h e pair of recorrlr; < ri,rJ

> is not i n a f i l e r e l a t i o n

R, w e say t h e r e i n a p a t h of length 0 from r . t o r . for 1 J r e l a t i o n 13.

c)

If t h e p a i r s of records

-

a r e i n a f i l e r e l a t i o n Ii, then we say t h a t t h e r e i s a path of length n from rl t o r lb

n+l

f o r r e l a t i o n Ii.

model t h e corlceptual f i l e s t r u c t u r e w e must have a way t o

s p e c i f y any f i l e r e l a t i o n t h a t a u s e r may r e q u i r e .

I n general, t h e r e

my be an a r b i t r a r i l y l a r g e number of records t h a t can be included i n

a f i l e structure.

Therefore, it i s not p r a c t i c a l f o r a u s e r t o s t a t e

t h e f i l e r e l a t i o n extensively by l i s t i n g a l l t h e p a i r s of records. Instead, he can s p e c i f y c r i t e r i a over t h e records which w i l l determine when two such records a r e t o be i n t h e r e l a t i o n .

Thus, f o r two records

> i s a member of a f i l e r e l a t i o n

if and only i f A and B

A and B, < A , B

satisfy the criteria f o r the relation.

Such c r i t e r i a can describe

e x p l i c i t l y t h e conditions which must be met f o r two records t o be connected by a d i r e c t access path. We provide below a s e t of production r u l e s f o r specifying c r i t e r i a . A t t h i s p o i n t it i s worth noting t h a t i n Chapter 3, we were only

concerned with h i e r a r c h i c organizations and s o simple production rules were a l l t h a t was necessary t o specify record s t r u c t u r e s .

lIowever,

t o organize records i n t o f i l e s , a f a r wider v a r i e t y of organizations i s required and, t h e r e f o r e , a more e l a b o r a t e way ol' opeciL'ying them j c

Definition 4-2. A file structure is a file relation determined by criteria obtained from the following production system: Criterion Production System: Primitives: attribute, bit string, character string, characteristic, integer, arithmetic relations (=, (+,

-, etc .),

.g,

etc.),

arithmetic operators

set membership relation ( e )

Rules to produce the names of records, fields, characteristics and paths: index

-,

(integer)

record-modifier

attribute-fo m

-,

HEAD

-,

X integer

-,

attribute attribute index

record-attribute -, attribute record-modifier -r

attribute

attribute-modifier -, attribute-form -,

attribute-form OF attribute-modifier

.-,

attribute record-modifier

record-reference -, record-attribute -,

record-attribute criterion

field-reference -. attribute-modifier characterictic-reference

characteristic

path-reference -. PA'IH ( record-reference, record-reference, criterion)

Piules to produce set-theoretic criterion: constant

-,

character string

-, bit

string

~et-member-, field-reference 4

set

-

constant

-. set-member, set-member

{set-member]

set-criterion

.-,

field-reference e set

-+

characteristic-reference c set

Ifules t o produce arithmetic criterion: term -. VAWE ( f ield-reference )

-. PARAMETER ( characteristic-reference ) -. LENGTH ( path-reference ) 4

relation-symbol

constant

arithmetic-operator

-,

4

*

4

4

-

-

4

x

4

5

>

+ 2

-,

term

4

(arithmetic-expr.ecciorl) a r i L11niel;ic:-opcr.uto I. (aritl~metic-exprccsior~)

arithmetic-cr i ter i on

#
implies t h a t

such values of A

S

one t a r g e t record i s t o S

( i .e

., i f

there a r e n

then n t a r g e t records a r e formed);

< AT; ~ ~ ( i )..; > implies t h a t only one t a r g e t record i s t o be formed and t h e remaining values of AS a r e t o be

discarded ( i . e . , a r e not t o be used a s values f o r A T i n other t a r g e t records); 2)

when a t a r g e t a t t r i b u t e AT repeats an unlimited number of times, then specifying < A

AS;

.. . > implies

repeat exactly a s many times as AS repeats;

that A

T

will

3) when a t a r g e t a t t r i b u t e A T r e p e a t s e i t h e r a f i x e d o r bounded number of times, say m, then specifying

< AT; AS;

a)

... > implies

t h a t whenever t h e number 02 A,

r

r e p e t i t i o n s i s l e s s t h a n t h e number of A

repetitions,

S

t h e n t a r g e t records a r e t o be formed such t h a t each value of A

S

appears i n some t a r g e t record;

.. .

.

< ~ ~ ( r n~) ;~ ( m .) ;. > implies t h a t whenever t h e number of A

T r e p e t i t i o n s i s l e s s than t h e number of AS r e p e t i -

t i o n s , then only one t a r g e t record i s t o be formed with t h e i t h value of A

s

a s t h e i t h value of A, and t h e remain-

ing source values of A

I'

s

a r e t o be discarded.

6.4 Applications of t h e Model of t h e Association L i s t Example 1. E x t r a c t i o n of a New F i l e from an E x i s t i n g F i l e Consider a source f i l e F1 whose records a r e described i n t h e following way: i)

The s t r u c t u r e s of t h e records a r e described by t h e s e t of productions P1: record s t r u c t u r e

-,

structure R 1

structure R 1

--,

< person, {substructure ~ l l i l ] >

substructure R l R 1

-,

substructure R l l , substructure L I R 2

substructure 1(1[9-. substructure 1112, substruc'Lurc flll13 oubutructure lilR3 s u b s t r u c t u r e It11

-r

~ u b c t r u c t u r eIU3, substruc:Lurc 1t14

-,

rmme

substructure R12

-,

age

substructure R13

-*

sex

substructure ~ 1 4 null -4

substructure R14

--,

substructure ~

-*

substructure ~14,substructure Rl4

1 4 structure ~ 1 4

structure R14

-*

< book, {substructure ~ 1 4 ~ 1>3

substructure ~ 1 4 ~ 1 substructure ~141,substructure ~ 1 4 ~ 2 -4

substructure ~ 1 4 ~ substructure 2 ~ 1 4 2 ,substructure ~ 1 4 3 4

substructure ~ 1 4 1 tltle -4

substructure ~ 1 4 2 pages 4

substructure ~143 date 4

ii)

The encoding of the records is specified by a set of characteristics C1 (the exact specification of these characteristics is not required for the purpose of this example)

.

Consider a target file F2 whose records are described in the following way: i)

The structures of the records are described by the set of productions P2: record structure

4

structure R2

structure R 2

-4

< author, F3; name, F1; c r i t e r i o n >

i n a new t a r g e t record.

To f i n d values f o r t h e t a r g e t a t t r i b u t e 'auth-

o r ' a l l records a r e checked t o see i f they contain a value of ' t i t l e ' equal t o t h e value of ' t i t l e ' obtained f o r t h e t a r g e t a t t r i b u t e ' t i t l e ' ( i n t h i s case SCIENCE 11). The two records shown above contain suck1 n value.

Therefore, they a r e used a s sources f o r t h e values of t h e t a r g e t

a t t r i b u t e 'author'

.

I n t h i s way, t h e values JONES and liOE a r e obtairled.

F i n a l l y , t h e values f o r ' d a t e ' and 'pages' a r e obtained from t h e same group 'book' i n t h e same record which was t h e source of t h e value SCIENCE 11.

I n t h i s way, t h e t a r g e t record f o r SCIENCE I1 i s formed.

6.5 The Relationship Between t h e Model and GDDL The model of an a s s o c i a t i o n l i s t defined i n t h e previous s e c t i o n provides a means f o r e x p l i c i t l y s t a t i n g how t a r g e t d a t a items a r e formed from source data items during conversion. GDDL's a b i l i t y t o describe data conversion has been defined i n terms of t h i s model and thus provides s i m i l a r c a p a b i l i t i e s . We w i l l now show how t h e model and GDDL a r e r e l a t e d . ASSOCIATE statement ( s e e Appendix A , Section 2.3.1.1) image of t h e a s s o c i a t i o n l i s t s i x - t u p l e s .

GDDL's

i s an exact

Target and source f i l e

names appear a s p a r t of t h e t a r g e t and source names (parameters i) and ii))

.

The :;WIICE

(attribute-modif i c r , c r i t e r i o n ) rnrniri(< sc-heme appear::

explicitly a:: CDDL' s SOJl{CE statemerlt ( see Appendix A , Sect ion

:'.3.1.3)

Thus, we conclude t h a t GDDL can specify any :~ssociai;iorl l i s t

t h a t can be defined using t h e model.

.

6.6

m e Conversion Process The association l i s t completes t h e information needed t o describe

e x p l i c i t l y how data i s t o be converted from one organization t o another. I n t h i s section we w i l l see how and where each component of t h e descript i o n f o r the source and t a r g e t f i l e s together with the association l i s t

i s used during the conversion process. I n Figure 6-1, w e showed t h a t t h e conversion process consists of e s s e n t i a l l y three p a r t s .

F i r s t , the source f i l e i s broken down i n t o

i t s component data items using the source description, t h e t a r g e t data items a r e formed using values obtained from source data items, and l a s t l y t h e t a r g e t data items a r e structured and encoded according t o t h e t a r g e t description,

Figure 6-3, which i s a d e t a i l e d treatment of

t h e conversion process, e s s e n t i a l l y r e f l e c t s these same t h r e e stages i n t h e instance of conversion from several source f i l e s t o several t a r g e t files

. Figure 6-3(a) shows how source descriptions a r e used t o read t h e

source f i l e s from the storage media and break t h e b i t s t r i n g representat i o n down i n t o data items, and how t h e association l i s t controls the process. Figure 6-3(b) shows how t h e t a r g e t data items a r e formed, and Figure 6-3(c) shows how these data items a r e organized i n t o a t a r g e t f i l e and w r i t t e n onto t h e storage media. Figure 6-3 i s not an algorithm f o r converting data.

It only

shows t h e order i n which description components a r e used f o r e x t r a c t i n g a single data item from a source f i l e , and f o r converting t h e value of

t h i s d a t a item i n t o p a r t of t h e t a r g e t f i l e .

I n conversion proper,

when l a r g e numbers of data items must be extracted, much of t h e processing f o r each d a t a item w i l l be done i n p a r a l l e l with t h a t f o r o t h e r d a t a items f o r e f f i c i e n c y considerations. Let uo follow t h e conversion process using Figure 6- 3. We w i l l assume t h a t t h e process i s underway and s e v e r a l records

f o r a p a r t i c u l a r t a r g e t f i l e have already been constructed.

Some of

t h e d a t a items f o r t h e next t a r g e t record have already been formed and we w i l l now follow t h e formation of t h e next data item. The t a r g e t record s t r u c t u r e determines t h e a t t r i b u t e f o r t h i s next data item.

We must now begin a t t h e t o p of Figure 6 - 3 ( a ) .

The a s s o c i a t i o n l i s t

0

1 i d e n t i f i e s which source f i l e contains

t h e a t t r i b u t e whose value w i l l be combined with t h e t a r g e t a t t r i b u t e . The storage s t r u c t u r e d e s c r i p t i o n

0

f o r t h a t source f i l e i s

2

used t o determine which blocks must be read ( i . e . , records of t h e f i l e )

which blocks contain

.

The storage encoding c h a r a c t e r i s t i c s @ a r e

needed t o read t h e s e

blocks o f f t h e storage medium and t o remove any l a b e l s . Once t h e b i t s t r i n g representation of t h e f i l e i s obtained, t h e association l i s t

@ identifies

which source record i s needed.

'R,

l o c a t e and e x t r a c t t h e b i t s t r i n g representation of t h e rsecorri, the r:r.itcriorl u ~ c df o r oequencing blre records (j)

:rnd t l i c r i l e

CII(.O~~~JU