University of Pennsylvania
ScholarlyCommons Technical Reports (CIS)
Department of Computer & Information Science
December 1971
An Approach to Data Description and Conversion Diane P. Smith University of Pennsylvania
Follow this and additional works at: http://repository.upenn.edu/cis_reports Recommended Citation Diane P. Smith, "An Approach to Data Description and Conversion", . December 1971.
University of Pennsylvania Department of Computer and Information Science Technical Report No. MS-CIS-72-20. This paper is posted at ScholarlyCommons. http://repository.upenn.edu/cis_reports/831 For more information, please contact
[email protected].
An Approach to Data Description and Conversion Abstract
Currently, the structure of stored data is determined implicitly by the software which accesses and processes it. This data structuring technology has given rise to two outstanding problems in data processing. First, there is the communication of the exact structure of data to users and machines, and secondly, the interchange of the data itself. This work contributed to overcoming these problems by developing a technique for describing the structure of data explicitly and independently of machines and software. This aim is reflected in the following objectives: 1) To understand data structures by developing a model which not only characterizes current data organizational techniques, but also provides a framework within which new data structures can be defined. 2) To use this model to develop a language which can explicitly describe the organization of data. 3) To use this model to study how data can be converted from one structure to another, with a view towards developing a method for describing data conversions. This model unifies the diverse area of data structures by including the record, file and storage organizations of data. Furthermore, the model clearly separates at each level the conceptual part, which is the logical structure imposed by a user, from the implementation part, which is the method by which the logical structure is I encoded as a binary representation. This separation leads to n straightforward mapping of a file onto storage. From an analysis of the state-of-the-art in data organization, it is shown that the model can express not only the data structures of current systems, but also certain useful generalizations which might well be produced by future systems. The model treats records as hierarchies of data items. These hierarchies are expressed by production systems based on a generalized notion of attribute-value pairs. Files are treated as graphs whose nodes are records. The connections between the nodes are expressed using a powerful production system which generates criteria for determining when any two records are to be linked. The structure of storage is generalized as a hierarchy since this structure is common to all storage media. The mapping of files onto storage is expressed in terms of rules for distributing the records of the file within the slots provided by the storage structure. The language, called Generalized Data Description Language (GDDL) is a realization of the model, and thus possesses all its capabilities . In particular, the language can describe the implementation of any aspect of a file as being dependent on any other aspect. The language is presented in an appendix in the form of a user's manual. Data conversion is studied in terms of transforming data in one structure to another, where both structures are expressed in the model. This study shows that to fully specify a conversion the relationship between the components of the two structures must be specified. In certain cases, such as the reorganization of a file, this relationship can be very elaborate. A method is developed for specifying such relationships, and a corresponding capability is built into GDDL. Thus, WDL has the ability not only to fully describe data structures, but also to specify data conversion.
This technical report is available at ScholarlyCommons: http://repository.upenn.edu/cis_reports/831
Comments
University of Pennsylvania Department of Computer and Information Science Technical Report No. MSCIS-72-20.
This technical report is available at ScholarlyCommons: http://repository.upenn.edu/cis_reports/831
University of Pennsylvania THE MOORE SCHOOL OF EI;ECTRICAL ENGINEZRING
ECHNICAL REPORT
AN APPROACH TO DATA DESCRIPTION
AND CONVERSION
by
Diane Pirog Smith Project Supervisor Noah S. Prywes
December
1971
Prepared f o r t h e Off i c e of Naval Hesearch Information Systems Arlington, Va. 22217 under Contract ~00014-67-A-0216-0007 P r o j e c t No. 049-272
Reproduction i n whole o r i n part i s permitted for any purpose of t h e United S t a t e s Government.
Moore School Heport No. 72-20
Spctirity
I
Classification
1
DOCUMENTCONTROLDATA- R L D (Security c l a a a i l i r a l i o n o f t i t l o , body o f abstract and indexing ennolalion must be enlered when the overall report i s c l a s s i f i e d )
r.
O R I G I N A T I N G A C T I V I T Y (C0rp01010 author)
2 ..
University of Pennsylvania The Moore School of E l e c t r i c a l Engineering Philadelphia, Pa. 19104
REPORT SECURITY C L A ~ S I F I C A T I O N
UNCLASSIFIED zb. GROUP
-
-.
3
Ht POHT
I
1 1 11
I.
AN APPRWCH TO DAT4 DESCRIPTION OESCHlPTlVP NOTES ( 5 p O
4
I
AND CONVERSION
of repor1 and.inclueive d a l e a )
Technical Report 5 . A U T H O R I S ) ( F i r s 1 name. middle I n i t i a l , l a a t n a m e )
Diane Pirog Smith 6. REPORT DA T E
7a. T O T A L N O . O F P A C E S
December 1971 Ba. C O N T R A C T
OR G R A N T
17b. N O . O F R E F S
328 NO.
20
Pa. O R I G I N A T O R ' S R E P O R T N U M B E R I S )
~00014-67-A-0216-0007 Moore School Report No. 72-20
b. P R O J E C T N O .
NR 049-272 9b. O T H E R R E P O R T N O ( S ) (Any other numbera that may be assigned lhim m p o r l )
C.
d. 10. D I S T R I B U T I O N S T A T E M E N T
Reproduction i n whole o r i n p a r t i s permitted f o r any purpose of t h e United S t a t e s Government. Office of Naval Research Information Systems Currently, t h e s t r u c t u r e of stored data i s determined i m p l i c i t l y by t h e software which accesses and processes it. This d a t a s t r u c t u r i n g technology has given r i s e t o two outstanding problems i n d a t a processing. F i r s t , t h e r e i s t h e c o m n i c a t i o n of t h e exact s t r u c t u r e of d a t a t o u s e r s and machines, and secondly, t h e interchange of t h e d a t a i t s e l f . This work contribute@t o overcoming t h e s e problems by developing 8 t e c h n a u e f o r describing t h e s t r u c t u r e of data e x p l i c i t l y and independently of machines and software. This aim i s r e f l e c t e d i n t h e following objectives:
1) To understand data s t r u c t u r e s by developing a model which not ordy c h a r a c t e r i z e s current data organizational techniques, but a l s o provides a framework w i t h i n which new d a t a s t r u c t u r e s can be defineti. 2 ) To use t h i s model t o develop a language which can e x p l i c i t l y describe t h e organization of data. 3) To use t h i s model t o study how d a t a can be converted from one s t r u c t u r e t o another, with a view towards developing a method f o r describirle; ciata conversions. n ~ model e u n i f i e s t h e diverse a r e a of d a t a s t r u c t u r e s by inc:ludirlg t h e record, f i l e and storage organizations of d a t a . Furthermore, t h e model c l e a r l y separates a t each l e v e l t h e conceptual p a r t , which is t h e l o g i c a l s t r u c t u r e imposed by a user., from t h e implementation p a r t , which i s t h e method by wl.llch t h e l o g i c a l ctructur-c i s errcoded a s a binary reprecentat ion. Thio oeparation l e a d s t o n s t r a i g t i t f orp&&iIIUeLi
DD
NOV
es
1473
(PAGE 1 ) Security
Classification
A-31408
I
Securltv Clarrlflcrllon
S/N
0101-007-6821
Security Classification
A-31409
DD FORM 1473 A b s t r a c t (continued) mapping of a f i l e onto s t o r a g e . From an a n a l y s i s of t h e s t a t e - o f - t h e - a r t i n d a t a o r g a n i z a t i o n , it i s shown that t h e model can express not only t h e d a t a s t r u c t u r e s of c u r r e n t systems, b u t a l s o c e r t a i n u s e f u l g e n e r a l i z a t i o n s which might w e l l be produced by f u t u r e systems. The model t r e a t s r e c o r d s a s h i e r a r c h i e s of d a t a items. These h i e r a r c h i e s a r e expressed by production systems based on a g e n e r a l i z e d n o t i o n of a t h r i t u t e - v a l u e p a i r s . F i l e s a r e t r e a t e d a s graphs whose nodes a r e r e c o r d s . The r:onnections between t h e nodes a r e expressed u s i n g a powerful productiorl system which g e n e r a t e s c r i t e r i a f o r determining when any two r e c o r d s a r e t o be l i n k e d . The s t r u c t u r e of s t o r a g e i s g e n e r a l i z e d a s a h i e r a r c h y s i n c e t h i s s t r u c t u r e i s common t o a l l s t o r a g e media. The mapping of f i l e s onto s t o r a g e i s expressed i n terms of r u l e s f o r d i s t r i b u t i n g t h e records of t h e f i l e w i t h i n t h e s l o t s provided by t h e s t o r a g e s t r u c t u r e . The language, c a l l e d Generalized Data D e s c r i p t i o n Language (GDDL) i s a r e a l i z a t i o n of t h e model, and t h u s possesses a l l i t s c a p a b i l i t i e s . I n p a r t i c u l a r , t h e language can d e s c r i b e t h e implementation of any a s p e c t of a f i l e a s being dependent on any o t h e r a s p e c t . The language i s presented i n an appendix i n t h e form o f a u s e r ' s manual.
-
Data conversion i s s t u d i e d i n terms of transforming d a t a i n one s t ~ u c t u r e t o a n o t h e r , where both s t r u c t u r e s a r e expressed i n t h e model. This study shows t h a t t o f u l l y s p e c i f y a conversion t h e r e l a t i o n s h i p between t h e components of t h e two s t r u c t u r e s rrmst be s p e c i f i e d . I n c e r t a i n cases, such a s t h e r e o r g a n i z a t i o n of a f i l e , t h i s r e l a t i o n s h i p can be very e l a b o r a t e . A method i s developed f o r s p e c i f y i n g such r e l a t i o n s h i p s , and a corresponding c a p a b i l i t y i s b u i l t i n t o GDDL. Thus, WDL has t h e a b i l i t y not only t o f u l l y d e s c r i b e d a t a s t r u c t u r e s , b u t a l s o t o s p e c i f y d a t a conversion.
I would l i k e t o express q y g r a t i t u d e t o my two supervisors:
D r . David K. Hsiao who f i r s t introduced me t o t h i s a r e a of research and
who provided invaluable help and c a r e f u l c r i t i c i s m , and D r . Grace Murray Hopper whose conviction of t h e importance of t h e t o p i c provided t h e encouragement I needed and whose v a s t experience i n t h e a r e a helped me t o recognize many of t h e c r u c i a l a s p e c t s of t h e problem.
I would a l s o
l i k e t o thank D r . Noah S. Prywes and D r . James Emery f o r t h e i r support and guidance. The Ford Foundation and t h e U.S. Army E l e c t r o n i c s Command, Avionics P r o j e c t , supported me a t various times during my graduate s t u d i e s .
I
a m p a r t i c u l a r l y g r a t e f u l t o t h e Information Systems Branch of t h e Office
of Naval Research f o r supporting t h i s research under contract ~ ~ 0 0 0 1 4 67-A-0216-0007.
INDM
4, 32, 158-159
a c c e s s method
107
i ~ c c e c sp a t h s
85
direct
83
implementat i o n
86, 87
length of
109, 120, 125
a d d r e s s i n g scheme alignment s e t
120, 122, 125
assembly languages
21
association l i s t
12, 21,
128
139
definition
141, 144
examples
5'7, 58, 111
attribute
64
data item a t t r i b u t e
62, 67
encoding
64
group a t t r i b u t e
62, 67, 73
a t t r i b u t e marker b a s i c block
108, 110,116, 124
108, 124
block b l o c k riame
110, 116, 124
c torage Itern block name
ch:~ r a v t c r code COI3OL
59
9, 2'7-31, 36, 71, 296
111
INDEX (continued)
COD AS^
6, 44-48
compound value
63, 64
conceptual part
7, 8
84
file structure
65
record structure storage structure
110
96, 105
connection set number
conversion (see data conversion) criteria conversion selection file
133-140
87
value
32
88-90,102, 105, 133, 134, 140
criterion production system
39
data base management systems
3, 5, 128-158, 160
data conversion definition
129
148-155
process
data description language applications
3,
2,
4
58, 111
data items data structure
4, 7-12,157
data type
61, 77
delimiter
60
direct access path
85
156-157
INDM ( continued)
82, 83
embedded p o i n t e r s
95
encoding
98
example encoding
of a t t r i b u t e s
62,
67
95
of f i l e s t r u c t u r e s
67
of record s t r u c t u r e s
of storage items and storage s t r u c t u r e s
59
of values
96
encoding method e x p l i c i t description
2
71
field
71
f i e l d type
79
file
85, 95
f i l e relation
85
definition
7, 84
f i l e structure
88
definition
35
encoding
23-27
FOliTRAN
group
63
group type head record
70
85 vii
116
INDEX ( continued) higher-level programming languages implicit specifications
36
2
108, 116, 121, 124
labels length
116, 121, 124
basic block
86, 87 96, 105
path value
60, 72, 77
length uniformity
116, 121, 124
basic block value l i n k number
60, 72, 77 96, 99, 100, 105
96, 99, 100, 105
linkage uniformity
85
l i s t structure machine languages
15
occurrence group
68, 73, 78
117, 121, 124
sssi
operating system
18
order group S S S ~
path
68, 73, 78
117, 121, 124
(see access path)
path length
96, 105 viil
INDEX ( continued) pointer form
120, 125
n8, 123
pointer i n t e r p r e t a t i o n r u l e s pointer mode
120, 125
pointer t a b l e
82, 83
encoding
95, 96
example
100
pointer type record
120, 125
63
record d i s t r i b u t i o n r a t i o
118~121, 124
record positioning r u l e s
l l O ? 118, 124-128
record s p l i t s e t
119, 121, 124
record s t r u c t u r e
7, 56 65
definition
67
encoding record type
70
r e p e t i t i o n number
68, 78
group ~ s s i
117,
r e p e t i t i o n order
124
69, 73, 78
repet i t i o n uniformity
69, 78
group sssi
117, 121, 124
ring structure
86 ix
INDEX (continued)
sequencing position
82
sequential encoding
95
98
example source f i l e
129
SSDL
6
csci
(see structured s e t of storage items)
119) 121, 125
s t a r t record oet storage c e l l
110
storage item
111
7, 108,
storage s t r u c t u r e
112,
116
structured s e t of storage i t e n s ( s s s i )
64
subordinate group
t a i l record
85
target f i l e
129
85
t r e e structure value
58, 59,
111
compound value value alignment value c r i t e r i a
63, 64, 111
61, 77 62, 69, 77
111
Page CHAPTER 1
INTRODUCTION
1
1.1 Background and Objectives
1
1.2
6
The Development of t h e Models, t h e Design of t h e Language, and t h e Study of Conversion
1 . 3 Organization of t h e lieport CHAPTER 2
M I S T I N G RAW STRUCTURES AND M'W DESCRIPTION LANGUAGES
2.1
Introduction
2.2
Data S t r u c t u r e s i n Machine Languages
2.3
Data S t r u c t u r e s i n Early Operating Systems
2.4
Data S t r u c t u r e s i n Assembly Languages
2.5
Data S t r u c t u r e s i n Early Higher-Level Programming Languages
2.6
Data S t r u c t u r e s i n Third-Generation Operating Systems
2.7
Data S t r u c t u r e s i n Current Versions of Higher-Level Programming Languages
2.8
Data S t r u c t u r e s i n Data Base Management Systems
2.9
Die Data D e ~ c r i p t i o r iLanguage of t h e COIASYL Data Base Task Croup
2.10 Summary
3.1
Introduction
!MILE OF CON!JENTS (continued)
Page 3.2 A Model of Record Structures 3.2.1
The Model of Data Items
58
3.2.1.1
The Concept of Data Items
58
3.2.1.2
Encoding Values
59
3.2.1.3
Encoding Attributes
62
3.2.2 The Model of Records 3.2.2.1 The Conceptual Record Structure 3.2.2.2 3.2.3
Encoding the Record Structure
The Specification of the Encoding Characteristics
3.3 Interpretation of Common Data Processing Concepts in Terms of the Model of Record Structures 3.4 An Application of the Model of Record Structures 3.5
The Completeness and Generality of the Model
3.6 Tne Relationship Between the Model and GDDL
3.7 Demonstrations of GDDL ' s Completeneos CHAPTER 4
56
FILE DESCRIPTTON
4.1 Introduction 4.2 A Model of File Structures 4.2.1 The Conceptual File Structure
4.2.2 Encoding the File Structure xii
63
BIBLE OF CONTENE (continued)
Page
CHAPTER
4.3 Applications of the Model of File Structures
98
4.4 The Completeness and Generality of the Model
101
4.5
The Relationship Between the Model and GDDL
104
4.6
Demonstrations of GDDL ' s Completeness
106
5
STORAGE DESCRIPTION
5.1 Introduction 5.2
A Model of Storage Structures 5.2.1
The Conceptual Structure of Storage
5.2.2
Encoding Storage Items and Storage Structure
5.2.3
Record Positioning and Pointer Interpretation Rules
5.3 An Application of the Model of Storage Structures
5.4
The Completeness and Generality of the Mode1
5.5
The Relation~hipBetween the Model and
GDDL
5 -6 Medium Dependent Encoding Characteristics
5.7 Demonstrations of GDDL's Completeness CHAPTER 6
mm CONVERSION
6.1 Introduction 6.2 The Concept of the Association List List 6 . 3 A Model of the fl~~ociation xlii
'2UI;E OF CONTENTS (continued) Page
6.4 Applications of t h e Model of the Association List
6.5
The Relationship between the Model and GDDL
6.6 m e Conversion Process CHAPTER 7
CONCLUDING REMARKS
APPENDIX A
REFERENCE MANUAL FOR GDDL
APPENDIX B
EXAMPLES OF GDDL DESCRIPTIONS
APPENDIX C
RELATIONSHIP OF GDDL TO COBOL
xiv
LlST OF FIGURES Page Figure 1-1.
The Components of a Data S t r u c t u r e and t h e i r Interrelationships
Figure 2-1.
IBM 7040 Data Description Statements
2-1, a . 2-1, b .
The IBM 7040 $PILE Statement The IBM 7040 $LABEL Statement
Figure 2-2.
The ANSI COBOL Statement f o r Describing a Data Item o r a Group i n a COBOL Record
Figure 2-3.
The ANSI COBOL Statement f o r Describing a COBOL F i l e
Figure 2-4.
The ANSI COBOI; Statement f o r Deccribing t h e Storage Convention of a COBOL F i l e
Figure 2-5.
Enhanced COBOL Description Statements
a. 2-5' 'b 2-5,
.
Figure 4-1.
4-1, a . 4-1, b. 4-1, c.
The COBOL Statement f o r Declaring Data Types The COBOL Statement f o r Specifying R e p e t i t i o n Implementation of Access Paths By Sequencing By Embedding P o i n t e r s By Using D b l e s of P o i n t e r s
Figure 4-2.
B i t S t r i n g Representation of F i l e Sequent i a l l y Encoded
Figure 4- 3.
B i t S t r i n g Representation of F i l e Encoded by Embedded P o i n t e r s
Figure 4-4.
F i l e Linked by Embedded P o i n t e r s
Figure 4-5.
B i t S t r i n g Representation of F i l e Encoded by a P o i n t e r Tuble
Figure
4-6.
F i l e Linked by a P o i n t e r % b l e
Fipre
5-1.
Formatted Tape
Figure 5-2.
SSSI f o r Disk F i l e
LIST OF FIGURES (continued)
Figure 5- 3.
Bit S t r i n g Representation of B p e F i l e X
122
Figure 6-1.
Simplified Conversion Process
132
Figure 6-2.
An Bample of Source Record Selection f o r the Formation of Target Records
135
Figure 6- 3.
The Use of Descriptions and the Association L i s t i n Data Conversion
150
6-3? a . 6-3, b
.
6-3, c. Figure 7-1.
m e Extraction of Data Items from Source F i l e s The Formation of Target Data Items from Source Data Items Creation of Target F i l e s from Trlrget Data Items The Trichotomy of Information Processing
xvi
158
LIST OF 'IYIBLES Page
Table 2-1
Summary of Data Representation Characteristics
50
Table 3-1
The Relationship Between the Model and GDDL
77
Table 4-1
Characteristics for each Encoding Method
95
mble
4-2
The Relationship Between the Model and GDDL
105
Table 5-1
Characteristics Required for Encoding
116
Table 5-2
The Relationship Between the Model and GDDL
124
xvi i
BIBLI OGRAPHY Birkhof f , G., L a t t i c e Theory, Society, 1948.
American Mathematical
( ~ 1968) h
Chapin, N., "A Deeper Look a t Data," Proceedings 1968, ACM National Conference, 1968, pp. 631-638.
(CO 1971)
CODASYL Data Base Tbsk Group, Data Base %sk Group Report t o t h e CODASYL Programming Language Committee, A p r i l 1971.
(CO 1969)
CODASYL Systems Committee Technical Report, p n e r a l i z e d Data Base Manaaement Systems,
(CO 1970)
Codd, E.F., "A Relational Model of Data f o r Large Shared Data Banks," Comrunications of the ACM, Volume 13, Number 6 June, 1970, PP. 377-387.
( Ga 1970)
Galler, B .A. and P e r l i s , A. J A View of Programming Languages, Addison-Wesley, 1970.
.,
(HS 1970) Hsiao, D. and Harary, F., "A Formal System f o r Information Retrieval from F i l e s , " Communications of t h e ACM, Vol. 13, No. 2, February 1970, pp. 67-73. (HS 1971)
,
Hsiao, D. "A Generalized Record Organization, Transactions on Computers, December 1971.
IEEE
system/360 Operating System, PL/I Language Specificat i o n s , F i l e NO. ~360-29, ~ o r m~28-6571-4, 1965.
( IBM 1965) IBM
( ~ 1968) a Lancaster, F.W., Evaluation of t h e MEDLARS Demand Search Service, U.S. Department of Health, Education and Welfare, Public Health Service, National Library of Medicine, Bethesda, Maryland, January 1968. (Ma
1971) Manola, Frank, "An Extended Data Management F a c i l i t y f o r a General Purpose Time Sharing System," M.Sc. Thesis, The Moore School of E l e c t r i c a l Engineering, University of Pennsylvania, 1971.
(Ma 1$9)
Marden, E., "Statement of Need f o r a Data Descriptive I,anguage," Statement prepared f o r USA Slandard.~X 3 A d Hoc Committee, 1969.
( ~ 1967) e
Mealy, C.,
"Anotkier Look a t I)ata,If F;JCC,
xviii
1$)'1,
pp. >?>-1,311.
BIBLIOGRAPHY ( continued)
( ~ 1971) a
Ramirez, J., and Solow, H . , "The Design and Implementation of t h e DDL Processor," The Moore School of E l e c t r i c a l Engineering, University of Pennsylvania, work i n progress.
(RCA, 1969)
RCA Information Systems, COBOL Reference Manual, 70-00-607, May 1969.
(HCA, 1970)
RCA Time Sharing Operating System, Data Management System Reference Manual, DJ-001-2-00, June 1970.
( ~ 1969) a
Sammet, Jean E . , Programming Languages: Fundamentals, Prenkice-Hall, 1969.
(st 1967)
Standish, T.A , "A Data Definition F a c i l i t y f o r Programming Languages," Carnegie I n s t i t u t e of Technology, 1967.
(SSDL 1970)
Storage S t r u c t u r e Definition Language Task Group, "Storage S t r u c t u r e D e f i n i t i o n Language, SSDL," Record of t h e 1970 ACM SICFIDET Workshop on Data Description and Access, Rice University, Houston, 1971.
(US 1968)
U .S. Navy Programming Languages Group,
IIistory and
.
COBOL,
NAVSO P-3063, 1968.
m d s m e n t a l s of
CHAPTER 1 INTRODUC!EON 1.1 Background and Objectives
Computer technology i s a f i e l d which has experienced a rapid and uneven evolution.
This evolution has seen computer users develop
techniques and conventions appropriate only t o t h e i r own needs and data processing environments.
This has l e d t o the i n a b i l i t y of d i f f e r e n t
user groups t o communicate information about, and t o exchange algorithms and data e f f e c t i v e l y .
The problem of user and machine dependent
algorithms has received considerable a t t e n t i o n , r e s u l t i n g i n t h e development of widely accepted and l a r g e l y machine independent programming languages such a s ALGOL.
However, t h e s e v e r i t y of t h e problems of
user and machine dependent data organization has only been r e a l i z e d comparatively recently
* , and
a s yet l i t t l e has been done t o a l l e v i a t e
t h i s situation. Traditionally data i s organized e i t h e r by developing s p e c i a l s o f t ware o r by specifying i t s s t r u c t u r e i n e x i s t i n g programming languages, operating systems o r data management systems.
I n e i t h e r case, t h e
exact data organization can only be understood by analyzing and i n t e r preting several complex and i n t e r a c t i n g programs w r i t t e n i n a v a r i e t y of languages.
*
For example, t o understand the data s t r u c t u r e s produced
"It has been estimated that t h e lack of an adequate data descrjpt i o n language i s costing t h e Department of Defense alone millions of d o l l a r s annually because of t h e i n a b i l i t y t o exchange data e f f e c t i v e l y . " ( ~ 1969, a pg. 1)
by a p a r t i c u l a r COBOL program, it i s necessary t o analyze and i n t e r p r e t t h e following programs: (i) t h e COBOL program i t s e l f ,
(ii) (iii)
t h e COBOL compiler, and the data management system of the machine being used.
This e f f o r t i s necessary because the f a c t o r s which determine t h e organization of data a r e i m p l i c i t i n the programs and software used t o process and s t r u c t u r e t h e data.
Consequently, such p r a c t i c e s i n
data organization have hampered not only t h e communication of data s t r u c t u r e s but a l s o t h e interchange of the data i t s e l f .
When data i s t o
be interchanged, it is necessary t o know f i r s t whether the e x i s t i n g organization i s compatible with t h e new software which i s t o use it, and secondly, how the organization can be converted t o make it comp a t i b l e when t h i s i s not t h e case.
m e i m p l i c i t nature of data organi-
zation can make t h i s an onerous task. A solution t o these problems of c m i c a t i o n and data i n t e r -
change i s t o make the organization of data e x p l i c i t and i t s understanding independent of machines and software systems.
!Phis can be achieved by
developing a language f o r e x p l i c i t l y specifying data s t r u c t u r e s which i s separate from t h e languages used t o process that data.
'Ilo under-
stand a data structure, it i s then only necessary t o i n t e r p r e t a ~ p e c i f i c a t i o nwhich i s expressly intended t o communicate data s t r u c t u r e information, r a t h e r than t o i n t e r p r e t a program one of whose side e f f e c t s i s t h e s t r u c t u r i n g of data.
- 3Such a d a t a d e s c r i p t i o n language (ddl) would have many applications.
One important a p p l i c a t i o n i s t o provide a means of c o m n i c a -
t i n g d a t a s t r u c t u r e s among u s e r s .
For example, using a d d l a c r e a t o r
of a d a t a base can describe p r e c i s e l y t o an a p p l i c a t i o n s programmer t h e exact s t r u c t u r e of t h e d a t a t h a t t h e programer wants t o use.
Just as
ALGOL i s now used t o communicate algorithms so can a d d l be used t o communicate d a t a s t r u c t u r e s . Not only can a d d l be used t o communicate with users, but by constructing a d d l i n t e r p r e t e r , t h e d d l can be used t o communicate with machines.
Using such an i n t e r p r e t e r , a computer could use t h e informa-
t i o n contained i n any f i l e when it i s provided with a d d l d e s c r i p t i o n f o r that file.
Users would then be f r e e t o s t r u c t u r e t h e i r d a t a i n
whatever manner they deem appropriate, without being constrained by t h e d a t a s t r u c t u r e s p e c i f i c a t i o n f a c i l i t i e s a v a i l a b l e i n operating systems and programming languages.
Thus, a d d l could be used i n e s t a b l i s h i n g
automatically t h e s t r u c t u r e of d a t a bases.
A d a t a base c r e a t o r would
provide a d d l d e s c r i p t i o n and h i s data t o t h e i n t e r p r e t e r which would s t r u c t u r e t h e d a t a according t o t h e d e s c r i p t i o n . Furthermore, we could apply a ddl t o t h e problem of mechanizing t h e conversion of d a t a from a c u r r e n t s t m c t u r e t o a new s t r u c t u r e . It would only be necessary t o input t o a converter t h e data, a d d l
d e s c r i p t i o n of i t s current s t r u c t u r e , a d d l d e s c r i p t i o n ol' i t s new s t r u c t u r e and a d d l d e s c r i p t i o n of t h e r e l a t i o n s h i p between elements i n one s t r u c t u r e and t h e o t h e r .
By InLerpreting t h e s e d e s c r i p t l o n c
t h e converter could output t h e d a t a i n i t s new s t r u c t u r e .
Thus, t h e
u s e r i s released from writing special conversion programs.
In t h i s
way f i l e s could be interfaced across programming language, operating system, data management system and hardware b a r r i e r s . A f u r t h e r application i s i n the design and operation of data and
data base management systems.
For example, a ddl can be used t o create
new data structures which can then be t e s t e d f o r e f f e c t i v e storage u t i l i z a t i o n and other efficiency considerations. A t t h i s point we should make clear what we mean by t h e term "data
structure".
We use t h e term t o r e f e r t o the structure of data a s it i s
t o appear on a storage medium, including both t h e conceptual organizat i o n imposed by the user and the implementation of t h i s conceptual organization. uages
Some research groups, p a r t i c u l a r l y those i n programming lang-
(st
1967, ~a 1970), often use data structure t o r e f e r t o not only
the structure of data (as we use t h e term) but a l s o t h e access method by which t h i s data i s used.
To these groups a pushdown, f o r example,
i s a data structure, whereas we would say t h a t a pushdown i s a data
s t r u c t u r e together with an access method which controls storage and r e t r i e v a l on a l a s t i n
- f i r s t out basis.
An access method i s a pro-
gram which i s designed t o store and r e t r i e v e data from a data structure. It follows from our diccussion above t h a t we need t o separate out data
structures from the programs which uce them, so we can describe the data structures independently and e x p l i c i t l y .
Furthermore, any appro-
p r i a t e access method can be designed once the data structure has been specified.
With t h i s background i n mind, we s t a t e t h r e e objectives f o r t h i s dissertation: 1) To understand d a t a s t r u c t u r e s by developing a model which
not only characterizes current data organizational techniques, but a l s o provides a framework within which new data s t r u c t u r e s can be defined. 2)
To use t h i s model t o develop a language which can e x p l i c i t l y
describe t h e organization of data.
3)
To use t h i s model t o study how data can be converted from one
s t r u c t u r e t o another, with a view towards developing a method f o r describing such conversions. It i s a n t i c i p a t e d t h a t data description languages w i l l contribute
a s much a s programming languages towards the evolution of information processing.
J u s t a s t h e current s t a t e of programming languages i s t h e
a c c u m l a t i o n of many e f f o r t s , it i s expected t h a t much research and development w i l l be needed t o f u l l y understand tlie power and applicab i l i t y of data description languages.
The development of the d d l i n
t h i s d i s s e r t a t i o n i s perhaps analogous t o the development of tlie f i r s t programming language.
Different programming languages usually have
d i f f e r e n t models of algorithms on which they a r e based.
For example,
ALGOL i s based on recursive procedures with arithmetic operations, whereas LISP i s based on t h e lambda-calculus and s t r i n g manipulations. Similarly, we provide our own model of data organization on which our data d e s c r i p t i o n language i c based.
There a r e other studies i n progress which r e l a t e t o t h e design of a ddl, s p e c i f i c a l l y the studies being made by the COREXI, Storage Struct u r e Description Language Task Group (SSDL 1970)
.
However, t h i s group
so f a r has mainly addressed i t s e l f t o techniques f o r mapping records onto storage, which i s j u s t a subset of the problem we have tackled here. The language given here i s the f i r s t one t o be completely developed and specified.
I n addition, we a r e the f i r s t t o study and propose a general
solution f o r the problem of using data descriptions f o r converting data from one structure t o another. 1.2
The Development of the Model, the Design of t h e Language, and the Study of Conversion We w i l l now discuss the development of the model and i t s use i n
the design of the ddl (called GDDL f o r Generalized Data Description ~anguage)which i s presented i n t h i s report. The development of data description from i t s f i r s t primitive forms i n machine languages t o i t s current forms i n data management systems has been based on ad hoc changes triggered by user needs and new technology. This has l e d t o a (ride variety of methods f o r describing data, without any general concept o r comprehensive model.
For example, COBOL (US
1968) i s based on highly developed record concepts, whereas ~6 ( ~ a 1969) i s based on c e r t a i n aspects of l i s t structures, and i n operating system design, systems programmers have b u i l t up a body of expertise on storage structures and f i l e implementation techniques.
However, the
common concepts underlying these and other aspects of data structures have not been extracted and formulated i n t o a comprehensive model.
Therefore, a thorough study of t h e d a t a d e s c r i p t i o n elements i n software systems and p r o g r a m i n g languages was undertaken, with a view towards e x t r a c t i n g t h o s e common elements t o include i n a comprehensive model of d a t a s t r u c t u r e s . Tbls model of d a t a s t r u c t u r e s i s divided i n t o t h r e e l a r g e l y inde-
pendent l e v e l s , namely, t h e record, f i l e and s t o r a g e l e v e l s , and each l e v e l i s f u r t h e r subdivided i n t o a conceptual p a r t and implementation part.
The conceptual p a r t i s t h e l o g i c a l s t r u c t u r e which i s imposed on The implementation p a r t i s t h e way i n which t h i s s t r u c t u r e
the data.
i s t o be represented o r encoded.
The components of t h i s s u b d i v i s i o n
of d a t a s t r u c t u r e s a r e i l l u s t r a t e d i n Figure 1-1. CONCEPTUAL PART
I M P m N m T I ON PART
data
I Structure I Logical Record Structure
a, L 4
Structure
a,
~ogical
.rl
0 a,
-
Storage Structure
FESULTING B I T STRING REPRESENWTION (B.s.R.)
MAPPING
B.S.R. of
\structure]
L Data
1teml
Encoding Record Structure Encoding File Structure
'
Encoding Storage Structure
OUT
t B.S.li. of 1 File
File in Storage Format
B.S.13. of Storage
Figure 1-1. The Components of a Data S t r u c t u r e and t h e i r Interrelationships
A
These subdivisions provide a valuable vantage point for understanding data structures. Let us look first at the implications of the division into conceptual and implementation parts. The nature of the conceptual part is quite distinct from the implementation part, even though most systems do not make this distinction. The conceptual part is the machine-independent structure which is imposed on the data by the user. He conceives of the data as being organized in this fashion, and this is the form in which his programs expect to find the data. The implementation part, which is machine-dependent, is the way in which the logical structure is encoded as a bit string representation which can be stored on a storage medium. In our model we will see that specifications which relate to the conceptual part have the nature of production systems, whereas, specifications which relate to the implementation part have the nature of certain characteristics of character strings like length or character code. In addition, this subdivision yields a valuable insight which has not been noted in other work.
This insight is based on the observation
that if a person intends to organize certain entities into a structure, he may want that organization to depend on any property of those entities which are available to him. In particular, if a person wanto to organize record6 into a file, he may apecify thio organization in terms of any available properties of thoce records. 'Phcce properties can include the valueo of data items in recordc, the logical structure of the records and the implementation of the record structure. Thus we can
see t h a t t o describe f i l e organization we have t o provide more than t h e c a p a b i l i t y of j u s t specifying a b s t r a c t graphical s t r u c t u r e s . Now we look a t the implications of dividing t h e model i n t o record, f i l e and storage l e v e l s .
The concept of a record i s common t o a l l data storage and r e t r i e v a l systems, y e t it i s usually overlooked i n t h e o r e t i c a l s t u d i e s of data structures.
The s t r u c t u r e of records i s an important consideration i n
t h a t it i s t h e basic organization of data items which i s t r e a t e d a s an e n t i t y f o r storage and r e t r i e v a l .
Thus f a r a hierarchic organization f o r
records has proven adequate, a s it provides a s t r u c t u r e which i s r e l a t i v e l y easy t o encode and decode without the need f o r extended scanning operations.
I n t h i s work, therefore, we only allow h i e r a r c h i c s t r u c t u r e s
a t t h e record l e v e l .
I n our model t h i s hierarchic organization i s
generalized i n t h a t it allows f o r l e v e l s of t h e hierarchy t o occur optionally o r t o repeat a number of times.
This conceptual s t r u c t u r e of
records has not been modelled e x p l i c i t l y before, although it i s e s s e n t i a l l y t h e l o g i c a l organization of records which i s i m p l i c i t i n COBOL.
COBOL, however, i s q u i t e r e s t r i c t i v e on t h e ways i n which the
implementation of records may be specified.
I n t h i s work we allow each
implementation c h a r a c t e r i s t i c t o be specified e i t h e r d i r e c t l y o r dependent on other c h a r a c t e r i s t i c s . Records a r e the elements which a r e organized i n t o f i l e s .
There
i s great f l e x i b i l i t y i n d i s t r i b u t i n g the o v e r a l l organization of a s e t
of data items between t h e record and f i l e l e v e l s .
On one hand, wc can
specify a record t o consist of a tingle d a t a item, and, i n e f f e c t ,
specify the overall organization of the data a t the f i l e l e v e l .
In
f a c t we can specify hierarchies a t the . f i l e l e v e l and thus a l l the conceptual structure f o r records can i n principle be moved t o the f i l e level.
However, while the conceptual structure of t h e data might remain
use of t h e data f o r storage and r e t r i e v a l has been changed. t h e same, the On the other hand, we can specify a record t o be a complex hierarchic
structure and possibly make the f i l e structure simple.
The d i s t r i b u t i o n
of structure between the f i l e and record l e v e l s depends on t h e intended use of the data.
Therefore, by distinguishing record structure from
f i l e structure we a r e able t o include these aspects of data structures i n our model. Our concept of a f i l e structure i s more general than others be-
cause, a s previously mentioned, we allow the specification of graphical structures which depend on data and record properties.
Btis requires
a more elaborate specification method than the usual methods based on
pure graph-theory. The specification of the structure and encoding of records, and t h e specification of how these records a r e structured and implemented a s a f i l e determine a b i t s t r i n g representation of t h e f i l e .
7his i s
t h e b i t s t r i n g which i o actually mapped onto a storage s t r u c t u r e . Our division of storage structure i n t o conceptual and implementat i o n p a r t s i s the key t o both simplifying the mapping of the b i t string representation of a f i l e onto a storage structure, and a l s o simplifying t h e specification of storage structures by extracting the structure
common t o storage media independent of physical considerations.
me
conceptual s t r u c t u r e of storage i s based on generalized h i e r a r c h i e s which a r e common t o a l l storage media.
The implementation of t h e s e
h i e r a r c h i e s i s based on encoding c h a r a c t e r i s t i c s which a r e a l s o independent of t h e storage media.
To bind a storage s t r u c t u r e t o a particu-
l a r medium, we have only t o r e l a t e t h e l e v e l s o f t h e hierarchy t o t h e a c t u a l physical l e v e l s of a storage medium.
I t i s over such a storage s t r u c t u r e t h a t t h e b i t s t r i n g representat i o n of a f i l e i s d i s t r i b u t e d .
A r e s u l t of our subdivision of d a t a
s t r u c t u r e s has been t o make t h e a c t u a l mapping of data onto a storage medium comparatively straightforward,
It i s only necessary t o decon-
catenate t h e b i t s t r i n g representation of t h e f i l e a t appropriate p o i n t s , and i n s e r t these component s t r i n g s without d i s t u r b i n g t h e i r order i n t o t h e s l o t s already provided by t h e storage s t r u c t u r e . These a r e t h e i n s i g h t s and advantages which a r e obtained by subdividing our model i n t h e above way.
From t h e study of d a t a descrip-
t i o n elements i n software systems and programming languages we can ensure t h a t we a t l e a s t included t h e d a t a d e s c r i p t i o n c a p a b i l i t i e s of
every current system t h a t was considered.
A G each of t h e c l a c s c s ol'
software i n t h e study include0 t h e most s o p h i s t i c a t e d r e p r e s e n t a t i v e oL' that c l a s s , it i s l i k e l y that we have i n f a c t included t h e capabil-
i t i e s of a l l current systems.
From t h i s model t h e requirements f o r
a d a t a d e s c r i p t i o n language a r e immediately apparent.
GDDL i t s e l f t o be very c l o s e l y r e l a t e d t o t h e model.
This allows
When the data description capability of t h e language had been designed, the problem of using descriptions t o convert data from one s t r u c t u r e t o another was studied.
Using d d l t s f o r data conversion
i s one application t h a t has been widely suggested, but never actually
investigated.
With our model of data structure, we could study the
conversion process itseli?.
I n t h i s study it w i l l be shown t h a t addition-
a l information i s required t o completely describe a conversion.
This
additional information specifies a relationship, which can be quite elaborate, between names i n one description and names i n the other. To model t h i s relationship the concept of an association l i s t was
developed.
GDDL c a p a b i l i t i e s f o r describing data conversion relation-
ships a r e incorporated d i r e c t l y from t h e association l i s t concept. 1.3
Organization of the Report The GDDL language i t s e l f i s presented i n Appendix A i n the form
of a self-contained reference manual.
The body of t h i s report
therefore i s concerned with presenting the model and i t s relationship t o t h e language.
It a l s o shows t h a t GDDL can describe any data organi-
zation t h a t can be obtained with current systems.
Further, because the
model allows generalizations of current data description c a p a b i l i t i e s , GDDL can describe data organizations t h a t a r e beyond these present
c a p a b i l i t i e s but might well be incorporated i n t o future systems.
The
generality of GDDL r e l a t i v e t o current systems i~diccucsed i n terms of the model.
Chapter 2 presents the study of the development of data description in programming languages and software systems.
The table at the
end of this study ('Table 2-1) provides the basis for showing that the models and thence GDDL include all current data structure capabilities. This study is quite long and the details are not essential for understanding the remaining chapters. &e
reader is therefore advised to
skip to Chapter 3 should the detail become too oppressive. Chapters 3,
4 and 5 develop the record, file and storage levels
of the model respectively. Each chapter shows the relationship between the model and the GDDL language at that level. The material in these chapters provides an excellent way of visualizing the structure of GDDL and its description capabilities. Chapter 6 discusses the ways of using data descriptions to convert data from one structure to another.
m e concept of an association list
is introduced and it is shown how an association list can be used to complete the specification of data conversion. Chapter
7 summarizes the contributions of this report and
suggests directions for future research. Appendix B contains examples of GDDL descriptions of' some realworld files and of data conversl.on from one structure to another. 'Ihcuc examples are chosen to further demonstrate the ability of GDDL to describe current data organizations. Appendix C contains a proof that GDDL can indeed describe all the COBOI2 record features. COBOL is the prototype for the most advanced record level data representations. It is shown that each COBOL record
- 14 description clauee can be expressed in GDDL.
CHAPTEX 2
2.1
MISTING DA'IYI STRUCTURF,S AND DA'B DESCRIPTION LANGUAGES
Introduction The o b j e c t of t h i s chapter i s t o provide an a n a l y s i s of d a t a
s t r u c t u r e s i n contemporary computer software with a view towards obtainin/-:a romprek~en~ive summary of d a t a s t r u c t u r e c l h a r a c t e r i c t i c s .
This
cummary provides t h e b a s i s f o r demonstrating i n l a t e r c h a p t e r s t h a t t h e CDDL i s complete.
The software systems covered by t h i s a n a l y s i s a r e : ( i ) machine languages,
(ii)
e a r l y o p e r a t i n g systems,
(iii) assembly languages,
( iv) (v) (vi) (vii) (viii)
e a r l y h i g h e r - l e v e l programming languages, c u r r e n t o p e r a t i n g systems, c u r r e n t h i g h e r - l e v e l programming languages, d a t a base management systems, and t h e CODASYL Data Description Language.
The c h a r a c t e r i s t i c s of each of t h e s e systems a r e analyzed i n a s e p a r a t e s e c t i o n of t h i s chapter.
The f i n a l s e c t i o n combines t h e
r e s u l t s of t h e s e a n a l y s e s i n t o a t a b l e . 2.2
Data S t r u c t u r e s i n Machine Languages I n machine languages, t h e r e a r e f o u r ways t h a t d a t a s t r u c t u r e
ckiaracteristics a r e specilied:
1) hardware specifications f o r conventions such a s the code f o r representing characters, t h e base f o r representing numbers, and the length
of t h e eglelleet addressable u n i t of storage.
These conventions a r e
Fixed For a given computer but may vary from machine t o machine.
To use
a p a r t i c u l a r machine, a system programmer has t o know these conventions. Thus, deecriptions i n t h e form of specifications i n manuals a r e usually provided, 2)
machine language i n s t r u c t i o n s t h a t specify the data type
( e . g., character or number), the s c a l e of numbers (e. g., f i x e d point or
. ., single o r double) .
f l o a t i n g point), and t h e precision of numbers ( e g
These descriptive elements a r e implicit i n data manipulation i n s t r u c t i o n s r a t h e r than e x p l i c i t a s declarations.
They a r e i l l u s t r a t e d by t h e
following examples. a)
To specify t h a t a character s t r i n g i s t o be placed i n t h e
accumulator of t h e computer, the machine language i n s t r u c t i o n CAL ( c l e a r and Add Logical word) would be used instead of t h e i n s t r u c t i o n CLA f o r placing a number i n the accumulator. b)
To specify t h a t a f l o a t i n g point number i s t o be added
t o t h e accumulator, t h e i n s t r u c t i o n FAD ( ~ l o a t i n g~ d d )would be used instead of t h e fixed point i n s t r u c t i o n ADD. c)
lb specify double precision f o r addition, t h e instructi.011
DFAD ( ~ o u b l ePrecision Floating A ~ O )would be used in1;tead of the sinflc p r e c i ~ i o ni n ~ t r u cion t ADD.
3) items.
machine language i n s t r u c t ions tha,t specify locations of data These descriptive elements a r e a l s o implicit i n data manipula-
t i o n inetructione r a t h e r than e x p l i c i t a s declarations.
For example,
t h e ST0 ( s t o r e ) i n s t r u c t i o n both d e c l a r e s that a p a r t i c u l a r l o c a t i o n i s t o be used f o r s t o r a g e and s p e c i f i e s that a d a t a item is t o be
stored i n t h a t location.
4)
machine language i n s t r u c t i o n s t h a t s p e c i f y which devices a r e
t o be used f o r input and output, and how d a t a would be organized on t h e device medium.
These d e s c r i p t i v e elements a r e a l s o i m p l i c i t i n d a t a
manipulation i n s t r u c t i o n s r a t h e r t h a n e x p l i c i t a s d e c l a r a t i o n s .
They
a r e i l l u s t r a t e d by t h e following examples. a)
To s p e c i f y that a p a r t i c u l a r I/O
device i s t o be used f o r
output, t h e machine language i n s t r u c t i o n WRS (write s e l e c t ) i s used t o prepare t h e a p p r o p r i a t e channel. b)
To specify t h a t a p a r t i c u l a r block of d a t a items i s t o be
t Load channel) copied onto an o u t p u t medium, t h e i n s t r u c t i o n RCH ( ~ e s e and i s used t o send t o t h e channel a channel command word which g i v e s t h e
s i z e of t h e block of d a t a t o be copied and i t s l o c a t i o n . c)
To s p e c i f y t h a t t h e l a s t block of d a t a has been reached
on a magnetic t a p e , t h e i n s t r u c t i o n WEF ( w r i t e ~ n d - o f - ~ i l e i) s used t o w r i t e a n e n d - o f - f i l e gap followed by a t a p e mark on t h e t a p e .
*
The c h a r a c t e r i s t i c s of d a t a s t r u c t u r e s
languages can be grouped i n t o two c a t e g o r i e s .
provided by machine One i n c l u d e s t h e charac-
t e r i s t i c s of i n d i v i d u a l d a t a items, and t h e o t h e r t h e c h a r a c t e r i s t i c s
*
A t t h e end of each s e c t i o n of t h i s chapter a l i s t of t h e c h a r a c t e r i s t i c s of t h e system under d i s c u s s i o n w i l l be p r e s e n t e d . Whenever a
new c h a r a c t e r . i s t i c ( n o t appearing i n previous s e c t i o n s ) i s i n t r o duced, it w i l l be underlined.
of storage media. 1. The c h a r a c t e r i s t i c s of individual data items c o n s i s t of: (i
t h e hardware ~ r o v i d e dcharacter code.
(ii) length,
(iii)
data type: a)
character s t r i n g ,
b)
numbers: 1)
binary base, Sign
-
radix o r diminished radix complement
(depending on t h e hardware),
3) 2.
fixed or floating-point scale.
The c h a r a c t e r i s t i c s of storage media c o n s i s t of: (i) (ii)
(i i i )
block s i z e , end-of -f i l e l a b e l s , and device assignment.
We note t h a t machine i n s t r u c t i o n s a r e seldom used o r made a v a i l a b l e t o describe e x p l i c i t l y t h e s t r u c t u r i n g of s e t s of data items.
Such
s t r u c t u r e s a r e created and maintained by machine language programs. 2.3
Data S t r u c t u r e s i n Early Operating Systems With t h e development of Operating Systems (os's), more complex
d a t a s t r u c t u r e s on storage devices were provided d i r e c t l y t o the programer.
They a r e described by statements of t h e OS job control lan-
guage (JCL).
Previously, these f i l e and storage s t r u c t u r e s had t o be
implemented as p a r t of user-written machine language programs.
Examples of such statements a r e t h e $FILE and $LABEL statements provided by t h e IBM 7040 JCL.
These a r e i l l u s t r a t e d i n Figure 2-1.
The $FILE statement i s used t o describe t h e c h a r a c t e r i s t i c s of t h e f i l e s t r u c t u r e and t h e p o s i t i o n i n g of t h e records on magnetic tape, t h e s t r u c t u r e of t h e t a p e ' s physical blocks and t h e t a p e u n i t . 1. The f i l e s t r u c t u r e and implementation c h a r a c t e r i s t i c s c o n s i s t
of:
2.
(i)
ordering t h e records i n t h e i r input sequence, and
(ii)
implementing t h i s s t r u c t u r e by scqueniiial storage.
The record positionirlg c h a r a c t e r i s t i c .is .the rccorri Lo tape
block r a t i o ; t h a t i s , t h e number of records per t a p e block.
3.
The storage s t r u c t u r e and implementation c h a r a c t e r i s t i c s a r e : ( i ) tape naming,
(ii) (iii)
labels: a)
header and t r a i l e r l a b e l s f o r tape r e e l s and f i l e s ,
b)
count f i e l d s f o r tape blocks,
(iv)
f i x e d ordering of tape blocks and l a b e l s on t h e tape,
(v)
f i x e d occurrence of a l l blocks and l a b e l s s p e c U i e d ,
(vi)
4.
tape block s i z e ,
r e p e t i t i o n of r e e l s
-
given a s number of r e e l s .
The device c h a r a c t e r i s t i c i s read/write density.
The remaining parameters of t h e statement a r e used t o describe b u f f e r s and a c t u a l processing. The $LABEL statement i s used t o describe t h e information i n a label.
1,abels a r e used t o implement storage s t r u c t u r e s .
$FILE
deck name
'f i l e
,
name ' [primary unit], [secondary u n i t ]
PRINT SCRTCH
The IBM 7040 $FIU Statement
Figure 2-1 a )
16 e
$LAB-
m
Figure 2-1 b) Figure 2-1.
,
[
number a
] ,[ ]
number
,
[ {z::}] ,
identification The IBM 7040 $LABF;L Statement
IBM 7040 ?hta Description Statements
.
I
Data S t r u c t u r e s i n At3~emblyLanguages Assembly languages were primarily designed t o enhance d a t a handling
and t o a l e s s e r degree, t o provide mnemonic machine i n s t r u c t i o n s .
The
data-oriented pseudo-instructions provided by assembly languages s i g n i f i c a n t l y increase t h e v a r i e t y o f d a t a s t r u c t u r e s made d i r e c t l y a v a i l a b l e t o the user.
Thus, many complex data s t r u c t u r e s t h a t had previously
been created and maintained by u s e r programs, can now be declared explicitly. I n Assembly Languages, elements and statements which d e a l with
data s t r u c t u r e s a r e t y p i f i e d a s follows: 1)
Symbolic rlames assigned t o d a t a items.
These names may be
used t o access t h e data items d i r e c t l y without r e f e r r i n g t o t h e address of t h e d a t a items.
For example, i n t h e IBM 7040 Macro-Assembly Language
MAP, t h e statement DlXT%DEC 1 3 r e s u l t s i n t h e name DM% being assigned
t o t h e l o c a t i o n i n which a decimal number 13 i s s t o r e d . 2)
Pseudo-instructions t h a t declare d a t a types.
For example,
i n IBM 7040 MAP, d a t a items may be declared t o be o c t a l , OCT; decimal, DEC; binary coded information, BCI; and v a r i a b l e f i e l d data, VFD.
This
i s i l l u s t r a t e d by t h e following examples:
a)
To specify that a data item named DMM i s t o be i n t e r -
preted a s t h e decimal i n t e g e r 13, t h e following MAP statement i s used: DINT%DEC 1 3 b)
To specify t h a t a d a t a item named ENTHY i n t o contain t h e
character C i n t h e f i r s t statement i s used:
6 b i t s of t h e data item, t h e following MAP
ENTRY VFD H ~ / C
3)
Pseudo-instructions that describe the structure of data items.
For example, in IBM
7040 MAP, to specify that a block of 6 consecutive
storage locations are to be reserved for storing data items, the following statement is used: BSS
4)
6
Pseudo-instructions that describe input/output characteristics
of particular media.
For example, in IBM
7040 MAP such statements are
of the form:
..., option LABEL option, . .. , option
name FILE option,
where the options for the FILE statement and LABEL statement are the same as the options for the IBM 7040 Job Control Language $FILE and $LABEL described in the previous section.
Thus, the following characteristics of individual data items, sets of data items and storage media are made accessible to programmers in Assembly Language. 1. The characteristics of individual data items consist of: ( i)
symbolic namine,
(il) the hardware provided character code, (iii) length, (Iv)
data type: a)
character &ring,
b)
numbers: 1)
binary, decimal o r o c t a l base,
2)
character sign f o r decimal numbers and radix and diminished r a d i x complement f o r binary numbers,
3) (v) 2.
data items i d e n t i f i e d by p o s i t i o n .
The c h a r a c t e r i s t i c s of s e t s of d a t a items c o n s i s t of: (i) (ii) (iii)
3.
f i x e d o r f l o a t i n g point s c a l e ,
f i x e d order, f i x e d occurrence, and s e t s of data items i d e n t i f i e d by t h e i r p o s i t i o n
Assembly languages depend on t h e i r underlying operating system f o r storage s t r u c t u r e .
2.5
Data S t r u c t u r e s i n Early Higher-Level Programming Languages
I n developing higher-level languages such a s FOREUN and COBOL, appropriate d a t a s t r u c t u r e s were provided.
For example, FORTRAN, which
was designed f o r s c i e n t i f i c computing, provides a r r a y accessing f o r handling homogeneous d a t a ( i . e . , d a t a of t h e same type). The data d e s c r i p t i o n statements of ANSI FOl?Tl&CIN have f o u r for.ms: 1) Declaration statements t h a t describe t h e s t r u c t u r e of i n d i -
v i d u a l d a t a items.
I n FORTRAN, c h a r a c t e r i s t i c s such a s s c a l e and
p r e c i s i o n a r e t r e a t e d a s a d d i t i o n a l d a t a types.
For example, i n FOHTliRN
I V , t h e following "type" d e c l a r a t i o n s a r e provided:
INTEGER
DOUBLFt PFaCISION
REAL
LOGICAL
COMPLEX
EXTERNAL
where LOGICAL d a t a items a r e t h e values T ( o r TRUE) and F ( o r FALSE), and EXTERNAL data items a r e data items which a r e defined e x t e r n a l l y t o t h e FORTRAN program.
To specify t h e ty-pe of a d a t a item, t h e name of
t h e item i s l i s t e d a f t e r t h e ty-pe i n a d e c l a r a t i o n statement, e . g . , INTEGER CVAL, A, B 2)
The d e c l a r a t i o n statement which describes t h e s t r u c t u r e of
s e t s of data items (groups).
I n ANSI FOR!IBAN, individual data items can
be grouped together i n h i e r a r c h i c s t r u c t u r e s which a r e i n t e r p r e t e d by t h e processor a s a r r a y s .
For example, t h e t r e e i l l u s t r a t e d below can be
i n t e r p r e t e d as a 2 x 3 array:
That i s , t h e p a i r s of d a t a items < a11,a21 a r e i n t e r p r e t e d a s rows. dimensions.
>, < a21,a22 > and < a31,a32 >
Arrays a r e l i m i t e d t o a maximum of t h r e e
The DIMENSION statement i s used t o describe such groupings.
The statement has t h e following format: DlMENSION a r r a y name (nl,n2),
.. ., a r r a y name
(n1,n2,n3)
where:
a r r a y name i s t h e name used t o r e f e r t o t h e a r r a y , and n n , n a r e t h e number of elements i n each of t h e 1' 2 3 dimensions of t h e a r r a y , allowed i n ANSI FORTRAN.
For example, t h e statement: ) a 2 x 3 a r r a y c a l l e d A. DIMENSION ~ ( 2 . ~ 3describes Data items i n t h e vectors a r e accessed by a r r a y indexing. 3)
t h e FORMAT statement which describes input and output data
structures.
The statement i s used t o describe data type and lerlgth
f o r each d a t a item i n a record t o be input o r output.
For example,
i n ANSI FORTRAN, t h e statement has t h e f o l l o w i r g format: FORMAT ( d a t a item s p e c i f i c a t i o n ,
..., data
item s p e c i f i c a t i o n )
where a data i t e m s p e c i f i c a t i o n c o n s i s t s of two p a r t s : and a d a t a l e n g t h p a r t .
a d a t a type
These types a r e :
F r e a l with no exponent E r e a l with exponent
D r e a l with double p r e c i s i o n exponent I integer
L l o g i c a l ( c h a r a c t e r s t r i n g T o r 11') A character s t r i n g
H h o l l e r i t h ( c h a r a c t e r s t r i n g used f o r output only)
Length i s given a s number of characters p e r d a t a item. example,
~6 describes a d a t a item which i s a
strjrq of
For
6 characters.
For r e a l data items, i n a d d i t i o n t o length, t h e number of d i e i t s Lo t h e r i g h t of t h e decimal point i s s p e c i f i e d .
For example, ~ 8 . 2describes a
d a t a item which i s a r e a l number with a maximum l e n g t h of i j characters
and which has 2 digits following the decimal point.
4) 1nput/0utput statements that describe the order of the data items to be input or output, and the device to be used. The statements have the following format: (device number, format statement number) data name,
.. ., data name
where: device number refers to a specific device, format statement number refers to the format statement describing the data items being input or output, data name refers to the data item or group (array values) being input or output. Thus, the following characteristics of data structures are made accessible to programmers by the data description statements of
FORTRAN: 1. The characteristics of individual data items consist of: (i)
symbolic naming,
( ii) the hardware provided character code,
(iii) fixed lengths as specified by the user, (iv) data type : a)
character string,
b)
number: 1) binary or decimal base, 2)
radix or diminiohed radix complement depending on hardware for binary numbers, character sign or no sign for decimal numbers,
f i x e d o r f l o a t i n g point scale,
3) (v) 2.
data items i d e n t i f i e d by t h e i r p o s i t i o n .
The c h a r a c t e r i s t i c s of records c o n s i s t of: (i) (ii)
array
access in^
(balanced t r e e s ) ,
f i x e d ordering,
(iii) f Fxed occurrences,
(iv)
3.
groups of data items i d e n t i f i e d by t h e i r p o s i t i o n .
FORTRAN depends on i t s underlying Operating System f o r i t s storage s t r u c t u r e .
Because t h e COBOL language was designed f o r handling l a r g e quantit i e s of data, more importance was given t o t h e data d e s c r i p t i o n s t a t e ments of t h e language than i n FOR'IRlN.
These statements a r e w r i t t e n i n
separate s e c t i o n s of a COBOL program.
The Data Division i s t h e s e c t i o n
f o r describing t h e d a t a items, records, f i l e s , working storage and program constants.
Another section, c a l l e d t h e Environment Division, i s
f o r describing t h e storage media.
I n it, information concerning f i l e
s e l e c t i o n i s given, and t h e equipment configuration ( t a p e s t a t i o n , print e r , e t c . ) i s described. 1)
I n COBOL's Data 1)ivision t h e r e i s one statemerlt l o r describint:
tjle organization of d s t a items i n records and one statement lor. descriL)ing t h e organization of records i n t o f i l e s ; a)
Each d a t a item o r group of d a t a items t h a t i s t o appear
i n a record i s described by a statement of t h e form i l l u s t r a t e d i n
Figure 2-2.
This statement i s used t o describe:
i) the l e v e l a t which the data item or group of data
items i s t o occur i n t h e hierarchic record, ii)
.
the data type ( e g., character s t r i n g = DISPLAY, numeric s t r i n g = CW),
iii) iv)
the length of the data item, the number of times t h e data item o r group of data items is t o occur in each record,
v)
the alignment of the data item i n respect t o
word boundaries and t o fixed length s t r i n g s of character positions.
levelnumbe { ~ ~ ~ ~ ~ REDEF c - l }INES [ ;date-name-21
COMPUTATIONAL
DISPLAY SYNCHRONlZED){,W,
I
RIGHT
) ] [; jm [PICTURE
/ [ ;(JUST J J U ~ T I F I E D } R I G H T ] [; VALUE IS literal]
Figure 2-2
I;BLANK WHEN ZERO].
The ANSI COBOL Statement For Describing a Data Item or a Group in a COBOL Record
(us 1968)
1
character-string
The organization of COBOL records i n a COBOL f i l e i s
b)
described by a statement of the form i l l u s t r a t e d i n Figure 2-3.
This
statement i s used t o describe i) ii) iii)
iv)
file-name
t h e s i z e of storage blocks, the s i z e of t h e records stored i n t h e blocks, any l a b e l s t o appear on t h e storage tape, t h e names of records appearing i n t h e f i l e .
; BLOCK CONT AlNS [integer-1 TO] intcgcr-2.(RECORDS
\CHARACTERS
[; R E C O R CONTAINS ~ [integer-3
RECORDS ARE
integer-4 CHARACTERS]
STANDARD date-name-1 [, data-name-21.
t-
Figuree :'-3
..
i
DIGAN:;] COJiOL Statement Tor?)csr.ribirly rr. C0130L Vilc (11:; 1()6(:)
J
2)
In COBOL's Environment Division, there is one section that
is used to describe input and output conventions. In it, equipment assignments and certain physical characteristics of each file to be used by the program are described by a statement of the form illustrated in
Figure
2-4. This statement is used to describe the device on which
the file is stored.
FILE-CONTROL FILE-CONTROL.
SELECT [OPTIONAL] lila-name
t ASSIGN TO [integer-11 implementor-name-1 [,implementor-name-21 integer-2
[FOR MULTIPLE
...
ALTERNATE
[{
AREA AREAS
}]I
.). . .
Figure 2-4. 'I'he ANSI COBOL Statement for Describing the Storage Convention of a COBOL File
(us 1968)
Thus, the data structures that are made accessible to programmers by COBOL can be characterized in the following way.
1. The characteristics of individual data items consist of:
(i)
symbolic naming,
(ii) the hardware provided character code, (iii) fixed lengths aE specified by the user,
(lv)
data types:
a)
character etring,
b)
number:
1)
binary o r decimal base,
2)
sign
-
radix o r diminished radix complement
(depending on t h e hardware) f o r binary numbers, and character sign o r no s i g n f o r decimal numbers,
3) (v)
f i x e d o r f l o a t i n g point s c a l e ,
value alignment ( j u s t i f i c a t i o n ) with blank o r zero padding,
(vi)
value s t r i n g alignment (synchronization) with respect t o computer words with blank o r zero padding,
(vii) 2.
The c h a r a c t e r i s t i c s of records c o n s i s t of:
(i ) (ii) (iii) (iv) (v)
3.
data items i d e n t i f i e d by t h e i r p o s i t i o n .
hierarchic structure, f i x e d order, f i x e d occurrences, f i x e d r e p e t i t i o n ordered a s input, groups of d a t a items i d e n t i f i e d by t h e i r p o s i t i o n .
COBOL depends on i t s underlying Operating System f o r i t s storage s t r u c t u r e s .
2.6 Data Structurer; i n Third- Generation Ope~atinr: Syctcms I n t h e i r current stage of development, Opcratirlg :;ystcm:: ((1;' ::) art. provldirlg more f i l e and ctorage stmc-L;urc u p t l o ~ ~ :tklurl ; curly
01;'::.
Wle c r e a t i on and maintenance of t h e s e s t r u c t u r c c a r e t r c a t c t i a:: a s e t of s e r v i c e s separate from those involved i n sct-leduling programs. The p a r t of an OS which supports t h e s e s e r v i c e s i s r e f e r r e d t o as t h e
data management system (DMS) of t h e operating system.
Among these
services a r e the moving of data between storage devices and main memory, and t h e accessing of data i n DMS maintained s t r u c t u r e s .
Additional JCZ
statements, known a s DMS statements, a r e provided t o evoke DMS services. I n general, DMS1s provide t h e i r users with a number of f i l e and storage s t r u c t u r e s .
To s t o r e data i n such structures, the u s e r proceeds
a s follows: (i) (ii)
he names the p a r t i c u l a r s t r u c t u r e i n a DMS statement, he l i s t s the parameters which s e l e c t those options provided by t h e DMS ( i f any), and
(iii)
he e n t e r s h i s data.
The data management service so evoked moves t h e data from the input device t o the appropriate storage devices and s t o r e s it i n the described structures. For example, the DMS I1 of the RCA SPECTRA 70/46 B O S (RCA 1971) provides i t s u s e r s with f i v e s t r u c t u r e s and r e l a t e d input/output conventions.
Collectively, these s t r u c t u r e s and conventions a r e called
access methods. 1)
They are:
PAM (primitive Access ~ e t h o d.) !This method provides only a
p a r t i c u l a r record format (fixed i n length) and storage on e i t h e r d i r e c t access devices o r on single r e e l , standard blocked tape. and accesses f i l e s only i n random order. t h e blocking and deblocking of records.
PAM c r e a t e s
The user must himself handle
2)
SAM (sequential Access ~ e t h o d ) . This method provides e i t h e r
f i x e d length, v a r i a b l e length o r undefined record formats (where records with undefined formats a r e stored one t o a block). accecsec f i l e s i n sequentlal order only.
SAM creates and
S t performs a l l blockirg,
dei~lockingand buffering f o r the user.
3)
ISAM (1ndex Sequential Access ~ e t h o d.) It provides e i t h e r
fixed o r variable length record formats and storage on direct-access devices only.
Records a r e maintained by means of a d i r e c t o r y whose
e n t r i e s point t o t h e records t o r e f l e c t t h e correct sequence. words, records may not be i n sequential order physically. key. whose values determine the sequence i s c a l l e d t h e access f i l e s i n a sequential or non-sequential order.
I n other
The f i e l d
Thus, ISAM can I n terms of
storage s t r u c t u r e , an ISAM f i l e i s made up of data blocks (2048 bytes) and d i r e c t o r y blocks.
Data blocks contain t h e u s e r ' s records which a r e
ordered i n i t i a l l y according t o t h e values of the key f i e l d . blocks contain p o i n t e r s t o data blocks.
Directory
ISAM performs a l l blocking,
deblocking and buffering f o r t h e u s e r .
4)
RTAM ( ~ a s i cmpe Access ~ e t h o d ) . This method provides e i t h e r
fixed length or undefined record formats (where records a r e stored one per block) and storage on tape only.
BmM i s used t o provide e f f i c i e n t
accessing of tape blocks.
5) EAM
vanesc scent Access
~ e t h o d ) . 1:t provides f i x e d length
record formats and storage on direct-access devices only.
and accesses temporary f i l e s only i n a random ortic r
.
creates
Because they a r e
temporary, EfU4 f l l e s have no label^ and require no c:ataloe;uing or
s e c u r i t y checks. Data s t r u c t u r e s i n t h e s e f i v e access methods a r e s i m i l a r i n s e v e r a l respects.
I n f a c t , only t h r e e s t r u c t u r e s a r e provided f o r records:
1) Fixed length
-
i n which each record contains e x a c t l y t h e same
number of b y t e s .
Standard format i s known t o a l l DMS access
methods. 2)
Variable length number of bytes.
-
i n which each record may contain a d i f f e r e n t I n each v a r i a b l e l e n g t h record, t h e f i r s t
,
two bytes of t h e record contain t h e characters "11" and t h e second two bytes contain t h e length of t h e record.
3) Undefined
-
i n which records a r e i d e n t i c a l i n length t o t h e
input/output b u f f e r s defined f o r t h e access method. There a r e t h r e e ways of organizing records i n t o f i l e s : 1) random organization, 2)
sequential, and
3)
indexed sequential.
For storage, records may be blocked and unblocked automatically, devices may be tape o r direct-access, and blocks may be standard (2064 bytes) o r nonstandard (< 4096 b y t e s ) .
Control codes such as tapemarks,
count f i e l d s , e t c . a r e handled automatically and may not be s p e c i f i e d by t h e user.
Thus, t h e following c h a r a c t e r i s t i c s of f i l e and storage s t r u c t u r e s a r e made a c c e s s i b l e t o programmers by t h e DMS d a t a d e s c r i p t i o n statements
1. The characteristics for organizing records into files and implementing the structure consist of: (i) (ii) (iii)
structuring records by input sequence, structuring records by value (key), implementing structures by a)
sequential positioning, and
b)
by pointers : 1) stored in tables or embedded in records, 2)
given as absolute address or relative to some origin.
2.
The characteristics for positioning records in device blocks consist of: (i) (ii)
the record-to-block ratio, and the distribution of records such that records either are maintained whole or are split between blocks.
3.
The characteristics for organizing storage blocks and implementing this structure consist of: (i)
block naming,
(ii) formatting for the following supported devices: magnetic tape, mgnetic disk, cards, and printer,
( ili)
block length specifica1;ion Tor cupportcd d.cviccs,
(lv) labels for cupported dcvicec, (v) fixed order of device formats, (vi) fixed occurrences of device formats, (vii)
repetition of formats for tape reels, disk levels,
cards and p r i n t e r pages. 2.7
Data Structures i n Current Versions of Higher-Level Programming Languages Current higher-level programming languages have been developed t o
take advantage of the data management services provided by operating systems and t o s a t i s f y user requirements f o r more complex working struct u r es
. For example, RCA SPECTRA 70/46 ANSI COBOL (RCA 1969) has statements
t o evoke SAM and ISAM and t h e i r related data structures. The COBOL Data Division has been enhanced:
new i n t e r n a l formats
have been added, repeating groups can be ordered, and r e p e t i t i o n numbers can vary f o r different record occurrences.
The clauses used t o
specify these options a r e i l l u s t r a t e d i n Figure 2-5.
[
USAGE IS
Figure 2-5, a.
The COBOL Statement f o r Declaring Data 'Sypes [integer-1 TO]
integer-2 TIMES
[ D E ~ I N G ON hta-name-l]
[
{~DE,"~) KEY IS data-name-2 C , data-name- 31
.. . ]
[INDEXED BY index-name-1 [, index-name-21 Figure 2-5, b. Figure 2-5.
...
!The COBOL Statement f o r Specifying Repetition Enhanced COBOL Description Statements
]
PL/I i s an example of a higher-level programing language t h a t
was designed t o incorporate a l a r g e r number of record s t r u c t u r e s than o t h e r languages a v a i l a b l e a t t h e time of i t s conception. a r r a y accessing, h i e r a r c h i c s t r u c t u r i n g and
strirltl,
It provided
processirtg f o r d a t a
items and group6 of d a t a itemc. PL/I
provides a r i c h s e t of c h a r a c t e r i s t i c s l o r s t r u c t u r i n g nrlci
implementing data items ( IBM 1965):
(i ) (ii) (iii) (iv)
symbolic naming, t h e hardware provided character code, f i x e d and varying l e n g t h s a s s p e c i f i e d by t h e u s e r , data types: a)
character s t r i n g ,
b)
number: 1) binary o r decimal base, 2)
sign
-
radix o r diminished radix complement
(depending on t h e hardware) f o r binary numbers, and character s i g n o r no s i g n f o r decimal numbers,
3) (v) (vi)
f i x e d o r f l o a t i n g point s c a l e ,
value alignment with zero o r blank pad c h a r a c t e r s , d a t a items i d e n t i f i e d by p o s i t i o n .
These d a t a d e s c r i p t i v e elements a r e combined i n cleclarat i o r ~statemelks of t h e form: i)
DECLAIiE data item name
PI C'IUliE
(n) [VAllY picture string
ii) DECZARE data item name
FIXED
To group data items into hierarchic structures and structures accessible by array indexing, PL/I provides the following elements: 1) a clause which is used to specify the dimensions for array
accessing. It has the form:
(5, ... , mn)
for an n dimensional array, where the
ith dimension has mi elements. This clause is used in a
DECLARE statement: DECLARE data name 2)
(y .. . , mn) . . .
a clause which is used to describe hierarchic relationships between data items. It has the same form as the level number clause in COBOL.
It is used in a DECLARE statement:
DECL4RE level number data item
level number data item
... ...
Such hierarchic structures may also be accessed by array
.
indexing
For file and storage structures, PZ/I provides statements which are used to invoke the DMS access methods of its. underlying operating system.
The characteriotics of data structures that are made acce~sible to the programmer by the data description elementc of many cvrrcnt t~ighcrlevel languages are summarized in Section 2.10.
2.8
Data
Btructures
in
Data Base Management Systems
Data Base Management Systems a r e an outgrowth of Information Storage and R e t r i e v a l (ISR)
systems.
ISR systems a r e designed t o manage
l a r g e q u a n t i t i e s of a p a r t i c u l a r ty-pe of data.
For example, one e a r l y
system, MEDURS, was created t o manage documents f o r t h e National Library of Medicine ( ~ 1968). a I n t h e s e systems, since only one type of information was t o be used, only one type of f i l e s t r u c t u r e was required.
Also, input and
output r o u t i n e s were s p e c i a l i z e d t o handle t h e f i l e s t r u c t u r e most effectively.
A s a whole, ISR systems were individually t a i l o r e d f o r
a p p l i c a t i o n s such a s text-handling and record-keeping. The development of more generalized text-handling and record-keep-
i n g systems l e d t o t o d a y ' s generalized Data Base Management Systems (DBMS' S) (CO 1969). Every DBMS has a language.
The data d e s c r i p t i o n statements of
t h e language specify t h e c t r u c t u r e of data maintained by t h e DBMS. general, t h e d a t a d e s c r i p t i o n statements form t h e l a r g e s t p a r t of a DBMS's language.
For example, i n t h e MARK I V DBMS developed by Informatic Inc. (CO 1969)~ raw d a t a must be input i n t h e format i n which it i s t o be
stored.
MARK I V formats can be characterized i n t h e following way.
1. The c h a r a c t e r i s t i c s of individual d a t a items corlsist of: ( i ) cymbolic r l s m i n g ,
(11)
t h e hardwa tne provided charac: ter. ceodc,
( i i ~ ) f i x e d l e n g t h s a s s p e c i f i e d by t h e u s e r ,
In
(iv)
data types: a)
character s t r i n g ,
b)
number: 1)
binary o r decimal base,
2)
Sign-radix o r diminished radix complement (depending on t h e hardware), and character signs o r no signs f o r decimal numbers,
3) f i x e d o r f l o a t i n g point scale, (v) 2.
The c h a r a c t e r i s t i c s of records c o n s i s t of: (i) (ii) (iii) (iv) (v)
3.
d a t a items i d e n t i f i e d by t h e i r p o s i t i o n .
hierarchic structure, f i x e d order, f i x e d occurrences, f i x e d o r varying r e p e t i t i o n s ordered as input, groups of data items i d e n t i f i e d by t h e i r p o s i t i o n .
MARK I V depends on i t s underlying Operating System f o r i t s
storage s t r u c t u r e s . MARK IV's language i s a t a b u l a r language.
Forms a r e provided i n
which a u s e r s e l e c t s options provided by t h e system. MARK I V i s a self-contained DBMS.
higher-level programming languages.
It i s not embedded i n any
DBMS's which a r e embedded i n some
higher-level languages a r e c a l l e d host-language DBMS's. designed t o enhance t h e i r host language.
They a r e
This development combines t h e
record s t r u c t u r e s provided by t h e host languages with t h e f i l e and storage s t r u c t u r e s provided by t h e DBMS.
COBOL and t h e ~ o n e y w e l l -
General E l e c t r i c Co.'s I n t e g r a t e d Data Store (IDS) together form an example of t h i s type of system (CO 1969).
I n COBOL-IDS, COBOL s t r u c t u r e s a r e used a t t h e d a t a item and record l e v e l . statements.
These s t r u c t u r e s a r e described by t h e standard COBOL The enhancement comes a t t h e f i l e l e v e l .
I D S adds t h e
c a p a b i l i t y t o describe network r e l a t i o n s h i p s among records. can be viewed a s interconnecting r i n g s t r u c t u r e s . a r e maintained by embedded p o i n t e r s . i n more than one r i n g . marly o t h e r records.
IDS networks
The .interconnections
Each record i n I D S may p a r t i c i p a t e
Thus, a s i n g l e record may be a s s o c i a t e d w i t h
I n each IDS r i n g t h e r e i s one record which i s
t r e a t e d a s a master record.
It contains c o n t r o l information.
remaining records i n t h e r i n g a r e c a l l e d d e t a i l records. may be master i n one r i n g and d e t a i l i n another.
The
Any record
The d a t a d e s c r i p t i o n
statements used t o describe t h e c h a r a c t e r i s t i c s of these r i n g s a r e i n t h e form of a d d i t i o n a l clauses i n t h e COBOL record statement.
Each r i n g
r e l a t i o n s h i p i s defined a t l e v e l 98 i n a record d e s c r i p t i o n . terminology a r i n g i s c a l l e d a CHAIN.
I n IDS
The clause f o r declaring a record
t o be a chain master has t h e form:
98 chain-name CHAIN MRSTZlI. The clause f o r d e c l a r i n g a record t o be a chain d e t a i l has t h e form:
98
chain-name CHAIN DEMIL ;
SELECT UNIQUE MASTEX]
[
MATCH-KEY IS data-rfime]
[; CHAIN-OIIDER IS SOE~'ZED]
';
(,SCENDIIUG)
SORT-KEY 1S dats-name]
[; RANDOMIZE ON data-name] ;
DUPLICA!ES NOT ALLOWED]
.
This clause specifies the chain i n which the record i s t o be a d e t a i l , the order i n which d e t a i l records a r e t o occur ( i f they a r e t o be ordered), and t h e f i e l d from which a hashed address of the record i s t o be derived ( i f t h i s i s desired). MARK I V and COBOL-IDS represent two d i f f e r e n t classes of DBMS.
However, they a r e both implemented a s application programs and a r e not p a r t s of the operating systems. a b l e t o the user.
Many system resources a r e thus unavail-
Furthermore, privacy protection and access control
which a r e v i t a l t o DBMS users a r e d i f f i c u l t t o enforce.
Therefore, a
d i f f e r e n t approach t o building a DBMS was taken by the designers of t h e Extended Data Management F a c i l i t y (EDMF)implemented a t the Moore School of E l e c t r i c a l Engineering a t the University of Pennsylvania (Ma 1971).
The EDMF was implemented a s a p a r t of the RCA SPECTRA 70/46
Time Sharing Operating System ('23%).
Statements of t h e EDMF a r e i n
t h e form of e i t h e r 150s Commands, macro-calls which may be used by the regular applications programmer i n assembly language programs, or b u i l t i n functions f o r t h e FORTRAN and COBOL languages. m e s e t of record and f i l e structures provided by the one of the most extensive that has been Implemented.
EI)MF
EDW provides
record structures which a r e beyond the COBOL structurec ( H S 19'71) provides the following characterictics.
are
.
It
1. The c h a r a c t e r i s t i c s of individual data items c o n s i s t o f : ( i ) symbolic naming,
(ii) (iil) (iv)
t h e hardware provided character code, f i x e d o r v a r i a b l e lengths a s s p e c i f i e d by t h e u s e r , data types: a)
character s t r i n g ,
b)
number: 1)
decimal o r binary base,
2)
sign
-
radix o r diminished radix complement
(depending on t h e hardware) f o r binary numbers, and character s i g n o r no s i g n f o r decimal numbers, (v)
value alignment
-
l e f t f o r character s t r i n g s and
r i g h t f o r numbers, with zero o r blank pad characters, (vi) i
2.
d a t a items i d e n t i f i e d by p o s i t i o n and by a t t r i b u t e names used a s d e l i m i t e r s .
The c h a r a c t e r i s t i c s of records c o n s i s t o f :
(i3
hierarchic structure,
(ii) f i x e d order,
(iii)
f i x e d o r optional occurrences of data items and groups,
(iv)
f i x e d and v a r i a b l e r e p e t i t i o n of data items and groups ordered a s input,
(v)
groupe i d e n t i f i e d by p o ~ i t i o nand by using u t t r . i b u i x names as markers.
A t the f i l e l e v e l , t h e EDMF allows records t o be linked together i n t o l i s t s , when the records contain the same data items ( c a l l e d keywords).
A record may be linked i n t o any number of l i s t s .
Pointers t o
the heads of t h e l i s t s a r e stored i n d i r e c t o r i e s ( t a b l e s ) i n ascending lexicographical order.
By s e t t i n g limits on l i s t lengths, f i l e s may
be implemented completely with pointers embedded i n records o r with t a b l e s of pointers o r some combination of t h e two.
This i s under the
u s e r ' s control, and allows him t o organize h i s data i n a wide range of s t r u c t u r e s , including inverted, multilist, and indexed random organizat i o n (HS 1970).
EDMF seems t o be t h e only e x i s t i n g DBMS t o allow t h e
user t h i s kind of control over the implementation of h i s f i l e . Each one of t h e above DBMS's was designed t o enhance various c h a r a c t e r i s t i c s a t e i t h e r t h e data item, record o r f i l e l e v e l , or a t a l l three.
The l e v e l and degree of enhancement vary from DBMS t o DBMS.
A summary w i l l be provided i n Section 2.10 of the most advanced DBMS
features. 2.9
The Data Description Language of t h e COaClSYL Data Base Task Group -%
The CODASYL Data Base B s k Group (DBTG) w a s organized t o unify work done on current DBMS data description languages.
The goal of the
DBTG i s t o produce a s i n g l e data description language (DDL) i n which a l l current data can be described.
-*
structures
at
the'
data item, record and f i l e l e v e l s
This DDL (CO 1971) includes:
CODASYL (conference on Data Systems ~anguages)i s a group o r i g i n a l l y formed t o c r e a t e a business-oriented language. I t produced COBOL and has now extended i t s i n t e r e s t s t o DBMS's.
1) t h e COBOL Data Diviulor~which allows tihe u s e r t o specify record f o m t s .
Unlike t h e EDMF, t h e COIlASYL UDL does not allow varying
l e n g t h data i t e m s , varying r e p e t i t i o n s , o r o p t i o n a l occurrences of d a t a items. 2)
statements describing network s t r u c t u r e s .
SEThas been developed t o describe f i l e s t r u c t u r e s . t i a l l y ordered s e t of records. s e v e r a l "member1' records.
The concept of a A SET i s a sequen-
Each SET has one "owner" record and
The concept of "owner" record i s s i m i l a r t o
t h a t of "master" record in IDS.
Member records of SET'S a r e ordered
i n e i t h e r of two ways: (a)
Records may be ordered by ascending o r descending sequences based on s p e c i f i c keys.
(b)
Records msy be ordered i n r e l a t i o n t o e x i s t i n g members of t h e SET a s they a r e i n p u t .
m a t i s , when a new record
i s input, it can be automatically placed a s t h e l a s t o r
f i r s t record of t h e SET. The SET concept i s s i m i l a r t o t h e IDS chain.
3)
statements describing f i l e implementation.
The COaASYL DDL,
a t t h e f i l e l e v e l , allows t h e u s e r t o specify whether a SET of records i n t o be implemented e i t h e r with embedded p o i n t e r s o r with t a b l e s of
pointers.
However, t h e s e cannot be combined a s i n t h e EDMF, and t h e u s e r
kfic no c o n t r o l over t h e p o i n t e r s o r t a b l e s t r u c t u r e .
I n summary, t h e following c h a r a c t e r i s t i c s of data s$ructures a r e made available t o t h e user by the C O a A S n DDL. 1. m e c h a r a c t e r i s t i c s of individual data items consist of:
(i) (ii) (iii)
symbolic naming, f i x e d lengths a s specified by t h e user, data types:
a)
character s t r i n g ,
b)
number: 1)
binary o r decimal base,
2)
sign
-
radix or diminished radix complement
(depending on hardware) f o r binary numbers, and character sign o r no sign f o r decimal numbers,
3) (iv) (v) 2.
fixed o r f l o a t i n g point scale,
value alignment with blank or zero padding, data items i d e n t i f i e d by t h e i r position.
The c h a r a c t e r i s t i c s of records consist of:
( i) hierarchic structure, (ii) (iii) (iv) (v)
f i x e d order, fixed occurrences, fixed and dependent r e p e t i t i o n s ordered a s input, groups i d e n t i f i e d by t h e i r position.
3.
The s t r u c t u r e and implementation c h a r a c t e r i s t i c s of f i l e s c o n s i s t of: (i) (ii)
s t r u c t u r i n g by input sequence, s t r u c t u r i n g by c r i t e r i a on keys (values): a)
c r i t e r i a comparisons:
b)
con junctions of c r i t e r i a ,
5, 2, =,
( iii) implementation:
4.
a)
by embedaed p o i n t e r s ,
b)
by t a b l e % of p o i n t e r s .
The COIlASYL DDL w i l l depend on i t s implementation f o r storage structures.
The COIlASYL DDL i s an attempt t o c r e a t e a common front-end language f o r describing d a t a s t r u c t u r e s t o DBMS's.
There i s t h e r e f o r e a
degree of overlap between t h e CODASYL DDL and GDDL developed herein. Before t h i s overlap i s discussed, it should be pointed out again t h a t GDDL i s designed t o be a language f o r completely describing d a t a
s t r u c t u r e s and f o r d a t a conversion. t o specify data conversion.
The CODASYL DDL i s not intended
Furthermore, GDDL provides t h e c a p a b i l i t y
of describing storage s t r u c t u r e s , whereas CODASYL DDL does not.
At
t h e record l e v e l , CODASYL DDL i s based on COBOL and we show i n Appendix C t h a t GDDL has more d e s c r i p t i v e power than COBOL a t t h e record level.
This a d d i t i o n a l power i s obtained by providing more general
c a p a b i l i t i e s f o r specifying record implementation.
A t the f i l e
l e v e l , UODASYL DDL i s d e s i ~ m e d , t odescribe j u s t tlr~oucI i l e structures
e x i s t i n g i n current systems.
GDDL i s designed t o provide much greater
descriptive power a t t h e f i l e l e v e l .
The power i s provided by general-
i z i n g current f i l e s t r u c t u r i n g technology e s s e n t i a l l y by allowing t h e dependency of f i l e s t r u c t u r e on data values, record structure, and record implementation t o be described. 2.10
summary Wo trends have appeared i n t h e handling of data by software
systems.
F i r s t , the data s t r u c t u r e s provided have become increasingly
elaborate, and secondly, the user has been given more arid more explic i t control over s e t t i n g up the data structures required. The e a r l i e s t systems provided the user with c e r t a i n s t r u c t u r a l options a t t h e data item l e v e l .
These options were, however, pro-
vided i m p l i c i t l y through a s e l e c t i o n of machine i n s t r u c t i o n s .
Suc-
cessive systems provided more c a p a b i l i t i e s a t the record l e v e l , and allowed these t o be declared e x p l i c i t l y .
I t was f i r s t i n operating
systems t h a t s t r u c t u r i n g f a c i l i t i e s were offered a t t h e f i l e l e v e l . Typically, the s t r u c t u r e s provided were limited t o a few options which frequently
included sequential and indexed sequential struc-
tures. With t h e development of DBMS's, users were given more control over the implementation and s t r u c t u r e of both records and f i l e s . However, they s t i l l have no control o r even knowledge of the storage s t r u c t u r e s used.
The d d l presented h e r e i n t a k e s t h e s e two t r e n d s towards t h e i r l o g i c a l conclusion.
F i r s t , t h e d d l can describe a more general c l a s s
of d a t a s t r u c t u r e s than t h a t provided by current d a t a processing technology.
Secondly, t h e ddl allows every aspect of a d a t a s t r u c t u r e
a t each l e v e l t o be described e x p l i c i t l y . Those a s p e c t s of data s t r u c t u r e s which have been i d e n t i f i e d i n t h e preceecling s e c t i o n s have been summarized i n Table 2-1.
This
t a b l e i s organized t o provide a convenient means of evaluating t h e d d l and i t s underlying model i n l a t e r chapters.
e
l a
3
Record Characteristics
Structure Characteristic
Symbolic Naming
X
X
X
Fixed by hard-
Diminished
Table 2-1. Summary of Data Structure Characteristics
X
X
3.ecord Characteristics Implementation Characteristics
Structure Char-
r
I
d
File
FI
Characteristics
X
S t r u c t u r i n g by input Sequence
I
rn
.rl
X
X
X
X
%
X X X
X X X
2f
x*
>
r4
Criteria on Value s (~eys)
Q)
.IJ 0
2
$,
P
*
a l
. S-P
c,
E
.
3.8
k.9 I c, k
2
c,
s 2
-
6' t:
* 3 . t i
,
C r i t e r i a on Paths
v2
c,
L
Implementing by Sequential Storage
c,
f!
Conjunction of C r i t e r i a
v2
::
B
8
*+a
.rl
c,
Z
B
X
X
X
X
X
x
x
X
X
X
Q)
Embedded in Method Record
oj
k
oj
8 !3 .rl
P
m
U
c, ad aJ c, m
c
Q)
El a,
d
B
H
4
8 ad
m k
Stored in mble
3.2 c 1 " Path 8 Length Upper Bound .rl
d
Limit
* EDMF
only
3 ". k d
$
5 k
4
8
a 0
f3
5
cn
2
111
%
X*
10
Storage Characteristics
Block Naming 0
*
d
Reel Formatting of
X
X X
Storage Characteristics
( continued)
'' %'
Tbpe Disk
Bytes/Block Bytes/Block
X
X
X
m
a , m
m 0
m C
Storage
cl R
Characteristics
( continued) a3
2
Record Split Set
Whole Split
cd
I 3
$ 4 3 X X
CHAPTER 3 RECORD DESCRIPTION
3.1 I n t r o d u c t i o n I n t h i s chapter we begin our t a s k of showing how t h e o r g a n i z a t i o n of d a t a can be e x p l i c i t l y described.
We present t h e model f o r record
s t r u c t u r e t h a t i s t h e foundation f o r t h e design of GDDL's record description features.
We show t h a t t h e model i s complete f o r record
d e s c r i p t i o n i n t h e sense t h a t record s t r u c t u r e s of Table 2-1 can be described i n t h e model.
We a l s o d i s c u s s how t h e model can d e s c r i b e
c e r t a i n g e n e r a l i z a t i o n s of present record s t r u c t u r e s .
Then we show
t h a t t h e record d e s c r i p t i o n statements of GDDL a r e based on t h i s model. I n t h i s way we show t h a t GDDL i s a l s o complete and generalized i n t h e above senses.
We f u r t h e r demonstrate t h e completeness of GDDL by noting
t h a t t h e COBOL record d e s c r i p t i o n f e a t u r e s a r e properly contained i n GDDL and by providing a s e t of examples which i l l u s t r a t e t h e a b i l i t y of GDDL t o describe e x i s t i n g record organizations. 3.2 A Model of Record S t r u c t u r e s We begin t h i s s e c t i o n by providing a n i n t u i t i v e i n t r o d u c t i o n t o t h e model. The s m a l l e s t meaningful piece of information we w i l l c u l l a " d a t a item". r e cordc
Data items a r e t h e components which are organizctl irito
.
Conceptually, a d a t a item is a ~ t r i n gof charrrctcrc, which provide a value f o r t h e d a t a item, t o g e t h e r with a n i d e n t i f i c a t i o n of t h e
-
56
-
type o r c l a s s of information t o which t h e value belongs.
This type
o r c l a s s of information we c a l l t h e a t t r i b u t e of t h e data item. When a d a t a item i s represented on a storage medium, t h e r e must be r u l e s which determine how t h i s data item i s implemented a s a b i t string. When a u s e r i s organizing data items f o r storage and r e t r i e v a l from a computer medium, he i d e n t i f i e s a p a r t i c u l a r l e v e l of organization which i s t o be stored and r e t r i e v e d as a s i n g l e urnit when t h e data i s being used. level.
This l e v e l of data item organization we c a l l t h e record
A convenient way t o conceptualize t h e organization of d a t a items
a t t h e record l e v e l i s a s a hierarchy.
It i s c e r t a i n l y t h e case t h a t
e x i s t i n g software systems ( e - g . , COBOL, MARK I V , I D S , EDMF, and t h e CODASYL DDL) provided h i e r a r c h i e s f o r organizing d a t a items i n t o records.
The records a r e themselves f i n a l l y represented on a storage medium a s a b i t string.
So again t h e r e must be r u l e s f o r specifying how a p a r t i c u l a r
organization conceived by a u s e r i s t o be represented a s a b i t s t r i n g . There a r e then t h e following components t o t h i s process of data organiz a t ion: f o r data items:
(1) t h e conceptual s t r u c t u r e of data items, (2) t h e encoding of t h i s c t m c t u r e in%o a b i t str i n g , UIKJ
(3) t h e r e s u l t i n g b i t s t r i n g representation; f o r records: (1)
t h e conceptual s t q c t u r e of t h e recortis,
(2)
the encoding of the record s t r u c t u r e i n t o a b i t s t r i n g , and
(3) the r e s u l t i n g b i t s t r i n g representation. We therefore have t o model each of these components.
The conceptual
s t r u c t u r e of data items and records i s modelled i n terms of t h e ideas of a t t r i b u t e and value by generalizing the work of ( ~ 1967), e ( ~ 1968), h (lin 1970) , and (HS 1971) 1's.
.
The b i t s t r i n g is simply a sequence of 0 ' s and
The encoding of the conceptual structure i s modelled d i r e c t l y i n
terms of c h a r a c t e r i s t i c s f o r encoding a t t r i b u t e s and values a s b i t strings.
The complete model w i l l be presented i n two steps.
F i r s t the
model of data items w i l l be described and then the model of records.
3.2.1
The Model of Data Items
3.2.1.1
The Concept of Data Items
The concept of a data item can be described i n terms of two primitives
-
a t t r i b u t e and value, and a d e f i n i t i o n of data item based on
these primitives. I n t u i t i v e l y , an a t t r i b u t e i s a quality, such a6 s i z e , or weight t h a t i s ascribed t o an object.
For each a t t r i b u t e , there i s a s e t of
measures o r q u a n t i t i e s , known a s values.
A single value t o be
associated with the a t t r i b u t e i s selected from t h i s s e t .
For example,
a measure f o r the a t t r i b u t e weight i s selected from the s e t of r e a l
numbers. Definition 3-1.
A data item i s an ordered p a i r of the form < a, v >
where a i s an a t t r i b u t e and v i s a value.
For example, t h e p a i r s < name, JONES >, < age, 32 >, < sex, M >, 4 school, NEWTOWN HIGH
SCHOOL >, < school, UNIVERSITY OF PENNSYLVANIA >
a r e d a t a items. I n r e p r e s e n t i n g a d a t a item on a computer medium (such a s c a r d s , t a p e , e t c . ) both t h e a t t r i b u t e and t h e value must be encoded.
We s h a l l
consider t h e r u l e s f o r each kind of encoding s e p a r a t e l y . 3.2.1.2
Encoding Values
A value i s encoded i f it i s transformed i n t o a b i t s t r i n g accordint:
t o t h e following encoding rule'. string.
Such a s t r i n g w i l l be c a l l e d a value
The r u l e f o r encoding a value i s simply a d e t a i l e d s p e c i f i c a t i o n
of t h e s i x c h a r a c t e r i s t i c s l i s t e d below: 1.
Character codes.
S t r i n g s of binary d i g i t s a r e used t o encode
c h a r a c t e r s such a s l e t t e r s , numbers and punctuation s i g n s .
Character
codes have been standardized t o t h e e x t e n t t h a t a l l new computers use e i t h e r of two codes:
USASCII ( o r ASCII) and EBCDIC.
However, it i s
not s u f f i c i e n t t o be a b l e t o s p e c i f y e i t h e r ASCII o r EBCDIC a s t h e r e a r e o t h e r codes which a r e i n use on e a r l i e r computers.
Also, u s e r s of
l a r g e d a t a bases employ what a r e , i n e f f e c t , new c h a r a c t e r codes t o compress d a t a .
Thus, t o be completely general, it must be poosiblc t o
d e s c r i b e any c h a r a c t e r code.
One way t o d e s c r i b e a c h a r a c t e r code i s
t o l i s t f o r each c h a r a c t e r t h e code i n terms of i t s b i t s t r i n g representation. A:;sociated w i t h a c h a r a c t e r code i s a s o r t o r d e r .
To describe
t h e s o r t o r d e r , t h e c h a r a c t e r s of t h e code can be l i s t e d i n L k l c caret order.
When values a r e t o be t r a n s l a t e d from one character code t o a second character code, it i s necessary t o indicate f o r every character i n t h e f i r s t code i t s image i n the second code.
This can be specified
by l i s t i n g t h e characters of the second code i n the same s o r t order a s t h e f i r s t code. An example of encoding the characters of a value i n EBCDIC i s presented below.
For t h e data item < name, JONES >, we have J
-,
11010001
0
4
11010110
N
-,
11010101
E
-,
11000101
S
2.
Length.
+
11100010
The length of a value s t r i n g i s t h e number of b i t s
i n the s t r i n g . For example, the value s t r i n g of the a t t r i b u t e name i n t h e previous example may be specified t o be of length 64 b i t s , where unused b i t s may be f i l l e d a r b i t r a r i l y .
3. Length Uniformity.
If t h e value s t r i n g s f o r an a t t r i b u t e a r e
always of uniform length, then t h e lengths of t h e value s t r i n g s car1 be described simply by giving the length.
However, if t h e length of value
s t r i n g s f o r an a t t r i b u t e a r e not uniform, then e i t h e r the length of each value s t r i n g muct be given and ~ t o r e c lac a data item, o r t h e vuluc s t r i n g must be dellmlted by s p e c i a l characters.
Thus, value s t r i n g s
may be specified ao being e i t h e r uniform o r varying.
4.
Value alignment.
When t h e lengths of the value s t r i n g s f o r
a n a t t r i b u t e a r e t o be uniform, t h e number of characters needed t o represent t h e value may be l e s s than the a l l o t t e d length.
I n such cases,
it i s necessary t o specify whether t h e value i s aligned t o t h e r i g h t o r t o t h e l e f t and t o specify t h e characters t o be used t o pad out the unused positions.
For example, consider t h e data item
4
name, JONES >.
length of t h e a t t r i b u t e may have been specified a s character code a s EBCDIC.
64 b i t s
The value and t h e
To specify t h a t the value i s t o be
aligned t o the l e f t with blank'characters used f o r padding, r e s u l t s i n t h e following encoding of t h e tialue JONXS: J
-,
11010001
0
-,
11011001
N
5.
Data type.
t e r s o r as numbers. digits.
-+
11010101
E
-,
11000101
S
-,
11100010
)d
-,
01000000
)d
-t
01000000
j4
-,
01000000
Value s t r i n g s may be interprctcti u s e i t h e r c:li:~r*ac-Numbers a r e e i t h e r o-lgned or urlsi(y~cdc t r i n g c u f
Signs nay be denoted by the plus
0.r.
mir~uc, by radix rornplc.ruenl;,
or by diminiskled radix complement. Numbers may be orgar~izede i t h e r a:; f i x e d point, o r a s f l o a t i n g point numbers with the number of signil'icant d i g i t s and the length of the mantissa s p e c i f i e d .
6.
Value c r i t e r i a .
Numeric and s e t - t h e o r e t i c c r i t e r i a may be
used t o define t h e s e t of acceptable values f o r a given a t t r i b u t e .
For
example, values of t h e a t t r i b u t e age may be r e s t r i c t e d t o numbers between
65 f o r a given s e t of data items.
21 and
O r values of the a t t r i b u t e c i t y
may be r e s t r i c t e d t o a p a r t i c u l a r s e t of c i t y names. 3.2.1.3
Encoding A t t r i b u t e s
We have seen how t h e value of a data item i s encoded.
To encode
t h e e n t i r e data iten1 we must now provide a way of identifying t h e a t t r i bute t o which t h a t value belongs. This can be achieved i n two ways.
The f i r s t way i s t o d i r e c t l y
encode t h e a t t r i b u t e a s a b i t o r character s t r i n g , and then p o s i t i o n t h i s s t r i n g r e l a t i v e t o t h e value.
This way of encoding an a t t r i b u t e
can be made t o f u l f i l l a second r o l e .
We saw i n t h e discussion of
l e n g t h uniformity i n t h e s e c t i o n on encoding values, t h a t i f a value i s s p e c i f i e d a s having varying length, then it must be delimited by charact e r s which s i g n i f y t h e end of t h e value s t r i n g .
The a t t r i b u t e encoding
can serve a s such a d e l i m i t e r f o r t h e value s t r i n g .
We w i l l c a l l t h e
s t r i n g which d i r e c t l y encodes an a t t r i b u t e , an a t t r i b u t e marker. The following c h a r a c t e r i s t i c i s used t o specify an a t t r i b u t e marker:
7. A t t r i b u t e marker. A t t r i b u t e markers can be c i t h c r
chax*acter
o r b i t s t r i n g s which a r e poaitioned d:Lrectly i n f rorlt of or d.lrectly behind a value s t r i n g .
The second way i n which t h e a t t r i b u t e of a p a r t i c u l a r v a l u e can be i d e n t i f i e d i s by knowing t h a t it always occurs i n a c e r t a i n p o s i t i o n r e l a t i v e t o o t h e r values.
That i s , i f a s e t of d a t a items a r e organized
i n such a way t h a t t h e p o s i t i o n of t h e value corresponding t o a given a t t r i b u t e can be i d e n t i f i e d , t h e n t h e a t t r i b u t e has been i n d i r e c t l y encoded by p o s i t i o n i n g .
A s t h s encoding of a t t r i b u t e s by p o s i t i o n i n g
depends on t h e o r g a n i z a t i o n of s e t s of d a t a items, t h i s way oS encoding a t t r i b u t e s w i l l be dlscusoed i n t h e next s e c t i o n . The Model of Records
3.2.2
3.2.2.1
The Conceptual Record S t r u c t u r e
I n t h i s s e c t i o n we want t o model t h e conceptual s t r u c t u r e of records.
F i r s t , however, we must p i n down e x a c t l y what we mean by a
record i t s e l f .
Then, we can go on t o o b t a i n t h e s t r u c t u r e of such
records.
I n t h e d a t a p r o c e s s i n g f i e l d , a u s e r of COBOL conceives of a record d i f f e r e n t l y t h a n say, a , u s e r of MAW I V .
I n t h e d e f i n i t i o n of
r e c o r d s below, w e attempt t o give an e x a c t f o r m a l i z a t i o n of t h e n o t i o n of record which i s independent of any p a r t i c u l a r software system. D e f i n i t i o n 3-2.
A record i s a s e t of d a t a items which a r e structut:ed
according t o t h e following r u l e s : r e c o r d -. group group
-,
< a t t r i b u t e , {compound. value)>
compound value
-,
compound value, (:ompound value
compound value -. group compound value -. d a t a item
We use t h e symbols < > t o denote an ordered s e t and the symbols { ] t o denote an unordered s e t . For example, the data items < name, JONES >, < age, 32 >, and
< sex, M > can be organized i n t o t h e following record:
< person, {< name, JONES >, < age, 32 >, < sex, M >]> A s another example, the data items < name, JONES >, < name, MARY >,
< age, 6 >, < name, JOHN >, < age, 10 > can be organized i n t o the record: < family, {< name, JONES >,
< child, {< name, MARY >, < age, 6 >]>, < child, {< name, JOHN >, < age, 10
>]>I>
I n t h i s case < child, [< name, MARY >, < age, 6 >)> and
< child, {< name, JOHN >, < age, 10 >]> a r e groups. It should be noted t h a t a data item i s simply an a t t r i b u t e - v a l u e p a i r , whereas a group i s an attribute-compound value p a i r .
When i t i s
necessary t o d i s t i n g u i s h t h e a t t r i b u t e s associated with compound values from the a t t r i b u t e s associated with values, we w i l l r e f e r t o them a s group a t t r i b u t e s and data item a t t r i b u t e s respectively.
I n the example
above, "name" and "age" a r e data item a t t r i b u t e s whereas "family" and "child" a r e group a t t r i b u t e s . d a t a items.
Compound values a r e a c t u a l l y groups o r
The groups forming a group a r e c a l l e d subordinate groups.
We note t h a t a s a consequence of t h e above d e f i n i t i o n t h e s t r u c t u r e of a record i s a hierarchy which has a n a t t r i b u t e assoc i a t e d with each p a r t of t h e hierarchy.
We can t h u s a b s t r a c t a
notion of record s t r u c t u r e based on t h e s e a t t r i b u t e s which i s independent of t h e values. D e f i n i t i o n 3-3.
This i s done i n D e f i n i t i o n 3-3. A record s t r u c t u r e i s a r e l a t i o n s h i p over d a t a item
a t t r i b u t e s produced according t o t h e following s t r u c t u r e productions: 1. record s t r u c t u r e
-,
structure
2.
structure
-,
< group a t t r i b u t e , {substructure]>
3.
sub s t r u c t u r e
-,
hubstructure, s u b s t r u c t u r e
4.
substructure
-,
structure
5.
substructure
-,
d a t a item a t t r i b u t e
6.
substructure
-,
null
For example, t h e d a t a item a t t r i b u t e s "name" and "age" may be r e l a t e d by s t r u c t u r e s obtained from t h e following s t r u c t u r e productions: family record s t r u c t u r e s t r u c t u r e F1 substructure FlFl s u b s t r u c t u r e F1 cuhstructure 2'2 :;ubutructure F'2
+
-,
< family, { s u b s t r u c t u r e ~ 1 ~ 1 ) >
4
s u b s t r u c t u r e F1, s u b s t r u c t u r e F2
-)
name
+
4
cubstructure F2
s u b s t r u c t u r e F212
[lull c u b s t ~ u c t u r cla':', ::ubstr.uc.Lur.e
11':'
s t r u c t u r e P2
< c h i l d , [ s u b s t r u c t u r e l?l'%l~l]>
s t r u c t u r e F2 s u b s t r u c t u r e F21F1
s t r u c t u r e F1
-,
s u b s t r u c t u r e F1, s u b s t r u c t u r e F2'12 age
Two p a r t i c u l a r s t r u c t u r e s of these a t t r i b u t e s are: (i) (ii)
< family, {name]>
< family, {name, < child, {name, age] >, < child, {name, age] > ] >
Note:
i)
Production 3 i n d e f i n i t i o n 3-3 allows a p a r t i c u l a r subs t r u c t u r e t o repeat an a r b i t r a r y number of times, (e.g., i n the above example substructure F2
ii) Production
-4
substructure F2, substructure ~ 2 ) .
6 allows t h e occurrence of a p a r t i c u l a r substruc-
t u r e t o be optional, (e.g., i n the above example substructure F2
-)
-
null).
If we a r e given a s t r u c t u r e , then we can obtain records from it
simply by s u b s t i t u t i n g a data item f o r each data item a t t r i b u t e i n the structure. For example, i f w e make the following s u b s t i t u t i o n s i n t h e struct u r e s above:
< name, JONES >
f o r name
f o r name
f o r age
< name, JOHN >
f o r name
< age, 10 >
f o r age
we obtain the following records: i)
< family, {< name, JONES >]>
ii)
< family, {< name; JONES >,
>I>, 1 0 >I>]>
< chiild, {< name, MARY >, < age, 6 < c h i l d , {< name, JOHN >, c age,
I n a previous s e c t i o n w e s a w how d a t a items were encoded.
Now
w e must consider how t h e s t r u c t u r e of a record i s encoded.
3.2
2.2
Encoding t h e Record S t r u c t u r e
The s t r u c t u r e of a record i s a r e l a t i o n s h i p over Lhe d a t a item a t t r i b u t e s i n t h e record s p e c i f i e d by s t r u c t u r e productions.
'Blese
productions a c t u a l l y produce a h i e r a r c h i c s t r u c t u r e which has t h e d a t a item a t t r i b u t e s on t h e lowest l e v e l s and each h i g h e r l e v e l i d e n t i f i e d by a group a t t r i b u t e .
Therefore, t o encode t h e s t r u c t u r e of a record
it i s only necessary t o ensure t h a t t h e a t t r i b u t e which i s a s s o c i a t e d w i t h each compound value can be i d e n t i f i e d . W e have seen that t h e a t t r i b u t e of a d a t a item can be i d e n t i f i e d by p u t t i n g a marker a d j a c e n t t o i t s value, o r , when t h e d a t a item
appears i n a group, t h e a t t r i b u t e can be i d e n t i f i e d by t h e positiori of i t s value r e l a t i v e t o t h e values of o t h e r a t t r i b u t e s . The a t t r i b u t e a s s o c i a t e d w i t i - i a compouncl value c-a11 be iderjtified
i n s i m i l a r wayc:.
Markers can ,be placed ad,jacent Lo Lhc c-ornpourrcl v : i l u c
ur:ing t h e same " a t t r i b u t e marker'" r-haractcrj cL i c . as t~el'orc.
AlLel.~i:lL i
vc-
t k ~ ea t t r i b u t e f o r a compourltl value can bc irlcr~lil'iud b y L l ~ cpo:: i L i c ) l ~ IY, in which t h e compour~dvalue occ-wr:: r e l a t i v e t o Lhc votnpou~al v:llucc ol' other a t t r i b u t e s .
We will now discuss what characteristics must be specified to identify an attribute from the position of the compound value or value. For convenience in this discussion, we will just use the term compound value to refer to both compound values and values. The attribute associated wlth a compound value can be identified if the compound value occurs in a particular order with respect to the compound values of other attributes in the same substructure. In this case, the order can be specified by listing the attributes of the compound values in the appropriate order. Further, if one of the attributes in this list corresponds to a substructure which is optional, then it must be specified that this attribute may not appear. Also, if one of the attributes in the list corresponds to a substructure which repeats, then the number of repetitions must be given. The characteristics required to identify the attribute of a compound value (or value) from the position of the compound value (or value) are given below:'
8.
Order.
The order of compuund values can be specified by
listing their attributes in the appropriate order. If the attributes are allowed to appear in any order, then the encoding must be done by markers. 9.
Occurrence.
The occurrence of an attribute may be either
mandatory or optional within a substructure. 10. Repetition number.
The repetition number is the number of
times an attribute may occur consecutively in a substructure.
11. R e p e t i t i o n uniformity.
If t h e number of times a n a t t r i b u t e
r e p e a t s i s always t h e same ( i . e . , t h e r e p e t i t i o n of t h e a t t r i b u t e i s uniform), t h e n t h e r e p e t i t i o n number can be s p e c i f i e d simply by g i v i n g t h e number d i r e c t l y
However, i f t h e r e p e t i t i o n of t h e a t t r i b u t e i s ]lot
uniform, then e i t h e r t h e r e p e t i t i o n number must be encoded and s t o r e d a s a d a t a item, o r t h e encoding of t h e values o r compound values f o r
t h e a t t r i b u t e must be d e l i m i t e d . 12.
Repetition order.
When t h e same a t t r i b u t e r e p e a t s , t h e n t h e
encoding of t h e val-ues o r compound v a l u e s f o r it may e i t h e r be s t o r e d d i r e c t l y i n any o r d e r o r i n some order described by c r i t e r i a on t h e values .
13.
Criteria.
Numeric and s e t - t h e o r e t i c c r i t e r i a may be used t o
dePine t h e s e t of a c c e p t a b l e v a l u e s o r compound v a l u e s f o r each attribute. 3.2.3
The S p e c i f i c a t i o n of t h e Encoding C h a r a c t e r i s t i c s I n t h e previous s e c t i o n s we have seen t h a t r e c o r d s a r e encoded by
specifying c e r t a i n c h a r a c t e r i s t i c s .
We w i l l a l l o w each c h a r a c t e r i s t i c
t o be s p e c i f i e d e i t h e r :
-
1)
directly
2)
indirectly
by specifying e x p l i c i t l y t h e c h a r a c t e r i s t i c , o r
-
by s p e c i f y i n g a f u n c t i o n which must b e computed
t o determine t h e c h a r a c t e r i s t i c .
The furlctiorl rmy be defineti
over t h e v a l u e s of d a t a items o r over o t h e r char.acter.istic.s using the usual arithmetic operators. For example, t h e l e n g t h q h a r a c t e r i s t i c can be s p e c i f i e d d i r c r t l y a s a number of b i t s , o r it can be ~ p e c i f i e di n d i r e c t l y a s perimps
(i) (ii)
being equal t o the value of some p a r t i c u l a r data item, or being equal t o t h e number of r e p e t i t i o n s of some p a r t i c u l a r attribute.
3 . 3 I n t e r p r e t a t i o n of Common Data Processing Concepts i n Terms of the Model of Record Structures A s e t of s t r u c t u r e productions together with a s p e c i f i c a t i o n of
the r u l e s f o r encoding t h e s t r u c t u r e s determines a p a r t i c u l a r type of record, or record type.
Two records a r e of the same record type if
and only i f they can both be obtained from t h e same s t r u c t u r e product i o n s and they both have t h e same encoding c h a r a c t e r i s t i c s . Note t h a t the term record i s sometimes used i n data processing l i t e r a t u r e t o r e f e r t o what we c a l l a record type. Note t h a t the production r u l e s of Definition 3-2 make it possible t o distinguish e a s i l y between a data item and a record consisting of a single data item, even though t h e both contain a s i n g l e value.
For
example, < name, JONES > i s a data item, whereas < person, {< name, JONES
>I> i s
a record.
This d i s t i n c t i o n r e f l e c t s the f a c t t h a t a data
item i n i t s e l f i s only a basic u n i t of information i n some data organization, whereas a data item structured a s a record i s i n addition the basic u n i t which is stored o r retrieved when t h a t data organization i s used. Two groupo a r e of t h e same Broup type if and only if Lhey (!an both be obtained from t h e came s t r u c t u r e prod.uctlons and. they both have t h e oame encoding c h a r a c t e r i o t i c s .
.
A d a t a item corresponds t o t h e i n t u i t i v e idea of a f i e l d .
Two f i e l d s a r e of t h e same f i e l d type if and only i f t h e y both have t h e same a t t r i b u t e and a r e both encoded i n t h e same way. In e a r l y v e r s i o n s of COBOL and i n some record i s allowed per f i l e .
Dm's
only one type of
I n t h e s e systems t h e r e was t h e r e f o r e no
need t o r e f e r t o p a r t i c u l a r types of records.
However, t h e model allows
f o r t h e appearance of more than one type of record i n a. f i l e .
Therefore,
some means of r e f e r r i n g t o p a r t i c u l a r types of records must be provided. S i m i l a r l y , it w i l l be u s e f u l t o be a b l e t o r e f e r t o p a r t i c u l a r types We w i l l u s e t h e a t t r i b u t e A of a record (group,
of groups and f i e l d s .
... > t o
f i e l d ) < A,
name t h e type of' t h a t record (group, f i e l d ) .
Thus,
a record < person, { . . .]> i s of type person, and a f i e l d < age, 10 > i s of type age.
To ensure t h a t t h i s way of r e f e r r i n g t o types of records
(groups, f i e l d s ) i s unambiguous, we must make t h e followirlg convention: Within a f i l e , a given a t t r i b u t e i s a s s o c i a t e d with only one s t r u c t u r e and only one s e t of encoding c h a r a c t e r i s t i c s . I n p a r t i c u l a r t h i s requires: (1) A given a t t r i b u t e can occur i n only one production of t h e f om:
c1;ructure
(:!)
-,
< a t t r i b u t e , { ~ u b c t r u tcu r c ]>
If A occurs i n a protiuc8t.i.on of the l'orm:
c t r u c t u r e -. < A, {substructure]> t h e n R cannot occur i n t h e s u b s t r u c t u r e . We w i l l see i n Section 3.5 t h a t t h i s convention erlsures t h a t t h e s t r u c t u r e productions produce only h i e r a r c h i c o r g a n i z a t i o n s .
3.4 An Application of t h e Model of Record Structures An example of using t h e model t o completely encode a s e t of data items i n a given otructure a s a b i t s t r i n g i s given below: Consider the data items
-
< name, JONES >, < age,
32
>, and
< sex, M > and t h e s t r u c t u r e specified by t h e s t r u c t u r e productions: person record s t r u c t u r e
-,
s t r u c t u r e P1
s t r u c t u r e P1 .-. c person, {substructure ~ 1 ~ 1 ] > substructure PI21
-,
substructure PlP2
substructure P11, substructure PIP2 substructure P12, substructure P13
substructure P11
-,
name
substructure P12
-,
age
substructure P13
-,
sex
The following record i s obtained from these s t r u c t u r e productions:
< person, {< name, JONES >,
< age, 32 >, < sex, M >]> The b i t s t r i n g representation of t h i s record i s produced using t h e following encoding c h a r a c t e r i s t i c s : (1)
The character code f o r t h e values of name, age and sex i s EBCDIC
(2)
.
The length of values of name i s 64 b i t s , of age i s 16 b i t s , and of sex i s 8 b i t s .
( 3)
The lengths of values of name, age and sex a r e uniform.
(It)
!The values of name a r e l e f t aligned and padded with blanks.
(5)
m e values o f name, age and sex a r e t o be i n t e r p r e t e d a s
character strings.
(6) There
a r e no r e s t r i c t i o n s defined by c r i t e r i a on t h e values
of name, age and sex.
( 7 ) No a t t r i b u t e markers a r e used with value s t r i n g s of name, age and sex.
The otructure i c encoded according t o t h e following c h a r a c t e r i c t i c c :
( 6 ) The attribute^ name, age and sex appcar i n
l;hc order i n which
they a r e named by the s t r u c t u r e produc.tior~s.
(9) An occurrence of each a t t r i b u t e i s mandatory. (10) Each a t t r i b u t e occurs once i n a s t r u c t u r e .
(11) The r e p e t i t i o n f o r each a t t r i b u t e i s uniform. (12) Since t h e r e may be only one occurrence of t h e a t t r i b u t e s name, age and sex, t h e r e p e t i t i o n order c r i t e r i o n does not apply.
(13) There a r e no r e s t r i c t i o n s defined by c r i t e r i a on t h e compound
values of person. Applying these encoding c k a r a c t e r i s t f c s , the followirq rccaor.cj raepr.csentation results: 110100011101011011010101110001011110001001000~0001000100ooo0
11110011111100101101Ol00
Igor every d i f f e r e n t s e t of d a t a i t e m s which are substituted i n the s t r u c t u r e obtairlecl. from t h e above s e t 01 ztr-uc.1;ur.c PI-oclu~'tior~:;, ir rj
if ferc.rit b i t r;tr i.rg is produc.erl by these c.nc:odir~ ca1itr r.ncsLct ' i L:L i (.:;.
3.5
l%e Completeness and Generality of the Model
To be complete, the model must incorporate i n i t s e l f a l l of the c b r a c t e r i s t i c e of record structures
derived i n Table 2-1.
This i s
done f o r the data i t e m c h a r a c t e r i s t i c s a s follows: By'rnbolic naming appears i n the model a s the concept of an attribute. 'Phe implementation characteristics f o r data items appear i n the
model d i r e c t l y a s encoding characteristics. The c h a r a c t e r i s t i c s r e l a t i n g t o the structure of records are incorporated i n the model a s follows: The structuring characteristics of records appear i n the model a s the concept of record structure. The implementation c h a r a c t e r i s t i c s a r e incorporated d i r e c t l y as encoding characteristics. Thus, the model includes each of the record l e v e l c h a r a c t e r i s t i c s appearing i n B b l e 2-1.
I n t h i s sense, t h e model i s complete.
We f u r t h e r note t h a t the structure productions and the convention of Section 3 . 3 impose a p a r t i a l ordering on t h e a t t r i b u t e s of a structure.
This i s proved a s follawa:
Theorem:
The structure productiono and the convention of Scctior~3 . 3
lmpone a p a r t i a l orderira over the a t t r l b u t e ~ ;of a teecord structulwc. IJroof: A p a r t i a l ordering i s a r e l a t i o n which i s 1)
reflexive ,
2)
antisymmetric
3) t r a n s i t i v e .
1,c.t uc clef ine
- t o be a r e l a t l o n over a t t r i b u t e s ac follows: 2
for attributes a and b, a
-3 b
... b .. .
If arid only i f u = b, o r < a , {
i~ s structure, where b may appear i n any depth of {
,3
or
, > brackets
-
now show 3 i s a p a r t i a l ordering.
We will
3 b and b 3 a and t h a t a # b. 2) Assume t h a t a T h i ~means < a, { . .. b . . . 3 2 and < b, [ . . . a . . . 3 1) By d e c i n i t i o n 3 i s reflexive.
> a r e structures.
But by (I) of t h e convention, t h e a t t r i b u t e b can only be associated
. . . a . . . 3. . . . < b , { . .. a .. .
with one substructure which must therefore be { i s a c t u a l l y
/,
a, {
I e not allowed by ( 2 ) of the conventi on. a
= b,
'Riuc, a
-3 b
Thus,
) >.
This
-
arid b 3 a implies
-
Hence 2 i s antisymnzetric. 2 b and 3) Assume a -
-
b 3 c.
I f a = b and/or b = c, then a
-3 c.
. . b ... ) > and < b, ( . .. c . .. ] > a r e s t r u c t u r e s , the convention < a , { .. . b ,.. 3 > i s a c t u a l l y
If < a, { , then by (1)of . . . I > .
-
T h u s , a3c,ands,is
transitive. Therefore,
-2 i s a p a r t i a l
ordering.
Mathemtically, any hierarchy can be r e a l i z e d by a p a r t i a l ordering (131 1948)
.
From t h e above proof, it Sollows t h a t t h e struvtu r.c
productions and conventions can r e a l i z e any hierarchic record s t r u c t u r e . The c h a r a c t e r i s t i c s of m b l e 2-1 a r e incorporated i n more generalized forms i n t h e model t o allow f o r the description of variat i o n s of e x i s t i n g data
structurec.
This g e n e r a l i t y i s provided i n
t h e following ways: 1)
The model provides a more generalized way t o describe the
order of data i t e m and groups.
A s we have seen i n Pable 2-1, current
systems only provide f o r the specification of fixed ordering.
However,
t h e ordering c h a r a c t e r i s t i c of the model allows order t o be specified a s ffxed o r
86
a r b i t r a r y r e l a t i v e t o the groups.
the following group , < t , b > ] > ,
< U) c >, c v , c < ~ d, > , e s , @ > I > ] > with the following order characteristic: m e ordering f o r the compound values of a t t r i b u t e x i8 fixed, and t h e ordering f o r the compound values of a t t r i b u t e s y and v i s arbitrary.
mis
r e s u l t s i n the following valid orderings of the values a , b, c,
d, e :
abcde, bacde, abced, and baced. Such variable orderings a r e not permitted i n current systems. 2)
The model provides a more generalized way t o specify the
encoding c h a r a c t e r i s t i c s than i s required t o describe the char.acteristics of m b l e 2-1.
I n Table 2-1, we saw t h a t t h e c h a r a c t e r i s t i c s length and repetition
could be specified a s depending on some single other data i t e m .
I n the
model, a l l c h a r a c t e r i s t i c s can be specified a s depending on other data
items, other characteristics and f'unctions of these. This greatly increases the variety of' encodings which can be specified.
In theee ways, the model allows generalizations of current data representations at the record level to be specified.
3.6 The Relationship Between the Model and
GDDL
GDDL has been explicitly decigned in terms of the model. A CDDL statement consists of an identifying name and a string of parameters. m e FIELD and GRmP statements are used to describe the conceptual organization of data items and groups. Each encoding characteristic of data items and the structure of records can be specified by one or more
parameters in GDDL statements. The parameters and statements for these characteristics are listed in Table 3-1 given below: 4
Value Characteristics
Statements and Parameters
Character Code
FIELD statement
Length
Remarks
parameter (ii) CHAR statement SET statement
FIELD statement parameter ( iii) parameter ( iv)
Specified in Section in Appendix A
1.1 1.ic.l
2.1.2.1
..U)
U
1.1
C V,
L
w
C
Length UniPormity
FIELD statement parameter (v)
U
1.1
2 o
r
Value Alignment
FIELD statement parameter (lx.)
Data m e
F I E L D statemerit
Value Criteria
GRCUP statement parameter (iii)f Criterion statements
U
E Q, .o
1.1
C
parameter (vi)
1.1
C
o
n 1.2
2.1
. Attribute Characterieties
Statements and Parameters
Attribute Marksre
CONCODE state-
Order
GRWP statement
ment
Remarke
>
GROUP statement parameter (ifi)b
Repetition Number
GROUP statement
Repetition Unifortnity
GROUP statement
Repetition Order
GROUP statement
Criteria
GROUP statement
2.1 m @rl o
.t: .rl $4
t
4
parameter (iii)c
?I o o
parameter (iii)d
2.1 2.1
1.2
8
1.2
parameter (iii)e 1.2
parameter (iii)f Criterion statements Specification of Characteristics
1.4.3
d
parameters (ii) and (11l)a Occurrence
Section in Appendix A
J
Section in Appendix A
Statements and Parameters
Direct
By listed parameters
Indirect
Parameter statements
2.1
1.4.4
Wble 3-1. The Relationehip Between the Model and. GDDL Inaigl~tinto the relationship between the model and. GDDL can beat be obtained by comparing the format of the GDDL FIELD and GROUP statements with the definitions of field ty-pe and group ty-pe (see Section 3.3 and Definitions 3-1 and 3-3)
.
The FIELD statement has the following format:
FIELD ( field name, encoding characteristics ) This corresponde to the specification of a Iield type in the followi~g way.
attribute corresponds to the field riame, arid cncodirlg
'Phe
characteristicc appear directly. Thus, we Gee t h a t the FIELD statement specifies data items. The G R W P statement has the following format: G R O ( group name,
. ..
; (list),
.. . , (list) . ..) .
This corresponds to the specification of a group type in the following way.
Compare the structure productions of Definition 3-3 with this
format.
The production of the type: structure
< attribute, { substructure ] >
corresponds to the format of the GROLTP statement, with the attribute corresponding to the group name, and with all the substructures that can be obtained using the remaining types of productions corresponding to (list),
... , (list).
The encoding characteristics for each sub-
structure are included in each list. Thus, we see that the G R W P statement specifies the structure for groups.
To specify that a particular
group is to be treated as a record, the IiECOliD statemerlt is used (cce Section 1.3 in Appendix A)
.
From the above table, we note that every chatmac*tcr.ist,ic.ol' Lhc moclel is included in GDDL. Since the complete sct of characteristic:: can encode the structure and values of data items, CalDL therefore has the same capability. This, in effect completes the ar&-u~neritthat GDL)L can specify any record level
structures which
can be described in
the model.
3.7 Demonetrations of GDDLtsCompleteness In the greviour~~ectionwe ohowed that GDDL is complete for record de~crlptionby ~howirqthat the model on which it was based i~ complete. We now provide eeveral practical examples of its completeness. me first of these examples is a demonstration that GDDL contains
the COBOL record description features as a proper subset. COBOL was chosen because it Is the prototype for almost every DBMS DDL and for the CODASYL DDL effort. It has the most highly developed record description capabilities currently available. The demonstration is given in Appendix C, part 1. In Appendix C, part 2 three examples are given of record
characteristics describable in GDDL but not in COBOL. The remaining examples demonstrate the use of GDDL in describing
real-world records. These record descriptions are part of larger examples of complete conversions of data f r m one structure to another. They are given in Appendix B.
CHAP'ER 4
4.1
FUE DESCRIPTION
Introduction This chapter i s devoted t o t h e study and description of organiza-
t i o n s of records called f i l e s .
W e develop a model of f i l e structures
which i a a very general extension of current concepts of f i l e s a s analyzed i n Chapter 2.
Thin model leads t o t h e technique f o r describing
f i l e structureo that i s incorporated i n GDDL.
This technique i s i l l u s -
t r a t e d i n a s e r i e s of examples which show t h a t GDDL can describe several well-known f i l e structures.
4.2 A Model of F i l e Structures I n Chapter 3, we developed a model of records.
I n t h i s chapter,
we a r e concerned with t h e record a s a basic u n i t of storage and r e t r i e v a l . When l a r g e numbers of records a r e t o be stored and retrieved, a problem of e f f i c i e n t u t i l i z a t i o n a r i s e s .
For example, s t o r e time i s consergved
If data need not be rearranged each t i m e a new record i s stored. And search time i s concerved if records can be
GO
arranged t h a t each record
i s stored physically next t o the record t h a t i s needed next.
Then, when
the f i r s t record t o be used i s found, succeeding records can be d i r e c t l y accessed i n the order of usage.
However, when access t o two o r more
record8 from a single record l a required, a sequential ordering of records doeu not i n l t a e l f provide the m o ~ te f f i c i e n t u'tilization. A uaer, then, should conceive of the record^ as bcing conx1ecetcd
together i n fiome way by acceat: paths.
-
81
-
These pal;hc make a record at
one p o i n t on a path a c c e s s i b l e t o records which occur a t p o i n t s previous t o it on t h e path.
n e y represent connections among t h e records i n
question t h a t t h e u s e r wants t o e x p l o i t f o r storage and r e t r i e v a l .
We
c a l l such an organization of records t h e conceptual f i l e s t r u c t u r e .
When
t h i u s t r u c t u r e l o implemented on a ctorage medium, it must be represented
i n gome way by a o t r i n g of b i t s . A s seen i n Chapter 2 , there a r e currently t h r e e ways i n which t h e
access paths of a f i l e s t r u c t u r e a r e implemented.
If t h e r e i s t o be an
access path from a record (say, A) t o another record (say, B ) , it may be implemented by:
-
(1) sequencing p o s i t i o n
t h e b i t s t r i n g representation f o r B
i s concatenated a f t e r t h e b i t s t r i n g representation f o r A ( s e e Figure 4-1, a ) ; (2)
embedding p o i n t e r s i n t h e records
-
a pointer t o B ( i . e . ,
an encoding of t h e p o s i t i o n t h a t t h e b i t s t r i n g representation of I3 occupies i n t h e record sequence) i s included as a f i e l d i n A ( s e e Figure 4-1, b);
(3)
arranging p o i n t e r s i n t a b l e s
-
a p o i n t e r to B i s concatenated
a f t e r t h e p o i n t e r t o A i n a sequence of pointerSs ( c a l l e d a t a b l e ) which i o maintained separately from t h e records themelves
.
Ultimately, a p o i n t e r t o B w i l l give t h e phycical address of the b l t & r i n g repreccntation of B when it Is stored on n cl;oragc medium.
Ilow the a c t u a l h l t a t r i n g f o r
ci
pointer can be obtuined i s discussed in
Chapter 5 , a f t e r we have considered t h e organization of storage media.
bsr
bsr A
where b s r means: b i t string reprccenta-
A
t i o n 01'
Figure 4-1, a.
By Sequencing
b s r B and
bsr A and pointer
pointer
Figure 4-2, b.
-
By Embedding P o i n t e r s
b s r pointer
...
to R
to R
...
'bsr
Figure 4-1, c .
Figure I t - 1 .
By Using Tables of Pointers
ImplemeriLatjon oi' Access I'IiLhs
bcr
We saw i n Chapter 3 how t h e records themselves a r e encoded as b i t strings.
Now we must consider the r u l e s f o r encoding t h e f i l e structure
into a b i t string.
If the f i l e structure i s t o be implemented by
sequencing, the rules muet determine the sequence i n which t h e b i t s t r i n g s roproosntlng t h e rucordo occur.
I f the f i l e structure ic t o be imple-
mented by pointere, the r u l e 6 must determine how the p o i n t e r ~a r e encoded i n t o b i t atrings, where these b i t s t r i n g s must be positioned i n r e l a t i o n t o the b i t s t r i n g s of the records, and the sequence i n which t h e b i t s t r i n g s of the records must occur.
These rules w i l l then determine a
b i t s t r i n g which represents the f i l e structure. There a r e thus three components of t h i s process: (1)
the conceptual f i l e structure,
(2)
the f i n a l b i t string, and
(3)
r u l e s f o r encoding t h e conceptual f i l e structure of records as a b i t string.
We therefore have t o model each of these components.
m e modelling of
t h e conceptual s t r u c t u r e i s influenced by (Co 1970).
'Phe r u l e s f o r
encoding a r e modelled a f t e r the work of (HS 1970).
The b i t s t r i n g '
is simply a sequence of 0 ' s and 1 ' s .
F i r s t , the conceptual f i l e structure w i l l be deccribed.
And
secondly, the r u l e s f o r encoding t h e f i l e structure w i l l be specified. 4.2.1
Ihe Conceptual F i l e Structure
L
W e noted In the previous ~ e c t i o nt h a t thc f i l e s t m c t u r c cletcrm l r l c c whlch rccords a r e connected by acceao pathr;.
In other words, J t
determines a r e l a t i o n ( c a l l e d a f i l e r e l a t i o n ) among records on t h e b a s i s of access paths.
Consider two records which we w i l l c a l l A and
B, such that e i t h e r
(I)
t h e b i t s t r i n g representation of B i s concatenated a f t e r t h e b i t s t r i n g r e p r e s e n t a t i o n of A, o r
(ii)
t h e r e i s a p o i n t e r from A t o B.
Then we say t h a t t h e r e i s a d i r e c t access path fromA t o B .
lielative
-
t o t h i s path we c a l l record A t h e head of t h e path and record B t h e tail of t h e path.
This terminology allows u s t o r e f e r t o records connected by
access paths without naming t h e s p e c i f i c records. D e f i n i t i o n 4-1.
The f i l e r e l a t i o n determined by access paths through a
s e t of records c o n s i s t s of t h e s e t of ordered p a i r s < head record, t a i l record > f o r each d i r e c t access path.
A s examples, consider t h a t we a r e given a s e t of records, S = {rl,
(1)
...
r ) where r is a record f o r 1 n i
i g n.
The access paths of t h e l i s t s t r u c t u r e :
r2 -3 ..* I-n-1 give t h e r e l a t i o n I 1-,
< rn- 1 9 rn
-)
'n < rlr rg 2, < r.,,r3 >,
-,
1' .
(2) Vie access path:: of t h e t r e e s t r u c t u r e :
1 .
. ... ,
give the r e l a t i o n
< r ,r >, 2 5 (3)
I2 = {< r 1'r 2 >, < r1 ,r 3 >, < r2,r4 >,
* * *
?
< rn-prn >
1)
The access paths of the r i n g structure:
give the r e l a t i o n Ig = [< r
,
r 1' 2
>,
..., < rn-l,rn>)
...?
< rnJr1 >]
It w i l l be convenient t o introduce t h e following terminology:
a)
If the p a i r of records < r
1'
r > is i n a f i l e relation
J
R, then we say t h a t there i s a path of length 1 from ri t o
f o r r e l a t i o n R. I' one. b)
Therefore, a d i r e c t access path has length
J : i t h e pair of recorrlr; < ri,rJ
> is not i n a f i l e r e l a t i o n
R, w e say t h e r e i n a p a t h of length 0 from r . t o r . for 1 J r e l a t i o n 13.
c)
If t h e p a i r s of records
-
a r e i n a f i l e r e l a t i o n Ii, then we say t h a t t h e r e i s a path of length n from rl t o r lb
n+l
f o r r e l a t i o n Ii.
model t h e corlceptual f i l e s t r u c t u r e w e must have a way t o
s p e c i f y any f i l e r e l a t i o n t h a t a u s e r may r e q u i r e .
I n general, t h e r e
my be an a r b i t r a r i l y l a r g e number of records t h a t can be included i n
a f i l e structure.
Therefore, it i s not p r a c t i c a l f o r a u s e r t o s t a t e
t h e f i l e r e l a t i o n extensively by l i s t i n g a l l t h e p a i r s of records. Instead, he can s p e c i f y c r i t e r i a over t h e records which w i l l determine when two such records a r e t o be i n t h e r e l a t i o n .
Thus, f o r two records
> i s a member of a f i l e r e l a t i o n
if and only i f A and B
A and B, < A , B
satisfy the criteria f o r the relation.
Such c r i t e r i a can describe
e x p l i c i t l y t h e conditions which must be met f o r two records t o be connected by a d i r e c t access path. We provide below a s e t of production r u l e s f o r specifying c r i t e r i a . A t t h i s p o i n t it i s worth noting t h a t i n Chapter 3, we were only
concerned with h i e r a r c h i c organizations and s o simple production rules were a l l t h a t was necessary t o specify record s t r u c t u r e s .
lIowever,
t o organize records i n t o f i l e s , a f a r wider v a r i e t y of organizations i s required and, t h e r e f o r e , a more e l a b o r a t e way ol' opeciL'ying them j c
Definition 4-2. A file structure is a file relation determined by criteria obtained from the following production system: Criterion Production System: Primitives: attribute, bit string, character string, characteristic, integer, arithmetic relations (=, (+,
-, etc .),
.g,
etc.),
arithmetic operators
set membership relation ( e )
Rules to produce the names of records, fields, characteristics and paths: index
-,
(integer)
record-modifier
attribute-fo m
-,
HEAD
-,
X integer
-,
attribute attribute index
record-attribute -, attribute record-modifier -r
attribute
attribute-modifier -, attribute-form -,
attribute-form OF attribute-modifier
.-,
attribute record-modifier
record-reference -, record-attribute -,
record-attribute criterion
field-reference -. attribute-modifier characterictic-reference
characteristic
path-reference -. PA'IH ( record-reference, record-reference, criterion)
Piules to produce set-theoretic criterion: constant
-,
character string
-, bit
string
~et-member-, field-reference 4
set
-
constant
-. set-member, set-member
{set-member]
set-criterion
.-,
field-reference e set
-+
characteristic-reference c set
Ifules t o produce arithmetic criterion: term -. VAWE ( f ield-reference )
-. PARAMETER ( characteristic-reference ) -. LENGTH ( path-reference ) 4
relation-symbol
constant
arithmetic-operator
-,
4
*
4
4
-
-
4
x
4
5
>
+ 2
-,
term
4
(arithmetic-expr.ecciorl) a r i L11niel;ic:-opcr.uto I. (aritl~metic-exprccsior~)
arithmetic-cr i ter i on
#
implies t h a t
such values of A
S
one t a r g e t record i s t o S
( i .e
., i f
there a r e n
then n t a r g e t records a r e formed);
< AT; ~ ~ ( i )..; > implies t h a t only one t a r g e t record i s t o be formed and t h e remaining values of AS a r e t o be
discarded ( i . e . , a r e not t o be used a s values f o r A T i n other t a r g e t records); 2)
when a t a r g e t a t t r i b u t e AT repeats an unlimited number of times, then specifying < A
AS;
.. . > implies
repeat exactly a s many times as AS repeats;
that A
T
will
3) when a t a r g e t a t t r i b u t e A T r e p e a t s e i t h e r a f i x e d o r bounded number of times, say m, then specifying
< AT; AS;
a)
... > implies
t h a t whenever t h e number 02 A,
r
r e p e t i t i o n s i s l e s s t h a n t h e number of A
repetitions,
S
t h e n t a r g e t records a r e t o be formed such t h a t each value of A
S
appears i n some t a r g e t record;
.. .
.
< ~ ~ ( r n~) ;~ ( m .) ;. > implies t h a t whenever t h e number of A
T r e p e t i t i o n s i s l e s s than t h e number of AS r e p e t i -
t i o n s , then only one t a r g e t record i s t o be formed with t h e i t h value of A
s
a s t h e i t h value of A, and t h e remain-
ing source values of A
I'
s
a r e t o be discarded.
6.4 Applications of t h e Model of t h e Association L i s t Example 1. E x t r a c t i o n of a New F i l e from an E x i s t i n g F i l e Consider a source f i l e F1 whose records a r e described i n t h e following way: i)
The s t r u c t u r e s of t h e records a r e described by t h e s e t of productions P1: record s t r u c t u r e
-,
structure R 1
structure R 1
--,
< person, {substructure ~ l l i l ] >
substructure R l R 1
-,
substructure R l l , substructure L I R 2
substructure 1(1[9-. substructure 1112, substruc'Lurc flll13 oubutructure lilR3 s u b s t r u c t u r e It11
-r
~ u b c t r u c t u r eIU3, substruc:Lurc 1t14
-,
rmme
substructure R12
-,
age
substructure R13
-*
sex
substructure ~ 1 4 null -4
substructure R14
--,
substructure ~
-*
substructure ~14,substructure Rl4
1 4 structure ~ 1 4
structure R14
-*
< book, {substructure ~ 1 4 ~ 1>3
substructure ~ 1 4 ~ 1 substructure ~141,substructure ~ 1 4 ~ 2 -4
substructure ~ 1 4 ~ substructure 2 ~ 1 4 2 ,substructure ~ 1 4 3 4
substructure ~ 1 4 1 tltle -4
substructure ~ 1 4 2 pages 4
substructure ~143 date 4
ii)
The encoding of the records is specified by a set of characteristics C1 (the exact specification of these characteristics is not required for the purpose of this example)
.
Consider a target file F2 whose records are described in the following way: i)
The structures of the records are described by the set of productions P2: record structure
4
structure R2
structure R 2
-4
< author, F3; name, F1; c r i t e r i o n >
i n a new t a r g e t record.
To f i n d values f o r t h e t a r g e t a t t r i b u t e 'auth-
o r ' a l l records a r e checked t o see i f they contain a value of ' t i t l e ' equal t o t h e value of ' t i t l e ' obtained f o r t h e t a r g e t a t t r i b u t e ' t i t l e ' ( i n t h i s case SCIENCE 11). The two records shown above contain suck1 n value.
Therefore, they a r e used a s sources f o r t h e values of t h e t a r g e t
a t t r i b u t e 'author'
.
I n t h i s way, t h e values JONES and liOE a r e obtairled.
F i n a l l y , t h e values f o r ' d a t e ' and 'pages' a r e obtained from t h e same group 'book' i n t h e same record which was t h e source of t h e value SCIENCE 11.
I n t h i s way, t h e t a r g e t record f o r SCIENCE I1 i s formed.
6.5 The Relationship Between t h e Model and GDDL The model of an a s s o c i a t i o n l i s t defined i n t h e previous s e c t i o n provides a means f o r e x p l i c i t l y s t a t i n g how t a r g e t d a t a items a r e formed from source data items during conversion. GDDL's a b i l i t y t o describe data conversion has been defined i n terms of t h i s model and thus provides s i m i l a r c a p a b i l i t i e s . We w i l l now show how t h e model and GDDL a r e r e l a t e d . ASSOCIATE statement ( s e e Appendix A , Section 2.3.1.1) image of t h e a s s o c i a t i o n l i s t s i x - t u p l e s .
GDDL's
i s an exact
Target and source f i l e
names appear a s p a r t of t h e t a r g e t and source names (parameters i) and ii))
.
The :;WIICE
(attribute-modif i c r , c r i t e r i o n ) rnrniri(< sc-heme appear::
explicitly a:: CDDL' s SOJl{CE statemerlt ( see Appendix A , Sect ion
:'.3.1.3)
Thus, we conclude t h a t GDDL can specify any :~ssociai;iorl l i s t
t h a t can be defined using t h e model.
.
6.6
m e Conversion Process The association l i s t completes t h e information needed t o describe
e x p l i c i t l y how data i s t o be converted from one organization t o another. I n t h i s section we w i l l see how and where each component of t h e descript i o n f o r the source and t a r g e t f i l e s together with the association l i s t
i s used during the conversion process. I n Figure 6-1, w e showed t h a t t h e conversion process consists of e s s e n t i a l l y three p a r t s .
F i r s t , the source f i l e i s broken down i n t o
i t s component data items using the source description, t h e t a r g e t data items a r e formed using values obtained from source data items, and l a s t l y t h e t a r g e t data items a r e structured and encoded according t o t h e t a r g e t description,
Figure 6-3, which i s a d e t a i l e d treatment of
t h e conversion process, e s s e n t i a l l y r e f l e c t s these same t h r e e stages i n t h e instance of conversion from several source f i l e s t o several t a r g e t files
. Figure 6-3(a) shows how source descriptions a r e used t o read t h e
source f i l e s from the storage media and break t h e b i t s t r i n g representat i o n down i n t o data items, and how t h e association l i s t controls the process. Figure 6-3(b) shows how t h e t a r g e t data items a r e formed, and Figure 6-3(c) shows how these data items a r e organized i n t o a t a r g e t f i l e and w r i t t e n onto t h e storage media. Figure 6-3 i s not an algorithm f o r converting data.
It only
shows t h e order i n which description components a r e used f o r e x t r a c t i n g a single data item from a source f i l e , and f o r converting t h e value of
t h i s d a t a item i n t o p a r t of t h e t a r g e t f i l e .
I n conversion proper,
when l a r g e numbers of data items must be extracted, much of t h e processing f o r each d a t a item w i l l be done i n p a r a l l e l with t h a t f o r o t h e r d a t a items f o r e f f i c i e n c y considerations. Let uo follow t h e conversion process using Figure 6- 3. We w i l l assume t h a t t h e process i s underway and s e v e r a l records
f o r a p a r t i c u l a r t a r g e t f i l e have already been constructed.
Some of
t h e d a t a items f o r t h e next t a r g e t record have already been formed and we w i l l now follow t h e formation of t h e next data item. The t a r g e t record s t r u c t u r e determines t h e a t t r i b u t e f o r t h i s next data item.
We must now begin a t t h e t o p of Figure 6 - 3 ( a ) .
The a s s o c i a t i o n l i s t
0
1 i d e n t i f i e s which source f i l e contains
t h e a t t r i b u t e whose value w i l l be combined with t h e t a r g e t a t t r i b u t e . The storage s t r u c t u r e d e s c r i p t i o n
0
f o r t h a t source f i l e i s
2
used t o determine which blocks must be read ( i . e . , records of t h e f i l e )
which blocks contain
.
The storage encoding c h a r a c t e r i s t i c s @ a r e
needed t o read t h e s e
blocks o f f t h e storage medium and t o remove any l a b e l s . Once t h e b i t s t r i n g representation of t h e f i l e i s obtained, t h e association l i s t
@ identifies
which source record i s needed.
'R,
l o c a t e and e x t r a c t t h e b i t s t r i n g representation of t h e rsecorri, the r:r.itcriorl u ~ c df o r oequencing blre records (j)
:rnd t l i c r i l e
CII(.O~~~JU