UNIT I INTRODUCTION TO DBMS

UNIT I INTRODUCTION TO DBMS File Systems Organization -Sequential, Pointer, Indexed, Direct -Purpose of Database SystemDatabase System Terminologies-...
Author: Daisy Marsh
5 downloads 2 Views 7MB Size
UNIT I INTRODUCTION TO DBMS

File Systems Organization -Sequential, Pointer, Indexed, Direct -Purpose of Database SystemDatabase System Terminologies-Database characteristics-Data models –Types of data models – Components of DBMS-Relational Algebra. LOGICAL DATABASE DESIGN: Relational DBMS -Codd's Rule-Entity-Relationship model -Extended ER Normalization –Functional Dependencies, Anomaly-1NF to 5NF-Domain Key Normal Form –Denormalization.

1. File Systems Organization -Sequential, Pointer, Indexed, Direct

A file is a collection or set (ordered or unordered) of data elements stored on storage media. A system software module is responsible for managing (reading, processing, deleting, etc.) a file.

Logical Structure of a file Field: Smallest (fixed) indivisible logical unit of a file. A field holds a part of some data value. Record: A set of logically related fields. Cardinality of this set may be fixed or variable, i.e., a record size may be fixed or variable. A file, therefore, is a collection of logically related records.

A record can be of fixed or variable size. Most commonly fixed size records are used. The file structure has been illustrated here in a tabular form but note that a file is not just a table its records can be organized in many different ways each way defines a unique organization. Operations on file Create: Create a new file. 1 http://www.francisxavier.ac.in

Write: Write to a file Read: Read a part of a file Rewind: Go to the beginning of the file Delete: Remove a file from the system. Close: Close a file (at the end of manipulation). Open: Open a file for processing. Modify: Modify a field value.

One of the main objectives of file organization is to speed up the file retrieval time, that is, to reduce the I/O time. This can be done by improving disk access speed but that has a limit because of the hardware constraints. The other way is to store file records in such a way that records can be accessed concurrently or in parallel. A hardware solution to this is using RAID structure. A RAID is an array of several disks which can be accessed in parallel. No single file organization is the most efficient for all types of data access. For this reason a database management system uses a number of different file organizations for storing the same set of data. There are basically three categories of file organizations (a) Sequential organization, (b) organization using Index, and (c) Random organization. We begin with sequential category. Files of these categories are called sequential files. (a) understand the file organization, (b) understand how a set of operations are performed, and (c) analyze the time required to performed each operation. This will help us to identify which file is suitable for what kind of data processing. Sequential file In this organization records are written consecutively when the file is created. Records in a sequential file can be stored in two ways. Each way identifies a file organization. • Pile file: Records are placed one after another as they arrive (no sorting of any kind). • Sorted file: Records are placed in ascending or descending values of the primary key.

Pile file: The total time to fetch (read) a record from a pile file requires seek time (s), rotational latency time (r), Block transfer time (btt), and number of blocks in the file (B). We also need a key field of the record (K). In pile file we do not know the address of a record, so we search 2 http://www.francisxavier.ac.in

sequentially. The possibilities are (a) the desired record may be in the first block or (b) it may be in some middle block or (c) it may be in the last block. The fetching (searching) process is shown in the flow diagram (Figure 1) and the example illustrates the time involved.

Fi gure 1. Search in a sequential file organization.

File Reorganization: In file reorganization all records, which are marked to be deleted are deleted and all inserted records are moved to their correct place (sorting). File reorganization steps are: • read the entire file (all blocks) in RAM. • remove all the deleted records. • write all the modified blocks at a different place on the disk.

Inserting a record: To insert a record, it is placed at the end of the file. No need to sort (ascending or descending order) the file. However, the file may have duplicate records. 3 http://www.francisxavier.ac.in

Deleting or modifying a record: This will require to fetch the block containing the record, find the record in the block and just mark it deleted, then write the modified block to the disk.

Sorted Sequential File: In a sorted file, first the record is inserted at the end of the file and then moved to its correct location (ascending or descending). Records are stored in order of the values of the key field.

A sequential file organization

A sequential file usually has an overflow area. This area is to avoid sorting the file after every deletion, insertion and/or modification. All records, which were added after the file was first populated, go to this overflow area. The overflow area is not itself sorted it is a pile file with fixed size record. At some convenient time the entire file is sorted where records from the overflow area go to their correct position in the main file.

Retrieval: Records are retrieved in a consecutive order. The order of record storage determines order of retrieval. During retrieval several required operations (partial result output etc.) can be performed simultaneously.

Insert, delete and modify (Update): Sequential files are usually not updated in place. Each operation regenerates a new version of the file. The new version contains the up-to-date information and the old version is usually saved for recovery. The new version becomes the old version and the next update to the file uses this old version to generate next new version. The intensity or frequency of use of a sequential file is defined by a parameter called ―Hit ratio‖, which defines is defined as follows:

4 http://www.francisxavier.ac.in

Desirable: high hit ratio value. This means a larger number of records are accessed to respond to a query. Interactive transactions have very low hit ratio.

Advantages of sequential file • Good for batch transactions. • Simple to implement • Good for report generation, statistical computation and inventory control.

Disadvantages • Not good for interactive transactions • High overheads in file processing for simple queries.

Index File Organization

Index organization tries to reduce access time. It may not reduce the storage requirement of a file. Some important terms of this organization are. Index: An index is similar to a pointer with an additional property. This additional property allows an index to identify the record it is pointing to. For example, an index to a record of employees points and identifies an employee record. This means that an index has some semantics associated with it. Formally, an index maps the key space to the record space. Index for a file may be created on the primary or secondary keys. Index types: Primary, Non dense, and Dense. Primary Index: An ordered file of index record. This file contains index records which are of fixed length. Each index record has two fields: • one field holds the primary key of the data file record.

5 http://www.francisxavier.ac.in

• the other holds pointer to disk block where the data file is stored.

Nondense index: No. of entries in the index file 25000) OR (DNO=5 AND SALARY>30000)(EMPLOYEE)

(ii) PROJECT Operation: () 

The PROJECT operation is used to select certain columns from the table and discards other columns. The pi  represents the project operator. It is a Unary operator because it operates on one relation.

 (R) attribute list is a list of attributes from attributes of relation R. 

The result of PROJECT operation has only the attribute specified in in the same order as they appear in the list.



The degree of the relation resulting from SELECT is same as that of attribute list.



If the attribute list has a nonkey attribute, PROJECT operation removes any duplicates , so the result is a set of tuples. This is known as duplicate elimination.



The number of tuples in the resultant relation is always less than or equal to the number of tuples in R

< list1> ( < list2> (R) ) = < list1> (R) 

PROJECT operation is not commutative.

Eg. LNAME, FNAME, SALARY(EMPLOYEE)

23 http://www.francisxavier.ac.in

(iii) RENAME Operation: () The rename operator is used to rename either the relation name or the attribute names or both. The  (rho ) operator denotes the rename operation. It is a Unary operator. (i) S(B1,B2,…Bn) (R) or (ii) S(R) or (iii) (B1,B2,..Bn)(R) (i)

Renames both relation and its attributes

(ii)

Renames the relation only

(iii)

Renames the attributes only

The attributes are renamed as the same order in R. FN,LN,SAL (FNAME, LNAME, SALARY(DNO= 5(EMPLOYEE))) 2. RELATIONAL ALGEBRA – SET OPERATORS: Set theoretic operations are binary operations that are applied in two sets of tuples. Union Compatibility: The two relations on which these operations are applied must have the same type of tuples. This condition is called as union compatibility. Two relations R(A1,,..An) and S(B1, ..Bn) are said to be union compatible if they have same degree n (same number of attributes )and if dom(Ai)=dom(Bi) ie., each corresponding pair of attributes have the same domain. Set operators: 24 http://www.francisxavier.ac.in



Union (U)



Intersection (n)



Set Difference (--)



Cartesian Product (x)

Defined on union compatible relations

(i) Union: The result of the operation denoted by R U S is a relation that includes all tuples that are either in R or in S or in both R and S. Duplicate tuples are eliminated. (ii) Intersection: The result of the operation denoted by R n S is a relation that includes all tuples that are in both R and S. Both union and intersection are commutative operations and can be treated as n-ary operations applicable to any number of relations and both are associative operations; R U S = S U R and R n S = S n R R U (S U T) = (R U S) U T ; R n (S n T) = (R n S) n T (iii) Set Difference: The result of this operation, denoted by R - S is a relation that includes all tuples that are in R but not in S. The minus operation is not commutative; R – S  S –R

STUDENT FN

INSTRUCTOR LN

Suresh

Rao

Bala

Ganesh

Ravi

Sankar

Mohan

Varma

25 http://www.francisxavier.ac.in

FNAME

LNAME

Kishore

Das

Sushil

Sharma

Suresh

Rao

Rakesh

Agarwal

Bala

Ganesh

Sushil

Sharma

Ashwin

Kumar

FN

LN

STUDENT U INSTRUCTOR STUDENT n INSTRUCTOR FN

LN

Suresh

Rao

Suresh

Rao

Bala

Ganesh

Bala

Ganesh

Ravi

Sankar

INSTRUCTOR STUDENT

Sushil

Sharma

STUDENT - INSTRUCTOR Mohan

Varma

Sushil

Sharma

Rakesh

Agarwal

FN

FN

KishoreRavi

LN

LN

Das

Sankar

Ashwin Mohan Kumar Varma Kishore

Das Rakesh

Ashwin

Agarwal

Kumar

(iv) Cartesian Product (or Cross Product) Operation: It is denoted by X. It is a binary set operation, but the relations on which it is applied need not be union compatible. This operation is used to combine tuples from two relations in a combinatorial fashion. The Cartesian product creates tuples with combined attributes of two relations. R(A1, A2,.. An) X S(B1,B2.. Bm) is a relation Q with degree n+m attributes Q(A1,.. An,B1,….Bm). The resultant relation Q has one tuple for each combination of tuples – one from R and one from S.

If R has nR tuples and S has nS tuples then R X S will have nR * nS Tuples.

26 http://www.francisxavier.ac.in

The operation applied leads to meaningless tuples. It is useful when followed by a selection that matches the values of attributes coming from relations. FEMALE_EMPS   SEX=‘F‘(EMPLOYEE) EMPNAMES   FNAME, LNAME, SSN (FEMALE_EMPS) EMP_DEPENDENTS  EMPNAMES x DEPENDENT

3. BINARY RELATIONAL ALGEBRA OPERATIONS (i)

JOIN Operation:

Join operation denoted by

is used to combine related tuples from two relations into single

tuples. It allows to process relationships among relations. R

S

The join condition is of the form AND AND .. and each condition is of the form Ai  Bj, where Ai is the attribute of R and Bj is the attribute of S, Ai and Bj must have same 27 http://www.francisxavier.ac.in

domain. And  is one of the relational operator { =,,,,} A join with such a general join condition is called as theta join. The result of join is a relation Q with n + m attributes Q(A1,.. An,B1,….Bm).. Q has one tuple from R and one tuple from S whenever the combination satisfies the join condition. The join condition specified on attributes from two relations R and S is evaluated for each combination. Each tuple combination for which the condition is evaluated to true is included in the result. Equi Join: The join which involves join conditions with equality comparisons only ie., it involves only = operator in join condition is known as equi join. DEPT_MGR  DEPARTMENT

MGRSSN=SSN EMPLOYEE

Natural Join : denoted by * Equi join with two join attributes have same name in both relationships. The attribute involved in join condition is known as Join attribute. DEPT_LOCS DEPARTMENT * DEPT_LOCATIONS

Outer Join : The tuples without a matching tuple and the tuples with null values for the join attribute are eliminated from JOIN result. Outer joins can be used to keep all the tuples in R or in S or those in both relation in the result of join even if they donot have matching tuples in join. It preserves all the tuples whether or not they match in the join condition. Left outer Join: Keeps every tuple in the left relation R in R

S even if no matching tuple is

found in S., The attributes of S in the join result are filled with null values. It is denoted by 28 http://www.francisxavier.ac.in

To find the names of all employees, along with the names of the departments they manage, we could use: T1  EMPLOYEE

SSN=MGRSSN DEPARTMENT

RESULT FNAME,MINIT,LNAME,DNAME (T1)

Right Outer Join: Keeps every tuple in the right relation S in R

S even if no matching tuple is

found in S., The attributes of R in the join result are filled with null values. It is denoted by Full Outer Join: Keeps all tuples in both the right relation R and S in R S , even if no matching tuples are found and filled with null values when needed. It is denoted as

The main difference between Cartesian product and join: In join only the combinations of tuples satisfying the join condition appears in the result. Whereas in Cartesian productrall combinations of tuples are included in the result. (ii) Divide operator (%) The division operator is applied on two relations R(Z) % S(X) where X C Z. Let Y be the set of attributes Y=Z-X the set of attributes that are not in S. The result of division is a relation T(Y) ., in which a tuple to appear in result T , the values in t must appear in R in combination with every tuple in S.

A

B

B

A



1

1





2



3

2

29



http://www.francisxavier.ac.in

Relations r, s: A

r  s:

B

C

D

E



a



a

1



a



a

1



a



b

1



a



a

1



a



b

3



a



a

1



a

r

b

1



a



b

1

C



B a



a



A

D

E

a

1

b

1

s



Null Values •It is possible for tuples to have a null value, denoted by null, for some of their attributes •null signifies an unknown value or that a value does not exist. •The result of any arithmetic expression involving null is null. •Aggregate functions simply ignore null values (as in SQL) •For duplicate elimination and grouping, null is treated like any other value, and two nulls are assumed to be the same (as in SQL) 4. Aggregate Functions and Grouping 30 http://www.francisxavier.ac.in

The relational algebra allows to specify mathematical aggregate functions on collections of values from the database. Examples of such functions include retrieving the average or total salary of all employees or the number of employee tuples. Common functions applied to collections of numeric values include SUM, AVERAGE, MAXIMUM, and MINIMUM. The COUNT function is used for counting tuples or values. Another type may be grouping the tuples in a relation by the value of some of their attributes and then applying an aggregate function independently to each group. An example would be to group employee tuples by DNO, so that each group includes the tuples for employees working in the same department. We can then list each DNO value along with, say, the average salary of employees within the department. 

(R)

where is a list of attributes of the relation specified in R, and is a list of ( ) pairs. In each such pair, is one of the allowed functions— such as SUM, AVERAGE, MAXIMUM, MINIMUM, COUNT—and is an attribute of the relation specified by R. The resulting relation has the grouping attributes plus one attribute for each element in the function list. 1. FMAXIMUM Salary (Employee) retrieves the maximum salary value from the Employee relation MAXIMUM_Salary 55000

2. Retrieve each department number, the number of employees in the department, and their average salary. (DNO  COUNT SSN, AVERAGE SALARY (EMPLOYEE))

31 http://www.francisxavier.ac.in

3. Without the grouping attribute, you would get the result for all tuples together  COUNT SSN, AVERAGE Salary (Employee)

Note :The results of aggregate functions are still relations But not numbers. 7. Logical Database Design: Relational DBMS - Codd's Rule A relational database management system (RDBMS) is a database management system (DBMS) that is based on the relational model as introduced by E. F. Codd. Most popular commercial and open source databases currently in use are based on the relational model. A short definition of an RDBMS may be a DBMS in which data is stored in the form of tables and the relationship among the data is also stored in the form of tables. E.F. Codd, the famous mathematician has introduced 12 rules for the relational model for databases commonly known as Codd's rules. The rules mainly define what is required for a DBMS for it to be considered relational, i.e., an RDBMS. There is also one more rule i.e. Rule00 which specifies the relational model should use the relational way to manage the database. The rules and their description are as follows:Rule 0: Foundation Rule A relational database management system should be capable of using its relational facilities (exclusively) to manage the database. Rule 1: Information Rule 32 http://www.francisxavier.ac.in

All information in the database is to be represented in one and only one way. This is achieved by values in column positions within rows of tables. Rule 2: Guaranteed Access Rule All data must be accessible with no ambiguity, that is, Each and every datum (atomic value) is guaranteed to be logically accessible by resorting to a combination of table name, primary key value and column name. Rule 3: Systematic treatment of null values Null values (distinct from empty character string or a string of blank characters and distinct from zero or any other number) are supported in the fully relational DBMS for representing missing information in a systematic way, independent of data type. Rule 4: Dynamic On-line Catalog Based on the Relational Model The database description is represented at the logical level in the same way as ordinary data, so authorized users can apply the same relational language to its interrogation as they apply to regular data. The authorized users can access the database structure by using common language i.e. SQL. Rule 5: Comprehensive Data Sublanguage Rule A relational system may support several languages and various modes of terminal use. However, there must be at least one language whose statements are expressible, per some well-defined syntax, as character strings and whose ability to support all of the following is comprehensible: a. data definition b. view definition c. data manipulation (interactive and by program) d. integrity constraints e. authorization f. Transaction boundaries (begin, commit, and rollback).

Rule 6: View Updating Rule 33 http://www.francisxavier.ac.in

All views that are theoretically updateable are also updateable by the system. Rule 7: High-level Insert, Update, and Delete The system is able to insert, update and delete operations fully. It can also perform the operations on multiple rows simultaneously. Rule 8: Physical Data Independence Application programs and terminal activities remain logically unimpaired whenever any changes are made in either storage representation or access methods. Rule 9: Logical Data Independence Application programs and terminal activities remain logically unimpaired when information preserving changes of any kind that theoretically permit unimpairment are made to the base tables. Rule 10: Integrity Independence Integrity constraints specific to a particular relational database must be definable in the relational data sublanguage and storable in the catalog, not in the application programs. Rule 11: Distribution Independence The data manipulation sublanguage of a relational DBMS must enable application programs and terminal activities to remain logically unimpaired whether and whenever data are physically centralized or distributed. Rule 12: Nonsubversion Rule If a relational system has or supports a low-level (single-record-at-a-time) language, that low-level language cannot be used to subvert or bypass the integrity rules or constraints expressed in the higher-level (multiple-records-at-a-time) relational language. ENTITY-RELATIONSHIP MODEL 

The basic object that the ER model represents is an entity, which is a "thing" in the real world with an independent existence.



An entity may be an object with a physical existence—a particular person, car, house, or employee—or it may be an object with a conceptual existence—a company, a job, or a university course. 34 http://www.francisxavier.ac.in



Each entity has attributes—the particular properties that describe it.



For example, an employee entity may be described by the employee‘s name, age, address, salary, and job.



A particular entity will have a value for each of its attributes.



The attribute values that describe each entity become a major part of the data stored in the database.

Types of attributes: 

Simple versus Composite



Single-Valued versus Multivalued



Stored versus Derived

Composite Versus Simple (Atomic) Attributes Composite attributes can be divided into smaller subparts, which represent more basic attributes with independent meanings. For example, the Address attribute of the employee entity can be sub-divided into StreetAddress, City, State, and Zip (Note 2), with the values "2311 Kirby," "Houston," "Texas," and "77001." Attributes that are not divisible are called simple or atomic attributes. Composite attributes can form a hierarchy; for example, StreetAddress can be subdivided into three simple attributes, Number, Street, and ApartmentNumber, The value of a composite attribute is the concatenation of the values of its constituent simple attributes. 35 http://www.francisxavier.ac.in

Composite attributes are useful to model situations in which a user sometimes refers to the composite attribute as a unit but at other times refers specifically to its components. If the composite attribute is referenced only as a whole, there is no need to subdivide it into component attributes. For example, if there is no need to refer to the individual components of an address (Zip, Street, and so on), then the whole address is designated as a simple attribute. Singlevalued Versus Multivalued Attributes Most attributes have a single value for a particular entity; such attributes are called singlevalued. For example, Age is a single-valued attribute of person. In some cases an attribute can have a set of values for the same entity—for example, a Colors attribute for a car, or a CollegeDegrees attribute for a person. Cars with one color have a single value, whereas two-tone cars have two values for Colors. Similarly, one person may not have a college degree, another person may have one, and a third person may have two or more degrees; so different persons can have different numbers of values for theCollegeDegrees attribute. Such attributes are called multivalued. A multivalued attribute may have lower and upper bounds on the number of values allowed for each individual entity. For example, the Colors attribute of a car may have between one and three values, if we assume that a car can have at most three colors. Stored Versus Derived Attributes In some cases two (or more) attribute values are related—for example, the Age and BirthDate attributes of a person. For a particular person entity, the value of Age can be determined from the current (today‘s) date and the value of that person‘s BirthDate. The Age attribute is hence called a derived attribute and is said to be derivable from the BirthDate attribute, which is called a stored attribute. Some attribute values can be derived from related entities; for example, an attribute NumberOfEmployees of a department entity can be derived by counting the number of employees related to (working for) that department.

Entity Types, Entity Sets, Keys, and Value Sets

Entity Types and Entity Sets o A database usually contains groups of entities that are similar. For example, a company employing hundreds of employees may want to store similar information concerning each of the employees. These employee entities share the same attributes, but each entity has its own value(s) for each attribute. 36 http://www.francisxavier.ac.in

o An entity type defines a collection (or set) of entities that have the same attributes. Each entity type in the database is described by its name and attributes. o Figure shows two entity types, named EMPLOYEE and COMPANY, and a list of attributes for each. A few individual entities of each type are also illustrated, along with the values of their attributes. o The collection of all entities of a particular entity type in the database at any point in time is called an entity set; the entity set is usually referred to using the same name as the entity type. For example, EMPLOYEE refers to both a type of entity as well as the current set of all employee entities in the database. o An entity type is represented in ER diagrams as a rectangular box enclosing the entity type name. o Attribute names are enclosed in ovals and are attached to their entity type by straight lines. o Composite attributes are attached to their component attributes by straight lines. o Multivalued attributes are displayed in double ovals. o An entity type describes the schema or intension for a set of entities that share the same structure. o The collection of entities of a particular entity type are grouped into an entity set, which is also called the extension of the entity type. Key Attributes of an Entity Type o An important constraint on the entities of an entity type is the key or uniqueness constraint on attributes. o An entity type usually has an attribute whose values are distinct for each individual entity in the collection. Such an attribute is called a key attribute, and its values can be used to identify each entity uniquely. o For example, the Name attribute is a key of the COMPANY entity type in Figure, because no two companies are allowed to have the same name. For the PERSON entity type, a typical key attribute is SocialSecurityNumber. o Sometimes, several attributes together form a key, meaning that the combination of the attribute values must be distinct for each entity. If a set of attributes possesses this property, we can define a composite attribute that becomes a key attribute of the entity type. o In ER diagrammatic notation, each key attribute has its name underlined inside the oval, 37 http://www.francisxavier.ac.in

Value Sets (Domains) of Attributes o Each simple attribute of an entity type is associated with a value set (or domain of values), which specifies the set of values that may be assigned to that attribute for each individual entity. If the range of ages allowed for employees is between 16 and 70, we can specify the value set of the Age attribute of EMPLOYEE to be the set of integer numbers between 16 and 70.

Extended Entity-Relationship (EER) Model The extended entity-relationship (EER) model is a language for definition of structuring (and functionality) of database or information systems. It uses inductive development of structuring. Basic attributes are assigned to base data types. Complex attributes can be constructed by applying constructors such as tuple, list or set constructors to attributes that have already been constructed. Entity types conceptualize structuring of things of reality through attributes. Relationship types associate types that have already been constructed into an association type. The types may be restricted by integrity constraints and by specification of identification of objects defined on the corresponding type. Typical integrity constraint of the extended entityrelationship model is participation, lookup and general cardinality constraints. Entity, cluster and relationship classes contain a finite set of objects defined on these types. The types of an EER schema are typically depicted by an EER diagram.

The extended entity-relationship model is mainly used as a language for conceptualization of the structure of information systems applications. Conceptualization of database or information systems aims in a representation of the logical and physical structure of an information system in a given database management system (or for a database paradigm), so that it contains all the information required by the user and required for the efficient behavior of the whole information system for all users. Furthermore, conceptualization may target to specify the database application processes and the user interaction. Description of structuring is currently the main use of the extended ER model. The diagram representation of EER schemata uses rectangles and diamonds for the entity and relationship types.

38 http://www.francisxavier.ac.in

Functional Dependencies, Anomaly-1NF to 5NF: 12.1 Functional Dependencies A functional dependency is a constraint between two sets of attributes from the database. Let R be the relational schema R={A1,A2,…An}. A functional dependency denoted by X Y Between two sets of attributes X and Y that are subset of R specifies a constraint on the possible tuples that can forma a relation state r of R X Y --- X functionally determines Y (or) there is a functional dependency from X to Y (or) Y is functionally dependent on X. X  L.H.S of F.D.

Y  R.H.S of F.D.

Definition 1: X functionally determines Y in a relational schema R iff whenever two tuples of r(R) agree on their X value , they must necessarily agree on their Y value. F.D. is a property of semantics or meaning odf attributes. Definition 2 : For any two tuples t1 and t2 in r if t1[x]=t2[x], we must have t1[y]=t2[y], ie., values of Y component of a tuple in r depend on and determined by the values of X component. 39 http://www.francisxavier.ac.in

Eg. SSN ENAME ie., SSN uniquely determines ENAME. Pnumber  {pname,location} {SSN, Pnumber}  hours Relation state r(R) that satisfies functional dependency constraints are called legal extensions or legal relation states. Diagrammatic Notation: An arrow mark is drawn from L.H.S to R.H.S of a f.d

SSN

ENAME

SSN 

ENAME

DNO{DNAME,MGRSSN} DNO

DNAME

MGRSSN

Inference rules: Let F be the set of f.ds specified on R. Other numerous functional dependencies also hold in R that satisfy F. The set of all such dependencies inferred from F is called the closure of F denoted by F+. SSN  {Ename,Bdate,Address,Dno} DNO {Dname,MGRSSN} Inferred F.d.s – F+ SSN {Dname,MGRSSN} SSN Ename

40 http://www.francisxavier.ac.in

A F.D X‘  Y‘ is inferred from a set of dependencies F specified on R if X‘  Y‘ holds in every relation state r(R). F

X‘  Y‘

ie.,X‘  Y‘ is inferred from F

IR1: (Reflexive)

: If x  y, then x y

IR2: (Augmentation)

: {xy}

IR3: (Transitive)

: {xy,yz}

IR4: (Decomposition)

: {xyz}

IR5: (Union)

: xy, xz

IR6: (Pseudo transitive)

: {xy,wyz}

xzyz xz

xy xyz wxz

Closure of X under F: For each set of attributes X in L.H.S. of a f.d. in F ,determine all set of attributes dependent on X. Thus for each X we determine X+ functionally determined by X based on F. X+ is called closure of X under F. Equivalence sets of f.d.s A set of F.Ds E is covered by a set of f.ds F or F covers E if every F.D in E can be inferred from F (also in F+) Two sets of F.Ds E and F are said to be equivalent if E+=F+ ie., every F.D in E can be inferred from F and vice-versa.ie., E covers F and F covers E. Minimal set of functional dependency: A set of functional dependencies F is said to be minimal if 1) Every dependency in F has a single attribute in its R.H.S. 2) We cannot replace any f.d X A in F with Y A where Y X have a set of functional dependencies equivalent to F. 3) We cannot remove any dependency from F and still have a set of dependencies equivalent to F. 41 http://www.francisxavier.ac.in

It is a standard or canonical form with no redundancies. Minimal Cover: A minimal cover of set of f.ds F is a minimal set of dependencies Fmin that is equivalent to F. Trivial and Nontrivial Functional Dependency A functional dependency XY is trivial if YX, otherwise it is non-trivial.

12.2 Normal Forms: Normalization was proposed by Codd (1972) 

It takes the relational schema through a series tests whether it satisfies a certain normal form.



It proceeds in Top down fashion  Design by analysis.



Codd has defined I, II., III NF and a stronger definition of 3NF known as BCNF.(Boyce Codd Normal Form)



All these normal forms are based on f.ds among attributes of a relation.

Normalization: It is a process of analyzing the given relational schema based on their FDs and primary keys to achieve desirable properties of minimizing redundancy and minimizing the insertion and Updation anamolies. The relational schema that donot satisfy normal form tests are decomposed into smaller relational schemas that meet the tests and possess desirable properties. Normalization procedure provides: 1) Framework for analyzing relational schema based on keys and f.ds 2) A series of normal from tests carried out on each relational schema so that the database is normalized to desired degree.

42 http://www.francisxavier.ac.in

A normal form of a relation is the highest normal form condition that it meets ad hence indicates the degree to which it is normalized. Normalisation through decomposition must meet 1) Lossless join or non-additive join property: Spurious tuples problem donot occur w.r.t schemas after decomposition. 2) Dependency preservation property Each f.d is represented by some individual relations after decomposition. The process of storing the join of higher normal form relations as a base relation in lower normal form is known as denormalization. Superkey: A superkey s of R is a set of attributes with the property that no two tuples t1 and t2 in r(R) will have t1[s] = t2[s]. Different set of attributes, which are able to identify any row in the database, is known as super key. Candidate Key: A candidate key is a minimal superkey. The removal of any attribute from K will not cause K to be a super key any more. K={A1,…Am} K-Ai is not a key Eg{SSN,Ename} is a superkey and {ssn} is a candidate key. Primary Key: Primary key could be any key, which is able to identify a specific row in database in a unique manner. An attribute of a relation schema R, which is a member of some candidate key of R is called Prime Attribute otherwise it is a non-prime attribute. And minimal super key is termed as candidate key i.e. among set of super keys one with minimum number of attributes. Primary key could be any key, which is able to identify a specific row in database in a unique manner. First Normal Form: Statement of First normal form: The domain of an attribute must include only atomic (simple atomic) values and that value of an attribute in a tuple must be a single value from domain of that attribute. 43 http://www.francisxavier.ac.in

1NF disallows set of values, a tuples of values ie., it disallows a relation within a relation or relations as attributes (ie., it disallows composite and Multivalued attribute) Eg. DEPARTMENT relational schema whose primary key is DNUMBER. Considering DLOCATIONS attribute. Each department can have a number of locations. DEPARTMENT DNAME

DNUMBER

DMGRSSN

DLOCATIONS

NOT IN 1NF

1) The domain of DLOCATIONS contains atomic values, but some tuples can have a set of these values. Thereore DLOCATIONS is not functionally dependent on DNUMBER.

2)

DNAME

DNUMBER

DMGRSSN

DLOCATIONS

Research

5

334

Bellaire

Research

5

334

Alaska

Domain of DLOCATIONS has set

Research

of

5

334

Newyork

values and hence non-

atomic.

But DNUMBER

DLOCATIONS exist. DNAME

DNUMBER

DMGRSSN

DLOCATIONS

Research

5

334

{Bellaire,Alaska,Newyork}

Normalisation to achieve 1NF: Three ways to achieve first normal form 1. Remove the attribute DLOCATIONS and place it in a separate relation along with the primary key DNUMBER of the department. Ie., primary key is the combination of {DNUMBER,DLOCATION}. So a distinct tuple exist for each location DNAME

DNUMBER

DMGRSSN

44 http://www.francisxavier.ac.in

DNUMBER

DLOCATIONS

2. Expand the key so that there will be a separate tuple for each location of department. But it introduces redundancy in relation. 3. If maximum number of values is known for DLOCATIONS eg 3., Replace DLOCATIONS by 3 atomic attributes LOC1, LOC2, LOC3. It introduces more null values in the relation. The 1 NF also disallows multivalued attributes that are themselves composite. These are called nested relations. Test for First Normal Form: Relation should not have non-atomic attributes or nested relation. Remedy Form new relations for each non-atomic attribute or nested relation. Second Normal Form: 2NF is based on Full Functional Dependency. Full Functional Dependency: A FD X Y is a full functional dependency, if removal of any attribute A from X, {X{A}} does not functionally determine Y. Eg. {SSN, PNUMBER}  HOURS Partial Functional dependency: A FD XY is a partial dependency if some attribute A X can be removed from X and the dependency still holds ie., A X, X-{A}  Y

Eg.,{SSN, Pnumber}  Ename

Statement of Second normal form: A relational schema R is in 2NF if every nonprime attribute A is fully functionally dependent on the primary key of R.

45 http://www.francisxavier.ac.in

If a relational schema R is not in 2NF, it can be 2NF normalized into a number of 2NF relations in which non prime attributes are associated only with the part of the primary key on which they are fully functionally dependent. Eg., SSN

PNumber Hours

Ename Pname PLocation

NOT IN 2 NF FD1

FD2 FD3 Ename, Pname & PLocation violates 2NF. FD2 and FD3 are partially dependent on primary key. Normalisation (Decomposition) to achieve 2NF:

SSN

Ename

SSN

FD2

PNumber Hours

FD1

PNumber Pname PLocation

FD3 Test for Second Normal Form For relations where primary key contains multiple attributes, non-key or non-prime attribute should not be functionally dependent on a part of primary key. Remedy: Decompose and set up a new relation for each partial key with its dependent attribute(s). make sure to keep a relation with the original primary key and any attributes that are fully functionally dependent on it. Third Normal Form: 46 http://www.francisxavier.ac.in

It is based on Transitive dependency. Transitive Dependency: A functional dependency XY in R is a transitive dependency if there is a set of attributes Z (that is neither a candidate key or subset of any key of R) and both XZ and ZY hold. Statement of Third Normal Form: According to Codd‘s Definition, a relational schema R is in 3NF if and only if it satisfies 2NF and every nonprime attribute is non-transitively dependent on that primary key. General Definition of 3NF: A relational schema R is in 3NF if whenever a nontrivial functional dependency XA hold in R either a) X is a superkey of R or b) A is a Prime Attribute of R This definition states that a table is in 3NF if and only if, for each of its functional dependencies X → A, at least one of the following conditions holds: 

X contains A (that is, X → A is trivial functional dependency), or



X is a superkey, or



A is a prime attribute (i.e., A is contained within a candidate key)

Eg., EMPLOYEE Ename SSN

Bdate Address Dno

Dname Dmgrssn

SSN MGRSSN is transitive through DNO. Ename SSN

Bdate Address Dno

Dno

Dname Dmgrssn

Test for Third Normal Form: 47 http://www.francisxavier.ac.in

A relation should not have non key attribute functionally determined by another non key attribute ie., there should be no transitive dependency of a non key attribute on the primary key. Remedy: Decompose and set up a relation that includes non-key attributes that functionally determine other non-key attribute. Boyce-Codd normal form (BCNF): It is stricter than 3 NF because every relation in BCNF also in 3NF. A relation R is in BCNF if whenever non-trivial dependency XA holds in R then X is a superkey of R. The condition of 3NF, which allows A to be prime, is absent from BCNF. A relation is in BCNF, if and only if, every determinant is a candidate key. ClientNo

interviewDate

interviewTime

staffNo

roomNo

CR76

13-May-02

10.30

SG5

G101

CR76

13-May-02

12.00

SG5

G101

CR74

13-May-02

12.00

SG37

G102

CR56

1-Jul-02

10.30

SG5

G102

ClientInterview FD1 : clientNo, interviewDate  interviewTime, staffNo, roomNo

(Primary Key)

FD2 : staffNo, interviewDate, interviewTime  clientNo

(Candidate key)

FD3 : roomNo, interviewDate, interviewTime  clientNo, staffNo

(Candidate key)

FD4 : staffNo, interviewDate  roomNo (not a candidate key) As a consequence, the ClientInterview relation may suffer from update anomalies. For example, two tuples have to be updated if the roomNo need be changed for staffNo SG5 on the 13-May-02. creating two new relations called Interview and StaffRoom as shown below, Interview (clientNo, interviewDate, interviewTime, staffNo) 48 http://www.francisxavier.ac.in

StaffRoom(staffNo, interviewDate, roomNo) Interview ClientNo

interviewDate

interviewTime

staffNo

CR76 CR76 CR74 CR56

13-May-02 13-May-02 13-May-02 1-Jul-02

10.30 12.00 12.00 10.30

SG5 SG5 SG37 SG5

Staff Room staffNo

interviewDate

SG5 SG37 SG5

13-May-02 13-May-02 1-Jul-02

roomNo G101 G102 G102

Domain-Key Normal Form (DKNF) There is no hard and fast rule about defining normal forms only up to 5NF. Historically, the process of normalization and the process of discovering undesirable dependencies was carried through 5NF as a meaningful design activity, but it has been possible to define stricter normal forms that take into account additional types of dependencies and constraints. The idea behind domain-key normal form (DKNF) is to specify (theoretically, at least) the "ultimate normal form" that takes into account all possible types of dependencies and constraints. A relation is said to be in DKNF if all constraints and dependencies that should hold on the relation can be enforced simply by enforcing the domain constraints and key constraints on the relation. For a relation in DKNF, it becomes very straightforward to enforce all database constraints by simply checking that each attribute value in a tuple is of the appropriate domain and that every key constraint is enforced. Denormalization

The ultimate goal during normalization is to separate the logically related attributes into tables to minimize redundancy, and thereby avoid the update anomalies that lead to an extra processing overhead to maintain consistency in the database.

49 http://www.francisxavier.ac.in

The above ideals are sometimes sacrificed in favor of faster execution of frequently occurring queries and transactions. This process of storing the logical database design (which may be in BCNF or 4NF) in a weaker normal form, say 2NF or 1NF, is called denormalization. Typically, the designer adds to a table attributes that are needed for answering queries or producing reports so that a join with another table, which contains the newly added attribute, is avoided. This reintroduces a partial functional dependency or a transitive dependency into the table, thereby creating the associated redundancy problems. Other forms of denormalization consist of storing extra tables to maintain original functional dependencies that are lost during a BCNF decomposition. Differentiate 3 NF and BCNF: The difference between 3NF and BCNF is that for a functional dependency A  B, 3NF allows this dependency in a relation if B is a primary-key attribute and A is not a candidate key.whereas BCNF insists that for this dependency to remain in a relation, A must be a candidate key. Note: A non-prime attribute of R is an attribute that does not belong to any candidate key of R. Denormalization Denormalization doesn‘t mean not normalizing. It‘s a step in the process. First we normalize, then we realize that we now have hundreds or thousands of small tables and that the performance cost of all those joins will be prohibitive, and then carefully apply some denormalization techniques so that the application will return results before the werewolves have overrun the city. Since normalization is about reducing redundancy, denormalizing is about deliberately adding redundancy to improve performance. Before beginning, consider whether or not it‘s necessary.Data Normalization, Denormalization, and the Forces of Darkness / Hollingsworth / p18 ● Is the system‘s performance unacceptable with fully normalized data? Mock up a client and do some testing. ● If the performance is unacceptable, will denormalizing make it acceptable? Figure out 50 http://www.francisxavier.ac.in

where your bottlenecks are. ● If you denormalize to clear those bottlenecks, will the system and its data still be reliable? Unless the answers to all three are ―yes,‖ denormalization should be avoided. Unfortunately for me, the Council of Light has some pretty old hardware and can‘t or won‘t upgrade to newer and more efficient SQL software. I‘d better get to work.

Denormalization Strategies Materialized Views If we‘re lucky, we won‘t need to denormalize our logical data design at all. We can let the database management system store additional information on disk, and it‘s the responsibility of the DBMS software to keep the redundant information consistent. Oracle does this with materialized views, which are pretty much what they sound like—SQL views, but made material on the disk. Like regular views, they change when the underlying data does. Microsoft SQL has something called indexed views, which I gather are similar although I‘ve never used them. Other SQL databases have standard hacks to simulate materialized views; a little Googling will tell you whether or not yours does. Database Constraints The more common approach is to denormalize some carefully chosen parts of the database design, adding redundancies to speed performance. Danger lies ahead. It is now the database designer‘s responsibility to create database constraints specifying how redundant pieces of data will be kept consistent. These constraints introduce a tradeoff. They do speed up reads, just as we wish, but they slow down writes. If the database‘s regular usage is write-heavy, such as the Council of Light‘s

51 http://www.francisxavier.ac.in

message forum, then this may actually be worse than the non-denormalized version. Mirror Tables If a particular table is too active, with constant querying and updating leading to slowdowns and timeouts and deadlocks, consider a miniature and internal version of ―Double your database fun.‖ You can have a background and foreground copy of the same table. The background table takes only writes, the foreground table is for reading, and synchronizing the two is a low-priority process that happens when less is going on. Split Tables The original table could also be maintained as a master, in which case the split tables are special cases of mirror tables (above) which happen to contain a subset instead of a complete copy. If few people want to query the original table, I could maintain it as a view which joins the split tables and treat the split tables as masters. That‘s a horizontal split, pulling rows out into different tables. For other needs we might do a vertical split, keeping all the same rows/keys but with each table having separate sets of columns of information. This is only worthwhile when there are distinct kinds of queries which are, in effect, already treating the table as more than one table. Combined Tables Instead of splitting one table into several, we might combine several into one. If tables have a one-to-one relationship, it might speed performance to combine them into one table even if that isn‘t part of the logical database design. We might even combine tables with a one-to-many relationship, although that would significantly complicate our update process and would likely work out to be a net loss. Combined tables differ from joined tables (above) in that they already share some data in some relationship, while any two tables which are frequently joined are potential candidates for joined tables. They also differ in that a joined table usually populates itself from the co-existing 52 http://www.francisxavier.ac.in

normalized tables, while a combined table will usually be the master copy. Index Tables These are nice for searching hierarchies. Remember how Normal 5 taught us that weapon composition is actually a hierarchy of materials rather than a simple list? Well, now we‘re stuck with searching that hierarchy by searching all the leaves under particular parent nodes and combining with a bunch of union statements. UNIT II SQL & QUERY OPTIMIZATION SQL Standards -Data types -Database Objects-DDL-DML-DCL-TCL-Embedded SQL-Static Vs Dynamic SQL -QUERY OPTIMIZATION: Query Processing and Optimization -Heuristics and cost Estimates in Query Optimization.

SQL - Structured Query Language Standards SQL has evolved from IBM‘s Sequel (Structures English QUEry Language) Language. Advantages of SQL:  SQL is a standard relational-database Language.  SQL is a comprehensive database language; it has statements for data definition, Query and update. Hence it is both DDL and DML  It has facilities for defining views on the database, specifying security and authorization, for defining integrity constraints and for specifying transaction controls.  It also has rules for embedding SQL statements into general purpose programming language such as C or Pascal. Parts of SQL: The SQL Language has several parts:  Data Definition Language – Defines relational schema, deleting relations and modifying relational schemas.  Interactive Data Manipulation Language – Based on both relational algebra and relational calculus. It includes commands to insert, update, select and delete tuples in the database.  View Definition – define views  Transaction Control – specify beginning and end of transactions. 53 http://www.francisxavier.ac.in

 Embedded SQL and Dynamic SQL - SQL statements can be embedded in general purpose programming language.  Integrity – SQL DDL commands specify integrity constraints that the data stored in the database must satisfy. The updates that violate these constraints are disallowed.  Authorization – specifying access rights to relations and views. Basic Structure of SQL: The basic structure of SQL expression consists of three clauses: select, from and where  The SELECT clause corresponds to the project operator of the relational algebra to list the attributes desired in the result of a query.  The FROM clause corresponds to the Cartesian product operation of the relational algebra to list the relations in the evaluation of the expression.  The WHERE clause corresponds the selection predicate of the relational algebra. It consist of the predicate involving attributes of the relations that appear in the from clause. A SQL query is of the form: SELECT A1,A2,A3…An FROM r1,r2,r3…rm WHERE P or SELECT FROM WHERE ; where: • is a list of attribute names whose values are to be retrieved by the query. • is a list of the relation names required to process the query. • is a conditional (Boolean) expression that identifies the tuples to be retrieved by the query. The SQL Query is equivalent to the relational algebra expression  A1,A2,A3..An( P (r1 xr2 x r3 x……rm) SQL forms the Cartesian product of the relations named in the from clause, performs a relational algebra selection using the where clause predicate and then projects the result onto the attributes of the select clause. 2.2 DDL: 54 http://www.francisxavier.ac.in

SQL uses the terms table, row, and column for relation, tuple, and attribute, respectively. The SQL2 commands for data definition are CREATE, ALTER, TRUNCATE and DROP; The SQL DDL allows specification of not only a set of relations, but also information about each relation, including: 

The schema for each relation



The domain values associated with each attribute.



The integrity constraints



The set of indices to be maintained for each relation.



The security and authorization information for each relation.



The physical storage structure of each relation on disk.

Schema and Catalog Concepts in SQL2 An SQL schema is identified by a schema name, and includes an authorization identifier to indicate the user or account who owns the schema, as well as descriptors for each element in the schema. Schema elements include the tables, constraints, views, domains, and other constructs (such as authorization grants) that describe the schema. A schema is created via the CREATE SCHEMA statement, which can include all the schema elements‘ definitions. Alternatively, the schema can be assigned a name and authorization identifier, and the elements can be defined later. For example, the following statement creates a schema, owned by the user with authorization identifier JSMITH: CREATE SCHEMA AUTHORIZATION JSMITH; A catalog always contains a special schema called INFORMATION_SCHEMA, which provides information on all the element descriptors of all the schemas in the catalog to authorized users. Integrity constraints such as referential integrity can be defined between relations only if they exist in schemas within the same catalog. Schemas within the same catalog can also share certain elements, such as domain definitions. The CREATE TABLE Command and SQL2 Data Types and Constraints The CREATE TABLE command is used to specify a new relation by giving it a name and specifying its attributes and constraints. The attributes are specified first, and each attribute is given a name, a data type to specify its domain of values, and any attribute constraints such as NOT NULL. Create table r (A1 D1, A2 D2,…………….An Dn, (integrity constraint1), (integrity constraint2) ……(integrity constraint k))

55 http://www.francisxavier.ac.in

Where r is the name of the relation, each Ai is the name of an attribute in the schema of a relation r and Di is the domain type of values in the domain DML(Data Manipulation Language) It is a language that provides a set of operation to support the basic data manipulation, operation on the data held in the database. Commands used are:  Insert  Delete  Select  Update DCL (Data Control Language)  Grant  Revoke Data/Domain types in SQL The SQL standard supports a variety of built-in domain types including: char(n)

– fixed length character string with length n.

varchar(n)

– variable length character string with maximum length n

int

- integer

smallint

- small integer , subset of integer

number(n)

- a number with n digits

number(p,d)

- a fixed point number with p digits and d of the p digits are to the right of the

decimal point. real

- floating point numbers

float(n)

- a floating point number with precision of atleast n digits.

date

- a calendar date containing day-month –four digit year

time

- the time of the day in hours, minutes and seconds.

timestamp

- A combination of date and time.

Database Objects View A View is subset of part of a database. It is a personalized model of a database. A view can hide data that a user does not need to see. Simplifies the usage of the system and enhance security. A user who is not allowed to directly access a relation may be allowed to access a part of a relation through view. Views may also be called as a virtual table. 56 http://www.francisxavier.ac.in

Provide a mechanism to hide certain data from the view of certain users. To create a view we use the command: create view v as where: is any legal expression The view name is represented by v create view as select from ; Update of a View 

Create a view of all loan data in loan relation, hiding the amount attribute 

create view branch-loan as select branch-name, loan-number from loan



Add a new tuple to branch-loan 

insert into branch-loan values (‗Perryridge‘, ‗L-307‘)



This insertion must be represented by the insertion of the tuple 

(‗L-307‘, ‗Perryridge‘, null)



into the loan relation



Updates on more complex views are difficult or impossible to translate, and hence are disallowed.



Most SQL implementations allow updates only on simple views (without aggregates) defined on a single relation

Advantages of Views: 1. Provide automatic security for hidden data. 2. Different views of same data for different users. 3. Provide logical data independence. 4. Provides the principle of interchangeability and principle of database relativity. Sequence A sequence is a database object created by a user and can be shared by multiple users. Use: 

To create a primary key value, this must be unique for each row.



Automatically generates unique numbers



Speeds up the efficiency of accessing sequence values when cached in memory.

57 http://www.francisxavier.ac.in

Syntax 

To Create Sequence CREATE SEQUENCE sequence_name [INCREMENT BY n] [START WITH n] [{MAXVALUE n | NOMAXVALUE}] [{MINVALUE n | NOMINVALUE}] [CYCLE | NOCYCLE}] [{CACHE n | NOCACHE}];



To use a Sequence

insert into table_name values (sequence_name.nextval,value1,value2, value3,…);  Modifying a Sequence If a MAXVALUE limit is reached in the sequence, no additional values from the sequence will be allocated and an error will be received indicating that the sequence exceeds the MAXVALUE. To continue to use the sequence, it can be modified using the ALTER SEQUENCE statement. ALTER SEQUENCE sequence_name [INCREMENT BY n] [{MAXVALUE n | NOMAXVALUE}] [{MINVALUE n | NOMINVALUE}] [{CYCLE | NOCYCLE}] [{CACHE n | NOCACHE}];  Removing a Sequence To remove a sequence from the data dictionary, use the DROP SEQUENCE statement. Once removed, the sequence can no longer be referenced. DROP SEQUENCE

sequence_name;

Synonyms: Direct references to objects. They are used to provide public access to an object, mask the real name or owner of an object, etc. A user may create a private synonym that is available to only that user. 58 http://www.francisxavier.ac.in

Embedded SQL-Static Vs Dynamic SQL Embedded SQL is a method of combining the power of a programming language and the database. Embedded SQL statements are processed by a special SQL precompiler. The embedded SQL statements are parsed by an embedded SQL. The output from the preprocessor is then compiled by the host compiler. This allows programmers to embed SQL statements in programs written in any number of languages such as: C Programming language family, COBOL, FORTRAN and Java. The SQL standard defines embeddings of SQL in a variety of programming languages such as C, Java, and Cobol. A language to which SQL queries are embedded is referred to as a host language, and the SQL structures permitted in the host language comprise embedded SQL. The basic form of these languages follows that of the System R embedding of SQL into PL/I. EXEC SQL statement is used to identify embedded SQL request to the preprocessor EXEC SQL END_EXEC Note: this varies by language (for example, the Java embedding uses # SQL { …. }; ) From within a host language, find the names and cities of customers with more than the variable amount dollars in some account. Specify the query in SQL and declare a cursor for it EXEC SQL END_EXEC The statement causes the query to be evaluated EXEC SQL .. END_EXEC The fetch statement causes the values of one tuple in the query result to be placed on host language variables. EXEC SQL fetch c into :cn, :cc END_EXEC Repeated calls to fetch get successive tuples in the query result A variable called SQLSTATE in the SQL communication area (SQLCA) gets set to ‗02000‘ to indicate no more data is available The close statement causes the database system to delete the temporary relation that holds the result of the query. EXEC SQL close c END_EXEC Dynamic SQL The SQL statements in the program are static; that is, they do not change each time the program is run Allows programs to construct and submit SQL queries at run time. The program constructs an SQL statement in a buffer, just as it does for the EXECUTE IMMEDIATE statement. Instead of host variables, a question mark (?) can be substituted for a constant. 59 http://www.francisxavier.ac.in

The program passes the SQL statement to the DBMS with a PREPARE statement, which requests that the DBMS parse, validate, and optimize the statement and generate an execution plan for it. The program can use the EXECUTE statement repeatedly, supplying different parameter values each time the dynamic statement is executed. Example of the use of dynamic SQL from within a C program. char * sqlprog = “update account set balance = balance * 1.05 where account_number = ?” EXEC SQL prepare dynprog from :sqlprog; char account [10] = ―A-101‖; EXEC SQL execute dynprog using :account; The dynamic SQL program contains a ?, which is a place holder for a value that is provided when the SQL program is executed. Difference between Static and Dynamic SQL: S.No. Static (embedded) SQL 1.

Dynamic (interactive) SQL

In static SQL how database will be accessed In dynamic SQL, how database will be is predetermined in the embedded SQL

accessed is determined at run time.

statement.Hard coded 2.

It is more fast and efficient.

It is less fast and efficient.

3.

SQL statements are compiled at compile

SQL statements are compiled at run time.

time. 4.

5.

6.

7.

Parsing, validation, optimization, and

Parsing, validation, optimization, and

generation of application plan are done at

generation of application plan are done at run

compile time.

time.

It is generally used for situations where data It is generally used for situations where data is is distributed uniformly.

distributed non-uniformly.

EXECUTE IMMEDIATE, EXECUTE and

EXECUTE IMMEDIATE, EXECUTE and

PREPARE statements are not used.

PREPARE statements are used.

It is less flexible.

It is more flexible.

Query Processing and Optimization A query expressed in a high-level query language such as SQL must first be scanned, parsed, and validated.

60 http://www.francisxavier.ac.in

The scanner identifies the language tokens—such as SQL keywords, attributes names and relation names—in the text of the query, whereas the parser checks the query syntax to determine whether it is formulated according to the syntax rules (rules of grammar) of the query language. The query must also be validated, by checking that all attribute and relation names are valid and semantically meaningful names in the schema of the particular database being queried. An internal representation of the query is then created, usually as a tree data structure called a query tree. It is also possible to represent the query using a graph data structure called a query graph. The DBMS must then devise an execution strategy for retrieving the result of the query from the database files. A query typically has many possible execution strategies, and the process of choosing a suitable one for processing a query is known as query optimization. Finding the optimal strategy is usually too time-consuming except for the simplest of queries and may require information on how the files are implemented and even on the contents of the files—information that may not be fully available in the DBMS catalog. Hence, planning of an execution strategy may be a more accurate description than query optimization. Steps in Query Processing

The steps involved in processing a query appear in Figure. The basic steps are 1. Parsing and translation 2. Optimization 3. Evaluation Given a query, there are generally a variety of methods for computing the answer. For example, we have seen that, in SQL, a query could be expressed in several different ways. Each SQL query can itself be translated into a relational-algebra expression in one of several ways. Furthermore, the relational-algebra representation of a query specifies only partially how to

61 http://www.francisxavier.ac.in

evaluate a query; there are usually several ways to evaluate relational-algebra expressions. As an illustration, consider the query select balance from account where balance < 2500

Heuristics and cost Estimates in Query Optimization

A drawback of cost-based optimization is the cost of optimization itself. Although the cost of query processing can be reduced by clever optimizations, cost-based optimization is still expensive. Hence, many systems use heuristics to reduce the number of choices that must be made in a cost-based fashion. Some systems even choose to use only heuristics, and do not use cost-based optimization at all. An example of a heuristic rule is the following rule for transforming relational algebra queries: Perform selection operations as early as possible. A heuristic optimizer would use this rule without finding out whether the cost is reduced by this transformation. In the first transformation the selection operation was pushed into a join. We say that the preceding rule is a heuristic because it usually, but not always, helps to reduce the cost. For an example of where it can result in an increase in cost, consider an expression σθ(r � s), where the condition θ refers to only attributes in s. The selection can certainly be performed before the join. However, if r is extremely small compared to s, and if there is an index on the join attributes of s, but no index on the attributes used by θ, then it is probably a bad idea to perform the selection early. Performing the selection early—that is, directly on s—would require doing a scan of all tuples in s. It is probably cheaper, in this case, to compute the join by using the index, and then to reject tuples that fail the selection. The projection operation, like the selection operation, reduces the size of relations. Thus, whenever we need to generate a temporary relation, it is advantageous to apply immediately any projections that are possible. This advantage suggests a companion to the ―perform selections early‖ heuristic: Perform projections early.

62 http://www.francisxavier.ac.in

It is usually better to perform selections earlier than projections, since selections have the potential to reduce the sizes of relations greatly, and selections enable the use of indices to access tuples. An example similar to the one used for the selection heuristic should convince you that this heuristic does not always reduce the cost. A heuristic optimization algorithm will reorder the components of an initial query tree to achieve improved query execution. Heuristics can be understood by visualizing a query expression as a tree, as illustrated in Figure

1. Deconstruct conjunctive selections into a sequence of single selection operations. This step, based on equivalence rule 1, facilitates moving selection operations down the query tree. 2. Move selection operations down the query tree for the earliest possible execution. This step uses the commutativity and distributivity properties of the selection operation noted in equivalence rules 2, 7.a, 7.b, and 11. For instance, this step transforms σθ(r � s) into either σθ(r) � s or r � σθ(s) whenever possible. Performing value-based selections as early as possible reduces the cost of sorting and merging intermediate results. The degree of reordering permitted for a particular selection is determined by the attributes involved in that selection condition. 3. Determine which selection operations and join operations will produce the smallest relations— that is, will produce the relations with the least number of tuples. Using associativity of the � operation, rearrange the tree so that the leaf-node relations with these restrictive selections are executed first. This step considers the selectivity of a selection or join condition. Recall that the most restrictive selection—that is, the condition with the smallest selectivity—retrieves the fewest records. This step relies on the associativity of binary operations given in equivalence rule 6. 63 http://www.francisxavier.ac.in

4. Replace with join operations those Cartesian product operations that are followed by a selection condition (rule 4.a). The Cartesian product operation is often expensive to implement since r1 × r2 includes a record for each combination of records from r1 and r2. The selection may significantly reduce the number of records, making the join much less expensive than the Cartesian product. 5. Deconstruct and move as far down the tree as possible lists of projection attributes, creating new projections where needed. This step draws on the properties of the projection operation given in equivalence rules 3, 8.a, 8.b, and 12. 6. Identify those subtrees whose operations can be pipelined, and execute them using pipelining. In summary, the heuristics listed here reorder an initial query-tree representation in such a way that the operations that reduce the size of intermediate results are applied first; early selection reduces the number of tuples, and early projection reduces the number of attributes. The heuristic transformations also restructure the tree so that the system performs the most restrictive selection and join operations before other similar operations. Heuristic optimization further maps the heuristically transformed query expression into alternative sequences of operations to produce a set of candidate evaluation plans. An evaluation plan includes not only the relational operations to be performed, but also the indices to be used, the order in which tuples are to be accessed, and the order in which the operations are to be performed. The access-plan –selection phase of a heuristic optimizer chooses the most efficient strategy for each operation. UNIT III TRANSACTION PROCESSING AND CONCURRENCY CONTROL Introduction-Properties of Transaction- Serializability- Concurrency Control – Locking MechanismsTwo Phase Commit Protocol-Dead lock.

3.1 INTRODUCTION A transaction is a unit of program execution that accesses and possibly updates various data items. A transaction consists of collection of operations used to perform a particular task. Each transaction begins with BEGIN TRANSACTION statement and ends with END TRANSACTION statement. Transaction -States • Active This is the initial state, the transaction stays in this state while it is executing. • Partially committed A transaction is in this state when it has executed the final statement. 64 http://www.francisxavier.ac.in

• Failed A transaction is in this state once the normal execution of the transaction cannot proceed. • Aborted A transaction is said to be aborted when the transaction has rolled back and the database is being restored to the consistent state prior to the start of the transaction. • Committed

A transaction is in the committed state once it has been successfully executed and the database is transformed into a new consistent state.

Fig States of a transaction

A transaction starts in the active state; A transaction contains a group of statements that form a logical unit of work. When the transaction has finished executing the last statement, it enters the partially committed state. At this point the transaction has completed execution, but it is still possible that it may have to be aborted. This is because the actual output may still be in the main memory and a hardware failure can still prevent the successful completion. The database system then writes enough information to the disk. When the last of this information is written, the transaction enters the committed states. A transaction enters the failed state once the system determines that the transaction can no longer proceed with its normal execution. This could be due to hardware failures or logical errors. Such a transaction should be rolled back. When the roll back is complete, the transaction enters the aborted state when a transaction aborts, the system has two options as follows: 65 http://www.francisxavier.ac.in

• Restart the transaction • Kill the transaction.

3.2 PROPERTIES OF TRANSACTION

The four basic properties that all transactions should possess are called ACID properties. • Atomicity: The 'all or nothing' property. A transaction is an indivisible unit that is either performed in its entirety or is not performed at all. It is the responsibility of the recovery subsystem of the DBMS to ensure atomicity. • Consistency: A transaction must transform the database from one consistent state to another consistent state. It is the responsibility of both the DBMS and the application developers to ensure consistency. The DBMS can ensure consistency by enforcing all the constraints that have been specified on the database schema, such as integrity and enterprise constraints. However in itself this is insufficient to ensure consistency.

Example:

A transaction that is intended to transfer money from one bank account to another and the programmer makes an error in the transaction logic and debits one account but credits the wrong account, then the database is in an inconsistent state. •Isolation: Transactions execute independently of one another, i.e. the partial effects of incomplete transactions should not be visible to other transactions. It is the responsibility of the concurrency control subsystem to ensure isolation. • Durability: The effects of a successfully completed transaction are permanently recorded in the database and must not be lost because of a subsequent failure. It is the responsibility of the recovery subsystem to ensure durability.

3.3 SERIALIZABILITY 66 http://www.francisxavier.ac.in

Basic assumption:  Each transaction preserves database consistency.  Thus serial execution of a set of transactions preserves database consistency  A (possibly concurrent) schedule is serializable if it is equivalent to a serial schedule.

Different forms of schedule equivalence give rise to the notions of:

1. Conflict Serializability 2. View Serializability

3.3.1 Conflict Serializability Instructions Ii and Ij, of transactions Ti and Tj respectively, conflict if and only if there exists some item Q accessed by both Ii and Ij, and at least one of these instructions wrote Q.

1.Ii = read( Q), Ij = read( Q). Ii and Ij don‗t conflict. 2.Ii = read( Q), Ij = write( Q). They conflict. 3.Ii = write( Q), Ij = read( Q). They conflict. 4.Ii = write( Q), Ij = write( Q). They conflict.

Intuitively, a conflict between Ii and Ij forces a temporal order between them. If Ii and Ij are consecutive in a schedule and they do not conflict, their results would remain the same even if they had been interchanged in the schedule. If a schedule S can be transformed into a schedule S0 by a series of swaps of non-conflicting instructions, we say that S and S0 are conflict equivalent. If a schedule Sis conflict serializable if it is conflict equivalent to a serial schedule.

Example of a schedule that is not conflict serializable:

It is not possible to swap instructions in the above schedule to obtain either the serial schedule < T3, T4 >, or the serial schedule < T4, T3 >.Schedule 3 below can be transformed 67 http://www.francisxavier.ac.in

into Schedule1, a serial schedule where T2 follows T1, by a series of swaps of non conflicting instructions. Therefore Schedule 3 is conflict serializable.

3.3.2 View Serializability Let Sand S0 be two schedules with the same set of transactions. S and S0 are view equivalent if the following three conditions are met: 1.

For each data item Q, if transaction Ti reads the initial value of Q in schedule S, then transaction

Ti must, in schedule S0, also read the initial value of Q. 2.

For each data item Q, if transaction Ti executes read( Q) in schedule S, and that value was produced by transaction Tj (if any), then transaction Ti must in schedule S0 also read the value of Q that was produced by transaction Tj.

3.

For each data item Q, the transaction (if any) that performs the final write (Q) operation in schedule S must perform the final write (Q) operation in schedule S0.

A schedule S is view serializable if it is view equivalent to a serial schedule. Every conflict serializable schedule is also view serializable. The following schedule is view-serializable but not conflict serializable

Every view serializable schedule, which is not conflict serializable, has blind 68 http://www.francisxavier.ac.in

writes.

Precedence graph: It is a directed graph where the vertices are the transactions (names). We draw an arc from Ti to Tj if the two transactions conflict, and Ti accessed the data item on which the conflict arose earlier. We may label the arc by the item that was accessed.

Example Schedule (Schedule

A) Figure

.

Precedence

Graph

for

Schedule A

52

3.4 CONCURRENCY CONTROL Multiple transactions are allowed to run concurrently in the system.

Advantages of concurrent executions 69 http://www.francisxavier.ac.in

o Increased processor and disk utilization, leading to better transaction throughput: one transaction can be using the CPU while another is reading from or writing to the disk. o Reduced average response time for transactions: short transactions need not wait behind long ones. . Concurrency control schemes: These are mechanisms to control the interaction among the concurrent transactions in order to prevent them from destroying the consistency of the database.

. Schedules- Schedules are sequences that indicate the chronological order in which instructions of concurrent transactions are executed. A schedule for a set of transactions must consist of all instructions of those transactions. Must preserve the order in which the instructions appear in each individual transaction. Recoverable schedule — if a transaction Tj reads a data items previously written by transaction Ti, the commit operation of Ti appears before the commit opera-tion of Tj. 3.5 LOCKING MECHANISMS Concurrency control protocols, which ensure serializability of transactions, are most desirable. Concurrency control protocols can be broadly divided into two categories: 

Lock based protocols



Time stamp based protocols

Lock based protocols Database systems, which are equipped with lock-based protocols, use mechanism by which any transaction cannot read or write data until it acquires appropriate lock on it first. Locks are of two kinds: 

Binary Locks: a lock on data item can be in two states; it is either locked or unlocked.



Shared/exclusive: this type of locking mechanism differentiates lock based on their uses. If a lock is acquired on a data item to perform a write operation, it is exclusive lock. Because allowing more than one transactions to write on same data item would lead the database into an inconsistent state. Read locks are shared because no data value is being changed.

Two Phase Locking - 2PL 70 http://www.francisxavier.ac.in

This locking protocol is divides transaction execution phase into three parts. In the first part, when transaction starts executing, transaction seeks grant for locks it needs as it executes. Second part is where the transaction acquires all locks and no other lock is required. Transaction keeps executing its operation. As soon as the transaction releases its first lock, the third phase starts. In this phase a transaction cannot demand for any lock but only releases the acquired locks.

Fig Two Phase Locking Two phase locking has two phases, one is growing; where all locks are being acquired by transaction and second one is shrinking, where locks held by the transaction are being released. To get an exclusive (write) lock, a transaction must first acquire a shared (read) lock and then upgrade it to exclusive lock.  Strict Two

Phase Locking

The first phase of Strict-2PL is same as 2PL. After acquiring all locks in the first phase, transaction continues to execute normally. But in contrast to 2PL, Strict-2PL does not release lock as soon as it is no more required, but it holds all locks until commit state arrives. Strict-2PL releases all locks at once at commit point.

Fig Strict Two Phase Locking 71 http://www.francisxavier.ac.in

Strict-2PL does not have cascading abort as 2PL does. Time stamp based protocols The most commonly used concurrency protocol is time-stamp based protocol. This protocol uses either system time or logical counter to be used as a time-stamp. Lock based protocols manage the order between conflicting pairs among transaction at the time of execution whereas time-stamp based protocols start working as soon as transaction is created. Every transaction has a time-stamp associated with it and the ordering is determined by the age of the transaction. A transaction created at 0002 clock time would be older than all other transaction, which come after it. For example, any transaction 'y' entering the system at 0004 is two seconds younger and priority may be given to the older one. In addition, every data item is given the latest read and write-timestamp. Time-stamp ordering protocol The timestamp-ordering protocol ensures serializability among transaction in their conflicting read and writes operations. This is the responsibility of the protocol system that the conflicting pair of tasks should be executed according to the timestamp values of the transactions. 

Time-stamp of Transaction Ti is denoted as TS(Ti).



Read time-stamp of data-item X is denoted by R-timestamp(X).



Write time-stamp of data-item X is denoted by W-timestamp(X).

Timestamp ordering protocol works as follows:



If a transaction Ti issues read(X) operation: o

If TS(Ti) < W-timestamp(X) 

Operation rejected. 72 http://www.francisxavier.ac.in

o

If TS(Ti) >= W-timestamp(X) 

o 

Operation executed.

All data-item Timestamps updated. If a transaction Ti issues write(X) operation: o

If TS(Ti) < R-timestamp(X) 

o

If TS(Ti) < W-timestamp(X) 

o

Operation rejected.

Operation rejected and Ti rolled back.

Otherwise, operation executed.

Thomas' Write rule: This rule states that in case of: 

If TS(Ti) < W-timestamp(X) ,Operation rejected and Ti rolled back. Timestamp ordering rules can be modified to make the schedule view serializable. Instead of making Ti rolled back, the 'write' operation itself is ignored.

TWO PHASE COMMIT PROTOCOL The two phase commit protocol is a distributed algorithm which lets all sites in a distributed system agree to commit a transaction. The protocol results in either all nodes committing the transaction or aborting, even in the case of site failures and message losses. The two phases of the algorithm are broken into the COMMIT-REQUEST phase, where the COORDINATOR attempts to prepare all the COHORTS, and the COMMIT phase, where the COORDINATOR completes the transactions at all COHORTS. The protocol works in the following manner: One node is designated the coordinator, which is the master site, and the rest of the nodes in the network are called cohorts. Other assumptions of the protocol include stable storage at each site and use of a write ahead log by each node. Also, the protocol assumes that no node crashes forever, and eventually any two nodes can communicate with each other. The latter is not a big deal since network communication can typically be rerouted.

73 http://www.francisxavier.ac.in

Basic Algorithm During phase 1, initially the coordinator sends a query to commit message to all cohorts. Then it waits for all cohorts to report back with the agreement message. The cohorts, if the transaction was successful, write an entry to the undo log and an entry to the redo log. Then the cohorts reply with an agree message, or an abort if the transaction failed at a cohort node. During phase 2, if the coordinator receives an agree message from all cohorts, then it writes a commit record into its log and sends a commit message to all the cohorts. If all agreement messages do not come back the coordinator sends an abort message. Next the coordinator waits for the acknowledgement from the cohorts. When acks are received from all cohorts the coordinator writes a complete record to its log. Note the coordinator will wait forever for all the acknowledgements to come back. If the cohort receives a commit message, it releases all the locks and resources held during the transaction and send an acknowledgement to the coordinator. If the message is abort, then the cohort undoes the transaction with the undo log and releases the resources and locks held during the transaction. Then it sends an acknowledgement. Disadvantages The greatest disadvantage of the two phase commit protocol is the fact that it is a blocking protocol. A node will block while it is waiting for a message. This means that other processes competing for resource locks held by the blocked processes will have to wait for the locks to be released. A single node will continue to wait even if all other sites have failed. If the coordinator fails permanently, some cohorts will never resolve their transactions. This has the effect that resources are tied up forever. 3.6 DEADLOCK: When dealing with locks two problems can arise, the first of which being deadlock. Deadlock refers to a particular situation where two or more processes are each waiting for another to release a resource, or more than two processes are waiting for resources in a circular chain. Deadlock is a common problem in multiprocessing where many processes share a specific type of mutually exclusive resource. Some computers, usually those intended for the time-sharing and/or real-time markets, are often equipped with a hardware lock, or hard lock, which guarantees exclusive access to processes, forcing serialization. Deadlocks are particularly disconcerting because there is no general solution to avoid them.

74 http://www.francisxavier.ac.in

Livelock: Livelock is a special case of resource starvation. A livelock is similar to a deadlock, except that the states of the processes involved constantly change with regard to one another wile never progressing. The general definition only states that a specific process is not progressing. For example, the system keeps selecting the same transaction for rollback causing the transaction to never finish executing. Another livelock situation can come about when the system is deciding which transaction gets a lock and which waits in a conflict situation. 3.6.1 Deadlock Prevention To prevent any deadlock situation in the system, the DBMS aggressively inspects all the operations which transactions are about to execute. DBMS inspects operations and analyze if they can create a deadlock situation. If it finds that a deadlock situation might occur then that transaction is never allowed to be executed. There are deadlock prevention schemes, which uses time-stamp ordering mechanism of transactions in order to pre-decide a deadlock situation. Wait-Die Scheme: In this scheme, if a transaction request to lock a resource (data item), which is already held with conflicting lock by some other transaction, one of the two possibilities may occur: 

If TS(Ti) < TS(Tj), that is Ti, which is requesting a conflicting lock, is older than Tj, Ti is allowed to wait until the data-item is available.



If TS(Ti) > TS(tj), that is Ti is younger than Tj, Ti dies. Ti is restarted later with random delay but with same timestamp.

This scheme allows the older transaction to wait but kills the younger one. Wound-Wait Scheme: In this scheme, if a transaction request to lock a resource (data item), which is already held with conflicting lock by some other transaction, one of the two possibilities may occur: 

If TS(Ti) < TS(Tj), that is Ti, which is requesting a conflicting lock, is older than Tj, Ti forces Tj to be rolled back, that is Ti wounds Tj. Tj is restarted later with random delay but with same timestamp. 75 http://www.francisxavier.ac.in



If TS(Ti) > TS(Tj), that is Ti is younger than Tj, Ti is forced to wait until the resource is available.

This scheme, allows the younger transaction to wait but when an older transaction request an item held by younger one, the older transaction forces the younger one to abort and release the item. In both cases, transaction, which enters late in the system, is aborted. 3.6.2Deadlock Avoidance Aborting a transaction is not always a practical approach. Instead deadlock avoidance mechanisms can be used to detect any deadlock situation in advance. Methods like "wait-for graph" are available but for the system where transactions are light in weight and have hold on fewer instances of resource. In a bulky system deadlock prevention techniques may work well. Wait-for Graph This is a simple method available to track if any deadlock situation may arise. For each transaction entering in the system, a node is created. When transaction Ti requests for a lock on item, say X, which is held by some other transaction Tj, a directed edge is created from Ti to Tj. If Tj releases item X, the edge between them is dropped and Ti locks the data item. The system maintains this wait-for graph for every transaction waiting for some data items held by others. System keeps checking if there's any cycle in the graph.

Fig Wait-for Graph Two approaches can be used, first not to allow any request for an item, which is already locked by some other transaction. This is not always feasible and may cause starvation, where a transaction indefinitely waits for data item and can never acquire it. Second option is to roll back one of the transactions.

76 http://www.francisxavier.ac.in

It is not feasible to always roll back the younger transaction, as it may be important than the older one. With help of some relative algorithm a transaction is chosen, which is to be aborted, this transaction is called victim and the process is known as victim selection.

UNIT IV TRENDS IN DATABASE TECHNOLOGY Overview of Physical Storage Media – Magnetic Disks – RAID – Tertiary storage – File Organization –Organization of Records in Files – Indexing and Hashing –Ordered Indices – B+ tree Index Files – B tree Index Files – Static Hashing – Dynamic Hashing - Introduction to Distributed Databases- Client server technology- Multidimensional and Parallel databases- Spatial and multimedia databases-Mobile and web databases- Data Warehouse-Mining- Data marts 4.1 OVERVIEW OF PHYSICAL STORAGE MEDIA Classification of Physical Storage Media 

Based on Speed with which data can be accessed



Based on Cost per unit of data



Based on Reliability



Based on life of storage

Volatile storage: loses contents when power is switched off Non-volatile storage: Contents persist even when power is switched off. Includes secondary and tertiary storage, as well as batter backed up main-memory. Storage Hierarchy Primary storage: Fastest media but volatile (cache, main memory). Secondary storage: next level in hierarchy, non-volatile, moderately fast access time. Also called on-line storage. E.g. flash memory, magnetic disks 77 http://www.francisxavier.ac.in

Tertiary storage: lowest level in hierarchy, non-volatile, slow access time. Also called off-line storage. E.g. magnetic tape, optical storage.

Fig Storage-device hierarchy

Cache 

The fastest and most costly form of storage



Volatile



Managed by the computer system hardware.

Main memory 

Fast access (10s to 100s of nanoseconds; 1 nanosecond = 10–9 seconds)



Generally too small (or too expensive) to store the entire database



Capacities of up to a few Gigabytes widely used currently



Capacities have gone up and per-byte costs have decreased steadily and rapidly



Volatile — contents of main memory are usually lost if a power failure or system crash occurs.

Flash memory 

Data survives power failure.



Data can be written at a location only once, but location can be erased and written to again.



Can support only a limited number of write/erase cycles.



Erasing of memory has to be done to an entire bank of memory.



Reads are roughly as fast as main memory.



But writes are slow (few microseconds), erase is slower.



Cost per unit of storage roughly similar to main memory. 78 http://www.francisxavier.ac.in



Widely used in embedded devices such as digital cameras. Also known as EEPROM.

Magnetic-disk 

Data is stored on spinning disk, and read/written magnetically.



Primary medium for the long-term storage of data; typically stores entire database.



Data must be moved from disk to main memory for access, and written back for storage.



Much slower access than main memory



Direct-access – possible to read data on disk in any order, unlike magnetic tape Hard disks vs. floppy disks



Capacities range up to roughly 100 GB currently. Much larger capacity and cost/byte than main memory/ flash memory



Growing constantly and rapidly with technology improvements



Survives power failures and system crashes



Disk failure can destroy data, but is very rare.

Optical storage 

Non-volatile, data is read optically from a spinning disk using a laser CD-ROM (640 MB) and DVD (4.7 to 17 GB) most popular forms



Write-one, read-many (WORM) optical disks used for archival storage (CD-R and DVDR)



Multiple write versions also available (CD-RW, DVD-RW, and DVD-RAM)



Reads and writes are slower than with magnetic disk



Juke-box systems, with large numbers of removable disks, a few drives, and a mechanism for automatic loading/unloading of disks available for storing large volumes of data.

Optical Disks 

Compact disk-read only memory (CD-ROM)



Disks can be loaded into or removed from a drive



High storage capacity (640 MB per disk)



High seek times or about 100 msec (optical read head is heavier and slower)



Higher latency (3000 RPM) and lower data-transfer rates (3-6 MB/s) compared to magnetic disks

Digital Video Disk (DVD) 79 http://www.francisxavier.ac.in



DVD-5 holds 4.7 GB, and DVD-9 holds 8.5 GB



DVD-10 and DVD-18 are double sided formats with capacities of 9.4 GB & 17 GB



Other characteristics similar to CD-ROM



Record once versions (CD-R and DVD-R)



Data can only be written once, and cannot be erased.



High capacity and long lifetime; used for archival storage



Multi-write versions (CD-RW, DVD-RW and DVD-RAM) also available

Tape storage 

Non-volatile, used primarily for backup (to recover from disk failure), and for archival data



Sequential-access – much slower than disk



Very high capacity (40 to 300 GB tapes available)



Tape can be removed from drive  storage costs much cheaper than disk, but drives are expensive



Tape jukeboxes available for storing massive amounts of data



Hundreds of terabytes (1 terabyte = 109 bytes) to even a petabyte (1 petabyte = 1012 bytes)

Magnetic Tapes 

Hold large volumes of data and provide high transfer rates



Few GB for DAT (Digital Audio Tape) format, 10-40 GB with DLT (Digital Linear Tape) format, 100 GB+ with Ultrium format, and 330 GB with Ampex helical scan format



Transfer rates from few to 10s of MB/s



Currently the cheapest storage medium



Tapes are cheap, but cost of drives is very high



Very slow access time in comparison to magnetic disks and optical disks



Limited to sequential access.



Used mainly for backup, for storage of infrequently used information, and as an offline medium for transferring information from one system to another.



Tape jukeboxes used for very large capacity storage (Terabyte (1012 bytes) to Petabye (1015 bytes) 80 http://www.francisxavier.ac.in

4.2 MAGNETIC DISKS

Fig Magnetic disk

Physical Characteristics of Magnetic Hard 

Surface of platter divided into circular tracks. Over 17,000 tracks per platter on typical hard disks.



Each track is divided into sectors. A sector is the smallest unit of data that can be read or written.



Sector size typically 512 bytes. Typical sectors per track: 200 (on inner tracks) to 400 (on outer tracks).



Read-write head is positioned very close to the platter surface (almost touching it). Reads or writes magnetically encoded information.



To read/write a sector-Disk arm swings to position head on right track. Platter spins continually; data is read/written as sector passes under head. 81 http://www.francisxavier.ac.in



Head-disk assemblies- Multiple disk platters on a single spindle (typically 2 to 4).



One head per platter, mounted on a common arm.



Cylinder i consists of i th track of all the platters

Performance Measures of Disks 

Access Time – the time it takes from when a read or write request is issued to when data transfer begins.



Seek Time – time it takes to reposition the arm over the correct track. Average Seek time is 1/2 the worst case seek time. 4 to 10 milliseconds on typical disks.



Rotational latency – time it takes for the sector to be accessed to appear under the head. Average latency is 1/2 of the worst-case latency. 4 to 11 milliseconds on typical disks (5400 to 15000 r.p.m.)



Data-Transfer Rate – the rate at which data can be retrieved from or stored to the disk. 4 to 8 MB per second is typical. Multiple disks may share a controller, so

rate that

controller can handle is also important 

Mean Time To Failure (MTTF) – the average time the disk is expected to run continuously without any failure. Typically 3 to 5 years. Probability of failure of new disks is quite low, corresponding to a ―theoretical

MTTF‖ of 30,000 to 1,200,000 hours for a new disk. 4.3 RAID RAID: Redundant Arrays of Independent Disks Disk organization techniques that manage a large numbers of disks, providing a view of a single disk of 

high capacity and high speed by using multiple disks in parallel,



high reliability by storing data redundantly, so that data can be recovered even if a disk

fails.

The chance that some disk out of a set of N disks will fail is much higher than the chance that a specific single disk will fail. E.g., a system with 100 disks, each with MTTF of 100,000 hours (approx. 11 years), will have a system MTTF of 1000 hours (approx. 41 days) o Techniques for using redundancy to avoid data loss are critical with large numbers of disks  Originally a cost-effective alternative to large, expensive disks. 82 http://www.francisxavier.ac.in

o I in RAID originally stood for ―inexpensive‘‘ o Today RAIDs are used for their higher reliability and bandwidth.  The ―I‖ is interpreted as independent  Improvement of Reliability via Redundancy  Redundancy – store extra information that can be used to rebuild information lost in a disk failure.  E.g., Mirroring (or shadowing) o Duplicate every disk. Logical disk consists of two physical disks. o Every write is carried out on both disks  Reads can take place from either disk o If one disk in a pair fails, data still available in the other  Data loss would occur only if a disk fails, and its mirror disk also fails before the system is repaired.  Probability of combined event is very small o Except for dependent failure modes such as fire or building collapse or electrical power surges.  Mean time to data loss depends on mean time to failure, and mean time to repair. E.g. MTTF of 100,000 hours, mean time to repair of 10 hours gives mean time to data loss of 500*106 hours (or 57,000 years) for a mirrored pair of disks (ignoring dependent failure modes)  Improvement in Performance via Parallelism o Two main goals of parallelism in a disk system: 1. Load balance multiple small accesses to increase throughput 2. Parallelize large accesses to reduce response time. o Improve transfer rate by striping data across multiple disks. o Bit-level striping – split the bits of each byte across multiple disks

 In an array of eight disks, write bit i of each byte to disk i. Each access can read data at eight times the rate of a single disk. But seek/access time worse than for a single disk Bit level striping is not used much any more o Block-level striping – with n disks, block i of a file goes to disk (i mod n) + 1 Requests for different blocks can run in parallel if the blocks reside on different disks. 83 http://www.francisxavier.ac.in

A request for a long sequence of blocks can utilize all disks in parallel. RAID Levels o Schemes to provide redundancy at lower cost by using disk striping combined with parity bits.  Different RAID organizations, or RAID levels, have differing cost, performance and reliability characteristics  RAID Level 0: Block striping; non-redundant. o Used in high-performance applications where data lost is not critical.  RAID Level 1: Mirrored disks with block striping. o Offers best write performance. o Popular for applications such as storing log files in a database system.  RAID Level 2: Memory-Style Error-Correcting-Codes (ECC) RAID Level 3: Bit-Interleaved Parity o a single parity bit is enough for error correction, not just detection, since we know which disk has failed  When writing data, corresponding parity bits must also be computed and written to a parity bit disk.  To recover data in a damaged disk, compute XOR of bits from other disks (including parity bit disk). o Faster data transfer than with a single disk, but fewer I/Os per second since every disk has to participate in every I/O. o Subsumes Level 2 (provides all its benefits, at lower cost).  RAID Level 4: Block-Interleaved Parity; uses block-level striping, and keeps a parity block on a separate disk for corresponding blocks from N other disks. o When writing data block, corresponding block of parity bits must also be computed and written to parity disk. o To find value of a damaged block, compute XOR of bits from corresponding blocks (including parity block) from other disks. o Provides higher I/O rates for independent block reads than Level 3

 block read goes to a single disk, so blocks stored on different disks can be read in parallel. o Provides high transfer rates for reads of multiple blocks than no-striping. o Before writing a block, parity data must be computed 84 http://www.francisxavier.ac.in

 Can be done by using old parity block, old value of current block and new value of current block (2 block reads + 2 block writes).  Or by recomputing the parity value using the new values of blocks corresponding to the parity block. More efficient for writing large amounts of data sequentially. o Parity block becomes a bottleneck for independent block writes since every block write also writes to parity disk.  RAID Level 5: Block-Interleaved Distributed Parity; partitions data and parity among all N + 1 disks, rather than storing data in N disks and parity in 1 disk. o E.g., with 5 disks, parity block for nth set of blocks is stored on disk (n mod 5) + 1, with the data blocks stored on the other 4 disks. o Higher I/O rates than Level 4.  Block writes occur in parallel if the blocks and their parity blocks are on different disks. o Subsumes Level 4: provides same benefits, but avoids bottleneck of parity disk.  RAID Level 6: P+Q Redundancy scheme; similar to Level 5, but stores extra redundant information to guard against multiple disk failures. o Better reliability than Level 5 at a higher cost; not used

85 http://www.francisxavier.ac.in

Choice of RAID Level o Factors in choosing RAID level  Monetary cost  Performance: Number of I/O operations per second, and bandwidth during normal operation  Performance during failure  Performance during rebuild of failed disk  Including time taken to rebuild failed disk o RAID 0 is used only when data safety is not important  E.g. data can be recovered quickly from other sources o Level 2 and 4 never used since they are subsumed by 3 and 5 o Level 3 is not used anymore since bit-striping forces single block reads to access all disks, wasting disk arm movement, which block striping (level 5) avoids o Level 6 is rarely used since levels 1 and 5 offer adequate safety for almost all applications o So competition is between 1 and 5 only  Level 1 provides much better write performance than level 5 o Level 5 requires at least 2 block reads and 2 block writes to write a single block, whereas Level 1 only requires 2 block writes o Level 1 preferred for high update environments such as log disks 86 http://www.francisxavier.ac.in

 Level 1 had higher storage cost than level 5 o disk drive capacities increasing rapidly (50%/year) whereas disk access times have decreased much less (x 3 in 10 years) o I/O requirements have increased greatly, E.g. for Web servers o When enough disks have been bought to satisfy required rate of I/O, they often have spare storage capacity  so there is often no extra monetary cost for Level 1!  Level 5 is preferred for applications with low update rate, and large amounts of data  Level 1 is preferred for all other applications Hardware Issues  Software RAID: RAID implementations done entirely in software, with no special hardware support  Hardware RAID: RAID implementations with special hardware o Use non-volatile RAM to record writes that are being executed o Beware: power failure during write can result in corrupted disk  E.g. failure after writing one block but before writing the second in a mirrored system.  Such corrupted data must be detected when power is restored  Recovery from corruption is similar to recovery from failed disk  NV-RAM helps to efficiently detected potentially corrupted blocks o Otherwise all blocks of disk must be read and compared with mirror/parity block  Hot swapping: replacement of disk while system is running, without power down o Supported by some hardware RAID systems, o reduces time to recovery, and improves availability greatly.  Many systems maintain spare disks which are kept online, and used as replacements for failed disks immediately on detection of failure o Reduces time to recovery greatly.  Many hardware RAID systems ensure that a single point of failure will not stop the functioning of the system by using o Redundant power supplies with battery backup. o Multiple controllers and multiple interconnections to guard against controller/ interconnection failures. 87 http://www.francisxavier.ac.in

4.4 TERTIARY STORAGE Tertiary storage: lowest level in hierarchy, non-volatile, slow access time. Also called off-line storage. E.g. magnetic tape, optical storage.

Optical storage Non-volatile, data is read optically from a spinning disk using a laser CD-ROM (640 MB) and DVD (4.7 to 17 GB) most popular forms Write-one, read-many (WORM) optical disks used for archival storage (CD-R and DVD-R) . Multiple write versions also available (CD-RW, DVDRW, and DVD-RAM) 4 Reads and writes are slower than with magnetic disk. Juke-box systems, with large numbers of removable disks, a few drives, and a mechanism for automatic loading/unloading of disks available for storing large volumes of data.

Optical Disks 

Compact disk-read only memory (CD-ROM)



Disks can be loaded into or removed from a drive



High storage capacity (640 MB per disk)



High seek times or about 100 msec (optical read head is heavier and slower)



Higher latency (3000 RPM) and lower data-transfer rates (3-6 MB/s) compared to magnetic disks

Digital Video Disk (DVD) 

DVD-5 holds 4.7 GB, and DVD-9 holds 8.5 GB



DVD-10 and DVD-18 are double sided formats with capacities of 9.4 GB & 17 GB.



Other characteristics similar to CD-ROM 4 Record once versions (CD-R and DVD-R)



Data can only be written once, and cannot be erased.



High capacity and long lifetime; used for archival storage



Multi-write versions (CD-RW, DVD-RW and DVDRAM) also available

Tape storage 

Non-volatile, used primarily for backup (to recover from disk failure), and for archival data 88 http://www.francisxavier.ac.in



Sequential-access – much slower than disk



Very high capacity (40 to 300 GB tapes available)



Tape can be removed from drive storage costs much cheaper than disk, but drives are expensive



Tape jukeboxes available for storing massive



Hundreds of terabytes (1 terabyte = 109 bytes) to even a petabyte (1 petabyte = 1012 bytes)

Optical Disks 

CD-ROM has become a popular medium for distributing software, multimedia data, and other electronic published information.



Capacity of CD-ROM: 500 MB. Disks are cheap to mass produce and also drives.



CD-ROM: much longer seek time (250m-sec), lower rotation speed (400 rpm), leading to high latency and lower data-transfer rate (about 150 KB/sec). Drives spins at audio CD spin speed (standard) is available.



Recently, a new optical format, digit video disk (DVD) has become standard. These disks hold between 4.7 and 17 GB data.



WORM (write-once, read many) disks are popular for archival storage of data since they have a high capacity (about 500 MB), longer life time than HD, and can be removed from drive -- good for audit trail (hard to tamper).

4.5 FILE ORGANIZATION The database is stored as a collection of files. Each file is a sequence of records. A record is a sequence of fields.  One approach: o assume record size is fixed. o each file has records of one particular type only. o different files are used for different relations.  This case is easiest to implement; will consider variable length records later.

4.5.1 Fixed-Length Records  Simple approach: 89 http://www.francisxavier.ac.in

o Store record i starting from byte n  (i – 1), where n is the size of each record. o Record access is simple but records may cross blocks  Modification: do not allow records to cross block boundaries.  Deletion of record I: alternatives: o move records i + 1, . . ., n to i, . . . , n – 1 o move record n to i o do not move records, but link all free records on a free list

Free Lists o Store the address of the first deleted record in the file header. o Use this first record to store the address of the second deleted record, and so on. o Can think of these stored addresses as pointers since they ―point‖ to the location of a record. o More space efficient representation: reuse space for normal attributes of free records to store pointers. (No pointers stored in in-use records.)

4.5.2 Variable-Length Records o Variable-length records arise in database systems in several ways: 90 http://www.francisxavier.ac.in

 Storage of multiple record types in a file.Record types that allow variable lengths for one or more fields.  Record types that allow repeating fields (used in some older data models). o Byte string representation  Attach an end-of-record () control character to the end of each record.  Difficulty with deletion.  Difficulty with growth.  Variable-Length Records: Slotted Page Structure

Slotted page header contains: o number of record entries. o end of free space in the block. o location and size of each record.  Records can be moved around within a page to keep them contiguous with no empty space between them; entry in the header must be updated.  Pointers should not point directly to record — instead they should point to the entry for the record in header.  Fixed-length representation: o reserved space o pointers  Reserved space – can use fixed-length records of a known maximum length; unused space in shorter records filled with a null or end-of record symbol.

91 http://www.francisxavier.ac.in

4.5.3 Pointer Method  Pointer method

 A variable-length record is represented by a list of fixed-length records, chained together via pointers.  Can be used even if the maximum record length is not known  Disadvantage to pointer structure; space is wasted in all records except the first in a a chain.  Solution is to allow two kinds of block in file:  Anchor block – contains the first records of chain  Overflow block – contains records other than those that are the first records

4.6 ORGANIZATION OF RECORDS IN FILES Heap – a record can be placed anywhere in the file where there is space  Sequential – store records in sequential order, based on the value of the search key of each record

92 http://www.francisxavier.ac.in

 Hashing – a hash function computed on some attribute of each record; the result specifies in which block of the file the record should be placed. Records of each relation may be stored in a separate file. In a clustering file organization records of several different relations can be stored in the same file  Motivation: store related records on the same block to minimize I/O 4.6.1 Sequential File Organization  Suitable for applications that require sequential processing of the entire file  The records in the file are ordered by a search-key  Deletion – use pointer chains  Insertion –locate the position where the record is to be inserted  if there is free space insert there  if no free space, insert the record in an overflow block.In either case, pointer chain must be updated  Need to reorganize the file from time to time to restore sequential order

4.6.2 Clustering File Organization o Simple file structure stores each relation in a separate file. o Can instead store several relations in one file using a clustering file organization. o E.g., clustering organization of customer and depositor:

93 http://www.francisxavier.ac.in

o bad for queries involving only customer o results in variable size records 4.6.3 Mapping of Objects to Files  Mapping objects to files is similar to mapping tuples to files in a relational system; object data can be stored using file structures.  Objects in O-O databases may lack uniformity and may be very large; such objects have to managed differently from records in a relational system. o Set fields with a small number of elements may be implemented using data structures such as linked lists. o Set fields with a larger number of elements may be implemented as separate relations in the database. o Set fields can also be eliminated at the storage level by normalization.  Similar to conversion of multivalued attributes of E-R diagrams to relations  Objects are identified by an object identifier (OID); the storage system needs a mechanism to locate an object given its OID (this action is called dereferencing). o logical identifiers do not directly specify an object‘s physical location; must maintain an index that maps an OID to the object‘s actual location. o physical identifiers encode the location of the object so the object can be found directly. Physical OIDs typically have the following parts: 1. a volume or file identifier 2. a page identifier within the volume or file 3. an offset within the page

4.7 INDEXING AND HASHING: 94 http://www.francisxavier.ac.in

INDEXING  Indexing mechanisms used to speed up access to desired data. o E.g., author catalog in library  Search Key - attribute to set of attributes used to look up records in a file.  An index file consists of records (called index entries) of the form.  Index files are typically much smaller than the original file.

Two basic kinds of indices: o Ordered indices: search keys are stored in sorted order. o Hash indices: search keys are distributed uniformly across ―buckets‖ using a ―hash function‖.  Index Evaluation Metrics o Access types supported efficiently. E.g., o records with a specified value in the attribute o or records with an attribute value falling in a specified range of values. o Access time o Insertion time o Deletion time o Space overhead ORDERED INDICES  In an ordered index, index entries are stored, sorted on the search key value. E.g., author catalog in library.  Primary index: in a sequentially ordered file, the index whose search key specifies the sequential order of the file. Also called clustering index The search key of a primary index is usually but not necessarily the primary key.  Secondary index: an index whose search key specifies an order different from the sequential order of the file. Also called non-clustering index. Search-key Pointer  Index-sequential file: ordered sequential file with a primary index. Dense Index Files o Dense index — Index record appears for every search key value in the file. 95 http://www.francisxavier.ac.in

Sparse Index Files Sparse Index: contains index records for only some search-key values. Applicable when records are sequentially ordered on search-key To locate a record with search-key value K we: Find index record with largest search-key value < K Search file sequentially starting at the record to which the index record points.  Less space and less maintenance overhead for insertions and deletions. Generally slower than dense index for locating records.  Good tradeoff: sparse index with an index entry for every block in file, corresponding to least search-key value in the block. Example of Sparse Index Files

Multilevel Index  If primary index does not fit in memory, access becomes expensive. 96 http://www.francisxavier.ac.in

 To reduce number of disk accesses to index records, treat primary index kept on disk as a sequential file and construct a sparse index on it. o outer index – a sparse index of primary index o inner index – the primary index file  If even the outer index is too large to fit in main memory, yet another level of index can be created, and so on.  Indices at all levels must be updated on insertion or deletion from the file.

Index Update: Deletion  If deleted record was the only record in the file with its particular search-key value, the search-key is deleted from the index also. Single-level index deletion: o Dense indices – deletion of search-key is similar to file record deletion. o Sparse indices – if an entry for the search key exists in the index, it is deleted by replacing the entry in the index with the next search-key value in the file (in search-key

97 http://www.francisxavier.ac.in

order). If the next search-key value already has an index entry, the entry is deleted instead of being replaced.  Index Update: Insertion o Single-level index insertion: o Perform a lookup using the search-key value appearing in the record to be inserted. o Dense indices – if the search-key value does not appear in the index, insert it. o Sparse indices – if index stores an entry for each block of the file, no change needs to be made to the index unless a new block is created. In this case, the first search-key value appearing in the new block is inserted into the index. o Multilevel insertion (as well as deletion) algorithms are simple extensions of the singlelevel algorithms. Secondary Indices o Frequently, one wants to find all the records whose values in a certain field (which is not the search-key of the primary index satisfy some condition. o Example 1: In the account database stored sequentially by account number, we may want to find all accounts in a particular branch. o Example 2: as above, but where we want to find all accounts with a specified balance or range of balances o We can have a secondary index with an index record for each search-key value; index record points to a bucket that contains pointers to all the actual records with that particular searchkey value. Secondary Index on balance field of account

Primary and Secondary Indices Secondary indices have to be dense. 98 http://www.francisxavier.ac.in

 Indices offer substantial benefits when searching for records.  When a file is modified, every index on the file must be updated, Updating indices imposes overhead on database modification.  Sequential scan using primary index is efficient, but a sequential scan using a secondary index is expensive.  each record access may fetch a new block from disk  Bitmap Indices Bitmap indices are a special type of index designed for efficient querying on multiple keys. Records in a relation are assumed to be numbered sequentially from, say,0  Given a number n it must be easy to retrieve record n  Particularly easy if records are of fixed size Applicable on attributes that take on a relatively small number of distinct values  E.g. gender, country, state, …  E.g. income-level (income broken up into a small number of levels such as 0-9999, 1000019999, 20000-50000, 50000- infinity) A bitmap is simply an array of bits. In its simplest form a bitmap index on an attribute has a bitmap for each value of the attribute.  Bitmap has as many bits as records.  In a bitmap for value v, the bit for a record is 1 if the record has the value v for the attribute, and is 0 otherwise. Bitmap indices are useful for queries on multiple attributes  Not particularly useful for single attribute queries Queries are answered using bitmap operations  Intersection (and)  Union (or)  Complementation (not) Each operation takes two bitmaps of the same size and applies the operation on corresponding bits to get the result bitmap  Males with income level L1: 10010 AND 10100 = 10000  Can then retrieve required tuples.  Counting number of matching tuples is even faster Bitmap indices generally very small compared with relation size.  If number of distinct attribute values is 8, bitmap is only 1% of relation size Deletion needs to be handled properly 99 http://www.francisxavier.ac.in

 Existence bitmap to note if there is a valid record at a record location  Needed for complementation n Should keep bitmaps for all values, even null value. B+-Tree Index Files o B+-tree indices are an alternative to indexed-sequential files. o Disadvantage of indexed-sequential files: performance degrades as file grows, since many overflow blocks get created. Periodic reorganization of entire file is required. o Advantage of B+-tree index files: automatically reorganizes itself with small, local, changes, in the face of insertions and deletions. Reorganization of entire file is not required to maintain performance. o Disadvantage of B+-trees: extra insertion and deletion overhead, space overhead. o Advantages of B+-trees outweigh disadvantages, and they are used extensively.  A B+-tree is a rooted tree satisfying the following properties:  All paths from root to leaf are of the same length  Each node that is not a root or a leaf has between [n/2] and n children.  A leaf node has between [(n–1)/2] and n–1 values  Special cases:  If the root is not a leaf, it has at least 2 children.  If the root is a leaf (that is, there are no other nodes in the tree), it can have between 0 and (n–1) values. Typical node

o Ki are the search-key values o Pi are pointers to children (for non-leaf nodes) or pointers to records or buckets of records (for leaf nodes).  The search-keys in a node are ordered K1 < K2 < K3 < . . . < Kn–1. Leaf Nodes in B+-Trees o Properties of a leaf node:  For i = 1, 2, . . ., n–1, pointer Pi either points to a file

100 http://www.francisxavier.ac.in

record with search-key value Ki, or to a bucket of pointers to file records, each record having search-key value Ki. Only need bucket structure if search-key does not form a primary key.  If Li, Lj are leaf nodes and i < j, Li‘s search-key values are less than Lj‘s search-key values.  Pn points to next leaf node in search-key order.

Non-Leaf Nodes in B+-Trees  Non leaf nodes form a multi-level sparse index on the leaf nodes. For a non-leaf node with m pointers: o All the search-keys in the subtree to which P1 points are less than K1. o For 2  i  n – 1, all the search-keys in the subtree to which Pi points have values greater than or equal to Ki–1 and less than Km–1.

101 http://www.francisxavier.ac.in

 Example of a B+-tree

B+-tree for account file (n = 3)

B+-tree for account file (n - 5) o Leaf nodes must have between 2 and 4 values ((n–1)/2and n –1, with n = 5). o Non-leaf nodes other than root must have between 3 and 5 children ((n/2 and n with n =5). o Root must have at least 2 children.  Observations about B+-trees o Since the inter-node connections are done by pointers, ―logically‖ close blocks need not be ―physically‖ close. o The non-leaf levels of the B+-tree form a hierarchy of sparse indices. o The B+-tree contains a relatively small number of levels (logarithmic in the size of the main file), thus searches can be conducted efficiently. o Insertions and deletions to the main file can be handled efficiently, as the index can be restructured in logarithmic time. Queries on B+-Trees  Find all records with a search-key value of k. o Start with the root node  Examine the node for the smallest search-key value > k.  If such a value exists, assume it is Kj. Then follow Pi to the child node.  Otherwise k Km–1, where there are m pointers in the node. Then follow Pm to the child node. 102 http://www.francisxavier.ac.in

o If the node reached by following the pointer above is not a leaf node, repeat the above procedure on the node, and follow the corresponding pointer. o Eventually reach a leaf node. If for some i, key Ki = k follow pointer Pi to the desired record or bucket. Else no record with search-key value k exists.  In processing a query, a path is traversed in the tree from the root to some leaf node.  If there are K search-key values in the file, the path is no longer than logn/2(K).  A node is generally the same size as a disk block, typically 4 kilobytes, and n is typically around 100 (40 bytes per index entry).  With 1 million search key values and n = 100, at most log50(1,000,000) = 4 nodes are accessed in a lookup.  Contrast this with a balanced binary free with 1 million search key values — around 20 nodes are accessed in a lookup. o The above difference is significant since every node access may need a disk I/O, costing around 20 milliseconds!  Updates on B+-Trees: Insertion o Find the leaf node in which the search-key value would appear o If the search-key value is already there in the leaf node, record is added to file and if necessary a pointer is inserted into the bucket. o If the search-key value is not there, then add the record to the main file and create a bucket if necessary. Then:  If there is room in the leaf node, insert (key-value, pointer) pair in the leaf node  Otherwise, split the node (along with the new (key-value, pointer) entry) as discussed in the next slide. o Splitting a node:  take the n(search-key value, pointer) pairs (including the one being inserted) in sorted order. Place the first  n/2  in the original node, and the rest in a new node.  let the new node be p, and let k be the least key value in p. Insert (k,p) in the parent of the node being split. If the parent is full, split it and propagate the split further up. The splitting of nodes proceeds upwards till a node that is not full is found. In the worst case the root node may be split increasing the height of the tree by 1.

103 http://www.francisxavier.ac.in

Result of splitting node containing Brighton and Downtown on inserting Clearview

B+-Tree before and after insertion of ―Clearview‖  Updates on B+-Trees: Deletion  Find the record to be deleted, and remove it from the main file and from the bucket (if present).  Remove (search-key value, pointer) from the leaf node if there is no bucket or if the bucket has become empty.  If the node has too few entries due to the removal, and the entries in the node and a sibling fit into a single node, then  Insert all the search-key values in the two nodes into a single node (the one on the left), and delete the other node.

104 http://www.francisxavier.ac.in

 Delete the pair (Ki–1, Pi), where Pi is the pointer to the deleted node, from its parent, recursively using the above procedure.  Otherwise, if the node has too few entries due to the removal, and the entries in the node and a sibling fit into a single node, then  Redistribute the pointers between the node and a sibling such that both have more than the minimum number of entries.  Update the corresponding search-key value in the parent of the node.  The node deletions may cascade upwards till a node which has n/2 or more pointers is found. If the root node has only one pointer after deletion, it is deleted and the sole child becomes the root. Examples of B+-Tree Deletion

Before and after deleting ―Downtown‖ o The removal of the leaf node containing ―Downtown‖ did not result in its parent having too little pointers. So the cascaded deletions stopped with the deleted leaf node‘s parent.

105 http://www.francisxavier.ac.in

Deletion of ―Perryridge‖ from result of previous example o Node with ―Perryridge‖ becomes underfull (actually empty, in this special case) and is merged with its sibling. o As a result ―Perryridge‖ node‘s parent became underfull, and was merged with its sibling (and an entry was deleted from their parent). o Root node then had only one child, and was deleted and its child became the new root node. B+-Tree File Organization o Index file degradation problem is solved by using B+-Tree indices. Data file degradation problem is solved by using B+-Tree File Organization. o The leaf nodes in a B+-tree file organization store records, instead of pointers. o Since records are larger than pointers, the maximum number of records that can be stored in a leaf node is less than the number of pointers in a non leaf node. o Leaf nodes are still required to be half full. Insertion and deletion are handled in the same way as insertion and deletion of entries in a B+-tree index

106 http://www.francisxavier.ac.in

Example of B+-tree File Organization o Good space utilization is important since records use more space than pointers. o To improve space utilization, involve more sibling nodes in redistribution during splits and merges.  Involving 2 siblings in redistribution (to avoid split / merge where possible) results in each node having at least entries

B-Tree Index Files  Similar to B+-tree, but B-tree allows search-key values to appear only once; eliminates redundant storage of search keys.  Search keys in nonleaf nodes appear nowhere else in the B-tree; an additional pointer field for each search key in a nonleaf node must be included.  Generalized B-tree leaf node Nonleaf node – pointers Bi are the bucket or file record pointers.

 B-Tree Index File Example

107 http://www.francisxavier.ac.in

B-tree (above) and B+-tree (below) on same data

o Advantages of B-Tree indices:  May use less tree nodes than a corresponding B+-Tree.  Sometimes possible to find search-key value before reaching leaf node. o Disadvantages of B-Tree indices:  Only small fraction of all search-key values are found early  Non-leaf nodes are larger, so fan-out is reduced. Thus BTrees typically have greater depth than corresponding B+-Tree  Insertion and deletion more complicated than in B+Trees  Implementation is harder than B+-Trees. o Typically, advantages of B-Trees do not out weigh disadvantages.

HASHING Hashing is a hash function computed on some attribute of each record; the result specifies in which block of the file the record should be placed.

108 http://www.francisxavier.ac.in

Static Hashing A bucket is a unit of storage containing one or more records (a bucket is typically a disk block). In a hash file organization we obtain the bucket of a record directly from its search-key value using a hash function. Hash function h is a function from the set of all search-key values K to the set of all bucket addresses B. Hash function is used to locate records for access, insertion as well as deletion. Records with different search-key values may be mapped to the same bucket; thus entire bucket has to be searched sequentially to locate a record. Example of Hash File Organization Hash file organization of account file, using branch-name as key o There are 10 buckets, o The binary representation of the ith character is assumed to be the integer i. o The hash function returns the sum of the binary representation of the characters modulo10. o E.g. h(Perryridge) = 5 h(Round Hill) = 3 h(Brighton) = 3 o Hash file organization of account file, using branch-name as key

109 http://www.francisxavier.ac.in

Hash Functions o Worst had function maps all search-key values to the same bucket; this makes access time proportional to the number of search-key values in the file. o An ideal hash function is uniform, i.e., each bucket is assigned the same number of search-key values from the set of all possible values. o Ideal hash function is random, so each bucket will have the same number of records assigned to it irrespective of the actual distribution of search-key values in the file. o Typical hash functions perform computation on the internal binary representation of the search-key. o For example, for a string search-key, the binary representations of all the characters in the string could be added and the sum modulo the number of buckets could be returned. Handling of Bucket Overflows o Bucket overflow can occur because of  Insufficient buckets  Skew in distribution of records. This can occur due to two reasons: multiple records have same search-key value chosen hash function produces non-uniform distribution of key values o Although the probability of bucket overflow can be reduced, it cannot be eliminated; it is handled by using overflow buckets. o Overflow chaining – the overflow buckets of a given bucket are chained together in a linked list. o The Above scheme is called closed hashing. o An alternative, called open hashing, which does not use overflow buckets, is not suitable for database applications.

110 http://www.francisxavier.ac.in

Hash Indices o Hashing can be used not only for file organization, but also for index-structure creation. o A hash index organizes the search keys, with their associated record pointers,into a hash file structure. o Strictly speaking, hash indices are always secondary indices o If the file itself is organized using hashing, a separate primary hash index on it using the same search-key is unnecessary. o However, we use the term hash index to refer to both secondary index structures and hash organized files. Example of Hash Index

Demerits of Static Hashing o In static hashing, function h maps search-key values to a fixed set of B of bucket addresses. o Databases grow with time. If initial number of buckets is too small, performance will degrade due to too much overflows. o If file size at some point in the future is anticipated and number of buckets allocated accordingly, significant amount of space will be wasted initially. 111 http://www.francisxavier.ac.in

o If database shrinks, again space will be wasted. o One option is periodic re-organization of the file with a new hash function, but it is very expensive. o These problems can be avoided by using techniques that allow the number of buckets to be modified dynamically. Dynamic Hashing Good for database that grows and shrinks in size. Allows the hash function to be modified dynamically. Extendable hashing – one form of dynamic hashing  Hash function generates values over a large range — typically b-bit integers, with b = 32.  At any time use only a prefix of the hash function to index into a table of bucket addresses.  Let the length of the prefix be i bits, 0  i  32. Bucket address table size = 2i. Initially i = 0. Value of i grows and shrinks as the size of the database grows and shrinks. Multiple entries in the bucket address table may point to a bucket. Thus, actual number of buckets is < 2i.  The number of buckets also changes dynamically due to coalescing and splitting of buckets.

112 http://www.francisxavier.ac.in

General Extendable Hash Structure

In this structure, i2 = i3 = i, whereas i1 = i – 1

Use of Extendable Hash Structure o Each bucket j stores a value ij; all the entries that point to the same bucket have the same values on the first ij bits. o To locate the bucket containing search-key Kj: o 1. Compute h(Kj) = X o 2. Use the first i high order bits of X as a displacement into bucket address table, and follow the pointer to appropriate bucket o To insert a record with search-key value Kj o follow same procedure as look-up and locate the bucket, say j. o If there is room in the bucket j insert record in the bucket. o Else the bucket must be split and insertion re-attempted. o Overflow buckets used instead in some cases. Updates in Extendable Hash Structure o To split a bucket j when inserting record with search-key value Kj: o If i > ij (more than one pointer to bucket j) o allocate a new bucket z, and set ij and iz to the old ij -+ 1. 113 http://www.francisxavier.ac.in

o make the second half of the bucket address table entries pointing to j to point to z o remove and reinsert each record in bucket j. o recompute new bucket for Kj and insert record in the bucket (further splitting is required if the bucket is still full) o If i = ij (only one pointer to bucket j) o increment i and double the size of the bucket address table. o replace each entry in the table by two entries that point to the same bucket. o recompute new bucket address table entry for Kj Now i > ij so use the first case above. o When inserting a value, if the bucket is full after several splits (that is, i reaches some limit create an overflow bucket instead of splitting bucket entry table further.

o To delete a key value, o locate it in its bucket and remove it. o The bucket itself can be removed if it becomes empty (with appropriate updates to the bucket address table). o Coalescing of buckets can be done (can coalesce only with a ―buddy‖ bucket having same value of ij and same ij –1 prefix, if it is present). o Decreasing bucket address table size is also possible.

o Note: decreasing bucket address table size is an expensive operation and should be done only if number of buckets becomes much smaller than the size of the table. Use of Extendable Hash Structure: Example

114 http://www.francisxavier.ac.in

Initial Hash structure, bucket size = 2 Hash structure after insertion of one Brighton and two Downtown Records

Hash structure after insertion of Mianus record

115 http://www.francisxavier.ac.in

Hash structure after insertion of three Perryridge records Hash structure after insertion of Redwood and Round Hill records

Extendable Hashing vs. Other Schemes Benefits of extendable hashing: o Hash performance does not degrade with growth of file o Minimal space overhead Disadvantages of extendable hashing o Extra level of indirection to find desired record o Bucket address table may itself become very big (larger than memory) Need a tree structure to locate desired record in the structure o Changing size of bucket address table is an expensive operation 116 http://www.francisxavier.ac.in

 Linear hashing is an alternative mechanism which avoids these disadvantages at the possible cost of more bucket overflows.

4.8

INTRODUCTION

TO

DISTRIBUTED

DATABASES

AND

CLIENT/SERVER TECHNOLOGY DISTRIBUTED DATABASE SYSTEM 

A distributed database system consist of loosely coupled sites that share no physical component



Database systems that run on each site are independent of each other



Transactions may access data at one or more sites

In a homogeneous distributed database 

All sites have identical software



Are aware of each other and agree to cooperate in processing user requests.



Each site surrenders part of its autonomy in terms of right to change schemas or software



Appears to user as a single system

In a heterogeneous distributed database 

Different sites may use different schemas and software



Difference in schema is a major problem for query processing



Difference in software is a major problem for transaction processing



Sites may not be aware of each other and may provide only limited facilities for cooperation in transaction processing

DISTRIBUTED DATA STORAGE Data Replication Data Fragmentation Data Replication A relation or fragment of a relation is replicated if it is stored redundantly in two or more sites Full replication of a relation is the case where the relation is stored at all sites.Fully redundant databases are those in which every site contains a copy of the entire database Advantages of Replication Availability: failure of site containing relation r does not result in unavailability of r is replicas exist. Parallelism: queries on r may be processed by several nodes in parallel. Reduced data transfer: relation r is available locally at each site containing a replica of r. 117 http://www.francisxavier.ac.in

Is advantages of Replication Increased cost of updates: each replica of relation r must be updated. Increased complexity of concurrency control: concurrent updates to distinct replicas may lead to inconsistent data unless special concurrency control mechanisms are implemented. One solution: choose one copy as primary copy and apply concurrency control operations on primary copy. Data Fragmentation Division of relation r into fragments r1 , r2 , …, rn which contain sufficient information to reconstruct relation r. Horizontal fragmentation : each tuple of r is assigned to one or more fragments Vertical fragmentation : the schema for relation r is split into several smaller schemas All schemas must contain a common candidate key (or superkey) to ensure lossless join property. A special attribute, the tuple-id attribute may be added to each schema to serve as a candidate key. Example : relation account with following schema Account = (account_number, branch_name , balance ) Data transparency : Degree to which system user may remain unaware of the details of how and where the data items are stored in a distributed system. Consider transparency issues in relation to: Fragmentation transparency Replication transparency Location transparency Naming of data items: criteria Every data item must have a system-wide unique name. It should be possible to find the location of data items efficiently. It should be possible to change the location of data items transparently. Each site should be able to create new data items autonomously.

CLIENT/ SERVER TECHNOLOGY Client/ Server technology is a means for separating the functions of an application into two or more distinct parts. Client/ server describes the relationship between two computer programs in which one program, the client, makes a service request from another program, the server, which fulfills the request. The client presents and manipulates data on the desktop computer. The server acts like a mainframe to store and retrieve protected data. It is network architecture in which each 118 http://www.francisxavier.ac.in

computer or process on the network is either a client or a server. Servers are powerful computers or processes dedicated to managing disk drives (file servers), printers (print servers), or network traffic (network servers). Clients are PCs or workstations on which users run applications. Clients rely on servers for resources, such as files, devices, and even processing power. Client - Server Technology A client/ server model has following three distinct components, each focusing on a specific job: 

Database server



Client application



Network.

Database Server A server (or "back end") manages the resources such as database, efficiently and optimally among various clients that simultaneously request the server for the same resources. Database server mainly concentrates on the following tasks. Managing a single database of information among many concurrent users. Controlling database access and other security requirements. Protecting database of information with backup and recovery features. Centrally enforcing global data integrity rules across all client applications. Client Application A client application (the "front end") is the part of the system that users apply to interact with data. The client application in a client/ server model focus on the following job: Presenting an interface between the user and the resource to complete the job Managing presentation logic, Performing application logic and Validating data entry Managing the request traffic of receiving and sending information from database server Network The third component of a client/ server system is network. The communication software is the vehicles that transmit data between the clients and the server in client server system. Both the client and the server run communication software that allows them to talk across the network. Three-Tier Technology Client – Server technology is also called 3-tier technology as illustrated in Figure below. Client/server is an important idea in a network, however, it can be used by programs within a 119 http://www.francisxavier.ac.in

single computer. In a network, the client/ server model provides a convenient way to interconnect programs that are distributed efficiently across different locations. Computer transactions using the client/ server model are very common. For example, to check your bank account from your computer, a client program in your computer forwards your request to a server program at the bank. That program may in turn forward the request to its own client program that sends a request to a database server at another bank computer to retrieve your account balance.

4.9 MULTIDIMENSIONAL AND PARALLEL DATABASES MULTIDIMENSIONAL DATABASES A multidimensional database is a specific type of database that has been optimized for data warehousing and OLAP (online analytical processing). A multi-dimensional database is structured by a combination of data from various sources that work amongst databases simultaneously and that offer networks, hierarchies, arrays, and other data formatting methods. In a multidimensional database, the data is presented to its users through multidimensional arrays, and each individual value of data is contained within a cell which can be accessed by multiple indexes. A multidimensional database uses the concept of a data cube (also referred to as a hypercube) to represent the dimensions of data currently available to its user(s). The multidimensional database concept is designed to assist with decision support systems.

This detailed organization of the data allows for advanced and complex query generation while providing outstanding performance in certain cases when compared to traditional relational structures and databases. This type of database is usually structured in an order that optimizes OLAP and data warehouse applications.

120 http://www.francisxavier.ac.in

Fig. Multidimensional Database Dimensions and Members This section introduces the concepts of outlines, dimensions, and members within a multidimensional database. If you understand dimensions and members, you are well on your way to understanding the power of a multidimensional database. A dimension represents the highest consolidation level in the database outline. The database outline presents dimensions and members in a tree structure to indicate a consolidation relationship. . Standard dimensions represent the core components of a business plan and often relate to departmental functions. Typical standard dimensions: Time, Accounts, Product Line, Market, and Division. Dimensions change less frequently than members. Attribute dimensions are associated with standard dimensions. Members are the individual components of a dimension. For example, Product A, Product B, and Product C might be members of the Product dimension. Each member has a unique name. Essbase can store the data associated with a member (referred to as a stored member in this chapter), or it can dynamically calculate the data when a user retrieves it. PARALLEL DATABASES

Data can be partitioned across multiple disks for parallel I/O. Individual relational operations (e.g., sort, join, aggregation) can be executed in parallel data can be partitioned and each processor can work independently on its own partition. Queries are expressed in high level language (SQL, translated to relational algebra) makes 121 http://www.francisxavier.ac.in

parallelization easier. Different queries can be run in parallel with each other. Concurrency control takes care of conflicts. Thus, databases naturally lend themselves to parallelism. Reduce the time required to retrieve relations from disk by partitioning the relations on multiple disks. Horizontal partitioning – tuples of a relation are divided among many disks such that each tuple resides on one disk. Partitioning techniques (number of disks = n): Round-robin: Send the ith tuple inserted in the relation to disk i mod n. Hash partitioning: Choose one or more attributes as the partitioning attributes. Choose hash function h with range 0…n - 1 Let i denote result of hash function h applied to the partitioning attribute value of a tuple. Send tuple to disk i. Range partitioning: Choose an attribute as the partitioning attribute. A partitioning vector [vo, v1, ..., vn-2] is chosen. Let v be the partitioning attribute value of a tuple. Tuples such that vi vi+1 go to disk I + 1. Tuples with v < v0 go to disk 0 and tuples with v vn-2 go to disk n-1. E.g., with a partitioning vector [5,11], a tuple with partitioning attribute value of 2 will go to disk 0, a tuple with value 8 will go to disk 1, while a tuple with value 20 will go to disk2.

INTERQUERY PARALLELISM Queries/transactions execute in parallel with one another. Increases transaction throughput; used primarily to scale up a transaction processing system to support a larger number of transactions per second. Easiest form of parallelism to support, particularly in a shared-memory parallel database, because even sequential database systems support concurrent processing. More complicated to implement on shared-disk or shared-nothing architectures Locking and logging must be coordinated by passing messages between processors. Data in a local buffer may have been updated at another processor. 122 http://www.francisxavier.ac.in

Cache-coherency has to be maintained — reads and writes of data in buffer must find latest version of data.

INTRAQUERY PARALLELISM Execution of a single query in parallel on multiple processors/disks; important for speeding up long-running queries. Two complementary forms of intraquery parallelism : Intraoperation Parallelism – parallelize the execution of each individual operation in the query. Interoperation Parallelism – execute the different operations in a query expression in parallel the first form scales better with increasing parallelism because the number of tuples processed by each operation is typically more than the number of operations in a query

4.9 MOBILE DATABASES 4.9.1 Mobile Computing Architecture The general architecture of a mobile platform

123 http://www.francisxavier.ac.in

It is distributed architecture where a number of computers, generally referred to as Fixed Hosts and Base Stations are interconnected through a high-speed wired network.  Fixed hosts are general purpose computers configured to manage mobile units.  Base stations function as gateways to the fixed network for the Mobile Units.  Wireless Communications –  The wireless medium have bandwidth significantly lower than those of a wired network.  The current generation of wireless technology has data rates range from the tens to hundreds of kilobits per second (2G cellular telephony) to tens of megabits per second (wireless Ethernet, popularly known as WiFi).  Modern (wired) Ethernet, by comparison, provides data rates on the order of hundreds of megabits per second.  The other characteristics distinguish wireless connectivity options:  interference,  locality of access,  range,  support for packet switching,  seamless roaming throughout a geographical region.

 Some wireless networks, such as WiFi and Bluetooth, use unlicensed areas of the frequency spectrum, which may cause interference with other appliances, such as cordless telephones.  Modern wireless networks can transfer data in units called packets, that are used in wired networks in order to conserve bandwidth.  Client/Network Relationships –  Mobile units can move freely in a geographic mobility domain, an area that is circumscribed by wireless network coverage.  To manage entire mobility domain is divided into one or more smallerdomains, called cells, each of which is supported by at least one base station.  Mobile units be unrestricted throughout the cells of domain, while 124 http://www.francisxavier.ac.in

maintaining information access contiguity.  The communication architecture described earlier is designed to give the mobile unit the impression that it is attached to a fixed network, emulating a traditional client-server architecture.  Wireless communications, however, make other architectures possible.  In a MANET, co-located mobile units do not need to communicate via a fixed network, but instead, form their own using cost-effective technologies such as Bluetooth. 

In a MANET, mobile units are responsible for routing their own data, effectively acting as base stations as well as clients. 

Moreover, they must be robust enough to handle changes in the network topology, such as the arrival or departure of other mobile units.

 MANET applications can be considered as peer-to-peer, meaning that a mobile unit is simultaneously a client and a server.  Transaction processing and data consistency control become more difficult since there is no central control in this architecture.  Resource discovery and data routing by mobile units make computing in a MANET even more complicated. 

Sample MANET applications are multi-user games, shared whiteboard, distributed calendars, and battle information sharing.

4.9.2 Characteristics of Mobile Environments  The characteristics of mobile computing include:  Communication latency  Intermittent connectivity  Limited battery life  Changing client location  The server may not be able to reach a client.  A client may be unreachable because it is dozing – in an energy-conserving state in which many subsystems are shut down – or because it is out of range of a base station. 125 http://www.francisxavier.ac.in



In either case, neither client nor server can reach the other, and modifications must be made to the architecture in order to compensate for this case.

 Proxies for unreachable components are added to the architecture.  For a client (and symmetrically for a server), the proxy can cache updates intended for the server.  Mobile computing poses challenges for servers as well as clients.  The latency involved in wireless communication makes scalability a problem. 

Since latency due to wireless communications increases the time to service each client request, the server can handle fewer clients.

 One way servers relieve this problem is by broadcasting data whenever possible.  A server can simply broadcast data periodically. 

Broadcast also reduces the load on the server, as clients do not have to maintain active connections to it.

Client mobility also poses many data management challenges. 

Servers must keep track of client locations in order to efficiently route messages to them.

 Client data should be stored in the network location that minimizes the traffic necessary toaccess  The act of moving between cells must be transparent to the client.  The server must be able to gracefully divert the shipment of data from one base to another, without the client noticing.  Client mobility also allows new applications that are location-based. WEB DATABASES A web database is a system for storing information that can then be accessed via a website. For example, an online community may have a database that stores the username, password, and other details of all its members. The most commonly used database system for the internet is MySQL due to its integration with PHP — one of the most widely used server side programming languages.

At its most simple level, a web database is a set of one or more tables that contain data. Each table has different fields for storing information of various types. These tables can then be linked together in order to manipulate data in useful or interesting ways. In many cases, a table will use a primary key, which must be unique for each entry and allows for unambiguous selection of data. A web database can be used for a range of different purposes. Each field in a table has to have a defined data type. For example, numbers, strings, and dates can all be inserted into a web database. Proper database design involves choosing the correct data type for each field in order to reduce memory consumption and increase the speed of access. Although for small databases this often isn't so important, big web databases can grow to millions of entries and need to be well designed to work effectively. SPATIAL AND MULTIMEDIA DATABASES SPATIAL DATABASE Types of Spatial Data  Point Data Points in a multidimensional space  E.g., Raster data such as satellite imagery, where each pixel stores a measured value  E.g., Feature vectors extracted from text  Region Data  Objects have spatial extent with location and boundary.  DB typically uses geometric approximations constructed using line segments, polygons, etc., called vector data. Types of Spatial Queries  Spatial Range Queries Find all cities within 50 miles of Madison Query has associated region (location, boundary) Answer includes ovelapping or contained data regions  Nearest-Neighbor Queries Find the 10 cities nearest to Madison Results must be ordered by proximity

 Spatial Join Queries Find all cities near a lake Expensive, join condition involves regions and proximity Applications of Spatial Data  Geographic Information Systems (GIS) E.g., ESRI‗s ArcInfo; OpenGIS Consortium Geospatial information All classes of spatial queries and data are common  Computer-Aided Design/Manufacturing Store spatial objects such as surface of airplane fuselage Range queries and spatial join queries are common  Multimedia Databases Images, video, text, etc. stored and retrieved by content First converted to feature vector form; high dimensionality Nearest-neighbor queries are the most common

Single-Dimensional Indexes B+ trees are fundamentally single-dimensional indexes. When we create a composite search key B+ tree, e.g., an index on , we effectively linearize the 2-dimensional space since we sort entries first by age and then by sal. Multi-dimensional Indexes A multidimensional index clusters entries so as to exploit ―nearness‖ in multidimensional space. Keeping track of entries and maintaining a balanced index structure presents a challenge! Consider entries: Motivation for Multidimensional Indexes  Spatial queries (GIS, CAD). Find all hotels within a radius of 5 miles from the conference venue. Find the city with population 500,000 or more that is nearest to Kalamazoo, MI. Find all cities that lie on the Nile in Egypt. Find all parts that touch the fuselage (in a plane design).

 Similarity queries (content-based retrieval). Given a face, find the five most similar faces.  Multidimensional range queries. 50 < age < 55 AND 80K < sal < 90K Drawbacks  An index based on spatial location needed. One-dimensional indexes don‗t support multidimensional searching efficiently. Hash indexes only support point queries; want to support range queries as well. Must support inserts and deletes gracefully.  Ideally, want to support non-point data as well (e.g., lines, shapes).  The R-tree meets these requirements, and variants are widely used today.

MULTIMEDIA DATABASES  To provide such database functions as indexing and consistency, it is desirable to store multimedia data in a database Rather than storing them outside the database, in a file system  The database must handle large object representation.  Similarity-based retrieval must be provided by special index structures.  Must provide guaranteed steady retrieval rates for continuous-media data. Multimedia Data Formats  Store and transmit multimedia data in compressed form JPEG and GIF the most widely used formats for image data. MPEG standard for video data use commonalties among a sequence of frames to achieve a greater degree of compression.  MPEG-1 quality comparable to VHS video tape. Stores a minute of 30-frame-per-second video and audio in approximately 12.5 MB  MPEG-2 designed for digital broadcast systems and digital video disks; negligible loss of video quality. Compresses 1 minute of audio-video to approximately 17 MB.  Several alternatives of audio encoding MPEG-1 Layer 3 (MP3), RealAudio, WindowsMedia format, etc. Continuous-Media Data  Most important types are video and audio data.  Characterized by high data volumes and real-time information-delivery requirements. Data must be delivered sufficiently fast that there are no gaps in the audio or video. Data must be delivered at a rate that does not cause overflow of system buffers. Synchronization among distinct data streams must be maintained  video of a person speaking must show lips moving synchronously with the audio

105 http://www.francisxavier.ac.in

Video Servers 

Video-on-demand systems deliver video from central video servers, across a network, to terminals must guarantee end-to-end delivery rates

 Current video-on-demand servers are based on file systems; existing database systems do not meet real-time response requirements.  Multimedia data are stored on several disks (RAID configuration), or on tertiary storage for less frequently accessed data.  Head-end terminals - used to view multimedia data PCs or TVs attached to a small, inexpensive computer called a set-top box. Similarity-Based Retrieval Examples of similarity based retrieval 

Pictorial data:

Two pictures or images that are slightly different as

represented in the database may be considered the same by a user. e.g., identify similar designs for registering a new trademark.  Audio data: Speech-based user interfaces allow the user to give a command or identify a data item by speaking. e.g., test user input against stored commands.  Handwritten data: Identify a handwritten data item or command stored in the database

DATA WAREHOUSING AND MINING Data Warehouse:  Large organizations have complex internal organizations, and have data stored at different locations, on different operational (transaction processing) systems, under different schemas  Data sources often store only current data, not historical data  Corporate decision making requires a unified view of all organizational data, including historical data  A data warehouse is a repository (archive) of information gathered from 106 http://www.francisxavier.ac.in

multiple sources, stored under a unified schema, at a single site  Greatly simplifies querying, permits study of historical trends  Shifts decision support query load away from transaction processing systems Components of a Datawarehouse When and how to gather data  Source driven architecture: data sources transmit new information to warehouse, either continuously or periodically (e.g. at night)  Destination driven architecture: warehouse periodically requests new information from data sources  Keeping warehouse exactly synchronized with data sources (e.g. using twophase commit) is too expensive

 Usually OK to have slightly out-of-date data at warehouse  Data/updates are periodically downloaded form online transaction processing (OLTP) system. What schema to use  Schema integration Data cleansing  E.g. correct mistakes in addresses 107 http://www.francisxavier.ac.in

 E.g. misspellings, zip code errors Merge address lists from different sources and purge duplicates  Keep only one address record per household (―householding‖) How to propagate updates  Warehouse schema may be a (materialized) view of schema from data sources  Efficient techniques for update of materialized views What data to summarize  Raw data may be too large to store on-line  Aggregate values (totals/subtotals) often suffice  Queries on raw data can often be transformed by query optimizer to use aggregate values.

WAREHOUSE SCHEMA Typically warehouse data is multidimensional, with very large fact tables  Examples of dimensions: item-id, date/time of sale, store where sale was made, customer identifier  Examples of measures: number of items sold, price of items Dimension values are usually encoded using small integers and mapped to full values via dimension tables Resultant schema is called a star schema More complicated schema structures  Snowflake schema: multiple levels of dimension tables  Constellation: multiple fact tables

108 http://www.francisxavier.ac.in

Data Mining  Broadly speaking, data mining is the process of semi-automatically analyzing large databases to find useful patterns.  Like knowledge discovery in artificial intelligence data mining discovers statistical rules and patterns  Differs from machine learning in that it deals with large volumes of data stored primarily on disk.  Some types of knowledge discovered from a database can be represented by a set of rules. e.g.,: ―Young women with annual incomes greater than $50,000 are most likely to buy sports cars‖.  Other types of knowledge represented by equations, or by prediction functions.  Some manual intervention is usually required  Pre-processing of data, choice of which type of pattern to find, postprocessing to find novel patterns Applications of Data Mining  Prediction based on past history  Predict if a credit card applicant poses a good credit risk, based on some attributes (income, job type, age, ..) and past history  Predict if a customer is likely to switch brand loyalty  Predict if a customer is likely to respond to ―junk mail‖

109 http://www.francisxavier.ac.in

 Predict if a pattern of phone calling card usage is likely to be fraudulent  Some examples of prediction mechanisms:  Classification  Given a training set consisting of items belonging to different classes, and a new item whose class is unknown, predict which class it belongs to.  Regression formulae  Given a set of parameter-value to function-result mappings for an unknown function, predict the function-result for a new parametervalue  Descriptive Patterns

Associations  Find books that are often bought by the same customers. If a new customer buys one such book, suggest that he buys the others too.  Other similar applications: camera accessories, clothes, etc. Associations may also be used as a first step in detecting causation  E.g. association between exposure to chemical X and cancer, or new medicine and cardiac problems Clusters  E.g. typhoid cases were clustered in an area surrounding a contaminated well  Detection of clusters remains important in detecting epidemics

DATA MART A data mart is the access layer of the data warehouse environment that is used to get data out to the users. The data mart is a subset of the data warehouse that is usually oriented to a specific business line or team. Data marts are small slices of the data warehouse. Whereas data warehouses have an enterprise-wide depth, the information in data marts pertains to a single department. In some deployments, each department or business unit is considered the owner of 110 http://www.francisxavier.ac.in

its data mart including all the hardware, software and data. This enables each department to use, manipulate and develop their data any way they see fit; without altering information inside other data marts or the data warehouse. In other deployments where conformed dimensions are used, this business unit ownership will not hold true for shared dimensions like customer, product, etc. The reasons why organizations are building data warehouses and data marts are because the information in the database is not organized in a way that makes it easy for organizations to find what they need. Also complicated queries might take a long time to answer what people want to know since the database systems are designed to process millions of transactions per day. Transactional database are designed to be updated, however, data warehouses or marts are read only. Data warehouses are designed to access large groups of related records. Data marts improve end-user response time by allowing users to have access to the specific type of data they need to view most often by providing the data in a way that supports the collective view of a group of users. A data mart is basically a condensed and more focused version of a data warehouse that reflects the regulations and process specifications of each business unit within an organization. Each data mart is dedicated to a specific business function or region. This subset of data may span across many or all of an enterprise‘s functional subject areas. It is common for multiple data marts to be used in order to serve the needs of each individual business unit (different data marts can be used to obtain specific information for various enterprise departments, such as accounting, marketing, sales, etc.)

111 http://www.francisxavier.ac.in

UNIT V ADVANCED TOPICS DATABASE SECURITY: Data Classification-Threats and risks – Database access Control – Types of Privileges –Cryptography- Statistical Databases.- Distributed Databases-ArchitectureTransaction Processing-Data Warehousing and Mining-Classification-Association rulesClustering-Information Retrieval- Relevance ranking-Crawling and Indexing the Web- Object Oriented Databases-XML Databases. 5.1 DATABASE SECURITY: Data Classification Database security concerns the use of a broad range of information security controls to protect databases (potentially including the data, the database applications or stored functions, the database systems, the database servers and the associated network links) against compromises of their confidentiality, integrity and availability. It involves various types or categories of controls, such as technical, procedural/administrative and physical. Database security is a specialist topic within the broader realms of computer security, information security and risk management. Security risks to database systems include, for example: 

Unauthorized or unintended activity or misuse by authorized database users, database administrators, or network/systems managers, or by unauthorized users or hackers (e.g. inappropriate access to sensitive data, metadata or functions within databases, or inappropriate changes to the database programs, structures or security configurations);



Malware infections causing incidents such as unauthorized access, leakage or disclosure of personal or proprietary data, deletion of or damage to the data or programs, interruption or denial of authorized access to the database, attacks on other systems and the unanticipated failure of database services;



Overloads, performance constraints and capacity issues resulting in the inability of authorized users to use databases as intended;

112 http://www.francisxavier.ac.in



Physical damage to database servers caused by computer room fires or floods, overheating,

lightning,

accidental

liquid

spills,

static

discharge,

electronic

breakdowns/equipment failures and obsolescence; 

Design flaws and programming bugs in databases and the associated programs and systems, creating various security vulnerabilities (e.g. unauthorized privilege escalation), data loss/corruption, performance degradation etc.;



Data corruption and/or loss caused by the entry of invalid data or commands, mistakes in database or system administration processes, sabotage/criminal damage etc.



Types of Security –

Legal and ethical issues



Policy issues



System-related issues

The need to identify multiple security levels Threats to databases - Loss of integrity - Loss of availability - Loss of confidentiality 5.2 Threats and risk Types of threats to database security 1. Privilege abuse: When database users are provided with privileges that exceeds their day-to-day job requirement, these privileges may be abused intentionally or unintentionally. Take, for instance, a database administrator in a financial institution. What will happen if he turns off audit trails or create bogus accounts? He will be able to transfer money from one account to another thereby abusing the excessive privilege intentionally. Having seen how privilege can be abused intentionally, let us see how privilege can be abused unintentionally. A company is providing a ―work from home‖ option to its employees 113 http://www.francisxavier.ac.in

and the employee takes a backup of sensitive data to work on from his home. This not only violates the security policies of the organization, but also may result in data security breach if the system at home is compromised. 2. Operating System vulnerabilities: Vulnerabilities in underlying operating systems like Windows, UNIX, Linux, etc., and the services that are related to the databases could lead to unauthorized access. This may lead to a Denial of Service (DoS) attack. This could be prevented by updating the operating system related security patches as and when they become available. 3. Database rootkits: A database rootkit is a program or a procedure that is hidden inside the database and that provides administrator-level privileges to gain access to the data in the database. These rootkits may even turn off alerts triggered by Intrusion Prevention Systems (IPS). It is possible to install a rootkit only after compromising the underlying operating system. This can be avoided by periodical audit trails, else the presence of the database rootkit may go undetected. 4. Weak authentication: Weak authentication models allow attackers to employ strategies such as social engineering and brute force to obtain database login credentials and assume the identity of legitimate database users. 5. Weak audit trails: A weak audit logging mechanism in a database server represents a critical risk to an organization especially in retail, financial, healthcare, and other industries with stringent regulatory compliance. Regulations such as PCI, SOX, and HIPAA demand extensive logging of actions to reproduce an event at a later point of time in case of an incident. Logging of sensitive or unusual transactions happening in a database must be done in an automated manner for resolving incidents. Audit trails act as the last line of database defense. Audit trails can detect the existence of a violation that could help trace back the violation to a particular point of time and a particular user. 5.3 Database Access Control

To protect databases against these types of threats four kinds of countermeasures can be implemented : access control, inference control, flow control, and encryption. 114 http://www.francisxavier.ac.in

A DBMS typically includes a database security and authorization subsystem that is responsible for ensuring the security portions of a database against unauthorized access. Two types of database security mechanisms:  Discretionary security mechanisms 

Mandatory security mechanisms

 The security mechanism of a DBMS must include provisions for restricting access to the database as a whole; this function is called access control and is handled by creating user accounts and passwords to control login process by the DBMS. Discretionary Access Control Based on Granting and Revoking Privileges The typical method of enforcing discretionary access control in a database system is based on the granting and revoking privileges. 5.3.1Types of Discretionary Privileges  The account level: At this level, the DBA specifies the particular privileges that each account holds independently of the relations in the database.  The relation (or table level): At this level, the DBA can control the privilege to access each individual relation or view in the database.  The privileges at the account level apply to the capabilities provided to the account itself and can include the CREATE SCHEMA or CREATE TABLE privilege, to create a schema or base relation; the CREATE VIEW privilege; the ALTER privilege, to apply schema changes such adding or removing attributes from relations; the DROP privilege, to delete relations or views; the MODIFY privilege, to insert, delete, or update tuples; and the SELECT privilege, to retrieve information from the database by using a SELECT query. The second level of privileges applies to the relation level, whether they are base relations or virtual (view) relations.The granting and revoking of privileges generally follow an authorization model for discretionary privileges known as the access matrix model, where the rows of a matrix M represents subjects (users, accounts, programs) and the columns represent

115 http://www.francisxavier.ac.in

objects (relations, records, columns, views, operations). Each position M(i,j) in the matrix represents the types of privileges (read, write, update) that subject i holds on object j. To control the granting and revoking of relation privileges, each relation R in a database is assigned and owner account, which is typically the account that was used when the relation was created in the first place. The owner of a relation is given all privileges on that relation. In SQL2, the DBA can assign and owner to a whole schema by creating the schema and associating the appropriate authorization identifier with that schema, using the CREATE SCHEMA command. The owner account holder can pass privileges on any of the owned relation to other users by granting privileges to their accounts. In SQL the following types of privileges can be granted on each individual relation R:  SELECT (retrieval or read) privilege on R: Gives the account retrieval privilege. In SQL this gives the account the privilege to use the SELECT statement to retrieve tuples from R.  MODIFY privileges on R: This gives the account the capability to modify tuples of R. In SQL this privilege is further divided into UPDATE, DELETE, and INSERT privileges to apply the corresponding SQL command to R. In addition, both the INSERT and UPDATE privileges can specify that only certain attributes can be updated by the account.  REFERENCES privilege on R: This gives the account the capability to reference relation R when specifying integrity constraints. The privilege can also be restricted to specific attributes of R. Notice that to create a view, the account must have SELECT privilege on all relations involved in the view definition 5.3.2 Specifying Privileges Using Views The mechanism of views is an important discretionary authorization mechanism in its own right.For example, if the owner A of a relation R wants another account B to be able to retrieve only some fields of R, then A can create a view V of R that includes only those 116 http://www.francisxavier.ac.in

attributes and then grant SELECT on V to B. The same applies to limiting B to retrieving only certain tuples of R; a view V‘ can be created by defining the view by means of a query that selects only those tuples from R that A wants to allow B to access. 5.3.3 Revoking Privileges In some cases it is desirable to grant a privilege to a user temporarily.For example, the owner of a relation may want to grant the SELECT privilege to a user for a specific task and then revoke that privilege once the task is completed. Hence, a mechanism for revoking privileges is needed. In SQL, a REVOKE command is included for purpose of canceling privileges. 5.3.4 Mandatory Access Control and Role-Based Access Control for Multilevel Security The discretionary access control techniques of granting and revoking privileges on relations has traditionally been the main security mechanism for relational database systems. This is an all-or-nothing method: A user either has or does not have a certain privilege.In many applications, and additional security policy is needed that classifies data and users based on security classes. This approach as mandatory access control, would typically be combined with the discretionary access control mechanisms. Typical security classes are top secret (TS), secret (S), confidential (C), and unclassified (U), where TS is the highest level and U the lowest: TS ≥ S ≥ C ≥ U.The commonly used model for multilevel security, known as the Bell-LaPadula model, classifies each subject (user, account, program) and object (relation, tuple, column, view, operation) into one of the security classifications, T, S, C, or U: clearance (classification) of a subject S as class(S) and to the classification of an object O as class(O). Two restrictions are enforced on data access based on the subject/object classifications: 1. A subject S is not allowed read access to an object O unless class(S) ≥ class(O). This is known as the simple security property. 2. A subject S is not allowed to write an object O unless class(S) ≤ class(O). This known as the star property (or * property).

117 http://www.francisxavier.ac.in

The value of the TC attribute in each tuple t – which is the highest of all attribute classification values within t – provides a general classification for the tuple itself, whereas each Ci provides a finer security classification for each attribute value within the tuple. The apparent key of a multilevel relation is the set of attributes that would have formed the primary key in a regular(single-level) relation. A multilevel relation will appear to contain different data to subjects (users) with different clearance levels. In some cases, it is possible to store a single tuple in the relation at a higher classification level and produce the corresponding tuples at a lower-level classification through a process known as filtering. In other cases, it is necessary to store two or more tuples at different classification levels with the same value for the apparent key. This leads to the concept of polyinstantiation where several tuples can have the same apparent key value but have different attribute values for users at different classification levels. In general, the entity integrity rule for multilevel relations states that all attributes that are members of the apparent key must not be null and must have the same security classification within each individual tuple. In addition, all other attribute values in the tuple must have a security classification greater than or equal to that of the apparent key. This constraint ensures that a user can see the key if the user is permitted to see any part of the tuple at all. Other integrity rules, called null integrity and interinstance integrity, informally ensure that if a tuple value at some security level can be filtered (derived) from a higherclassified tuple, then it is sufficient to store the higher-classified tuple in the multilevel relation. 5.3.5 Role-Based Access Control Role-based access control (RBAC) emerged rapidly in the 1990s as a proven technology for managing and enforcing security in large-scale enterprisewide systems. Its basic notion is that permissions are associated with roles, and users are assigned to appropriate roles. Roles can be created using the CREATE ROLE and DESTROY ROLE commands. The GRANT 118 http://www.francisxavier.ac.in

and REVOKE commands discussed under DAC can then be used to assign and revoke privileges from roles.  RBAC appears to be a viable alternative to traditional discretionary and mandatory access controls; it ensures that only authorized users are given access to certain data or resources.  Many DBMSs have allowed the concept of roles, where privileges can be assigned to roles.  Role hierarchy in RBAC is a natural way of organizing roles to reflect the organization‘s lines of authority and responsibility.  Another important consideration in RBAC systems is the possible temporal constraints that may exist on roles, such as time and duration of role activations, and timed triggering of a role by an activation of another role.  Using an RBAC model is highly desirable goal for addressing the key security requirements of Web-based applications. In contrast, discretionary access control (DAC) and mandatory access control (MAC) models lack capabilities needed to support the security requirements emerging enterprises and Webbased applications. 5.4 Types of Privileges Privileges: Privileges defines the access rights provided to a user on a database object. There are two types of privileges. 1) System privileges - This allows the user to CREATE, ALTER, or DROP database objects. 2) Object privileges - This allows the user to EXECUTE, SELECT, INSERT, UPDATE, or DELETE data from database objects to which the privileges apply. Few CREATE system privileges are listed below:

119 http://www.francisxavier.ac.in

System Privileges

Description

allows users to create the specified object in their own schema. CREATE ANY allows users to create the specified object object in any schema.

CREATE object

The above rules also apply for ALTER and DROP system privileges. Few of the object privileges are listed below: Object Description Privileges INSERT allows users to insert rows into a table. allows users to select data from a SELECT database object. UPDATE allows user to update data in a table. allows user to execute a stored EXECUTE procedure or a function. Roles: Roles are a collection of privileges or access rights. When there are many users in a database it becomes difficult to grant or revoke privileges to users. Therefore, if you define roles, you can grant or revoke privileges to users, thereby automatically granting or revoking privileges. You can either create Roles or use the system roles pre-defined by oracle.Some of the privileges granted to the system roles are as given below: System Role

Privileges Granted to the Role CREATE TABLE, CREATE VIEW, CONNECT CREATE SYNONYM, CREATE SEQUENCE, CREATE SESSION etc. CREATE PROCEDURE, CREATE SEQUENCE, CREATE TABLE, RESOURCE CREATE TRIGGER etc. The primary usage of the RESOURCE role is to restrict access to database objects. DBA ALL SYSTEM PRIVILEGES

120 http://www.francisxavier.ac.in

5.5 Cryptography A DBMS can use encryption to protect information in certain situations where the normal security mechanisms of the DBMS are not adequate. For example, an intruder may steal tapes containing some data or tap a communication line. By storing and transmitting data in an encrypted form, the DBMS ensures that such stolen data is not intelligible to the intruder. Thus, encryption is a technique to provide privacy of data.

In encryption, the message to be encrypted is known as plaintext. The plaintext is transformed by a function that is parameterized by a key. The output of the encryption process is known as the cipher text. Ciphertext is then transmitted over the network. The process of converting the plaintext to ciphertext is called as Encryption and process of converting the ciphertext to plaintext is called as Decryption. Encryption is performed at the transmitting end and decryption is performed at the receiving end. For encryption process we need the encryption key and for decryption process we need decryption key as shown in figure. Without the knowledge of decryption key intruder cannot break the ciphertext to plaintext. This process is also called as Cryptography. The basic idea behind encryption is to apply an encryption algorithm, which may' be accessible to the intruder, to the original data and a user-specified or DBA-specified encryption key, 'which is kept secret. The output of the algorithm is the encrypted version of the data. There is also a decryption algorithm, which takes the encrypted data and the decryption key as input and then returns the original data. Without the correct decryption key, the decryption algorithm produces gibberish. Encryption and decryption keys may be same or· different but there must be relation between the both which must me secret. 121 http://www.francisxavier.ac.in

5.5.1Techniques used for Encryption There are following techniques used for encryption process: • Substitution Ciphers • Transposition Ciphers Substitution Ciphers: In a substitution cipher each letter or group of letters is replaced by another letter or group of letters to mask them For example: a is replaced with D, b with E, c with F and z with C. In this way attack becomes DWWDFN. The substitution ciphers are not much secure because intruder can easily guess the substitution characters. Transposition Ciphers: Substitution ciphers preserve the order of the plaintext symbols but mask them-;-The transposition cipher in contrast reorders the letters but do not mask them. For this process a key is used. For example: iliveinqadian may be coded as divienaniqnli. The transposition ciphers are more secure as compared to substitution ciphers. 5.5.2 Algorithms for Encryption Process There are commonly used algorithms for encryption process. These are: • Data Encryption Standard (DES) • Public Key Encryption Data Encryption Standard (DES) It uses both a substitution of characters and a rearrangement of their order on the basis of an encryption key. The main weakness of this approach is that authorized users must be told the encryption key, and the mechanism for communicating this information is vulnerable to clever intruders.

122 http://www.francisxavier.ac.in

Public Key Encryption Another approach to encryption, called public-key encryption, has become increasingly popular in recent years. The encryption scheme proposed by Rivest, Shamir, and Adheman, called RSA, is a well-knm.vnexample of public-key encryption. Each authorized user has a public encryption key, known to everyone and a private decryption key (used by the decryption algorithm), chosen by the user and known only to him or her. The encryption and decryption algorithms themselves are assumed to be publicly known. Consider user called Suneet. Anyone can send Suneet a secret message by encrypting the message using Sunset's publicly known encryption key. Only Suneet can decrypt this secret message because the decryption algorithm required Suneet's decryption key, known only to Suneet. Since users choose their own decryption keys, the weakness 0f DES is avoided. The main issue for public-key encryption is how encryption and decryption keys are chosen. Technically, public-key encryption algorithms rely on the existence of one-way functions, which are functions whose inverse is computationally very hard to determine. The RSA algorithm, for example is based on the observation that although checking whether a given number of prime is easy, determining the prime factors of a nonprime number is extremely hard. (Determining the prime factors of a number with over 100 digits can take years of CPU-time on the fastest available computers today.) We now sketch the intuition behind the RSA algorithm, assuming that the data to be encrypted is an integer 1. To choose an encryption key and a decryption key, our friend Suneet-create a public key by computing the product of two large prime numbers: PI and P2. The private key consists of the pair (PI, P2) and decryption algorithms cannot be used if the product of PI and P2 is known. So we publish the product PI *P2, but an unauthorized user would need to be able to factor PIP2 to steal data. By choosing PI and P2 to be sufficiently large (over 100 digits), we can make it very difficult (or nearly impossible) for an intruder to factorize it.

123 http://www.francisxavier.ac.in

Although this technique is secure, but it is also computationally expensive. A hybrid scheme used for secure communication is to use DES keys exchanged via a public-key encryption scheme and DES encryption is used on the data transmitted subsequently. 5.5.3 Disadvantages of encryption There are following problems of Encryption: 

Key management (i.e. keeping keys secret) is a problem. Even in public-key encryption the decryption key must be kept secret.



Even in a system that supports encryption, data must often be processed in plaintext form. Thus sensitive data may still be accessible to transaction programs.



Encrypting data gives rise to serious technical problems at the level of physical storage organization. For example indexing over data, which is stored in encrypted form, can be very difficult.

5.6 STATISTICAL DATABASE A statistical database is a database used for statistical analysis purposes. It is an OLAP (online analytical processing), instead of OLTP (online transaction processing) system. Modern decision, and classical statistical databases are often closer to the relational model than the multidimensional model commonly used in OLAP systems today.

124 http://www.francisxavier.ac.in

Statistical databases typically contain parameter data and the measured data for these parameters. For example, parameter data consists of the different values for varying conditions in an experiment (e.g., temperature, time). The measured data (or variables) are the measurements taken in the experiment under these varying conditions. Many statistical databases are sparse with many null or zero values. It is not uncommon for a statistical database to be 40% to 50% sparse. There are two options for dealing with the sparseness: (1) leave the null values in there and use compression techniques to squeeze them out or (2) remove the entries that only have null values. 5.7 DISTRIBUTED DATABASE A distributed database is a database in which a storage devices are not all attached to a common processing unit such as the CPU, controlled by a distributed database management system (together sometimes called a distributed database system). 5.7.1 Architecture of DBMS

125 http://www.francisxavier.ac.in

5.7.2 Component Architecture of Distributed Database

126 http://www.francisxavier.ac.in

5.8 TRANSACTION PROCESSING In computer science, transaction processing is information processing that is divided into individual, indivisible operations, called transactions. Each transaction must succeed or fail as a complete unit; it can never be only partially complete. Transaction processing systems consist of computer hardware and software hosting a transaction-oriented application that performs the routine transactions necessary to conduct business. Examples include systems that manage sales order entry, airline reservations, payroll, employee records, manufacturing, and shipping. Transaction processing is designed to maintain a system's Integrity (typically a database or some modern file systems) in a known, consistent state, by ensuring that interdependent operations on the system are either all completed successfully or all canceled successfully. 5.8.1 ACID Properties A transaction may contain several low level tasks and further a transaction is very small unit of any program. A transaction in a database system must maintain some properties in order to ensure the accuracy of its completeness and data integrity. These properties are refer to as ACID properties and are mentioned below: 

Atomicity: Though a transaction involves several low level operations but this property states that a transaction must be treated as an atomic unit, that is, either all of its operations are executed or none. There must be no state in database where the transaction is left partially completed. States should be defined either before the execution of the transaction or after the execution/abortion/failure of the transaction.



Consistency: This property states that after the transaction is finished, its database must remain in a consistent state. There must not be any possibility that some data is incorrectly affected by the execution of transaction. If the database was in a consistent state before the execution of the transaction, it must remain in consistent state after the execution of the transaction.



Durability: This property states that in any case all updates made on the database will persist even if the system fails and restarts. If a transaction writes or updates some data in database and commits that data will always be there in the database. If the transaction 127 http://www.francisxavier.ac.in

commits but data is not written on the disk and the system fails, that data will be updated once the system comes up. 

Isolation: In a database system where more than one transaction are being executed simultaneously and in parallel, the property of isolation states that all the transactions will be carried out and executed as if it is the only transaction in the system. No transaction will affect the existence of any other transaction.

5.9 DATAWAREHOUSING Data sources often store only current data, not historical data. Corporate decision making requires a unified view of all organizational data, including historical data.A data warehouse is a repository (archive) of information gathered from multiple sources, stored under a unified schema, at a single site Greatly simplifies querying, permits study of historical trends Shifts decision support query load away from transaction processing systems

DESIGN ISSUES

When and how to gather data Source driven architecture: data sources transmit new information to Warehouse, either continuously or periodically (e.g. at night).Destination driven architecture: warehouse periodically requests new information from data sources. Keeping warehouse exactly synchronized with data sources (e.g. using two-phase commit) is too expensive. 128 http://www.francisxavier.ac.in

What schema to use Schema integration Data cleansing E.g. correct mistakes in addresses (misspellings, zip code errors) Merge address lists from different sources and purge duplicates How to propagate updates

Warehouse schema may be a (materialized) view of schema from data sources What data to summarize Raw data may be too large to store on-line Aggregate values (totals/subtotals) often suffice Queries on raw data can often be transformed by query optimizer to use aggregate values Dimension values are usually encoded using small integers and mapped to full values via dimension tables. Resultant schema is called a star schema. More complicated schema structures

-

Snowflake schema: multiple levels of dimension tables

-

Constellation: multiple fact tables

129 http://www.francisxavier.ac.in

5.10 DATAMINING

Data mining is the process of semi-automatically analyzing large databases to find useful patterns. Prediction based on past history. Predict if a credit card applicant poses a good credit risk, based on some attributes (income, job type, age, ..) and past history.Predict if a pattern of phone calling card usage is likely to be fraudulent. Some examples of prediction mechanisms: Classification Given a new item whose class is unknown, predict to which class it belongs Regression formulae Given a set of mappings for an unknown function, predict the function result for a new parameter value Descriptive Patterns Associations Find books that are often bought by ―similar‖ customers. If a new such customer buys one such book, suggest the others too. Associations may be used as a first step in detecting causation E.g. association between exposure to chemical X and cancer, Clusters E.g. typhoid cases were clustered in an area surrounding a contaminated well. Detection of clusters remains important in detecting epidemics. 5.11 Classification:

Classifications rules help assign new objects to classes. E.g., given a new automobile insurance applicant, should he or she be classified as low risk, medium risk or high risk? Classification rules for above example could use a variety of data, such as educational level, salary, age, etc. person P, P.degree = masters and P.income > 75,000 P.credit = excellent 130 http://www.francisxavier.ac.in

person P, P.degree = bachelors and (P.income 25,000 and P.income 75,000) P.credit = good

Rules are not necessarily exact: there may be some misclassifications.Classification rules can be shown compactly as a decision tree.

CONSTRUCTION OF DECISION TREES

Training set: a data sample in which the classification is already known.Greedy top down generation of decision trees. Each internal node of the tree partitions the data into groups based on a partitioning attribute, and a partitioning condition for the node Leaf node: all (or most) of the items at the node belong to the same class, or all attributes have been considered, and no further partitioning is possible.

131 http://www.francisxavier.ac.in

The purity of a set S of training instances can be measured quantitatively in several ways. Notation: number of classes = k, number of instances = |S|, fraction of instances in class i = pi. The Gini measure of purity is defined as Gini (S) = 1 -  When all instances are in a single class, the Gini value is 0. It reaches its maximum (of 1 –1 /k) if each class the same number of instances. DECISION TREE CONSTRUCTION ALGORITHM

Procedure GrowTree (S ) Partition (S );

Procedure Partition (S) if ( purity (S ) > p or |S| < s ) then return; for each attribute A evaluate splits on attribute A; Use best split found (across all attributes) to partition S into S1, S2, …., Sr, 132 http://www.francisxavier.ac.in

for i = 1, 2, ….., r Partition (Si );

NAÏVE BASIYAN CLASSIFIERS Bayesian classifiers require

computation of p (d | cj )

precomputation of p (cj )

p (d ) can be ignored since it is the same for all classes.To simplify the task, naïve Bayesian classifiers assume attributes have independent distributions, and thereby estimate p (d | cj) = p (d1 | cj ) * p (d2 | cj ) * ….* (p (dn | cj )

Each of the p (di | cj ) can be estimated from a histogram on di values for each class cj

 the histogram is computed from the training instances

Histograms on multiple attributes are more expensive to compute and store

REGRESSION

Regression deals with the prediction of a value, rather than a class.Given values for a set of variables, X1, X2, …, Xn, we wish to predict the value of a variable Y. One way is to infer coefficients a0, a1, a1, …, an such that Y = a0 + a1 * X1 + a2 * X2 + … 133 http://www.francisxavier.ac.in

+ an * Xn

Finding such a linear polynomial is called linear regression. In general, the process of finding a curve that fits the data is also called curve fitting. The fit may only be approximate -because of noise in the data, or -because the relationship is not exactly a polynomial Regression aims to find coefficients that give the best possible fit.

5.12 ASSOCIATION RULES Retail shops are often interested in associations between different items that people buy. Someone who buys bread is quite likely also to buy milk. A person who bought the book Database System Concepts is quite likely also to buy the book Operating System Concepts. Associations information can be used in several ways. E.g. when a customer buys a particular book, an online shop may suggest associated books. Association rules: bread milk DB-Concepts, OS-Concepts Networks Left hand side: antecedent, right hand side: consequent An association rule must have an associated population; the population consists of a set of instances E.g. each transaction (sale) at a shop is an instance, and the set of all transactions is the population Rules have an associated support, as well as an associated confidence.Support is a measure of what fraction of the population satisfies both the antecedent and the consequent of the rule. E.g. suppose only 0.001 percent of all purchases include milk and screwdrivers. The support for the rule is milk screwdrivers is low. Confidence is a measure of how often the consequent is true when the antecedent is true. E.g. the rule bread milk has a confidence of 80 percent if 80 percent of the purchases that include bread also include milk. FINDING ASSOCIATION RULES We are generally only interested in association rules with reasonably high support (e.g. support of 2% or greater) Naïve algorithm 134 http://www.francisxavier.ac.in

1. Consider all possible sets of relevant items. 2. For each set find its support (i.e. count how many transactions purchase all items in the set). Large itemsets: sets with sufficiently high support 3. Use large itemsets to generate association rules. From itemset A generate the rule A - {b } b for each b A. Support of rule = support (A). Confidence of rule = support (A ) / support (A - {b }) Determine support of itemsets via a single pass on set of transactions Large itemsets: sets with a high count at the end of the pass If memory not enough to hold all counts for all itemsets use multiple passes, considering only some itemsets in each pass. Optimization: Once an itemset is eliminated because its count (support) is too small none of its supersets needs to be considered. The a priori technique to find large itemsets: Pass 1: count support of all sets with just 1 item. Eliminate those items with low support Pass i: candidates: every set of i items such that all its i-1 item subsets are large Count support of all candidates Stop if there are no candidates

Basic association rules have several limitations.Deviations from the expected probability are more interesting E.g. if many people purchase bread, and many people purchase cereal, quite a few would be expected to purchase both .We are interested in positive as well as negative correlations between sets of items Positive correlation: co-occurrence is higher than predicted Negative correlation: co-occurrence is lower than predicted Sequence associations / correlations E.g. whenever bonds go up, stock prices go down in 2 days Deviations from temporal patterns E.g. deviation from a steady growth 135 http://www.francisxavier.ac.in

E.g. sales of winter wear go down in summer Not surprising, part of a known pattern. Look for deviation from value predicted using past pattern 5.13 CLUSTERING Clustering: Intuitively, finding clusters of points in the given data such that similar points lie in the same cluster can be formalized using distance metrics in several ways. Group points into k sets (for a given k) such that the average distance of points from the centroid of their assigned group is minimized Centroid: point defined by taking average of coordinates in each dimension. Hierarchical Clustering Example from biological classification Other examples: Internet directory systems (e.g. Yahoo, more on this later) Agglomerative Clustering Algorithm Build small clusters, then cluster small clusters into bigger clusters, and so on Divisive clustering algorithms Start with all items in a single cluster, repeatedly refine (break) clusters into smaller ones. Another metric: minimize average distance between every pair of points in a cluster has been studied extensively in statistics, but on small data sets.Data mining systems aim at clustering techniques that can handle very large data sets. E.g. the Birch clustering algorithm

5.14 INFORMATION RETRIEVAL

Information retrieval is the activity of obtaining information resource relevant to an information need from a collection of information resources. Searches can be based on metadata or on full-text indexing. –

Typically it refers to the automatic (rather than manual) retrieval of documents 136 http://www.francisxavier.ac.in

• –

Information Retrieval System (IRS)

―Document‖ is the generic term for an information holder (book, chapter, article, webpage, etc)



Systemic approach –



Goal (for a known information need):

Return as many relevant documents as possible and as few non-relevant documents as possible

• –

Cognitive approach Goal (in an interactive information-seeking environment, with a given IRS):Support the user‘s exploration of the problem domain and the task completion.

5.15 RELEVANCE RANKING

Relevancy ranking is the process of sorting the document results so that those documents which are most likely to be relevant to your query are shown at the top. Relevance ranking is usually best for searches that are not ―either/or‖ types of searches. For example, in most traditional title searches, the result is either the library has the book, or it does not. The relevancy program would either show the entry for the book, or an alphabetical list that has a statement in the appropriate place that says, ―Your search would be here.‖ This is a very good place for this concrete, well-known sorting method.

5.16 CRAWLING AND INDEXING THE WEB

A Web crawler is an Internet both that systematically browses the World Wide Web, typically for the purpose of Web indexing. A Web crawler may also be called a Web spider, an ant, an automatic indexer, or (in the FOAF software context) a Web scutter. Web search engines and some other sites use Web crawling or spidering software to update their web content or indexes of others sites' web content. Web crawlers can copy all the pages they visit for later processing by a search engine that indexes the downloaded pages so that users can search them 137 http://www.francisxavier.ac.in

much more quickly. Web indexing (or Internet indexing) refers to various methods for indexing the contents of a website or of the Internet as a whole. Individual websites or intranets may use a back-of-thebook index, while search engines usually use keywords and metadata to provide a more useful vocabulary for Internet or onsite searching. Search engine indexing collects, parses, and stores data to facilitate fast and accurate information retrieval. Index design incorporates interdisciplinary concepts from linguistics, cognitive psychology, mathematics, informatics, and computer science. An alternate name for the process in the context of search engines designed to find web pages on the Internet is web indexing.

5.17 OBJECT ORIENTED DATABASE

Basically, an OODBMS is an object database that provides DBMS capabilities to objects that have been created using an object-oriented programming language (OOPL). The basic principle is to add persistence to objects and to make objects persistent. Consequently application programmers who use OODBMSs typically write programs in a native OOPL such as Java, C++ or Smalltalk, and the language has some kind of Persistent class, Database class, Database Interface, or Database API that provides DBMS functionality as, effectively, an extension of the OOPL. Object-oriented DBMSs, however, go much beyond simply adding persistence to any one object-oriented programming language. This is because, historically, many object-oriented DBMSs were built to serve the market for computer-aided design/computer-aided manufacturing (CAD/CAM) applications in which features like fast navigational access, versions, and long transactions are extremely important. Object-oriented DBMSs, therefore, support advanced object-oriented database applications with features like support for persistent objects from more than one programming language, distribution of data, advanced transaction models, versions, schema evolution, and dynamic generation of new types. Object data modeling An object consists of three parts: structure (attribute, and relationship to other objects like 138 http://www.francisxavier.ac.in

aggregation, and association), behavior (a set of operations) and characteristic of types (generalization/serialization). An object is similar to an entity in ER model; therefore we begin with an example to demonstrate the structure and relationship.

Attributes are like the fields in a relational model. However in the Book example we have, for attributes publishedBy and writtenBy, complex types Publisher and Author, which are also objects. Attributes with complex objects, in RDNS, are usually other tables linked by keys to the employee table. Relationships: publish and writtenBy are associations with I: N and 1:1 relationship; composed of is an aggregation (a Book is composed of chapters). The 1: N relationship is usually realized as attributes through complex types and at the behavioral level. For example, Generalization/Serialization is the is a

relationship

,

which is supported in OODB through class hierarchy. An ArtBook is a Book, therefore the ArtBook class is a subclass of Book class. A subclass inherits all the attribute and method of its superclass. 139 http://www.francisxavier.ac.in

Message: means by which objects communicate, and it is a request from one object to another to execute one of its methods. For example: Publisher_object.insert (‖Rose‖, 123…) i.e. request to execute the insert method on a Publisher object) Method: defines the behavior of an object. Methods can be used to change state by modifying its attribute values to query the value of selected attributes The method that responds to the message example is the method insert defied in the Publisher class. The main differences between relational database design and object oriented database design include: Many-to-many relationships must be removed before entities can be translated into relations. Many-to-many relationships can be implemented directly in an object-oriented database. Operations are not represented in the relational data model. Operations are one of the main components in an object-oriented database. In the relational data model relationships are implemented by primary and foreign keys. In the object model objects communicate through their interfaces. The interface describes the data (attributes) and operations (methods) that are visible to other objects.

5.18 XML Database

XML: Extensible Markup Language. It is defined by the WWW Consortium (W3C).It is derived from SGML (Standard Generalized Markup Language), but simpler to use than SGML. Documents have tags giving extra information about sections of the document. E.g. XML Introduction … Extensible, unlike HTML. Users can add new tags, and separately specify how the tag should be handled for display. The ability to specify new tags, and to create nested tag structures make XML a great way to exchange data, not just documents.Much of the use of XML has been in data exchange applications, not as a replacement for HTML.Tags make data (relatively) selfdocumenting. Example: 140 http://www.francisxavier.ac.in

A-101





Downtown



500







A-101



Johnson





Data interchange is critical in today‘s networked world

Examples: Banking: funds transfer Order processing (especially inter-company orders) Scientific data –

Chemistry: ChemML, …

– Genetics: BSML (Bio-Sequence Markup Language), Paper flow of information between organizations is being replaced by electronic flow of information. Each application area has its own set of standards for representing 141 http://www.francisxavier.ac.in

information.XML has become the basis for all new generation data interchange formats.Earlier generation formats were based on plain text with line headers indicating the meaning of fields. Similar in concept to email headers.Does not allow for nested structures, no standard ―type‖ language.Tied too closely to low level document structure (lines, spaces, etc) Each XML based standard defines what are valid elements, using XML type specification languages to specify the syntax DTD (Document Type Descriptors) XML Schema Plus textual descriptions of the semantic.XML allows new tags to be defined as required. However, this may be constrained by DTDs. A wide variety of tools is available for parsing, browsing and querying XML documents/data. Inefficient: tags, which in effect represent schema information, are repeated better than relational tuples as a data-exchange format.Unlike relational tuples, XML data is self-documenting due to presence of tags. Non-rigid format: tags can be added.Allows nested structures.Wide acceptance, not only in database systems, but also in browsers, tools, and applications.

STRUCTURE OF XML Tag: label for a section of data Element: section of data beginning with and ending with matching .Elements must be properly nested. Proper nesting  … …. Improper nesting

 … ….

Formally: every start tag must have a unique matching end tag, that is in the context of the same parent element.Every document must have a single top-level element Example Hayes 142 http://www.francisxavier.ac.in

Main

Harrison

A-102 Perryridge 400 …

143 http://www.francisxavier.ac.in