Retrievals & Projections

Slide 1 Retrievals & Projections Objectives of the Lecture : •To consider retrieval actions from a DB; •To consider using relational algebra for defi...
Author: Shona Bates
4 downloads 1 Views 403KB Size
Slide 1

Retrievals & Projections Objectives of the Lecture : •To consider retrieval actions from a DB; •To consider using relational algebra for defining relations; •To consider the Project operator and its use in SQL retrievals.

Slide 2

It is helpful to conceptually split the handling of a DB into 2 categories : Actions, which change the structure or state of the DB in some way. Definitions, which are needed to define precisely what or how the actions should be carried out. In one sense they can be thought of as parameters to the actions, but they are typically on a grander scale than we normally think of as regards parameters.

Slide 3

Actions on a DB  

Actions are expressed in SQL statements. The keyword or key phrase that starts an SQL statement determines what the statement‟s action is.

Examples : Create (Table) Alter (Table) Insert (Into), Delete (From) Update, Commit and Rollback. 



Data retrieval is another action. It is the raison d’etre of any DB. There would be little point in a DB whose data could not be retrieved. The SQL keyword that starts a retrieval statement is Select.

All SQL actions are expressed in statements that begin with a keyword that defines the particular action, and ends with a semi-colon. The examples are all action keywords that have been seen already, to create relations and populate them with data. There are more action keywords in SQL, some beyond the scope of this course. SQL actually has a number of actions that begin with Create, and the second keyword, e.g. Table, indicates what kind of thing is to be created.

Slide 4

Retrieval Actions 

The SQL Select statement must comprise the following two phrases : Select ……. From ……. ; Optionally it may contain other phrases.

Retrievals are sometimes known as Queries, because the data retrieved can be thought of as the answer to a question. Example :“Which employees are married ?” Data from the DB provides the answer. 



On this course, the DBMS is set up so as to always retrieve data to our computer screen, so that we can see it. (However SQL Select does permit retrievals to other locations).

Thus SQL retrievals have already been seen on earlier parts of this course, where whole relations were retrieved in order to inspect their contents. Retrieving data from the DB to an application program is a common option. In fact the computer screen on which SQL statements are entered and responses and query results displayed is in fact run by an application program that sends messages to the DBMS and receives messages from the DBMS. However such an application program is typically supplied by the DBMS vendor (as is the case with Oracle) and hence from that viewpoint could be considered part of the DBMS. Other common cases are where the DBMS retrieves data from the DB and sends it to a traditional application program (e.g. an accounting program, a payroll program), or to a file, printer, or a web browser.

Slide 5

Defining Relations for Actions 







Relational algebra or relational calculus languages are used to define a relation that is to be acted on. A defined relation may be :  a part of a DB relation,  A merger of 2 or more DB relations,  part of a „merged‟ relation. Relational algebra / calculus language can provide the great power and flexibility, together with conceptual simplicity, needed to define „relations-to-be-acted-on‟. SQL is a mixture of relational algebra and calculus plus other „ad hocery‟ needlessly complex.

Definitions can become very sophisticated; e.g. parts of relations being merged together, a part taken of that merged relation and then merged with another relation, and so on. That is why the language should be as conceptually simple as possible . It enables the database programmer to focus entirely on the logical definition of the desired relation, and not have to consider how to express that definition in the database language – the definition is directly expressed in the language. Relational calculus is based on predicate calculus and is therefore not a favoured language for most DB users. Relational algebra is more procedural in approach, although at the same conceptual level as relational calculus; most DB users prefer the algebraic approach, and its procedurality is supportive of deriving complex queries. Insertions, deletions and amendments use the same concepts for defining the relations involved in their actions as do retrievals, although typically to a lesser degree of complexity. Because SQL is a mixture of things, it has become needlessly complex. .

Slide 6

Relational Algebra (1) 

 



The SQL language is based on a mixture of relational algebra, relational calculus, and ad hoc peculiarities of its own. Relational algebra is simpler than SQL. Therefore for simplicity, this course follows the tradition of using relational algebra to define the relation to be retrieved. It is then expressed in SQL for execution. Relational algebra consists of :  a number of relational operators; a monadic operator operates on one operand, i.e.one relation, to produce a single relation as a result; a dyadic operator operates on two operands to produce a single result relation.  A way of combining relational operators together to form a relational expression.

SQL has been criticised for the mixture of aspects in its design, and especially for its ad hoc aspects, which make it unnecessarily complicated and not so expressive as it might be. Nevertheless it has become, in Stonebraker‟s famous phrase, the “global dataspeak” of relational databases; hence it is of great practical importance and the language taught on this course. In principle, we could use relational algebra for defining the relations involved in inserts, deletes and amendments, as well as retrievals, but SQL limitations remove the need for such power. As we have already seen, we can carry out all such updates in SQL without considerations of relational algebra (or calculus for that matter). Most relational DBMSs only use the power of algebra or calculus for retrieval, where it is important to maximise the ability to make any kind of query on the DB that can be desired.

Slide 7

Relational Algebra (2) Relational algebra is based on the same concepts as arithmetic algebra. 

Intuitively useful operators : Arithmetic Examples : plus minus



Relational Project – pick out attributes Restrict – pick out tuples

“Closure Under the Algebra” In 3-((5 6)+2) ( 5 6 ) results in a number, 30; so the next calculation is (30 + 2), giving 32; the last calculation is (3 - 32), giving -29. Each operator generates another number, i.e. closure. Similarly each relational operator generates another relation. In both cases, arbitrarily complex formulae can be built up.

It is the closure under the relational algebra that allows a query to retrieve one relation which may nevertheless contain the result of a very complex and sophisticated relational expression using many named relations in the DB.

Slide 8

Designing Retrievals 





Designing a retrieval means defining the relation to be retrieved, since it is only necessary to prefix the SQL definition with Select for it to be retrieved. As each relational algebra operator encapsulates an intuitively useful concept, with a universally accepted name for referencing it, the following method is used to learn how to write queries :  Learn a number of individual algebra operators and how to express them in SQL.  Learn how to combine operators to create powerful expressions, and how to write the expressions in SQL. SQL uses a standard fixed format for writing expressions. This is straightforward for simple and reasonable queries, but can be constraining for complex queries.

The RAQUEL notation is used to express the relational algebra, because of its inherent simplicity and lack of mathematical symbols.

The first relational algebra operator to be considered is Project.

Slide 9

Example of Projection (1) R B 5 3 8 9 5 1

D 4 7 5 6 8 2

A 4 6 4 3 4 6

B 5 3 8 9 5 1

C 1 2 1 7 2 2

D 4 7 5 6 8 2

R Project[ B, D ]

Project is a monadic operator, i.e. it takes one relation as an operand. It also takes a set of attribute names as a parameter. It produces another relation as a result, which consists of just the attributes named in the parameter. In the example above, the parameters were attribute names B and D, and so the result is a relation consisting solely of attributes B and D.

Slide 10

Example of Projection (2) R A 4 6 34 4

C 1 2 71 2

A 4 6 4 3 4 6

B 5 3 8 9 5 1

C 1 2 1 7 2 2

D 4 7 5 6 8 2

R Project[ A, C ]

The example above is similar to the previous one, in that the result relation consists of two attributes, this time A and C. However, unlike the previous example, the tuples in the result achieved by just picking out the two attributes are not a relation; because the result contains duplicate tuples. There are two (4,1) tuples and two (6,2) tuples. Thus the result is a bag of tuples, not a set of tuples. Yet every relational algebraic operator must return another relation as a result. Hence there is a second step to the projection operation, namely the removal of all duplicate tuples, in order to give a true relation as a result. Clearly if no duplicate tuples arise from projecting out the desired attributes, then the second step can be omitted. This was the case with the first example.

Optionally, by animating the PowerPoint file ProjectAnim.ppt (obtainable from the Blackboard system), a demonstration of the Project operator can be seen.

Slide 11

Definition of ‘Project’ 1. Creates a new relation containing only the specified attributes.

2. Any would-be duplicate tuples are removed, to ensure the result is a set of tuples.

If the specified attribute(s) include a candidate key, then there can be no duplicate tuples to remove.

3. The result‟s attribute names are those of the operand‟s attribute names specified in the parameter to the Project operation.

The above summarises the complete Project operation. Note the necessity for giving names to the attributes of the result. This is often forgotten, but is essential if we need to apply another relational operator to the results of the projection, and that operator needs to know the attribute names of its operand. The performance of the Project operation can be optimised by checking to see if the result includes at least one candidate key; if it does, then the duplicate tuple removal step can be omitted since the presence of the key(s) precludes any duplicate tuples from existing.

Slide 12

Characteristics of Projection 



Don‟t need to know about any constraints on the operand relation in order to be able to do a projection. Boundary Cases :  If all the attributes are specified, then result operand 



If no attributes are specified, then result nullary relation i.e. result has no attributes.

In principle, the unwanted attributes can be specified instead, i.e. those to be removed. Example : in relation R( A, B, C, D ) R Project[ ~ B, D ] R Project[ A, C ]

As with all the algebraic operators, what the Project operator does is independent of any integrity constraints on the operand. The only effect of the constraints is that they may affect the data occurring in the operand, consequently affecting the data occurring in the result. Project operates consistently even at the limit of its boundary conditions. Unfortunately SQL is not so consistent - see later. ~ means Not. Although it is only a convenience to be able to specify those attributes to be removed instead of those to be retained, it can be useful in practice.

Slide 13

SQL : Projection Principles : 

Put the operand relation‟s name in the From phrase.



Put the attributes to be projected out in the Select phrase.



This SQL statement also retrieves the newly created relation from the DB.

Examples : SQL equivalent of the 2 example projections on relation „R‟ :Select B, D Select A, C From R ; From R ;

Thus the Select keyword performs two functions : it signifies that the SQL statement is a retrieval action; it signifies that what follows it is a set of attribute names that are to be projected out of the result. From always signifies the relation(s) to be used, i.e. the operand(s) of the complete SQL statement. Note that the two SQL examples are written in the same way, despite the fact that one requires duplicate tuples to be removed and the other doesn‟t; but see the next slide.

Slide 14

SQL : Duplicate Rows 

SQL does not remove duplicate rows automatically.

Problem ! 

SQL must be instructed to remove duplicates, by inserting the Distinct keyword between the Select keyword and the list of attribute names.

Examples :Select Distinct B,D From R; Select Distinct A, C From R;

As there were no duplicate tuples, Distinct is strictly unnecessary; but it is still a correct statement. This does a real projection; the earlier version did not.

It is not incorrect to include the keyword Distinct when it is unnecessary (although it might impair performance), but it is essential to include it when there are duplicate rows in the operand SQL table, otherwise they will not be removed, which could cause some undesirable results to be obtained. The dilemma of whether to include Distinct is particularly acute when one doesn‟t know enough about the operand table to know whether Distinct is needed or not. Suppose SQL is comparing 2 rows to see if they are identical, and suppose they are identical except for the fact that in one or more columns both rows contain nulls. Logically the result of comparing 2 nulls is the truth value maybe. As there is nothing there to be compared, the result cannot be true or false. This result is used elsewhere in SQL. However in the context of comparing 2 rows, SQL assumes that the comparison yields true, thereby making the rows identical (which is intuitively what one would expect) and not different as the truth value maybe implies; and SQL gets rid of one of the rows.

Slide 15

SQL : Removing Duplicate Rows 



The reason why SQL is not designed to remove duplicate tuples automatically is because, originally, this was thought to need too much computing power. This issue is now contentious : 1. Computers are now much more powerful. 2. Much more serious performance optimisation problems can be created by retaining duplicates than by removing them.

Advice :  Always initially insert Distinct, unless a candidate key in the result ensures no problems of duplicates.  IF performance cannot then be made fast enough THEN decide whether performance or lack of duplicates is more important, and choose the preferred option.

The unfortunate logical flaw in SQL arises from efficiency concerns of 30 years ago (!). It is not always easy to remove duplicate tuples. A common way is to physically sort the table rows and then check adjacent rows to see if they‟re identical. However the sorting can be time-consuming. However given today‟s more effective optimisation possibilities, which unfortunately only work for sets of rows and not for bags of rows, and which therefore cannot be used today in SQL, it is sometimes considered now to have been a counterproductive shortterm expediency to have built this into SQL, at least at the logical level Considering the relative power and cheapness of modern computers compared to their earlier counterparts, and the relative data volumes involved, sorting and removing duplicates is much less of a problem these days anyway.

Slide 16

SQL : Projection Characteristics 





To project out all the attributes in a relation, use * in the Select phrase. Example :Select * From R ; The result is the same as relation „R‟. (Thus a retrieval of a whole table is in fact a retrieval of a projection of all the attributes). Nullary relations. These are not permitted in SQL. If nothing is put between the Select and From keywords, a syntax error will be issued. SQL does not allow the specification of unwanted attributes in the Select phrase.

In practice, not being able to have nullary relations does not create much of a problem. However Hugh Darwen1 has pointed out the value of using nullary relations to represent true and false, as a consequence of which, there are some useful queries which could be written but are impossible in SQL. If one wants to remove 1 - 2 attributes from a relation which has many attributes, say 10 or more, having to write out all the attributes that one does want is tedious and can be error-prone.

1

Hugh Darwen, The Nullologist in Relationland in C. J. Date & Hugh Darwen, Relational Database Writings 1989 - 1991 (Addison-Wesley, 1992).