Slide 1

The Principles of Joining Relations Objectives of the Lecture : •To consider the algebraic Cartesian Product operator; •To consider the algebraic Generalised Join operator; •To consider the algebraic Natural Join operator.

Slide 2

Why Join Relations Together ? 









Sometimes the answer to a query requires data from two or more relations. Since a query retrieves only one relation, it is necessary to combine those two or more relations into one relation, which is then retrieved. There are several kinds of Join operators that will combine two relations into one. If three relations need to be joined together, then we first Join two of them together, and then Join the result to the third relation; and so on for four and more relations. If the result of a join has more data in it than we want, we can prune it down to the required size using Project and/or Restrict operations before retrieving the final result.

Of course, we could just retrieve the data we want from the individual relations and then manually relate the data from them, transcribing it into some suitable form to display the answer to the query. But why should we do this work if we can get the computer to do it ? Especially as the volume of data can be quite large, and the work therefore timeconsuming, tedious and error-prone. It might be thought that only being able to retrieve one relation is an undesirable limitation. However the answer to a query is by its very nature a set of related data, and that is exactly what a relation is. Thus deciding how to join 2 or more relations together is a part of designing the relation that will hold the answer to our query. If we have manually picked related data out of 2 or more relations, then we have de facto created a single relation as our result, even if we did not write it down in tabular form. (Remember that a relation is not the same thing as its representation, and therefore does not have to be displayed as a table, although admittedly current SQL DBMSs invariably only allow a tabular representation - unlike say spreadsheets that can display worksheet data as graphs, histograms and/or piecharts).

Slide 3

‘Cartesian Product’ The simplest & crudest way to join 2 relations into one is to apply a Cartesian Product operator to them. For 2 relations R and S, this can be expressed as : R CProd S 





Cartesian Product creates a new relation whose tuples are formed by merging every tuple of the first operand with every tuple of the second operand, i.e. all possible combinations of tuples. The operands must have no attribute names in common, as otherwise the result would have duplicate attribute names. The result is rarely useful ! This is because the answer to a query usually requires related data, and a Cartesian Product does not relate the tuples of its operands, but simply puts all combinations of them together.

Relational algebra contains a Cartesian Product operator. It is a primitive operator included for theoretical reasons. The far more useful join operators that we will consider in this lecture are not primitive operators . They are the equivalent of the (possible Projection of the) Restriction of a Cartesian Product, all three of which are primitive operations, i.e. cannot be implemented by an expression involving any more basic or more fundamental operator(s).

Slide 4

Size of Result of Cartesian Product m+n attributes

m attributes

n attributes

T tuples

Always ( T U ) tuples

U tuples

While being rarely useful, Cartesian Product is always huge ! This is another reason for not using it. It is particularly its cardinality that makes it large, because this is the product of the operands‟ cardinality. For example: Operand cardinalities of 5 and 10 tuples yield a result cardinality of 50 tuples. Operand cardinalities of 25 and 50 tuples yield a result cardinality of 1250 tuples. Operand cardinalities of 100 and 500 tuples yield a result cardinality of 50,000 tuples. Operand cardinalities of 500 and 5,000 tuples yield a result cardinality of 2,500,000 tuples. The result size increases very rapidly !

Slide 5

Example of a Generalised Join (1) R A 4 3 3

B 6 4 4

C 7 5 7

D 7 4 7

E 6 5 6

A 4 3

S B 6 4

C 5 7 2

D 4 7 2

E 5 6 1

R Gen[ B < C ] S

The Generalised Join operator still considers every tuple in one operand with every tuple in the other operand, just as Cartesian Product does. However, whereas Cartesian Product merges every possible combination of 2 tuples together regardless and copies them into the result, Generalised Join applies the condition provided as a parameter to each pair of tuples in turn, and only if the condition evaluates to true does it copy the merged tuples into the result relation. In the example above, attribute B in relation R‟s tuple is compared with attribute C in relation S‟s tuple, and only if B in R is less than C in S are the 2 tuples merged and copied into the result.

Slide 6

Definition of ‘Generalised Join’ 





Creates a new relation which has all the attributes from both the operands; the attributes retain their original names. The result is created by :  looking at all possible combinations of tuples from the two operands.  using the comparison given as a parameter to compare each possible pair of tuples. IF the comparison is true THEN the two tuples are merged into one tuple and copied into the result ELSE the two tuples are ignored  The new relation contains only the merged tuples. The operator is also known as Theta Join.

The comparison provides a means of matching together tuples in the two different relations, so that only matched tuples, i.e. related tuples, are copied into the result. This is far more useful than Cartesian Product because it is normally related data that we want. By “merging two tuples into one” is meant that all the attributes from both tuples are used to form one new tuple. Since the 2 original tuples cannot have any attribute names in common, there cannot be any problem (such as 2 duplicate attribute names) in doing this. The name Theta Join stems from mathematical terminology. Any valid comparison can be used in the join condition provided as a parameter, and the formal mathematical abbreviation for “any kind of valid comparison” is “theta comparison”, theta standing for any comparator. A Generalised Join is equivalent to a Restriction of a Cartesian Product : R Gen[ condition ] S R CProd S Restrict[ condition ]

Slide 7

Join Conditions 







The same attribute name cannot appear in both operands. The two attributes compared in a condition must :  have the same data type so that it is possible to compare them,  be in different relations, otherwise a tuple from one operand cannot be related to a tuple in the other. The condition can use any possible comparison that is permissible for values of that type. Comparisons can be combined together with Boolean Operators (i.e. AND, OR, and NOT) to form one composite condition.

Because the result contains all the attributes from both operands, the operands cannot have attribute names in common to prevent the contradiction of the result relation having duplicate attribute names while simultaneously being a set of attributes. The rules stipulating the nature of the condition are to ensure that it functions effectively in order to match tuples together. The condition can be as simple or as complex as desired, which means that it provides an extremely powerful and flexible means of matching tuples. Self-evidently a comparison must always compare two attributes, one from each operand, in order to try to match the tuples, unlike in a Restriction condition where typically a comparison only involves one operand. However it is a not uncommon error to put a Restriction condition into the parameter instead of a Join condition. Note that in RAQUEL, the attribute named on the left hand side of a comparison must appear in the left hand operand, and conversely the attribute named on the right hand side of a comparison must appear in the right hand operand.

Slide 8

Example of a Generalised Join (2) R A 4 3

B 6 4

C 2 2

D 2 2

E 1 1

A 4 3

S B 6 4

C 5 7 2

D 4 7 2

E 5 6 1

R Gen[ A > E And B D ] S

In the example above, every pair of tuples is matched by seeing whether attribute A in relation R is larger than attribute E in relation S And whether attribute B in relation R is different to attribute D in relation S.

Slide 9

Size of Result of Generalised Join m+n attributes

m attributes

n attributes

T tuples Between zero and ( T U ) tuples

Attribute(s) to be compared.

U tuples

The result of a Generalised Join is as big as the result of a Cartesian Product in terms of the degree of the result. However in terms of cardinality, the result of a Generalised Join is usually much smaller than that of a Cartesian Product. Nevertheless, without knowing something about the contents of the operands and the nature of the condition, it is very difficult to predict the exact cardinality of the result. It can vary between zero tuples - if the condition never succeeds in matching any pairs of tuples - to the full Cartesian Product size if the condition matches every combination of tuples. In this sense the cardinality boundary conditions of a Generalised Join are analogous to those of the Restrict operation.

Slide 10

Example : Attribute Name Constraint R Gen[ B < B ] S A 4 3 3

B 6 4 4

B 7 5 7

D 7 4 7

A 6 5 6

Duplicate attribute names

An important practical problem.

R A 4 3

S B 6 4

C B 5 7 2

D 4 7 2

A 5 6 1

Duplicate attribute names

How can we differentiate between these duplicate attribute names in the result, say to do a projection on it ?

In RAQUEL, we can use the position of the attribute name B to distinguish which attribute B is which in the Generalised Join parameter. However, once the result is created, we cannot distinguish one attribute B in it from another. This is a problem in the above example because the two B attributes have different data in them, and so we may need to distinguish between them.

Slide 11

Example of an Equi Join (1) Note : Values in attributes duplicated. CAR RegNo K1235 ABC

Owner ENo EName M-S E3

E3 1

Smith 5

S 2

EMPLOYEE ENo EName M-S

RegNo

Owner

K1235 ABC

E3

E3 1

Smith 5

E5 2

Mitchell 6 M 4 Robson 7

D 6

S 2

2 6 STA E5 W811

E5 2

Mitchell 6 M 4

2 6 STA E5 W811

JON 71

E1

E1 3

Robson 7

D 6

JON 71

E1

E1 3

V7718 PQ

E6

E6 4

Blake 8

M 8

V7718 PQ

E6

E6 4

Blake 8

M 8

E8

Jones

W

CAR Gen[ Owner = ENo ] EMPLOYEE

In the example above, a single „=‟ comparison is used, because we want to match an employee with a car owner; i.e. we need the same employee number for both. In this case, although the attribute names are different and so there is no problem of distinguishing one attribute from another, the problem of duplicate data arises. It makes the result unnecessarily big, and may cause confusion. The two attributes are guaranteed to have duplicate data because an „=‟ comparison was used to match the tuples; only tuples whose Owner and ENo attribute values are the same will be copied into the result.

Slide 12

Definition of ‘Equi Join’ 

A special case of a Generalised Join where the only comparison  is an „=‟ comparison,  or several „=‟ comparisons Often needed when comparing a Anded together. composite Foreign and Candidate Key. Example : suppose employees/ car owners were identified by a combination of department number and employee number.



„=‟ comparisons are very frequent in practice. This is because of all the comparators, only „=‟ identifies that 2 values are the same & hence refer to the same real world “thing”.

When tuples are to be matched, it is normally because we want tuples that refer to the same real world “thing” - employee, car, etc - and therefore „=‟ has to be used to bring this about, as opposed to any other possible comparator. This is why for an equi join, if there are multiple attribute comparisons, they are „=‟ comparisons Anded together, in order that the „composite attribute‟, so to speak, refers to the same real world “thing” in both tuples. It is because equi joins are so common in practice that the problem of duplicate data in the result relation can be non-trivial. In a well-designed DB, it is frequently but not exclusively the case that the attributes to be compared comprise a candidate key and its corresponding foreign key. We have already seen that a foreign key must by definition contain a subset of its candidate key‟s values, and so the two keys are an obvious possibility for joining on.

Slide 13

Example of Equi Join (2) (and duplicate attribute data)

Error !

Duplicate attribute names !

SNo PNo Qty SNo SName S1 1 P1 1 10 1 S1 1 Smith 5

SHIP SNo PNo Qty S1 1 P1 1 10 1

SUPP SNo SName S1 1 Smith 5

S1 2

P2 2

12 2

S1 21 Smith 5

S1 2

P2 2

12 2

S2 2 Adam 6

S2 3

P1 3

63

S2 23 Adam 6

S2 3

P1 3

63

S3 3 Brown 7

S2 3

P2 3

8

S2 23 Adam 6

S2 3

P2 3

8

SHIP Gen[ SNo = SNo ] SUPP

The above example shows an equi join between a shipments relation SHIP and a suppliers relation SUPP. Each tuple of the result gives the information about a particular shipment and the supplier who sent the shipment. SNo is a candidate key in SUPP and a foreign key in SHIP, whose only candidate key is (SNo, PNo). The result shows both problems, that of duplicate attribute names and that of duplicate data. The duplicate attribute names appear in the result because the attributes containing the same data in the two relations have the same name, not surprisingly.

Slide 14

The Two Equi Join Problems The problems are : 1. Duplicate attribute names in the result; because attributes containing the same data in different relations usually have the same names. 2. Duplicate data in the result; due to the „=„ comparison. 

Yet the need for equals comparisons is very common.

Solution One  Rename the duplicate attribute(s) in one operand to something unique. This solves the duplicate name problem.  Then do an equi join.  Use a Projection operation to remove the duplicate attribute data. Solution Two Use a Natural Join operator.

When designing a DB, it is good practice, and common sense anyway, to call the same kind of data in different relations by the same attribute name. It is for this reason that comparable data in different relations usually has the same name. The Natural Join operator is effectively a non-primitive operator that carries out „Solution One‟ and so is effectively a short-hand for it. Although Natural Join is a rather specialised operator compared to the Generalised Join, in fact it is used for at least 95% of joins (because of the common need for „=‟ comparisons), and Generalised Join is used comparatively infrequently.

Slide 15

Definition of ‘Natural Join’ A special case of an Equi Join where :  all the attribute(s) to be compared must have the same name(s) and the same data type(s),  the duplicate attribute(s) are automatically removed from the result by the operator. This is much the most useful join operator in practice. Note : If there are duplicate attribute names in the operands, but they are not to be compared and used in the join, a Natural Join operation will not do.

Slide 16

Example of Natural Join Problem solved, with elegance ! PNo Qty SNo SName 1 Smith 5 P1 1 10 1 S1

SHIP SNo PNo Qty S1 1 P1 1 10 1

SUPP SNo SName S1 1 Smith 5

P2 2

12 2

S1 21 Smith 5

S1 2

P2 2

12 2

S2 2 Adam 6

P1 3

63

S2 23 Adam 6

S2 3

P1 3

63

S3 3 Brown 7

P2 3

8

S2 32 Adam 6

S2 3

P2 3

8

SHIP Join[ Sno ] SUPP Note : only attribute name(s) here, as the equals comparison is assumed.

The Natural Join can be seen to do what is most frequently required in a way that is simple and straightforward for the user, hence the name “natural”.

Slide 17

Types of Join Operator 





Thus in principle there are two different join operators : Natural Join R Join[ attribute name(s) ] S Generalised Join R Gen[ condition ] S Should there should be an equi join operator as well ? In practice, it is rare to need an equi join operation instead of a Natural Join. As an equi join is trivially expressed by a Generalised Join, there is no point in creating a special equi join operator. The main significance of the equi join is as a means of conceptually deriving the Natural Join from the Generalised Join. (One should also be aware of the equi join, since it appears in the literature in various guises).

Slide 18

Use of Different Join Operators Natural Join. Use where all the following are true :  All the comparison(s) are „=‟.  Multiple „=‟ comparisons are all Anded together.  All the attributes being compared (and no others) have the same names and data types.  Duplicate attribute data is to be removed. Generalised Join. Use where a Natural Join cannot be used, in particular where at least one of the following is true :  Not all the comparison(s) are „=‟.  One or more of the Boolean operators Not or Or are used.  One or more of the attribute(s) being compared have different names, although they must have the same data types to be comparable.

The two join operators cover all possible cases. However attributes of duplicate data but different names can occur with a Generalised Join, which may then need to be removed, and can be removed with Project.