Estimating Nested Selectivity in Object-Oriented Databases 1

2

3

2

Wan-Sup Cho , Wook-Shin Han , Ki-Hyung Hong , and Kyu-Young Whang 1

Dept. of MIS, Chungbuk National University Dept. of Electrical Engineering & Computer Science, Division of Computer Science Korea Advanced Information Technology Research Center (AITrc) Advanced Institute of Science and Technology (KAIST) 3 Dept. of Computer Science, Sungshin Women's University E-mail: [email protected], fwshan,[email protected], [email protected] 2

Abstract

engineering databases, multimedia databases, and geographical information systems [5, 9]. Although a number of OODBMSs have been developed, muc h less research[2, 3, 6 , 8] has been done in query optimization compared with the relational ones. A query optimizer automatically generates a set of feasible evaluation plans for processing a given query, and selects the one with the minimum evaluation cost. The query optimizer is an essential subsystem in a DBMS for ecient processing of high-level queries. An object-oriented query often consists of conditions on path expressions called neste d predicates. Among many de nitions [1, 6, 7, 9] of the path expression, w e adopt the extended path expression de ned by Kifer et al.[7]. The extended path expression allows the selector to limit the domains of the attribute in the path expression. Figure 1 shows an example database. The following t w o conditions are examples of the nested predicates that include extended path expressions: P1 : Student.workfor[Project].budget SOME > 100K P2 : Student.own[Car].color 3 \Red"

A searc h condition in object-oriented queries consists of nested predicates, eac h of which is a predicate on a path expression. In this paper, w e present a new selectivit y estimation technique for nested predicates. Selectivity of a nested predicate, neste d selectivity, is de ned as the ratio of the number of quali ed objects of the starting class in the path expression to the total number of objects of the class. The new technique takes into account the e ects of direct representation of manyto-many relationships. Many-to-many relationships frequently occur in object-oriented databases, but have not been properly handled in con ven tionalselectivit yestimation techniques. F or many-to-many relationships, we generalize the block-hit function originally proposed by B. Yao allowing the cases where one object belongs to more than one block. The most signi cant advantage of our technique is that the accuracy of the estimation is far enhanced with only a small additional overhead. We present an ecient method for obtaining the statistical information that is needed for our estimation technique. We analyze the accuracy of our estimation technique and compare the result with those of conven tional ones. The experimental result shows there is a signi cant deviation in the estimation obtained by conven tional ones, con rming the advan tage of our technique.

budget

Project Employee

Engineer

1 Introduction

Equipment type

use workfor own

Student Car

Vehicle

color Truck

: user defined class : system defined clas : class/subclass link : attribute/domain lin

Figure 1: An example object-oriented database.

Object-oriented database management systems are adequate for supporting new database applications such as

Here, workfor and own are multi-v alued attributes.The nested predicate P1 becomes true when a student participates in at least one project that has the budget of more than $100,000. In P2 the selector `[Car] ' restricts the domain of the attribute own to the class Car1 . Therefore, P2 becomes true when a student has at least one red-colored car. We de ne nested selectivity for the nested predicate

This work has been supported by Korea Science and Engineering Foundation (KOSEF) through Advanced Information Technology Research Center (AITrc). 

Permission to make digital or hard copies of part or all of this work or personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers, or to redistribute to lists, requires prior specific permission and/or a fee.

1 Note that the domain of the attribute own consists of the classes V ehicle, Car, and T ruck.

CIKM 2000, McLean, VA USA © ACM 2000 1-58113-320-0/00/11 . . .$5.00

94

P : C1 :a1 [C2 ]:a2 [C3 ]:::[Cn ]:an  value as the ratio of the number of objects in C1 that satisfy P to the total number of objects in C1 . Here,  is a compar-

tions that will be used throughout this paper. We then propose a new nested selectivity estimation technique in which the e ects of many-to-many relationships are re ected. We also present an ecient method for obtaining the statistical information that is used in selectivity estimation. In Section 4, we compare our technique with conventional ones showing the signi cance of the role of many-to-many relationships. Finally, Section 5 concludes the paper.

ison operator. Nested selectivity for the predicate P is denoted by ns(P). Note that the nested selectivity is a generalized form of the conventional de nition of selectivity[5, 10, 11, 15]. If n = 1 in the predicate P, that is, if P = C1 :a1  value , ns(P) becomes the selectivity of conventional de nition. In this paper, we present a technique for nested selectivity estimation that takes into account particular features of the object-oriented data model: manyto-many relationships. We model the e ect of manyto-many relationships among classes. In the objectoriented database, a many-to-many relationship between two adjacent classes in a path expression is often represented by a multi-valued attribute and shared references to objects. For example, the many-to-many relationship between Employee and Project in Figure 1 is represented by the multi-valued attribute workfor (i.e., one employee works for multiple projects) and shared references to the project objects by multiple employees (i.e., one project has multiple participants). Since this kind of situation occurs frequently in object-oriented databases, many-to-many relationships must be considered in the selectivity estimation. Previous literature [1, 8, 11, 13] deals only with one-to-many(including one-to-one) relationships. In relational databases, since a many-tomany relationship in the real world is often represented by two one-to-many value relationships with a separate relationship table, most literature excludes direct representation of many-to-many relationships. In objectoriented databases, since previous research[1, 8] assumes only single-valued attributes, many-to-many relationships, caused by multi-valued attributes, are not properly handled. The mathematical analysis shows that the proposed technique provides signi cant gain in accuracy with almost no additional overhead when many-to-many relationships are involved in the query. To compare the accuracy of our estimation technique and the conventional ones, we construct test databases with various data distributions using UniSQL ORDBMS[22] and measure the nested selectivities on the databases. The experimental results show that the estimates by the proposed technique closely match the nested selectivities measured from the constructed databases: the average relative error in the estimated nested selectivity is smaller than 0.036. In contrast, there are signi cant large errors in conventional ones: the average relative error is larger than 0.640. The organization of the paper is as follows. In Section 2 we review previous research on selectivity estimation. In Section 3 we present terminology and assump-

2 Related Work In this section, we review traditional selectivity estimation techniques and point out their limitations in the presence of many-to-many relationships.

2.1 Selectivity Estimation in OODBs Although most literature on object-oriented query optimization claims the necessity of selectivity estimation for nested predicates [6, 8], only a few of them provide concrete estimation formula[1, 8]. Even these works do not accommodate the e ects of many-to-many relationships in selectivity estimation. Kim et al.[8] estimate the nested selectivity as the selectivity of the simple predicate on the ending attribute of the path expression. For example, the nested selectivity of the predicate P3 : Employee.own[Car].color 3 \Red" is estimated as the selectivity of the following simple predicate P4 P4 : Car.color 3 \Red" which in turn is estimated by using the technique in Selinger et al.[11]. For instance, if the percentage of red-colored car is 1/5, Kim et al.[8] estimate that 1/5 of the employees have a red-colored car irrespective of the cardinality of the multi-valued attribute own. That is, ns(P3 ) = ns(P4 ) = 1/5. Bertino et al.[1] estimate the number of employees who own at least one red-colored car by multiplying the sharing degrees2 of the attributes own and color assuming that they are single-valued attributes and that the predicate has the equality operator. For example, Figure 2 shows a situation where the sharing degrees of the attribute own and color are 4 and 50, respectively: i.e., each car is shared by four employees and each color by fty cars. Here, the number of employees who own at least one red-colored car is estimated by 200 (4  50 = 200). Thus, ns(P3 ) = 200/(number of objects in Employee). Note that the estimation technique is devised for many-to-one relationships as shown in FigThe sharing degree of the attribute own is the average number of objects of the class Employee with the same value for the attribute own. That is, how many employees share a car on the average. 2

95

3 A New Selectivity Estimation Technique

Car

Employee 4

50

Figure 2: Selectivity estimation using sharing degrees.

In this section, we present a nested selectivity estimation technique that accommodates the e ect of manyto-many relationships. Section 3.1 presents terminology and assumptions that will be used throughout this paper. Section 3.2 presents a new selectivity estimation technique for many-to-many relationships.

ure 2, thus it cannot be applied to the many-to-many relationships.

3.1 Terminology and Assumptions

"Red" own

color

Figure 3 shows the nested predicate P: C1 :a1 [C2 ]:a2 ::: ; Cn ]:an [Cn+1 ]  value that will be used as an example predicate in Section 3. Here, the domain of the attribute an , denoted by Cn+1 , is a primitive class such as character or integer type. The attribute ai ; i = 1; 2; ::; n , may have multiple values from its domain Ci+1 . The ending attribute an is a simple attribute, i.e., its domain is a primitive class (e.g., integer, string, etc.), and the remaining attributes ai 's, 1  i < n , are complex attributes, i.e., ai 's domain is a non-primitive class. Here, the nested selectivity ns(P ) is the ratio of the number of quali ed C1 objects (i.e., the objects of C1 that satisfy the predicate P ) to the total number of C1 objects. In Figure 3, we de ne the partial nested predicate for P as the nested predicate on a right subpath expression of the left operand of P . For example, for i  n, the nested predicate Ci :ai [Ci+1 ]:ai+1 [Ci+2 ]:::an [Cn+1 ]  value, denoted by P i , is a partial nested predicate of P. We de ne terminology for a class and its attributes as in Table 1. The attributes of the class Ci are classi ed into two groups: simple attributes and complex attributes. In Table 1, we denote a simple attribute as a0i , and complex attribute ai . We make the following assumptions: (1) An Oid of an object consists of the pair, where class-id is the identi er of the class to which the object belongs, and instance-id the identi er of the object (instance) either within the class or within the entire database [1, 9]. (2) For an extended path expression including designation of the selector, we extend the de nition of uniform distribution [4, 11] as follows[3]. If the domain of a complex attribute, say ai , consists of classes D1 ; :::; Dn ; n  1, then the Oids of each Dk that appear in ai of Ci

2.2 Selectivity Estimation in Relational Databases In relational databases, various selectivity estimation techniques [10, 11, 13, 14, 15] have been proposed for relational operators such as selections, joins, and projections. Selectivity for a search condition { a boolean combination of relational operators { has been estimated by multiplying the selectivities of the relational operators in the search condition under the assumption of attribute independence [5, 10, 11]. Among these, the estimation technique by Whang et al.[13, 14] can be applied for the estimation of nested selectivity. However, previous research did not accommodate the e ect of direct representation of many-to-many relationships that occur frequently in the object-oriented databases. Whang et al.[13] used the concept of nested selectivity for estimating the cost of joins. For example, consider the relational query Employee ./id=owner Car AND Car.color = \Red". Let dv(Car.owner) be the number of distinct values in Car.owner. Let RedCar be the set of red-colored cars, and dv(RedCar.owner) denotes the number of distinct values in RedCar.owner. Then, the ratio of Employee objects that will be joined with the RedCar objects3 can be estimated by dv(RedCar.owner)/dv(Car.owner) under the assumption of uniform distribution. Then, how dv(RedCar.owner) can be estimated ? If Employee and Car have the one-tomany relationship, i.e., an employee may have multiple cars, Whang et al.[13] used the block-hit function[16] in the estimation of dv(RedCar.owner). In Section 3.2, we present the estimation technique based on the block-hit function in detail. However, note that many-to-many relationships are not accommodated in this estimation technique. 3

This is called the coupling factor[13].

C1

C2 a1

Ci a2

...

ai

Cn ...

an Cn+1

Figure 3: Nested predicate P: C1 :a1 :::[Cn ]:an value .

96

which an object can participate [5]. The workfor binary relationship between Employee and Project in Figure 1 has a many-to-many cardinality ratio, meaning that each employee can be related to (participate) multiple projects, and each project can be related to (have) multiple employees. The cardinality ratio of a relationship must be one of one-to-one (1 : 1), one-to-many (1 : M ), many-to-one (N : 1), and many-to-many (N : M ). For the nested predicate P in Figure 3, the cardinality ratio of the relationship between Ci and Ci+1 becomes share(Ci :ai [Ci+1 ]) : fan(Ci :ai [Ci+1 ])[1, 2, 3]. Under the assumption of total participation, fan and share can be estimated as follows: fan(Ci :ai [Ci+1 ]) = n(Cin:a(iC[C)i+1 ]) (1) i (2) share(Ci :ai [Ci+1 ]) = n(Cni(:aCi [Ci)+1 ])

Table 1: Terminology.

n(Ci ) n(Ci :ai [Ci+1 ]) dv(Ci :ai [Ci+1 ]) fan(Ci :ai [Ci+1 ]) share(Ci :ai [Ci+1 ]) dv(Ci :a0i ) max(Ci :a0i ) min(Ci :a0i )

number of objects in the class Ci number of references (Oids) including duplicates in Ci :ai that refer to Ci+1 objects number of Ci+1 objects referred to by Ci :ai average number of Ci+1 objects referred to by an object of Ci average number of Ci objects that refer to the same object of Ci+1 number of distinct values for the attribute a0i maximum value appeared in the attribute a0i minimum value appeared in the attribute a0i

i+1

Since n(Ci :ai [Ci+1 ]) is total number of references including duplicates in Ci :ai that refer to Ci+1 objects, each Ci object has n(C n:a(C[C) +1]) references to the Ci+1 C +1 ]) objects, and each Ci+1 object is referred to by n(Cn(:aC [+1 ) objects of Ci on the average. In Eq. (2), n(Ci+1 ) is equivalent to dv(Ci :ai [Ci+1 ]) since all the objects of Ci+1 are referred to by Ci :ai under the total participation assumption. i

i

i

i

i

are uniformly distributed in the class Ci ; i  1. For example, in Figure 1, the Oids of Car objects that appear in the attribute own are uniformly distributed in the attribute Employee.own (or Student.own, or Engineer.own). Similarly, the Oids of Vehicle and Truck objects are also uniformly distributed in this attribute. If n = 1, that is, the inheritance graph consists of a single class, this assumption is reduced to the conventional de nition [4, 11]. This assumption does not violate the commonly accepted de nition of uniform distribution [4, 11], but extends the de nition according to the class hierarchy. (3) We assume that the attributes in the restriction predicates and the other attributes are independent [4]. Two attributes (say A and B ) are independent if the conditional probability of obtaining an A value given a B value is equal to the probability of obtaining the A value.

i

i

i

Example 3.1 Figure 4 shows possible cardinality ra-

tios between Ci and Ci+1 . The gray-colored nodes represent objects, and the links references via attribute ai . In Figure 4(a), since the number of references Ci+1

Ci a1 a2 a3 a4

b1 b2

(a) 2 : 1

Ci+1

Ci a1 a2 a3 a4 (b) 1 : 3

Ci+1

Ci b1 b2 b3 b10 b11 b12

b1 b2 b3 b4 b5 b6

a1 a2 a3 a4 (c) 2 : 3

Figure 4: Cardinality ratio between Ci and Ci+1 . to the objects of Ci+1 is four, n(Ci :ai [Ci+1 ]) = 4. From Eqs.(1) and (2), the cardinality ratio for Figure 4 (a) is 2 : 1 (share(Ci :ai [Ci+1 ]) = 42 = 2, fan(Ci :ai [Ci+1 ]) = 4 = 1). Similarly, the cardinality ratios for Figure 4(b) 4 and (c) are 1 : 3 and 2 : 3, respectively.

3.2 Nested Selectivity In this section, we present a nested selectivity estimation technique that includes many-to-many relationships under the assumption of total participation4. We rst review the notion of the cardinality ratio of a relationship and then derive an estimation formula that cover various cardinality ratios. For a relationship between two classes, the cardinality ratio speci es the number of relationship instances in

In Figure 3, we estimate the nested selectivity ns(P ) according to the following steps. Note that the order the estimation is done is the reverse of that in the path expression C1 :a1 [C2 ]:a2 :::[Cn ]:an . 1. For i = n + 1; n; :::; 1, estimate the number of objects in Ci that satisfy the partial nested predicate P i : Ci :ai [Ci+1 ]:ai+1 [Ci+2 ]:::an [Cn+1 ]  value. We

Total participation means that all objects of a class are related to the objects of another class[3].

4

97

denote the set of Ci objects that satisfy P i as Ciq and call it the set of quali ed objects of Ci . And n(Ciq ) denotes the cardinality of Ciq . For each partial nested predicate P i ; i = 2; 3; :::; n, the nested selectivity ns(P i ) becomes nn((CC )) 2. After n(C1q ) is obtained, ns(P ) becomes nn((CC11 )) . q i i

blocks hit (blocks with at least one record selected) is given by ; n;p 

b ( m; p; k ) = m  [ 1 ; ; nk  ]

q

Eq. (3) is originally intended to estimate the number of block accesses when a certain number of records are randomly selected. However, it can also be applied to the estimation of the number of distinct values in an attribute after selection operations have been evaluated [13]. In Figure 4(b), where the cardinality ratio is 1 : N , we can make a correspondence between the estimation of the number of blocks hit and the n(Ciq ) as follows: n(Ci ) ) m (: number of blocks) fan(Ci :ai [Ci+1 ]) ) p (: blocking factor) n(Ciqq+1 ) ) k (: number of records to be selected) n(Ci ) ) expected number of blocks hit From this correspondence, n(Ciq ) can be estimated as follows:

In step 1, we estimate n(Ciq ); i = n + 1; n; :::; 1, as follows: (1) n(Cnq +1 ) n(Cnq +1 ) is the number of values in the attribute an qthat satisfy the condition an  value. If  is = or 3, n(Cn+1 ) becomes 1 since only the object value in Cn+1 satis es the condition. If  is >, we estimate n(Cnq +1 ) under the uniform distribution assumption [4] as in Selinger et al. [11]. Hence, n(Cnq +1 ) = 1, if  is = or 3, and (C :a );value n(Cnq +1 ) = dv(Cn :an )  maxmax , if  is (C :a );min(C :a ) >. For other operators such as 1 (N:M cases), there is a signi cant di erence between

100

them. Since previous techniques assume share = 1 and re ect only the e ect of fan (i.e., the 1:N cases)7, errors may occur. On the other hand, Eq.(6) re ects two e ects: one of fan and the other of share. When both share and fan are greater than 1 (i.e., N:M cases where N > 1, and M >1), the selectivity estimation technique should re ect these two e ects. In Eq.(6), to re ect these two e ects, we modify the number of blocks n(Ci ) in Eq.(4) as n(Ci )=share(Ci :ai [Ci+1 ]), and then apply the block hit function proposed by Yao[25], and nally get the result by multiplying the result of this function to share(Ci :ai [Ci+1 ]).

5 Conclusions We have presented a new technique for estimating the selectivity of nested predicates (nested selectivity) in object-oriented databases. Nested selectivity is used in performance analysis of object-oriented database systems, query optimization, and physical database design. Unlike conventional ones, our technique takes into acWhang et al.[22] called this e ect the coupling e ect and proposed a join cost formula in which the coupling e ect is taken into account. Coupling e ect occurs in joins where the mapping cardinality is 1:N. 7

[8] Kim, K. C. et al., \Acyclic Query Processing in Object-Oriented Databases," In Proc. 7th Int'l Conf. Entity-Relationship Approach, 1988. [9] Kim, W., Introduction to Object-Oriented Databases, MIT Press, MA, 1990. [10] Mannino, M. V. et al., \Statistical Pro le Estimation in Database Systems," ACM Computing Surveys, Vol. 20, No. 3, pp. 191-221, Sept. 1988. [11] Selinger, P.G. et al., \Access Path Selection in a Relational Database Management System," In Proc. Int'l Conf. on Management of Data, ACM SIGMOD, pp. 23-34, May 1979. [12] Shekita, E. and Carey, M., \A Performance Evaluation of Pointer-Based Joins," In Proc. Int'l Conf. on Management of Data, ACM SIGMOD, Atlantic City, New Jersey, pp. 364-374, May 1990. [13] Whang, K. Y. et al., \Separability - An Approach to Physical Database Design," IEEE Trans. on Computers, Vol. C-33, No. 3, pp. 209-222, Mar. 1984. [14] Whang, K. Y. et al., \A Linear-Time Probabilistic Counting Algorithm for Database Applications," ACM Trans. on Database Systems, Vol. 15, No. 2, pp. 208-229, Oct. 1990. [15] Whang, K.Y. et al., \Dynamic Maintenance of Data Distribution for Selectivity Estimation," Int'l Journal on Very Large Data Bases, Vol. 3, No. 1, pages 29-52, Jan. 1994. [16] Yao, S. B., \Approximating Block Accesses in Database Organizations," Comm. of the ACM, Vol. 20, No. 4, pp. 260-261, 1977.

count the e ects of direct representation of many-tomany relationships, which frequently occur in objectoriented databases. For many-to-many relationships, we have generalized the block-hit function by S. B. Yao to allow each object to be stored in two or more blocks. We have also presented an ecient method for obtaining the statistical information used in our technique by utilizing inherent features of object-oriented databases. We have analyzed the e ect of many-to-many relationships on the estimation of nested selectivity and compared it with those of conventional techniques. The most signi cant advantage of our technique is that the accuracy of the estimation is far enhanced with only a small additional overhead. We have tested the accuracy of our estimation technique through extended experiments using test databases with various data distributions. The results show that our estimation very closely matches with actual measurements, while conventional techniques show large deviations, con rming the advantage of our technique.

References [1] Bertino, E., \Index Con guration in ObjectOriented Databases," The Int'l Journal on Very Large Data Bases, Vol. 3, No. 3, pp. 355-399, July 1994. [2] Bertino, E., \On Modelling Cost Functions for Object-Oriented Databases," IEEE Trans. on Knowledge and Data Engineering, Vol. 9, No. 3, 1997. [3] Cho, W. S. et al., \A New Method for Estimating the Number of Objects Satisfying an ObjectOriented Query Involving Partial Participation of Classes," Information Systems, Vol. 21, No. 3, 1996. [4] Christodoulakis S., \Implications of Certain Assumptions in Database Performance Evaluation," ACM Trans. on Database Systems, Vol. 9, No. 2, pp. 163-186, June 1984. [5] Elmasri, R. and Navathe, S., Fundamentals of Database Systems, Benjamin/Cummings, 2nd Ed. 1994. [6] Kemper, A., et al., \Optimizing Disjunctive Queries with Expensive Predicates," In Proc. Int'l Conf. on Management of Data, ACM SIGMOD, Minneapolis, Minnesota, pp. 336-347, May 1994. [7] Kifer, M. et al., \Querying Object-Oriented Databases," In Proc. Int'l Conf. on Management of Data, ACM SIGMOD, San Diego, CA, pp. 393-402, May 1992.

101