Data Warehouse Modeler: A CASE Tool for Warehouse Design

Data Warehouse Modeler: A CASE Tool for Warehouse Design Les Miller Department of Computer Science College of Liberal Arts and Sciences 210 Atanasoff ...
Author: Silas Hoover
1 downloads 0 Views 48KB Size
Data Warehouse Modeler: A CASE Tool for Warehouse Design Les Miller Department of Computer Science College of Liberal Arts and Sciences 210 Atanasoff Iowa State University Ames, IA 50011 Tel: 515-294-9936 E-mail: [email protected]

Abstract Data warehouses have become an instant phenomenon in many large organizations that deal with a massive amount of information. Drawing on the experiences from the systems development field, we surmise that an effective CASE tool will enhance the success of warehouse implementations. Thus, we present a CASE tool designed to generate the SQL queries necessary to build a warehouse from a set of operational relational databases. The warehouse designer simply specifies a list of attribute names that will appear in the warehouse, conditions if any are desired, and a description of the operational databases. The tool returns the queries needed to populate the warehouse table.

Sree Nilakanta Department of Management College of Business 300 Carver Hall Iowa State University Ames, IA 50011 Tel: 515 - 294-8113; Fax: 515 - 294-2534 E-mail: [email protected]

answer the ad hoc information requirements of decision makers. Figure 1.1 shows a three tier warehouse architecture proposed by McFadden and Watson[1996].

User Interface

Warehouse Generator Query

Warehouse Design

Warehouse Modeler

1. Introduction Organizations today record and track every single detail of their transaction history hoping that valuable business insights can be unearthed. With millions of transactions captured in their systems, organizations are experiencing an explosion of data. Early attempts to use the burgeoning transaction database (now commonly called the online transaction processing system or OLTP) for decision support has been a forgettable mistake for many organizations. Recent interest in data extraction and on line analytical processing of extracted and aggregated data has given rise to a new creed of information system, namely the data warehouse system. Data warehouses now appear to offer the much heralded database support for decision support. Several different definitions are bandied about to describe what a data warehouse is. McFadden and Watson[1996] defines data warehouse to be a collection of integrated, subjectoriented databases, designed to support decision enhancing activities, where each unit of data is relevant to some moment in time. The data in the warehouse is not updated in real time but refreshed periodically. Business managers, who use the warehouse data for business analyses access the data for querying purposes only. Since the warehouse database is physically segregated from the operational databases, and is a collection of summaries and details, it provides a faster forum to

Meta Data

Figure 1.1 Three Tier Architecture for Data Warehouse Desiging and building a data warehouse is an evolutionary process resembling any IS application development project albeit with some special characteristics. Unlike other IS development efforts where well defined methodolgies and CASE tools exist, no such methodologies or tools have come to market to support data warehouse design. Clearly there are similarities between data warehouse design and other IS designs. For example, both share the need to start with adequate planning, elicit user requirements, and proceed to modeling, implementation, testing, training, and maintenance. The details, however, differ.

1060-3425/98 $10.00 (c) 1998 IEEE

Modeling a data warehouse is the most critical component in the warehouse design process. The process must include logical data modeling, physical data modeling, and meta data management. Though similarities might exist between database modeling and data warehouse modeling, there also exists significant differences. A data warehouse model must satisfy rapid information retrieval and ad hoc query formulations. Star schema and dimensional schema are often employed here. In addition to the logical data model the warehouse design must maintain a meta database consisting of information that will aid in administering the warehouse. Essentially the meta data supports the user interface to the warehouse [Mattison 1996]. A great deal of research on warehouses is focused on maintaining materialized views. The Stanford Data Warehouse Project (WHIPS) [Hammer et. al 1995] has primarily looked at methods to update a relational data warehouse given that the operational databases are relational. Quass et. al [1996], Wiener et. al [1996], and Zhuge et. al [1995] investigate algorithms for such view maintenance. Gupta [1997] has recently studied the selection of views to materialize. Similarly, Baekgaard and Roussopoulos [1997] at the University of Maryland have explored data warehouse view refreshment. If recent announcements in the trade press and practitioner conferences are any indication, there is a growing need for tools that will facilitate the design and development of warehouses than any other effort. Lessons learned from the systems development domain should imply that the availability and use of well conceieved CASE tools can improve the success of data warehouse systems. But, there are no such CASE tools that a warehouse designer can use today. In this paper we present a CASE tool designed to generate the SQL queries necessary to build a warehouse from a set of operational relational databases. The warehouse designer simply specifies a list of attribute names that will appear in the warehouse, conditions if any are desired, and a description of the operational databases. The tool returns the queries needed to populate the warehouse table. In the next section we discuss the basic database model and how such a model may be used with existing operational databases using current theories of relational databases and query generation. The next section will describe our CASE tool, the warehouse modeler. We also indicate some of the future versions that this tool might take. We conclude with a section on the current limitations and propose different directions for the tool development.

2. Model We assume that the reader is familiar with the basic concepts of relational databases [Maier 1983]. If N is the set of attributes, then we define a database scheme R = {R1, R2, ..., Rk} to be a set of subsets of N and ∪ Ri = N. i=1 to m

Let R = {R1, R2, ..., Rk} be a set of relation schemes over N. A relation r(N) satisfies the join dependency (jd) *[ R1, R2, ..., Rk]], if and only if r = πR1(r)* ... *πRk(r). A relation r(N) satisfies the embedded join dependency (ejd)*[ R1, R2, ..., Rm]], if and only if πS(r) satisfies *[ R1, R2, ..., Rm], where S is the union of the Ri (1 ≤ i≤ m) and S ⊆ N. The set of relations that satisfy the dependencies in a dependency set D is denoted by sat(D). If there exists a database scheme R = {R1, R2, ..., Rk] such that sat({*[R]}) defines the set of legal universal relations, then *[ R] will be called the univeral relation join dependency(UR/JD). The UR/JD assumption [Fagin et.al 1982, Ullman 1982] is essentially an assumption that such a join dependency exists for each database. A hypergraph is a couple H = (N, E), where N is a set of vertices and E is a set of edges which are non-empty subsets of N . There is a natural correspondence between database schemas and hypergraphs. We define the underlying hypergraph for a database scheme to be HR(N, R). A hypergraph H = (N, E) is said to be γ acyclic [Fagin 1983] if it contains no γ-cycle which is a sequence of the form (E1, x1, E2, x2, ..., Ek, xk, Ek+1) (i) x1, x2, ..., xk are distinct vertices in N; (ii) E1, E2, ..., Ek are distinct edges and Ek+1 = E1; (iii) k ≥ 3; (iv) xi is in Ei ∩ Ei+1, 1 ≤ i ≤ k; (v) if 1 ≤ i ≤ k, then xi is in no Ej except Ei and Ei+1. A database scheme is γ-acyclic exactly when the underlying hypergraph is γ-acyclic. The hypergraph can be used to model the full join dependency[Fagin et. al 1982]. A hinge [Gyssens and Paradaens 1984] is a subhypergraph with particular properties. In [Miller et. al 1990], we showed that an ejd is represented by either a hinge or a trivial subhypergraph. The following paragraphs formalize the concept. A hypergraph H is reduced if no hyperedge of H is properly contained in another hyperedge of H. H is connected if every pair of its hyperedges is connected by some path of hyperedges. If H is a reduced connected hypergraph with the vertex set N and the edge set E, then E' is a complete subset of E if and only if E' ⊂ E and for each Ei in E if Ei ⊆ attr(E') (attr(E') represents the union of the edges in E'), then Ei belongs to E'. E' is said to be a trivial subset of E if |E'| ≤ 1 or E = E'. Let E' be a complete subset of E and E1, E2 ε E - E'. Then we say E1 and E2 are connected with respect to E' if and only if they have common vertices not belonging to E'. Let E' be a nontrivial complete subset of E and j1, j2, ..., jp be connected components of E - E' with respect to E'. Then E' has the bridge-property if and only if for every i = 1, 2, ..., p there exists Ei ε E' such that (attr(E') ∩ Ni) ⊆ Ei, where Ni = attr(ji). Ei is called a separating edge of E' corresponding to ji. A nontrivial complete subset of E' of E with the bridge property is called a hinge of H.

1060-3425/98 $10.00 (c) 1998 IEEE

3. Case Tool In this section we discuss the current prototype in detail and briefly describe our plans for the next version of the case tool. The discussion of the current prototype is in two parts. First, we examine the user interface design of the current prototype. The second part of our discussion focuses on the heart of the case tool - the warehouse modeler. Figure 3.1 illustrates the block diagram of the system. Legacy System

Data Acquis ition Softwa re

Repository

Database serv er

EIS/DSS serv er

Data warehouse

Multidim . databases

EIS Client

3.2. Warehouse Modeler

DSS Client

There are two critical issues that must be considered when generating the mappings. The first issue is the determination of a join sequence. Once we have the join sequence, it must be evaluated to determine the quality of the join. The primary issue in the evaluation of the join is to decide if the join will produce extraneous data (i.e., lossy join). To generate the join sequence we start with a hypergraph model of the relational databases. We assume that common data attributes have the same name in each of the data relations in which they occur. Note that this assumption is not a problem, since views can be used to rename attributes that violate the assumption. The relational data in the operational databases is modeled by a hypergraph, where the attributes of the relations are the nodes and relation schemas are represented by the hypergraph edges. It has been shown that any connected hypergraph in this model is equivalent to a complete join dependency [Fagin et. al 1982]. To generate our mappings we make use of an extension of the query translation given by Owrang and Miller [Owrang and Miller 1988]. To employ the query translation approach we assume that the source space is modeled by a hypergraph consisting of a single edge containing all of the attributes used in the warehouse. Given the set of attributes that define a warehouse table, we can model the warehouse mapping as a hypergraph query consisting of the desired attributes (S), a select edge (the where portion of the query) and a project edge (the select portion of the query). The hypergraph H0 for a sample operational database is shown in Figure 3.2. To simplify our algorithm we make use of the complete intersection graph IR to represent the operational databases(s). The complete intersection graph for the sample operational databases is given in Figure 3.3. Note that we omit the edges with the empty set to simplify IR. The procedure to find a join sequence can then be described as follows: (1) Create a search tree from IR called an Adjusted Breadth First Search (ABFS) tree. The root node of the ABFS tree must contain at least one attribute in S and is not adjacent only to nodes containing a super set of the S attributes found in the root node.

EIS Client ...

External Data

is partitioned based on the data in the operational databases used. Each database is then processed by the portion of the partitioned query targeted at the database. The results of the individual queries are then joined to form the desired data for the warehouse. Note that if the hypergraph that depicts the operational relational database tables is disconnected, then the join is necessarily lossy and two (or more) queries will have to be generated. In the next subsection we look at the warehouse modeler (Figure 3.1).

Figure 3.1 Warehouse Modeler Block Diagram 3.1. User Interface The current prototype has been implemented on an HP/UX machine. The user interface in this system has been a low priority item with most of our energy focused on the correctness of our warehouse modeler. As a result, the system uses a command line format for the user interface. User friendliness will be a significant design issue in the next version of this tool. The current system requires two types of information from the user. Before using the case tool, the user must ensure that the design information on the operational relational database(s) has been stored in the case tool’s file system. The minimal information required is the set of relation schemas. The functional dependencies are optional, since the case tool will produce lossless joins either way. The primary advantage of including the functional dependencies is that the joins required for losslessness will in general require fewer relations to be joined when the functional dependencies are included. When the information on the operational relational database(s) has been stored, the user can start the warehouse query generation tool. The input required by the tool is the list of attributes in the warehouse table (or cube) and any conditions that the warehouse designer wishes to impose on the data being placed in the warehouse. The interface creates a sql query from the list of attributes and conditions. The query has the general form: select list_of_attributes_in_warehouse where conditions The query is denoted as the warehouse query. The warehouse query is then passed to the warehouse modeler where it is translated into the sql query necessary to generate the warehouse from the operational databases(s). The query returns the new query to the user via the user interface. In the event that the query spans two or more operational databases, the query

1060-3425/98 $10.00 (c) 1998 IEEE

e1

e2 C

Y

node (Ei) being added to the path with the path label of its parent. The unused nodes, which are adjacent to the current end point of the path E1 , and contain at least one attribute in S that is not in the path label associated with E1 will be added to the current search tree as a child of E1. Otherwise, the nodes are stored in the adjustment set ( A ) with a pointer to the position where they could be added to the search tree. The expansion is terminated when the union of the path labels in the search tree is equal to S. If the termination condition does not hold and the search tree cannot be expanded further, a node from A is chosen to restart the process. Example 3.1 illustrates the result for the root node CS and the complete intersection graph given in Figure 3.2.

e3 S

S#

e4 Q

P# D# e5 e6

Create the search tree for the root CS where S = {S#, S, M, P#}.

Figure 3.2 A Sample Operational Relational Database

CS •

(2) Choose the join path (the shortest path that includes all atributes of S ) of this ABFS tree generated by step (1). (3) Compare the length of the join path formed in (2) with current shortest join path and choose shortest of the two as the current shortest join path.

• {S#P#S} S#P#QS

• S#C {S#S} • S#P#D# {S#SP#}

D#M

• D#M {S#SP#M}

D# S#

S#P#D#

S#C

Adjustment set A with potential parent in parentheses A = {S#Y(S#P#QS)} Example 3.1

S#

S#Y

{S}

S#

S#

S#P#QS

C

S#

S

CS

Figure 3.3 The Complete Intersection Graph for the Sample Operational Database given in Figure 3.2 In the rest of this subsection we look at generation of the join path. The complete intersection graph is used in the algorithm to determine which nodes (relations) are adjacent to the current endpoint of the path. The set of attributes S in the source query is used to determine when to stop the expansion of the ABFS tree. There is a path label associated with each node in the ABFS tree. It is used to prune or delay the expansion of subtrees of the current search tree. The path label of a node is obtained by taking the union of the set of attributes that appear in S and the

Note that to expand S#P#QS in Example 3.1 with either S#Y or S#P#D# is of no value (i.e., we add no new S attribute to the path label) in this tree, but having S#P#D# in the adjustment set we are able to use it in the expansion of S#C. Once the ABFS tree of a root node is created, the shortest path of this tree (join path) with nodes N1, N2, ..., Nk is chosen, such that Ni = S. The algorithm presented in [Owrang and Miller 1988] is employed to use the path labels associated with each node in the ABFS tree to find this join path. Owrang and Miller [1988] assumed that the database was γ acyclic. As a result, finding the shortest path meant finding the join sequence that didn’t include any extra relations. In the arbitrary database environment it is necessary to extend the algorithm to determine whether there are distinct paths (distinct join sequences) in the ABFS tree set. The difference can be seen in Figure 3.4. For this database a warehouse design involving can be generated by two distinct join sequences (path in the ABFS tree set). Note that the underlying hypergraph is not γ acyclic.

1060-3425/98 $10.00 (c) 1998 IEEE

Savings Account

Bank

Customer

Checking Account

of a hypergraph defines an embedded join dependency for the corresponding database, any join sequence that has the hinge property is necessarily lossless. In some sense this was what Owrang & Miller [1988] were using in the query translation paper. In γ acyclic hypergraphs any connected subhypergraph forms a hinge. If the hinge test fails and functional dependencies are available, the test given in Miller, et. al[1990] is used. In this test the chase algorithm is applied to the relations in the join sequence. The test uses the set of functional dependencies and the join dependencies defined by the operational databases to test for lossless join. Given the hypergraph shown in Figure 3.5, Example 3.2 illustrates the second level test. In Figure 3.5 a hypergraph for a cyclic database is given. When the sub hypergraph defined by the hyperedges S1, S2 & S3 is tested for the hinge property it fails. But if the functional dependency set includes D -> F and F -> H, we can show that {[X] [S1, S2, S3, E1, E2, E3], D -> F, F > H} implies [X][S1, S2, S3]. Therefore, the join sequence is lossless.

Figure 3.4 Simple Bank Database The extended algorithm detects the two distinct paths and queries the warehouse designer which (or both) of the paths the warehouse generation query should use. When both are chosen a union operation is generated in the warehouse query. An extension that is currently being considered for future prototypes is to give the warehouse designer the option of replacing joins with outer joins. The advantage of allowing the use of outer join is that dangling tuples can be added to the warehouse. Another issue that surfaces when the operational databases are not γ acyclic is the question of the correctness of the join operation. The most critical issue is testing whether or not the set of joins used to generate the warehouse data are from a lossless join. In the next subsection our mechanism for testing is examined. Lossless Join Testing The question of testing for losslessness can be addressed at two different levels. Maier & Ullman [1983] use functional dependencies and multivalued dependencies to create maximal objects that denote sets of relations that can be joined losslessly. This approach does force losslessness, but has two problems. First, this approach is rather restrictive. There are sets of relations that can be losslessly joined that will fail this test. Second, it is unlikely that an organization’s database administrator will have more than functional dependencies (if that). We have adopted an approach based on a series of tests that are dynamic in the sense that the tests are applied after the join sequence has been generated. The approach operates at two levels. First, the join sequence determined in the mapping phase is tested for losslessness by testing whether or not the subhypergraph defined by the relations in the join sequence forms a hinge with respect to the database hypergraph. Since it has been shown by Gyssens & Paredaens [1984] that any hinge

Example 3.2

S2

S1 A E1

B

CD

E

S3 F

G

H I

M

E3 K

L

J

E2 Figure 3.5 A Cyclic Hypergraph Example Our current prototype stops here. If the join is still not known to be lossless the prototype reports the issue to the warehouse designer. The next version of the case tool that is currently being developed will make use of algorithms to build a hinge from the current join sequence (if no functional dependencies are given) or an fd-hinge (when functional dependencies are available). The fd-hinge, first given in Miller[1992], includes functional dependencies in the expansion process to minimize the number of relations that have to be added to the join sequence to ensure the lossless join property. 3.3 Next version The next version of our case tool is currently being designed. Since the warehouse modeler was the focus of the current system, the major changes will be in the user interface.

1060-3425/98 $10.00 (c) 1998 IEEE

The only real changes in the warehouse modeler will be the incorporation of fd-hinges into the lossless join test and building the capability for the user to incorporate more than one path into a join sequence. The use of outer join is also being considered. The biggest change for the user interface is that the tool will be implemented on a PC running Windows 95 for the next version. As a result, the tool will be windows based and the user will be able to click on the attributes desired in the warehouse. The user will be able to generate the conditions by either clicking on the appropriate symbols or by typing portions or all of the conditions. When the attribute list and the conditions are complete, the user will click on the send warehouse query button. The operations query will be returned to the user as part of the window and be stored in a file. The result should be a more user friendly environment for the case tool.

4. Conclusions and future work The design of a case tool for generating the query for creating a warehouse table (or cube) has been presented. The current prototype has focused on the expectation that the operational data are in relational databases. A second version of this system is currently under development. The new version of the system will provide a more user friendly interface and some extensions within the warehouse modeler software. We have also started work on a case tool based on the Zeus view mechanism [Yen et. al 1997, Yen & Miller 1995, Yen, Miller and Wong 1994, Yen, Miller and Pakzad 1994]. The tool will be able to make use of views defined over an extension of the Zeus views to integrate the various data models and legacy systems into the warehouse design process.

5. References [Baekgaard and Roussopoulos 1997] L. Baekgaard and N. Roussopoulos. “Efficient Refreshment of Data Warehouse Views.” Technical Report, Department of Computer Science, University of Maryland, URL: http://www.cs.umd.edu/TRs/ authors/Nick\_Rouss opoulos-no-abs.html., 1997 [Fagin 1983] Fagin, R. Degrees of acyclicity for hypergraphs and relational database schemes. JACM. Vol. 30. No. 3. Pages 514-550, 1983.. [Fagin et. al. 1982] Fagin, R., A.O. Mendelzon, J. Ullman. A simplified universal relation assumption and its properties. ACM TODS. Vol. 7. No. 3. Pages 343-360, 1982. [Gupta 1997] H. Gupta. Selection of Views to Materialize in a Data Warehouse. Proceedings of the International Conference on Database Theory, Athens, Greece, January 1997.

[Gyssens and Paredaens. 1984] Gyssens, M. and J. Paredaens. A decomposition methodolgy for cyclic databases. Advances in Database Theory. Vol. 2. Pages 85-122, 1984. [Hammer 1995] J. Hammer, H. Garcia-Molina, J. Widom, W. Labio, Y. Zhuge. “The Stanford Data Warehousing Project.” In IEEE Data Engineering Bulletin, June 1995. [Maier 1983] Maier, D. The Theory of Relational Databases. Computer Science Press. Rockville, Maryland, 1983. [Maier and Ullman 1983] Maier, D. and J. Ullman. Maximal object and the semantics of universal relational databases. ACM TODS. Vol. 8. No. 1. Pages 1-14, 1983. [Mattison 1996] R. Mattison. Data Warehousing:Strategies, Technologies and Techniques, McGraw Hill, New York, 1996. [McFadden and Watson 1996] McFadden, Fred and Hugh J. Watson, “The World of Data Warehousing: Issues and Opportunities,” Journal of Data Warehousing, vol. 1, no. 1, pp. 61-71, 1996. [Miller et. al. 1990] Miller, L. L., J. Leuchner, S. Kothari and K. Liu. Testing arbitrary subhypergraphs for the lossless join property. Journal of Information Science. Vol. 51. Pages 95-110, 1990. [Owrang and Miller 1988] Owrang, O. M. and L. L. Miller. Query translation based on the hypergraph model. The Computer Journal. Vol. 31. No. 2. Pages 155-164, 1988. [Quass et. al. 1996] D. Quass, A. Gupta, I. S. Mumick, and J. Widom “Making Views Self-Maintainable for Data Warehousing.” Proceedings of the Conference on Parallel and Distributed Information Systems, Miami Beach, FL, December 1996. [Ullman 1982] Ullman, J. The universal relation strikes back. ACM PODS. Los Angles. Pages 10-22, 1982. [Wiener 1996] J. L. Wiener, H. Gupta, W. J. Labio, Y. Zhuge, H. Garcia-Molina, J. Widom. “A System Prototype for Warehouse View Maintenance.” Proceedings of the ACM Workshop on Materialized Views: Techniques and Applications, Montreal, Canada, June 7, 1996, pp. 26-33. [Yen and Miller 1995] Yen, C.H. and L.L. Miller.. An extensible view system for multidatabase integration and interoperation. Integrated Computer-Aided Engineering. Vol. 2. No. 2. Pages 97-123, 1995

1060-3425/98 $10.00 (c) 1998 IEEE

[Yen et.al. 1997] Yen, C.H., L.L. Miller, A. Sirjani and J. Tenner. Extending the object- relational interface to support an extensible view system for multidatabase integration and interoperation. International Journal of Computer Systems Science and Engineering. To appear, 1997. [Yen, Miller and Wong 1994] Yen, C.H,, L.L. Miller and J. Wong. The architectural design of the zeus multimedia system. IASTED Parallel and Distributed Computing and Systems Conference. Pages 247-252, 1994. [Zhuge 1995] Y. Zhuge, H. Garcia-Molina, J. Hammer, and J. Widom, “View Maintenance in a Warehousing Environment.” Proceedings of the ACM SIGMOD Conference, San Jose, California, May 1995.

1060-3425/98 $10.00 (c) 1998 IEEE