Data Warehouses: a Modelling Perspective 1

In: Systems Development Methods for the Next Century II (Eds. W.G.Wojtkowski, S. Wrycza, J. Zupančič), Proc. of 7th Int. Conf. on Information Systems,...
Author: Deborah Scott
2 downloads 0 Views 728KB Size
In: Systems Development Methods for the Next Century II (Eds. W.G.Wojtkowski, S. Wrycza, J. Zupančič), Proc. of 7th Int. Conf. on Information Systems, Bled, Slovenia, 1998. Plenum Press, 1999

Data Warehouses: a Modelling Perspective1 Jaroslav Pokorný Department of Software Engineering Charles University, Faculty of Mathematics and Physics Malostranské nam. 25 118 00 Prague - Czech Republic Phone: 00420/2/21914256 Fax: 00420/2/532742 e-mail: [email protected]

Abstract: The paper is focused on the conceptual and the multidimensional level of data warehousing. A detailed survey of various approaches to the multidimensional modelling is presented. Basic notions such as a dimension, a fact, a star schema, and a constellation schema are introduced including their formal definitions. Moreover, cardinality and query constraints are defined and shown to be useful in a multidimensional design. Due to the hierarchical structure of dimensions the role of the associated E-R schema is emphasized. It can bear a part of semantics needed for understanding dimensional tables. A more general variant – the multidimensional modelling with explicit hierarchies is then introduced. Finally, an outline of the methodology for designing a data warehouse is proposed.

Key words: OLAP, data warehouse, multidimensional modelling, star schema, constellation schema, dimension table, fact table, E-R modelling.

1. Introduction A significant progress in decision support has appeared in last years. Data warehousing as a collection of decision support technologies ChD97 has increasingly become a focus of both the academia and the database industry. What is a data warehouse (DW)? A broad definition of a data warehouse can be the following Gu97: a data warehouse is a structured extensible environment designed for the analysis of non-volatile data, logically and physically transformed from multiple source applications to align with business structure, updated and maintained for a long time period, expressed in simple business terms, and summarized for quick analysis. In more database-oriented terms, a DW is a database suitable for direct querying and analysis. Another architectures allow a two-level approach with a DW and smaller logical data marts oriented to specific areas of enterprise. Even architecture without DW and containing only data marts is possible. A more user-oriented variant how to approach the data stored in DW is through OLAP (On-Line Analytical Processing). An OLAP database can be conceived as a database derived from DW and materialized under other special software (often called DW/OLAP). Another, not too feasible possibility understands the OLAP database as a front-end to one or more OLTP (On-Line Transaction Processing) databases (so called OLTP/OLAP). We do not prefer the latter possibility further in the paper. It is a common use of DW or OLAP that involves real time access and analysis of multidimensional data. For example, in sales DW, time of sale, salesperson, and sales district might be some of the 1

This work has been partially supported by GACR project 201/98/0532.

dimensions of interest. Values of sales, e.g. quantity, cost, revenue, etc. depend on these dimensions and are required in decision making processes. Since these dimensions involve hierarchies, e.g. date  month  quarter  year, aggregation of information is a key element in the usefulness of data multidimensionality. We agree with RB96, that DWs as well as the OLAP approach require a new query-centric view of the data. Multidimensional modelling is one approach how to do it. While it is universally recognized that the design of an operational database is based on three levels of abstraction, i.e. the conceptual, database, and physical one, respectively, the most important aspect of DW modelling is a support of possibilities how to analyse the data. Obviously, we could begin with the well-known E-R model which provides a standard documentation of operational systems whose data will be selected, cleaned, and pre-proceed before its transmission into DW. Unfortunately, this model has some drawbacks: 

it does not provide a sufficient information for the analytical processing that may be performed with DW,



it describes conceptual schemes that are optimized rather for data entry and are not too appropriate for query-centric view of the data.

Many practical approaches as well as theoretical papers advocate a variant of DW modelling called multidimensional. Actually, data in DW is modelled multidimensionally to be viewed multidimensionally. An example of the relational representation of multidimensional data viewing from the car business subject area is shown in Figure 1. In jargon DW and OLAP modelling, the schema is called “star schema”, since it has one relation, a fact table, which represents the multidimensional space in the centre and the other relations which represent dimensions around it. Notice that logical connection of tables in the star schema is nothing else than well-known referential integrity used in SQL databases. SalesOrg_ID

Representative

Type

Type

Description

Office

SalesOrg_ID

Number_of_ seats

Colour_ID

Colour_ID

Quantity

Cost

Revenue

Name

Profit

Class

Figure 1: Tabels organized as a star schema In the literature, we could recognize other approaches to multidimensionality, e.g. star joins and data cubes. They are conceptually similar, but not the same. Often it is not clear how to include these abstractions, or models, said in more data modelling terms, into logical design techniques usually used for the enterprise information systems. For example, relations in Figure 1 are, in fact, four independent tables and the “starness” should be conceived rather on a more conceptual level. Thus, the first group of problems concerns how to distinguish more precisely between the conceptual and the representation view on stars. We will follow the approach given by McGuff McG96 in which the multidimensional modelling for DW makes a separate design stage. It is placed between the business (or conceptual) modelling and the representation (or database) modelling. There is an interesting question what role the multidimensional model plays among particular stages of DW modelling. Thus, we divide all aspects of the star architecture into for stages:

1



conceptual modelling



multidimensional modelling



representation modelling



physical modelling

The conceptual view will cover the aspects of data, which express its associations to real world objects, the multidimensional view will describe a DW by dimensions and fact tables. The representation of facts and dimensions will concern their database description. Finally, in the physical modelling we are interesting in indexing and other implementation aspects of the database representation of DW. This paper looks at the DW modelling mainly from the point of view of the first two modelling stages. We introduce precise definitions of the basic modelling constructs in the particular abstraction levels and describe their mutual relationships. The remainder of the paper is organized as follows. A detailed survey of various approaches to the multidimensional modelling is presented in Section 2. We try here to distinguish differences between the conceptual and the multidimensional modelling. Section 3 introduces basic notions of the multidimensional modelling such as a dimension, a fact, a star schema, and a constellation schema. Their formal definitions follow. Then dimensions are discussed and their hierarchical structure is emphasized. Fact constellations and snowflakes are shown as a partial solution of the problem how to model hierarchies. A general variant – the multidimensional modelling with explicit hierarchies is then introduced. In summary, an outline of the methodology for DW designing is presented.

2. DW modelling – a motivation example In each approach to information system design an importance of the conceptual stage is emphasised. In practice, well-known modelling methodologies with the most common E-R diagrams as the leading one in OLTP environment are inappropriate in DW environment Ki97. An E-R schema assumes that programs process the data and conceptual structures represent only the relationships between objects. The goal is to offer a conceptual description of normalized data stored in a database. On the other hand, the calculation of derived data is crucial in DW. Various aggregations are required and this calculated data could be of interest in the DW modelling. Thus, we need other approaches to data modelling of DW. Obviously, adequate methods must be propagated into all modelling levels. A general approach to the DW modelling is based on so called multidimensional (or shortly dimensional) modelling Bu96, LW96, McG96, Ra95, RB96, ChD97, Ki97, [Po97]. There are two basic approaches to the multidimensional modelling:  its structures are based on tables (dimension and fact tables) arranged into star schemes,  its structures are based on hypercubes (cubes, multidimensional arrays) Examples of both approaches are in Figure 2 and Figure 3, respectively. The latter shows data in cells arranged by dimension of the data. There is a close relationship between both approaches. In GLS96, a tabular data model is developed which provides a unified apparatus to star-like multidimensional databases and to hypercubes. We will focus rather on the first approach because of its better graphical visualization. For example, it is more convenient to view ndimensionality, n  3, trough pictures like Figure 2 instead of any generalization of 3-ary cubes.

2

SALES ORGANIZATION

MODEL Type

SalesOrg_ID

Description Number of seats Class

Representative Office

SALES Type SalesOrg_ID Colour_ID

COLOUR Colour_ID

Quantity Cost Revenue Profit

Name Figure 2: Star schema as a result of the multidimensional modelling COLOUR MODEL SALES ORGANIZATION

Figure 3: Hypercube (Multi)dimensional modelling (DM) is a logical design whose basic components are facts, dimensions, and attributes. In an informal highlighting the semantics of these notions, we follow Ki96, Ki97. Each dimensional schema is composed of one table with a multi-part key, called the fact table, and a set of tables called dimension tables. Each dimension table has a single-part primary key that corresponds exactly to one of the components of the multi-part key in the fact table. Dimension attributes are used as the source of most of the interesting constraints in DW queries. Fact attributes are a focus of interest for the enterprise. They are modelled by non-key attributes in the fact table. Each their instance is “measured” by values of a tuple of dimension values. The tables of both types may be considered in a similar way as the tables in SQL, i.e., each table has rows whose components are elements of associated column domains. As usually, each table is a subset of the Cartesian product of its column domains. In Ra95 sparsity is included into DM. In the sales example, there may in fact be actual sales for only small fraction of the Cartesian product of Model, Colour, and Sales organisation. For example, 98 sparsity is possible. A usual solution is to represent only valid combinations of dimension values. Said in SQL terms, we should not record rows with all facts set on NULL in the (relational) table. Although important for physical modelling, the sparsity does not seem to be relevant in the level of dimensional modelling. We will not consider it further. The relationship of both DM approaches to logically connected tables in Figure 1 is intermediate. We can implement multidimensional structures via relations of any RDBMS or its accommodated variant (see, e.g., Informix MetaCube, Oracle Express Server). However, it is not the only variant. The representation of multidimensional structures can be also direct, see e.g. products such as Arbor’

3

Essbase, Hyperion OLAP, Kenan’s Acumate ES etc. Similarly, supposing relational representation, we can find approaches in which physical structures or access methods are different from usual relational strategies. For example, RDBMSs as ORACLE or Sybase support star queries in last versions of their relational engines. Finally, we show multidimensionality in the formalism of E-R diagrams, i.e. in a more conceptual way. Based on the original Chen’s notation, Figure 4 shows a conceptual view to our example. In the terminology of the E-R model, the notion of star schema is used for special kinds of n-ary relationship types. The E-R schema in Figure 4 has also the characteristic „star-like“ structure and it is possible to call it the star schema. Returning to tables on the multidimensional level of modelling (see Figure 2) we see they are connected with lines, which express many-to-many cardinalities of relationships (see Figure 4) that are modelled by the fact table. The emphasis on many-to-many is essential Ki97. Obviously, for a given dimensional table and a fact table, the cardinality is one-to-many. Remind that the E-R model is not the only possibility for the conceptual view on multidimensional structures. In the HIT model [Po89] facts can be expressed as HIT-attributes. For example “Revenue from the sales of a given model type in a given colour in a given sales organisation” (abbr. RSMC) is an example of such attribute. Its schema is depicted in Figure 5. Type

Description

SalesOrg_ID

Nuber of seats Classs MODEL M

N

SALES ORGANIZATION Representative

SALES

Quantity COLOUR

P

Colour_ID Name

Office

Cost Revenue s Profit

Figure 4: Conceptual view on the star schema in the E-R model Model Colour of

in in

from the sales

Sales organization

Revenue

Figure 5: Conceptual view on the star schema in the HIT model

4

In the terminology of the HIT data model, Model, Colour, and Sales organization denote entity sorts, the Revenue is a descriptive sort. The attribute RSMCS is a partial function from Model  Colour  Sales into the set of Revenue values. The HIT data model offers a very transparent semantics of facts. Facts are not isolated attributes (understand: names of columns in the fact table). Facts can be expressed in the HIT data model as the propositions, which are contained in HIT-attributes. For example, the sentence “The revenue from the sales of the Mini Van model type in yellow colour in the Clyde’s Autohouse is 20.000$“ denotes such a proposition. It is generated from the RSMCS attribute. Accommodating to the “space” terminology used in DM, we could name above discussed HITstructures comets. A conceptual star schema is expressed by a set of comet schemes.

3. Multidimensional modelling formally Now we introduce a formal view on the multidimensional modelling. In the first step we will formalize the basic DM notions, i.e. stars and constellations. Since the attributes of a dimension may be related via a hierarchy of relationships we obtain strongly denormalized dimension tables, the conceptual entity types of which could be also denormalized. This results into various solutions. Their common point is the necessity to model explicit hierarchies on the conceptual level. We introduce explicit dimension hierarchies also in the multidimensional level.

3.1 Stars and constellations Following [Po97] we will approach now the multidimensional modelling in a more formal way. We use upper letters D, F, ... for table schemes and D, F, ... for tables. A star schema could be given by a triple D, F, CC, where D is a set of dimension table schemes Di with attributes Ai, i=1,...,n, F is a fact table schema, and CC is a set of cardinality constraints. One attribute of each table Di is called the key of Di and is denoted as KDi. The key of F table is Ui=1..n KDi. Other (non-key) attributes of F are called facts. The cardinality constraint CCi for F and Di, i=1...,n, is defined as follows. Let F and Di be a fact and a dimension table, respectively. Then the cardinality constraint is satisfied by these tables, when for each row u from F there is only one row v in Di, such that u.KDi = v.KDi Informally, rows of the fact and any dimension tables are in many-to-one relationship. In more precise notation this fact could be expressed with the help of min-max pairs as F:(1,1), Di:(0,n) i.e., some rows from Di could be associated with no row from F. Thus, dimensions are independent on facts, facts can not exist without dimensions. Cardinality constraints also imply an expected observation that each KD is a foreign key in F. In contrary to the SQL, the value of any foreign key in Fmust not be NULL. Now we can define a multidimensional database. Let S be a star schema. A multidimensional database over S is a set of tables Di , i=1,...,n, and F that satisfy all cardinality constraints from CC. Single-star schema environment is regarded as easy to understand, but it is also a little limited. In the OLAP environment we can distinguish another approach in which more star schemes are defined on the conceptual level. We obtain so-called multi-star or, better, constellation schema. An important observation concerns dimensions in a constellation schema. Dimensions are shared, i.e. one dimension table schema can be common for more fact table schemes. More formally, a constellation schema is given by a triple D, F, CC, where D is a set of dimension table schemes, F is a set of fact table schemes, and CC is a set of cardinality constraints. For each F  F there exists a subset D´  D and CC´  CC such that D´, F, CC´ is a star schema. The notion of multidimensional database over a constellation schema is a natural extension of this notion specified in its previous definition. It is remarkable, that we need no additional integrity constraints that usually play an 5

important role in operational databases. We can suppose that DW data designed and put into the associated database come from consistent data sources and had been cleaned during the DW building process. Cardinality constraints CC are the only meaningful constraints used for DM. They express basic logical connections among tables in a multidimensional database. The approach to DW described above is insufficient in the aspect of distinguishing operational databases from DWs. We have emphasised the query-centric view of the enterprise data. Thus, we need model constraints that make it possible to restrict the set of possible queries over DW. The reason could be for example an impossibility to aggregate some fact instances for some attributes. Suppose the SALES fact table in Figure 6. The value of the attribute Number_of_customers for a given model, a colour, and a sales organization provides the number of the customers for the mentioned measures. Aggregation of these numbers for, e.g., all products cannot be consistently performed since some customers could be counted more times. In fact, some customers could buy more products in one purchase, for example. In other words, the attribute Number_of_customers is not additive along the product dimension. Remind that facts are usually numeric quantities that can be summed when a group of fact rows is selected. Since aggregation is mostly an additive process, it is appropriate if facts are limited to additive, numeric values. By the attribute additivity we mean a possibility to sum attribute values along all dimensions. Due the fact, non-key attributes in fact tables are considered as additive along all dimensions by default. This results in the idea to add the information about exceptions from this rule into star and constellation schemes. In GMR98, semi-additive and not-additive attibutes are distinguished. A fact attribute is called semi-additive if it is not additive along one or more dimensions. Non-additive attributes are additive along no dimensions. SALES ORGANIZATION PRODUCT P_number

SalesOrg_ID

Manufacturer Type

Representative Office

SALES P_number SalesOrg_ID Colour_ID

COLOUR

Number_of_ customers

Colour_ID

Cost

Figure 6.: Aditivity of attributes

Name

Non-additivity does not mean impossibility to aggregate at all. For example we could apply the Average operator to the Number_of_customers values along the dimension PRODUCT. Formally, we would extend our definition of star schema by the set QC (Query Constraints) whose elements are triples (A, D, Agg), where A is a non-key attribute of F, D is dimension, and Agg is a subset of a set of aggregation function considered in DW modelling. Other kind of query constraints could be so called grouping relations introduced in LW96. The schema of a grouping relation contains attributes selected from dimension tables. These pre-defined groupings explicitly restrict a set of possible groupings that can appear in user queries.

3.2 Dimensions In an informal approach often cited in the literature, dimensions are described by attributes some of which are descriptive, e.g. Description, Name, within the others may be included business-oriented,

6

enterprise-specific dimension hierarchies, e.g. PRODUCT: Item  Class. An important dimension is TIME structured as date  month  quarter  year. Other attribute hierarchy is SALES_ORGANIZATION: office  district  region. Attributes in a dimension hierarchy (see Figure 8) are called members of the hierarchy. In more rigorous approach, particular members of each such hierarchy are classes of entities. On the conceptual level, these hierarchies are chains of entity types where the cardinality of relationship for each two neighbours is 1:A (see Figure 7). An extension of a hierarchy can be defined e.g. as a set of trees. Nodes of each tree will bear data about particular entities. Certain dimensions could be modelled via part_of relationship construct. Some hierarchies can be multiple in one dimension. The TIME dimension provides a good example. In fact, the hierarchy date  week is a separate hierarchy from the one mentioned above. A typical usage of hierarchies is in various possibilities of aggregation. For example, beginning with office we can roll up to geographically higher wholes and aggregate the associated facts. Region

Region 1 N

District

District 1 N

Office

Office

Figure 7: A dimension as a chain of entity types

Figure 8. A dimension as a hierarchy

Hierarchies often collapse into one dimension table, which contains all members as particular attributes (Figure 9) or the whole hierarchy extension is conceived as a domain of one attribute (Figure 10). As a consequence of this approach, we can obtain highly denormalized dimension tables. In fact, attribute hierarchies define naturally sets of functional dependencies. Consequently, dimension tables are not in 3NF. This does no problems because the assumption behind the star schema is that the associated database is static, i.e. no updates are performed on-line. Despite the advantage to place into the same fact table data from different aggregation levels, storing a hierarchy into one dimension table can pose some problems in DM. We would like to keep aggregate data in the fact table e.g. such as sales dollars for a region, for a given model, and for a given colour. Thus, the question is how to construct the key of the dimension table SALES_ORGANIZATION. Suppose, the table schema contains all attributes in the hierarchy, i.e. office, district, region. Then these attributes have to participate in the table primary key. N. Raden Ra96 solves the problem by introducing generated (artificial) keys that cover all possibilities and by determining a special attribute Level. Figure 9 explains this problem. The generated key is SO_key.

SO_Key 234 235 236

SalesOrg_ID STO3276 STO3189 STY5478

Office BUIC1 BM1 AUD4

Representative Jones Hover Archwood

District Idaho Florida Idaho

7

Region North South North

R_representative Smith Navara Smith

Level Office Office Office

237 238 239 240 241

STQ6781 NULL NULL NULL NULL

AUD8 NULL NULL NULL NULL

Seaman NULL NULL NULL NULL

Florida Florida Idaho NULL NULL

South South North North North

Navara Navara Smith Smith Smith

Office District District Region Region

Figure 9: Dimension table with a hidden hierarchy and generated keys Dimension key STO3276 STO3189 STY5478 STQ6781 Idaho Florida North South

District Idaho Florida Idaho Florida NULL NULL NULL NULL

Region North South North South North South NULL NULL

Level 0 0 0 0 1 1 2 2

Figure 10: Dimension table with a hidden hierarchy and self-identified dimension keys We can observe that in both approaches the number of tables is reduced. On the other hand, the level field is often a source of operational errors. Beyond all doubt, it seems that the conceptual schema to dimensional tables organized in the way in Figures 9 a 10 is necessary.

3.3 Fact constellations, snowflakes Often emphasized property of star schemes is that they are built for simplicity and speed. We have argued, that the level indicator can limit their flexibility. Moreover, summary data in the fact table can yield poorer performance, and dimension tables are huge. More structured approaches make it possible  to split the fact table into more fact tables according to a dimension hierarchy,  or, to build hierarchies as paths of separate tables. The former means that for each star schema so-called fact constellation schema Ra96 is constructed. In fact, only one star schema along one (or more) selected hierarchy (roll up) is extended. The basic fact table of this schema contains data aggregated by the lowest member of the hierarchy. For example, for the Sales_organization it is Office. Obviously, the generated key is not necessary in this approach. Then, particular new fact tables can be built, i.e. SALES_D and SALES_R for aggregation by district and region, respectively. Figure 11 shows adding the SALES_D fact table to the original star schema. Obviously, the Level attribute is not necessary. Similarly it is possible to extend the schema in other dimensions. But, when we need to aggregate data, e.g., by district and class, it is necessary to build other fact table.

SALES ORGANIZATION SalesOrg_ID SALES Type SalesOrg_ID

Representative Office District Region R_Representative

SALES_D 8

Type District

Figure 11: The fact constellation schema We can observe that these structures create a proper subset of constellation schemes specified in Section 3.1. The main disadvantage of this approach is a more complicated design because many variants for particular kinds of aggregation must be considered and selected. Moreover, dimension tables are still large. Another alternative to the star schema is to denormalize the dimension tables according to its associated hierarchy. The fact table is split into different fact tables as before. Keys of such tables point to the smaller dimension tables. By „snowflaking“ we mean an explosion of the original star schema into more star schemes each of them describes facts on another level of dimension hierarchies. Figure 12 shows a part of the snowflake schema derived from snowflaking the dimension SALES ORGANIZATION.

3.4 Dimensional modelling with explicit hierarchies In MK96, snowflakes are replaced by explicitly expressed hierarchies. Separate fact tables are again built for different kinds of aggregations accordingly to connections with appropriate members of the dimension hierarchies. An example is given in Figure 12. Notice that SALE_C_R does not contain the fact Quantity that probably offers not useful information in this context. If we create all possible aggregates, the total number of associated tables grows rapidly. For example, for two dimension hierarchies with the number of respective p and q of hierarchy members, we could obtain p  q fact tables. This is not practical. Thus, a certain caution must be kept in deciding how to design the conceptual schema. At first glance, it seems reasonable to try to define constellation schemes with explicit hierarchies. This approach has a number of advantages from the conceptual modelling point of view. First, dimensions are structurally visible in the schema. Second, different fact tables are explicitly assigned to those dimensions, which are for given facts relevant. Other aggregations can be calculated (roll-up).

OFFICE

DISTRICT

SalesOrg_ID SALES Type SalesORG_ID Colour_ID

Representative Office District Region R_Representative

District SALES_D Type District

9

Colour_ID

Region R_Representative

REGION Region SALES_R

R_Representative

Type Region Colour_ID Quantity Cost Revenue Profit

Figure 12: Part of the snowflake schema REGION DISTRICT

SALES

R_Representative

SALE_C_R

OFFICE

District

Region

SalesOrg_ID Representative Office District

Region

Class Cost Revenue Profit

CLASS

Colour_ID SalesOrg_ID Type Quantity Cost Revenue Profit

Region

Class MODEL

Number of seats

Type Description Class Figure 13: Explicit hierarchies of dimensions

A constellation schema with explicit dimension hierarchies is the tuple D, F, H, CC, QC, where D is a set of dimension table schemes Di with attributes Ai, i=1,...,n, F is a set of fact table schemes, H (dimension hierarchies) is a subset D  D, CC is a set of cardinality constraints, and CC is a set of query constraints.  Dimension tables are structured into dimension hierarchies. A dimension hierarchy is a sequence Di1,...,Dik, k 1, where (Dij, Di(j+1)  H, j=1,...,k-1, or D, D D, such that the following conditions hold:

10

(a) all dimension table schemes in the sequence are different, (b) there are no two dimension tables schemes D´ and D´´, such that (D´,Di1) and (Dik,D´´) are in H, (c) if (Dj, Dk)  H, KDk is the key of Dk, then KDk is also an attribute of Dj, (d) each element of D and H participates at least in one dimension hierarchy, (e) if D is a dimension hierarchy, then D is not a member an any couple from H.  For each fact table F from F, there are subsets DF  D, CCF  CC, and QCF  QC such that DF, F, CCF, QCF  is a star schema.  The set CC is the union of two sets of integrity constraints ICD and ICF where (f) ICD is the set of cardinality constraints CCij defined for each pair (Di, Dj) in H, (g) ICF = UF  F CCF 

QC = UF  F QCF

A multidimensional database over a constellation scheme with explicit hierarchies S is a set of dimensional and fact tables that satisfy all cardinality constraints from CC. We can observe from the condition (c) that KDj in Di is again a foreign key in the same sense as in the connection a dimension table to a fact table. The condition (a) implies acyclicity of each hierarchy, the condition (b) guarantees its maximum length. With (d) we can model „isolated“ single dimensions. Notice that this definition extends slightly the similar definition in [Po97].

4. Conclusions We have discussed various approaches to the multidimensional modelling. Summarizing the discussion, we recommend the following methodology for designing DW: 1. Defining E-R schema for basic business entities and their relationships, 2. revealing relationships for fact tables and dimension hierarchies, 3. transforming the E-R schema into a DM schema, 4. extending DM schema with additional fact attributes (e.g. explicit aggregations), 5. defining query constraints. These steps cover first two levels of DW modelling. To implement a multidimensional database via RDBMs means to find a lossless transformation of the DM schema into a relational schema. The solution of this problem is given in [Po97]. Our future work will be devoted to developing a sufficient set of operations for querying dimension and fact tables. An inspiration can be found in the previous research works concerning an extension of the relational algebra by aggregation functions Kl82, or functional languages [Po96]. Other approach is used in LW96.

Literature Bu96 Buytendijk, F.A.: Multidimensional Data Analysis for OLAP. April 1996. ChD97 An Overview of Data Warehousing and OLAP Technology. ACM SIGMOD Record 26(1), March, 1997. GLS96 Gyssens, M., Lakshmanan, V.S., Subramanian, I.N.: Tables As a Paradigm for Querying and Restructuring. Proc. ACM Symp. on Principles of Database Systems, Montreal, 1996.

11

GMR98 Golfarelli, M., Maio, D., Rizzi, S.: Conceptual Design of Data Warehouses from E/R Schemes. Proc. Fo the Hawaii Int. Conference on Systém Sciences, 1998, Kona, Hawaii, 1998. Gu97 Gupta, V.: An Introduction to data Warehousing. System Services Corporation, 1977. Ki96 Kimball, R.: The Data Warehouse Toolkit: Practical Techniques for Building Dimensional Data Warehouses. John Willey, 1996. Ki97 Kimball, R.: A Dimensional Manifesto. DBMS, August 1997. Kl82 Klug, A.: Equivalence of Relational Algebra and Relational Query Languages having Aggregate Functions. Jacm, Vol. 29, No. 3, 1982, pp. 699-717. LW96 Li, Ch., Wang, X.S.: A Data model for Supporting On-Line Analytical Processing. Proc. Of Conf. On Integration and Knowledge Management, November, 1996, pp.81-88. McG96 McGuff, F.: Data Modeling for Data Warehouses. http://members.aol.com/fmcguff/dwmodel, 1996. MK96 Meredith, M.E., Khader, A.: Designing Large Warehouses. Database Programming  Design, June 1996. [Po89] Pokorný, J.: A function: unifying mechanism for entity-oriented database models. Entity-Relationship Approach, C.Batini (Ed.), Elsevier Science Publishers B.V. (North-Holland),1989. pp.165-181. [Po96] Pokorný, J.: Conceptual modeling of statistical data. Proc. of Int. Conf. DEXA' 96, Zurich, 1996, pp. 377382. [Po97] Pokorný, J.: Conceptual modelling in OLAP. Submitted to ECIS ' 98 Conf., 1997. Ra95 Raden, A: Modeling the Data Warehouse. Archer Decision Sciences, Inc., 1995. Ra96 Raden N.: Star schema 101. http://www.netmar.com/nraden/str101_q.htm. RB96 Red Brick: Star Schemas and STARjoinTM TechnologyTM. White paper, 1996.

12

Suggest Documents