Data Mart Conceptual Modeling and Design

Data Warehouse/Data Mart Conceptual Modeling and Design 1. Methodological Framework • Conceptual Design & Logical Design • Top-Down Versus Botton-Up ...
Author: John Matthews
3 downloads 0 Views 4MB Size
Data Warehouse/Data Mart Conceptual Modeling and Design

1. Methodological Framework • Conceptual Design & Logical Design • Top-Down Versus Botton-Up Approach

(4)

• Design Phases and schemata derivations 2. Conceptual Modelling: The Dimensionnal Fact Model (DFM)

Bernard ESPINASSE Professeur à Aix-Marseille Université (AMU) Ecole Polytechnique Universitaire de Marseille

• Fact schema • Dimension hierarchies • Additive, semi-additive and non-additive attributes

September 2013

• Overlapping compatible fact schemata

Methodological Framework Conceptual Modelling: the Dimensionnal Fact Model (DFM) Conceptual Design: from Relational schema to DFM

Bernard ESPINASSE - Data Warehouse Conceptual modeling and Design

1

• Representing query patterns on a fact schema 3. Conceptual Design : From Relationnal schema to DFM of Data Mart • Building the attribute tree, pruning and grafting the attribute tree • Defining dimensions, measures and granularity of data

Bernard ESPINASSE - Data Warehouse Conceptual modeling and Design

2

• Books • Golfarelli M., Rizzi S., “Data Warehouse Design : Modern Principles and Methodologies”, McGrawHill, 2009. • Kimball R., Ross, M., “Entrepôts de données : guide pratique de modélisation dimensionnelle”, 2°édition, Ed. Vuibert, 2003. • S. Rizzi. “Conceptual modeling solutions for the data warehouse”. In Data Warehousing and Mining: Concepts, Methodologies, Tools, and Applications, J. Wang (Ed.), Information Science Reference, pp. 208-227, 2008. • M. Golfarelli, D. Maio, S. Rizzi. “Conceptual Design of Data Warehouses from E/R Schemes”. Proceedings 31st Hawaii International Conference on System Sciences (HICSS-31), vol. VII, Kona, Hawaii, pp. 334-343, 1998.

• Conceptual Design & Logical Design • Life-Cycle • Top-Down, Botton-Up and Mixed Strategies • Design Phases • Schemata derivations for DMs design

• Courses • Course of M. Golfarelli M. and S. Rizzi, University of Bologna • Courses of M. Böhlen and J. Gamper J., Free University of Bolzano

Bernard ESPINASSE - Data Warehouse Conceptual modeling and Design

3

Bernard ESPINASSE - Data Warehouse Conceptual modeling and Design

4

• Entite-Relation models are not very useful in modeling DWs

• Building a DW is a very complex task, which requires an accurate planning aimed at devising satisfactory answers to organizational and architectural questions

• Is now universally recognized that a DW is based on a multidimensional view of data : ! But there is still no agreement on HOW to implement its conceptual design !

• A large number of organizations lack experience and skills that are required to meet the challenges involved in DW projects

• Most of the time, DW design is at the logical level : a multidimensional model (star/snowflake schema) is directly designed : ! But a star schema is nothing but a relational schema: it contains only the definition of a set of relations and integrity constraints !

• Major cause of DW failures lies in the absence of a global view of the design process, of a design methodology

• A better approach: ! 1) design first a conceptual model : Conceptual Design ! 2) which is then translated into a logical model : Logical Design

• Tree main strategies for DW design:

Appl.

Appl.

DB2

DB3

Appl.

DB1

DB4

Botton-Up Approach

Data Marts

DM3

DM2

DM1

Appl.

Appl.

Appl.

Bernard ESPINASSE - Data Warehouse Conceptual modeling and Design

Mixed Approach

Top-Down Approach

DW

! Mixed strategy Bernard ESPINASSE - Data Warehouse Conceptual modeling and Design

6

Analyze global business needs, plan how to develop a DW, design it, and implement it as a whole with its DMs

Bottom-Up Approach:

Trans..

Global Data Warehouse

! Botton-Up strategy

Top-Down Approach: 1. Design of DW 2. Design of DMs

Existing databases and systems (OLTP) Appl.

! Top-Down strategy

5

Bernard ESPINASSE - Data Warehouse Conceptual modeling and Design

Appl.

• Design Methodologies are necessary to minimizing the risks for failure

1. Design of DMs 2. Integration of DMs in DW 3. Maybe no physical DW

Mixed Approach: 1. Design of DW for DM1 2. Design of DM2 and integration with DW 3. Design of DM3 and integration with DW 4. ... 7

(+) Stengths: ! Promising: it is based on a global picture of the goal to achieve, and in principle it ensures consistent, well integrated DW (-) Weakness: ! High-cost estimates with long-term implementations discourage company managers from embarking on these kind of projects. ! Analyzing and integrating all relevant sources at the same time is a very difficult task: they are all available and stable at the same time. ! Extremely difficult to forecast the specific needs of every department involved in a project, which leads to specific DMs ! As no working DW system is going to be delivered in the short term, users cannot check for this project to be useful, so they lose trust and interest in it. Bernard ESPINASSE - Data Warehouse Conceptual modeling and Design

8

Top-Down and Bottom-Up strategies should be mixed :

• DW is incrementally built and several DM are iteratively created • Each DM is based on a set of facts that are linked to a specific department and that can be interesting for a user group (+) Stengths: ! Leads to concrete results in a short time ! Does not require huge investments ! Enables designers to investigate one area at a time Gives managers a quick feedback about the actual benefits of the system being built (-) Weakness: Keeps the interest for the project constantly high may determine a partial vision of the business domain.

• When planning a DW, a bottom-up strategy should be followed • One Data Mart (DM) at a time is identified and prototyped according to a top-down strategy by building a conceptual schema for each fact of interest • The first DM (DM1) to prototype : ! is the one playing the most strategic role for the enterprise ! should be a backbone for the whole DW ! should lean on available and consistent data sources

=> Mixed strategy …

Bernard ESPINASSE - Data Warehouse Conceptual modeling and Design

9

10

Bernard ESPINASSE - Data Warehouse Conceptual modeling and Design

!"#"$%"&#$'()*+,$-.")() Phase 1: Goal setting and planning

Phase 2: Infrastructure Design

Phase 3: Design and developpement of Data Marts

Phase 1 : Goal setting and planning od the DW • set system goals, borders, and size • select an approach for design and implementation • estimate costs and benefits • analyze risks and expectations • examine the skills of the working team

Each Data Mart (DM) will be designed according these steps: db administrator

!"#$%&'()(*+,-,' ().'-)/&0$(/-") 1&2#-$&3&)/ ()(*+,-, 4")%&5/#(* .&,-0)

Phase 2 : Infrastructure design of the DW • analyze and compare the possible architectural solutions • assess the available technologies and tools • create a preliminary plan of the whole DW system

designer business user

6"$7*"(. ().'.(/('8"*#3& 9"0-%(* .&,-0) :;9 .&,-0)

Phase 3 : Design and development of DMs • Every iteration causes a new DM and new applications to be created and progressively added to the DW system

Bernard ESPINASSE - Data Warehouse Conceptual modeling and Design

type; type -> category; category -> department … dimensions fact

day holiday

year

quarter

month

date

quantity receips unitPrice numberOfCustomer

• dimension attributes in the nodes along each sub-path of the hierarchy starting from the dimension define progressively coarser granularities.

salesManager salesDistrict

SALE

store

storeCity

state

country

week

Bernard ESPINASSE - Data Warehouse Conceptual modeling and Design

19

Bernard ESPINASSE - Data Warehouse Conceptual modeling and Design

20

non-dimension attributes contains additional information about an attribute of the hierarchy: it cannot be used for aggregation ! Ex : size : aggregating sales according to the size of the product would not make sense! department

size

category

marketingGroup

non dimension attribute

Optional arcs (marked by a dash) express optional relationships between pairs of attributes (useful for logical design) Ex : diet, promotion. The diet attribute takes a value (such as cholesterol-free, gluten-free, or sugar-free) only for food products; for the other products, it is undefined. department

size

brandCity

brandCity

category

marketingGroup

type

optional arc

brand

type brand

product product

diet salesManager

day holiday

salesManager

day holiday

year

quarter

month

date

quantity receips unitPrice numberOfCustomer

salesDistrict

SALE quantity receips unitPrice numberOfCustomer

year

store

quarter

salesDistrict

SALE

month

date

storeCity

state

country

telephone address

country

state

discount

endDate cost

21

Bernard ESPINASSE - Data Warehouse Conceptual modeling and Design

Cross-dimensional attribute is a dimensionnal or descriptive attribute whose value is defined by the combination of 2 or more dimensional attributes, possibly belonging to different hierarchies. Ex : if a product Value Added Tax (VAT) depends both on the product category and on the country where the product is sold, you can use a cross-dimensional attribute to represent it: VAT

department category

marketingGroup

storeCity

promotion startDate

week

size

store

week

advertising

22

Bernard ESPINASSE - Data Warehouse Conceptual modeling and Design

A convergence takes place when 2 dimensional attributes within a hierarchy are connected by 2 or more alternative paths of many-to-one associations (Graphically, use of arrows). Ex : in store dimension, store are grouped into sales districts and no inclusive relationship exists between districts and states, but each district is part of only one country: Store -> salesDistrict -> country or Store -> storeCity -> state -> country convergence

brandCity

cross-dimensionnal attributes

type brand product

diet

holiday

year

quarter

month

salesManager day

salesManager

day

date

quantity receips unitPrice numberOfCustomer

holiday

store

storeCity

state

year

country

month

date

quantity receips unitPrice numberOfCustomer

store

storeCity

state

country

week

week

Bernard ESPINASSE - Data Warehouse Conceptual modeling and Design

quarter

salesDistrict

SALE

salesDistrict

SALE

23

Bernard ESPINASSE - Data Warehouse Conceptual modeling and Design

24

Shared hierarchies exist when entire portion of hierarchies are frequently replicated 2 or more time in fact schemata In particular in time hierarchies, 2 or more date-type dimensions with different meaning can easily exist in a same fact, and need to build a month-year hierarchy on each one of them => an abreviation is introduced Ex: calling and called phone numbers … callingNumberType callingNumberDistrict

hour

CALL callingNumber

calledNumberDistrict

Ex : in a fact schema modeling the sales of books, whose dimensions are date and book. It would certainly be interesting to aggregate and select sales on the basis of book authors. However, it would not be accurate to model author as a dimensional child attribute of book because many different authors can write many books. Then, the relationship between books and authors is modeled as a multiple arc:

number duration

day date

calledNumber

calledNumberType

Multiple arc models a many-to-many association between the 2 dimensional attributes it connects (Graphically, denoted by doubling of the arc)

month

genre

holiday

year

SALE quantity receips unitPrice numberOfCustomer

shared hierarchy type

calling

hour

CALL O telNumber district

year

quarter

month

number duration

called

date

book

author

week date

month

multiple arc

year

roles

25

Bernard ESPINASSE - Data Warehouse Conceptual modeling and Design

3 Types of measure : ! Flow measure: refer to a timeframe (ex: number of products sold in a day) ! Level measure: evaluated at particular time (ex: number of products in inventory) ! Unit measure: evaluated at particular time but are expressed in relative terms (ex: product unit price, discount percentage) ! Suitable operators for aggregation: Flow measures Level measures Unit measures

Temporal hierarchies SUM, AVG, MIN, MAX AVG, MIN, MAX AVG, MIN, MAX

Nontemporal hierarchies SUM, AVG, MIN, MAX SUM, AVG, MIN, MAX AVG, MIN, MAX

• In the DFM, along all the dimensions, by default measures are additive (operator SUM) • Non-additive measure can be explicitely specify with its operator(s) used for aggregation – other that SUM (Ex: AVG and MIN for inventory level) department category type

weight

brand

packaging

ItemPerPallet product

non additive measure

! additive along a dimension when can be used the SUM aggregation operator ! non-additive along a dimension if the aggregation operator is not SUM (ex: inventory level) ! a non-additive measure is non-aggregable is no operator exists (ex: unitPrice product) 27

address

AVG, MIN INVENTORY

3 Natures of measure :

Bernard ESPINASSE - Data Warehouse Conceptual modeling and Design

26

Bernard ESPINASSE - Data Warehouse Conceptual modeling and Design

level incomingQuantity year

quarter

month

date

warehouse

city

country

week

Bernard ESPINASSE - Data Warehouse Conceptual modeling and Design

28



common to F and G if at least one dimension attribute is shared. • each hierarchy in H includes all and only the dimension attributes included in the corresponding hierarchies of both F and G. • Different facts are represented in different fact schemata

each hierarchy in H includes all and only the

attribut

semester and year by performing a simple calculation. dimension attributes included in the corresponding dimens hierarchies of both F and G. designe Thus, given the two compatible schemes in in the Figure attributefact dependencies two 3, schemes are not calculate the percentage of non-European employees f solution each city, job job and year. increas attribute quarter could inconflicting: principle be added to the time MAX In some cases, aggregation along a dimension can probabl dimension in the resulting• scheme. thefactother hand, the set On of the attributes in Hthe is the union of the carried out at different abstraction levels even t timeifhie sets that, in F andby G; adopting this corresponding dimension attributes were not explici designer must keep Consider in mind EMPLOYEES store month the 2 fact schemata : Re • the dimensions in H are the intersection of those in shown. For instance, given a month attribute 3.3. within solution, the time for extracting data by quarter will number of emp. F and G , assuming that a given dimension isyear time hierarchy, fact instances cancity be aggregated byschem quart state AVG max. salary • F represents all employees of an enterprise job DW may • Queries the user formulates on the require comparing fact attributes increase significantly; thus,common the best would to Fsolution and G if at least one dimension semester and year by MAX performing a simple calculatio MAX MAX The • attribute G non-European employees. taken from distinct, though related, schemata (drill across in OLAP) isthe shared. Thus, given the (a) two compatible fact schemes in Figure probably be to add explicitly theonly quarter attribute to the • each hierarchy in H includes all and only the attribute quarter could in principle be added toqueries the tim slice-an fact scheme. continent • 2 fact schemata are said compatible if they share at least one dimension time hierarchy in the employeedimension attributes included in the corresponding dimension in the resulting scheme. On the other hand, t attribute

month

EMPLOYEES

store

hierarchiesF of both F and G.

3.3. Representing query patterns on job a fact

• 2 compatible schemata F andnumber G mayofbeemp. overlapped to create a resulting schema scheme year city state AVG max. salary H

MAX

attribut

G designer must keep in mind that, by adopting th abstrac solution, for extracting data by quarter w nation job the time introdu increase significantly; thus, the best solution wou select a [2]. to t probably be to add explicitly the quarter attribute On a NON-EUROPEAN time hierarchy in the employee fact scheme.

quarter

MAX MAX • Without conflict between attribute dependencies in the 2 schemata:

query p

EMPLOYEES

EMPLOYEES store number of emp. year 3.3.AVG The basic OLAP operatorsmonth for formulating typical Representing query patterns on a factthe dim city state placed (a) number of across emp. scheme year queries on DWs are roll-up, drill down, drill and city state fact ins AVG • the set of the fact attributes in H is the union of the sets in F and G max. salary age contain sex MAX slice-and-dice; they are used, respectively, to aggregate fact MAX range continent of those in F and G, assuming The basic OLAP operators for formulatingis typic invol • the dimensions in H are the intersection (a) (b) attributes in order to view data at a higher level of not beas queries on DWs are roll-up, drill down, drill across that a given dimension is common to F and G if at least one dimension The slice-and-dice; job they are used, respectively, to aggregate fa abstraction, disaggregate fact attributes incontinent order to attribute is shared attributes in order MAX to view data at a higher combin level nation job introduce further detail, relate andGcompare distinct facts, F and are compatible, they share the time, job and store dimensions abstraction, disaggregate fact attributes in comput order pattern • each hierarchy in H includes all and only the dimension attributesselect and project facts so as to reduce their nation job dimensionality introduce further detail, relate and compare distinct sold fac and ALL EMPLOYEES year select and project facts so as to reduce their dimensional included in the corresponding hierarchies of both F and G. each ty [2]. number of emp. [2]. the ratio city state AVG max salary of emp. On a fact scheme, a query may be represented by a a fact scheme, a query may be represented by NON-EUROPEAN MAXOnnumb. NON-EUROPEAN of non-eur. emp. MAX query pattern, which consists in a set of markers placed EMPLOYEES quarter AVG query pattern, which consists in aquarter set of markers placed on EMPLOYEES the dimension attributes. One or 30 more markers can (c) Bernard ESPINASSE - Data Warehouse Conceptual modeling and Design 29 Bernard ESPINASSE - Data Warehouse Conceptual and Design number of emp. modeling year city state the dimension attributes. One or more AVG markers can be number of emp. year placed within each hierarchy, to indicate at what level city state Figure 3. Scheme overlapping. AVG fact instances must be aggregated. A dimension may al placed within each hierarchy, to indicate at what level(s) age contain nofact markers, toinindicate none Consider the two schemes Figures that 3.a and 3.b:of its attribut sex fact instances must be aggregated. A dimension may also range is involvedallinemployees the query. the first represents of Non-dimension an enterprise, theattributes ne age (b)attributes contain no markers, to indicate that none of its notthe benon-European shown on theemployees. query pattern. second only Although these sex range schemes are aimed at extracting information, they may be a The data shown different as a result of a query is involved in the query. Non-dimension job attributes need month are compatible; in fact they shareattributes, the time, job and store MAX combination of fact and/or the result of a (b) G is H: not be shown on the query pattern. • Schema resulting from overlapping F and dimensions. The be scheme resulting fromatoverlapping is computation made on out them. Figure 4 shows the que • In some cases, aggregation along a dimension can carried shown inpattern Figure representing 3.c; it can bethe used, for instance, The data shown as adifferent result ofabstraction a query may be even any if the corresponding following query:to"total quant levels dimension attributes job sold and average returns per unit sold for each week and f ALL MAX combination of fact attributes, theEMPLOYEES result of any were year notand/or explicitly shown. each type of product". The average returns per unit sold number of emp.the query computation made on them. Figure 4 shows the ratio between the total returns and the quantity sold. city state AVG max salary of emp. MAX pattern representing the following quantity • Ex: aquery: month attribute within a time hierarchy, fact instances can be numb. of"total non-eur. emp. MAX AVG category sold and average returns per unit sold for each and forsemester and year by performingtype ALL EMPLOYEES aggregated byweek quarter, a simple year (c) manufacturer each type of product". The average returns per unit sold is calculation. number of emp. Scheme overlapping. product the ratio between the total returnsFigure and the3.quantity sold. city state AVG

max salary of emp. MAX numb. of non-eur. emp. MAX AVG (c)

• Thus,thegiven the F and G fact3.a schemata, attribute quarter could in sales manager Consider two fact schemes in Figures and 3.b: the first representsbe alladded employees an time enterprise, the principle to ofthe dimension in the resulting schema H category second only the non-European employees. Although these SALE schemes are aimed at extracting different information, they manufacturer type qty sold month week that, by adopting On the other hand, in mind store city stat are•compatible; in fact they sharethe the designer time, job andmust store keep returns/qty sold dimensions. The scheme resulting fromextracting overlapping is this solution, the time for data by quarter will increase product shown in Figure 3.c; it can be used, for instance, to Figure 4. Query pattern. significantly

• H can be used, for instance, to calculate the percentage of non-European 3. year. Scheme overlapping. employees for each Figure city, job and

Consider the two fact schemes in Figures 3.a and 3.b: the first represents all employees of an enterprise, the second only the non-European employees. Although these schemes are aimed at extracting different information, they are compatible; in fact they share the time, job and store Bernard ESPINASSE - Data Warehouse Conceptual modeling and Design dimensions. The scheme resulting from overlapping is shown in Figure 3.c; it can be used, for instance, to

• thus, thesales bestmanager solution would probably be to add explicitly the quarter attribute to the time hierarchy in the employee fact schema. SALE month week 31

qty sold store city state returns/qty sold - Data Warehouse Conceptual modeling and Design Bernard ESPINASSE

Figure 4. Query pattern.

32

As a guideline, most measures in a fact scheme should be additive. An example of additive measure in the sale scheme is qty sold: the quantity sold for a given sales manager is the sum of the quantities sold for all the stores managed by that sales manager. A measure may be non-additive on one or more dimensions. Examples of this are all the measures expressing a level, such as an inventory level, a temperature, etc. An inventory level is non-additive on time, but it is additive on the other dimensions. A temperature measure is non-additive on all the dimensions, since adding up two temperatures hardly makes sense. However, this kind of non-additive measures can still be Fact schema INVENTORY : Fact aggregated by using operators such as average, maximum, minimum; Figure 5 shows an manager example where both operators AVG and MIN can be used for aggregation; measure qty expresses, for each product, the number of copies present within each warehouse during department category each week.

type

product

invoice number order date

brand units perseason pallet

year quarter month

season

year month week

warehouse city

brand diet

qty stateshipped .....

deal

state

type carrier address

allowance

corporate

address

customer

customer SHIPMENT ship to

city

state

ship from address

address

ship mode

AVG, MIN

type

brand invoice number product diet order date corporate address

INVENTORY qty

department category

SHIPMENT ship to year quarter month datecity qty shipped ship from .....

date

address

season

manager

weight package size

type

weight package size

category weight package size package type product

schema SHIPMENT:

contact person

contact person ship mode

deal

type

terms carrier address incentive

allowance

terms incentive

• Building the attribute tree • Pruning and grafting the attribute tree • Defining dimensions • Defining measures (fact attributes) • Defining the granularity of data (hierarchies).

(a)

(a) Fact schema overlaping INVENTORY and SHIPMENT: category

Fig. 5. The INVENTORY fact scheme.

category

For other measures, aggregation is inherently impossible for conceptual reasons. type weight brand Consider the measure number of customers in the sale example, estimated package for a size given product product, day and store by counting the number of purchase tickets for that product printed season on that day in that store. Since the same ticket may include other products, adding or season averaging the number of customers for two or more products would lead to an inconsistent month SHIPMENT year result. Thus, number of customers is non-aggregable on the product dimension (while it! year INVENTORY is additive on the time and the stores dimensions). In this case, the reason for nonqty shipped aggregability is that the relationship between purchase tickets and products is many-toinventory qty AVG, ..... many instead of many-to-one: measure number of customers cannot be consistently MIN

weight package size

type brand

product

month

SHIPMENT ! INVENTORY

qty shipped inventory qty AVG, ..... MIN

(b) (b) Fig. 8. The SHIPMENT scheme (a) and its overlap with INVENTORY (b). Fig. 8. The SHIPMENT scheme (a) and its overlap with INVENTORY (b).



The measures in f are the union of those in f' and f". Thus, the fact on which f is

• The Conceptual measures in f are the union of those in and f". Thus, which f is embracing both f' and f". centred may bef'considered as athe sortfact of on "macro-fact" Bernard ESPINASSE - Data Warehouse modeling and Design

33

Bernard ESPINASSE - Data Warehouse Conceptual modeling and Design

centred may be considered •as aEach sort ofhierarchy "macro-fact" f' andonly f". the attributes included in the corresponding in embracing f includesboth all and • Each hierarchy in f includes all and only the included in thefunctional corresponding hierarchies ofattributes both f' and f". The dependencies expressed by the interhierarchies of both f' and f".attribute The functional expressed by the interlinks independencies f' and f" are preserved. attribute links in f' and f" are • preserved. The domain of each dimension attribute in f is the intersection of the domains of the • The domain of each dimension attribute in f isattributes the intersection of f". the domains of the corresponding in f' and corresponding attributes in f' and f".

34

• Facts correspond to events occurring dynamically

The step to derive DF schemata from Relational schema is :

• Within an Relational schema, a fact is represented by a table:

• 1. Finding and defining facts from Relational schema

• Tables representing frequently updated archives are good candidates to define facts

For each fact :

• Tables representing nearly-static archives or representing structural properties of the domain (such as STORE and CITY), are not candidates to define facts

• 2. Building the Attribute Tree from Relational schema • 3. Building the Fact Schema from Attribute Tree

• Each fact identified on the Relational schema becomes the root of an attribute tree, that become a fact schema.

Note that the step to derive DF schemata from E/R schema is very similar: the main difference concerns the algorithm used to build the attribute tree

Bernard ESPINASSE - Data Warehouse Conceptual modeling and Design

35

Ex : In the case the more important fact is a product sale is represented by the SALES table

Bernard ESPINASSE - Data Warehouse Conceptual modeling and Design

36

Relational schema of the DVD rental BD:

For each fact defined, the attribute tree is built as follow : • Each node of the attribute tree corresponds to one or more Relational schema attributes

• CARDS (cardNumber, expiry) • CUSTOMERS (cardNumber:CARDS, name, gender, address, telephone, personalDocument) • MOVIES (moviesCode, title, category, director, lengh, mainActor) • COPIES (positionOnShelf, movieCode:MOVIES) • RENTALS (positionOnShelf:COPIES, cardNumber:CARDS, date, time)

The table RENTALS is the only candidate for expressing facts, the attribute tree associated is:

• The root of the attribute tree corresponds to the primary key of F

expiry

• For each node v, the corresponding attribute functionally determines all the attributes that correspond to the descendants of v (functionnal dependencies)

title

name cardNumber (CUSTOMER)

positionOnShelf (RENTALS)

movieCode category

telephone cardNumber (CARDS)

gender

positionOnShelf (COPIES)

lengh

address

director

personalDocument

Bernard ESPINASSE - Data Warehouse Conceptual modeling and Design

37

Relational schema of the Flight BD:

date

city

country

Attribute Tree 2 (FLIGHTS_INSTANCES) city

fromAirport

carrier fromAirport flightNumber (FLIGHTS_INSTANCES)

airline

flightNumber (FLIGHTS)

departureTime

FLIGHTS FLIGHT_INSTANCES TICKETS CHECK_IN

flightNumber (FLIGHTS)

departureTime toAirport

name

toAirport

date

name city

39

name

country

carrier

The tables that are candidates for expressing facts are :

Bernard ESPINASSE - Data Warehouse Conceptual modeling and Design

38

name

airline

• • • •

mainActor

Bernard ESPINASSE - Data Warehouse Conceptual modeling and Design

Attribute Tree 1 (FLIGHTS)

FLIGHTS (flightNumber, airline, fromAirport:AIRPORTS) FLIGHT_INSTANCES (FlightNumber:FLIGHTS, date) AIRPORTS (IATAcode, name, city, country) TICKETS (ticketNumber, flightNumber:FLIGHT_INSTANCES), seat, fate, passengersFirstName, passengersSurname, passengersGender) • CHECK-IN (ticketNumber:TICKETS, CheckInTime, numberOfBags) • • • •

time

country

Bernard ESPINASSE - Data Warehouse Conceptual modeling and Design

city

country

40

Attribute Tree 3 (TICKETS):

Attribute Tree 4 (CHECK_IN): city

city

name fare

name

country fare country

checkInTime

carrier

fromAirport flightNumber (FLIGHTS)

fromAirport

ticketNumber (TICKETS)

airline

flightNumber (FLIGHTS)

ticketNumber (TICKETS)

airline

flightNumber (FLIGHTS_INSTANCES)

departureTime flightNumber (FLIGHTS_INSTANCES)

departureTime

checkInTime

carrier

ticketNumber (CHECK_IN)

ticketNumber (CHECK_IN)

numberOfBags

passengerGender toAirport

passengerGender toAirport

passagerLastName

date

passagerFirstName

passagerLastName

date

numberOfBags

passagerFirstName

name city

name city

country

Bernard ESPINASSE - Data Warehouse Conceptual modeling and Design

41

42

Bernard ESPINASSE - Data Warehouse Conceptual modeling and Design

3.1: Pruning and grafting the attribute tree:

For each fact:

expiry

• 3.1. Pruning and grafting the attribute tree:

title

name cardNumber (CUSTOMER)

• We can retain or graft any nodes corresponding to composite keys

positionOnShelf (RENTALS)

movieCode category

telephone cardNumber (CARDS)

gender

• We can modify, add, or delete a fuctional dependency

positionOnShelf (COPIES)

lengh

address

director

personalDocument

• We can add one or more fuctional dependencies if a non-mormalized table exists in the relational schema

gender

• 3.2. Defining Fact Schema with its dimensions (fact dimensions)

date

customer

time

positionOnShelf (RENTALS)

cardNumber (CUSTOMER)

• 3.3. Defining Fact Schema measures (fact attributes)

mainActor

title category

lengh

date

• 3.4. Defining Fact Schema granularity of data (dimension hierarchies).

director mainActor

The step to derive DF schemata from E/R schema is very similar: the main difference concerns the algorithm used to build the attribute tree

Bernard ESPINASSE - Data Warehouse Conceptual modeling and Design

country

Facts TICKETS and CHECK_IN are the best choices because existing functional dependencies permit to include a maximum of attributs in trees 3 and 4.

43

• • • •

movieCode and Title are inverted cardNumber(CARDS) and name (renamed customer) are inverted positionOnShelf(COPIES) and cardNumber(CARDS) are grafted time, expiry, telephone, address, personalDocument, movieCode and cardNumber(CUSTOMERS) are pruned

Bernard ESPINASSE - Data Warehouse Conceptual modeling and Design

44

Fact schema “RENTAL”:

Relational schema of the DVD rental BD:

gender

positionOnShelf (RENTALS)

customer

• •

title category

cardNumber (CUSTOMER)

lengh

• • •

director

date

mainActor

dimensions

CARDS (cardNumber, expiry) CUSTOMERS (cardNumber:CARDS, name, gender, address, telephone, personalDocument) MOVIES (moviesCode, title, category, director, lengh, mainActor) COPIES (positionOnShelf, movieCode:MOVIES) RENTALS (positionOnShelf:COPIES, cardNumber:CARDS, date, time)

SQL measure glossaries for fact schema “RENTAL”: fact

number = SELECT COUNT (*) FROM RENTALS R INNER JOINT COPIES C ON R.positionOnShelf = C.positionOnShelf, COPIES C INNER JOINT MOVIES F RENTALS R INNER JOINT CUSTOMERS C ON R.cardNumber = C.cardNumber GROUP BY F.title, R.date, C.name;

date gender

customer

RENTAL

title

category

number

measure

lengh director mainActor

45

Bernard ESPINASSE - Data Warehouse Conceptual modeling and Design

Pruning and grafting the attribute tree:

46

Bernard ESPINASSE - Data Warehouse Conceptual modeling and Design

Fact schema “TICKET ISSUE”:

country carrier

date

fare

check-in

city

country fromAirport ticketNumber (TICKETS)

airline

city

numberOfBags

Airport

flightNumber

departureTime

from to

date

check-in

numberOfFlights numberOfBags receipts

airline

seat

toAirport

TICKET ISSUE

flightNumber

passengerGender

departureTime

passengerGender

city

arrivalTime

carrier

country

• country is now the child of city • checkIn is now a bolean added on the tree when number node was grafted: ist value is TRUE only for tickets whose passengers have checked in.

Bernard ESPINASSE - Data Warehouse Conceptual modeling and Design

47

Bernard ESPINASSE - Data Warehouse Conceptual modeling and Design

48

FLIGHTS (flightNumber, airline, fromAirport:AIRPORTS) FLIGHT_INSTANCES (FlightNumber:FLIGHTS, date) AIRPORTS (IATAcode, name, city, country) TICKETS (ticketNumber, flightNumber:FLIGHT_INSTANCES), seat, fate, passengersFirstName, passengersSurname, passengersGender) • CHECK-IN (ticketNumber:TICKETS, CheckInTime, numberOfBags) • • • •

The following relationnal logical schema describes an operational database for car rentals : • • • •

SQL measure glossaries for fact schema “TICKET ISSUE”: numberOfFlight = SELECT COUNT (*) FROM TICKETS T INNER JOINT FLIGHT_INSTANCES I ON T.flightNumber = I.flightNumber AND T.date = I.date GROUP BY T.passengerNumber, I.date, T.flightNumber; numberOfBags = SELECT SUM (C.numberOfBag) FROM TICKETS T INNER JOINT FLIGHT_INSTANCES I ON T.flightNumber = I.flightNumber AND T.date = I.date TICKETS T INNER JOINT CHECK_IN C ON T.ticketNumber = C.ticketNumber GROUP BY T.ticketNumber, I.date, T.flightNumber; receipts = SELECT SUM (T.fare) FROM TICKETS T INNER JOINT FLIGHT_INSTANCES I ON T.flightNumber = I.flightNumber AND T.date = I.date GROUP BY T.passengerGender, I.date, T.flightNumber;

• • • •

RENTAL_OFFICES (OfficeName, City, Area, State, Country) CARS (LicensePlate, Category, Model, Brand, Fuel, RegistrationDate) HAVE_OPTIONAL (LicensePlate:CARS, Optional) RENTALS (LicensePlate: CARS, PickupDate, DropoffDate, PickupPlace:RENTALJDFFICES, DropoffPlace :RENTAL_OFFICES, Miles) DRIVERS (LicenseNumber, LicenseExpiration, DriverName, Birthdate) DRIVE (LicenseNumber: DRIVERS,(LicensePlate, PickupDate):RENTALS) INSURANCES (Risk,(LicensePlate, PickupDate):RENTALS, Cost) PAYMENTS ((LicensePlate, PickupDate):RENTALS, Amount, Discount, PaymentMode) !5.,

3#$%&5-

;&,&.

Some hidden functional dependencies hold: City->State->Country->Area, and Model->Brand. Inspect and3(&normalize the source schema, then choose 7$.1 899(*.:,". a fact of interest and design its fact schema. 05#=6785'.)&'()>=7047)..*?);6+@)7 $*+)

26785'.)

&67'(*03

&'() !"#$% =0*3(. !*,(%-.()+/01)

26785'.)

$*89)(/01)

/0.( :,,*8)/01)