Data Warehouse/Data Mart Conceptual Modeling and Design
1. Methodological Framework • Conceptual Design & Logical Design • Top-Down Versus Botton-Up Approach
(4)
• Design Phases and schemata derivations 2. Conceptual Modelling: The Dimensionnal Fact Model (DFM)
Bernard ESPINASSE Professeur à Aix-Marseille Université (AMU) Ecole Polytechnique Universitaire de Marseille
• Fact schema • Dimension hierarchies • Additive, semi-additive and non-additive attributes
September 2013
• Overlapping compatible fact schemata
Methodological Framework Conceptual Modelling: the Dimensionnal Fact Model (DFM) Conceptual Design: from Relational schema to DFM
Bernard ESPINASSE - Data Warehouse Conceptual modeling and Design
1
• Representing query patterns on a fact schema 3. Conceptual Design : From Relationnal schema to DFM of Data Mart • Building the attribute tree, pruning and grafting the attribute tree • Defining dimensions, measures and granularity of data
Bernard ESPINASSE - Data Warehouse Conceptual modeling and Design
2
• Books • Golfarelli M., Rizzi S., “Data Warehouse Design : Modern Principles and Methodologies”, McGrawHill, 2009. • Kimball R., Ross, M., “Entrepôts de données : guide pratique de modélisation dimensionnelle”, 2°édition, Ed. Vuibert, 2003. • S. Rizzi. “Conceptual modeling solutions for the data warehouse”. In Data Warehousing and Mining: Concepts, Methodologies, Tools, and Applications, J. Wang (Ed.), Information Science Reference, pp. 208-227, 2008. • M. Golfarelli, D. Maio, S. Rizzi. “Conceptual Design of Data Warehouses from E/R Schemes”. Proceedings 31st Hawaii International Conference on System Sciences (HICSS-31), vol. VII, Kona, Hawaii, pp. 334-343, 1998.
• Conceptual Design & Logical Design • Life-Cycle • Top-Down, Botton-Up and Mixed Strategies • Design Phases • Schemata derivations for DMs design
• Courses • Course of M. Golfarelli M. and S. Rizzi, University of Bologna • Courses of M. Böhlen and J. Gamper J., Free University of Bolzano
Bernard ESPINASSE - Data Warehouse Conceptual modeling and Design
3
Bernard ESPINASSE - Data Warehouse Conceptual modeling and Design
4
• Entite-Relation models are not very useful in modeling DWs
• Building a DW is a very complex task, which requires an accurate planning aimed at devising satisfactory answers to organizational and architectural questions
• Is now universally recognized that a DW is based on a multidimensional view of data : ! But there is still no agreement on HOW to implement its conceptual design !
• A large number of organizations lack experience and skills that are required to meet the challenges involved in DW projects
• Most of the time, DW design is at the logical level : a multidimensional model (star/snowflake schema) is directly designed : ! But a star schema is nothing but a relational schema: it contains only the definition of a set of relations and integrity constraints !
• Major cause of DW failures lies in the absence of a global view of the design process, of a design methodology
• A better approach: ! 1) design first a conceptual model : Conceptual Design ! 2) which is then translated into a logical model : Logical Design
• Tree main strategies for DW design:
Appl.
Appl.
DB2
DB3
Appl.
DB1
DB4
Botton-Up Approach
Data Marts
DM3
DM2
DM1
Appl.
Appl.
Appl.
Bernard ESPINASSE - Data Warehouse Conceptual modeling and Design
Mixed Approach
Top-Down Approach
DW
! Mixed strategy Bernard ESPINASSE - Data Warehouse Conceptual modeling and Design
6
Analyze global business needs, plan how to develop a DW, design it, and implement it as a whole with its DMs
Bottom-Up Approach:
Trans..
Global Data Warehouse
! Botton-Up strategy
Top-Down Approach: 1. Design of DW 2. Design of DMs
Existing databases and systems (OLTP) Appl.
! Top-Down strategy
5
Bernard ESPINASSE - Data Warehouse Conceptual modeling and Design
Appl.
• Design Methodologies are necessary to minimizing the risks for failure
1. Design of DMs 2. Integration of DMs in DW 3. Maybe no physical DW
Mixed Approach: 1. Design of DW for DM1 2. Design of DM2 and integration with DW 3. Design of DM3 and integration with DW 4. ... 7
(+) Stengths: ! Promising: it is based on a global picture of the goal to achieve, and in principle it ensures consistent, well integrated DW (-) Weakness: ! High-cost estimates with long-term implementations discourage company managers from embarking on these kind of projects. ! Analyzing and integrating all relevant sources at the same time is a very difficult task: they are all available and stable at the same time. ! Extremely difficult to forecast the specific needs of every department involved in a project, which leads to specific DMs ! As no working DW system is going to be delivered in the short term, users cannot check for this project to be useful, so they lose trust and interest in it. Bernard ESPINASSE - Data Warehouse Conceptual modeling and Design
8
Top-Down and Bottom-Up strategies should be mixed :
• DW is incrementally built and several DM are iteratively created • Each DM is based on a set of facts that are linked to a specific department and that can be interesting for a user group (+) Stengths: ! Leads to concrete results in a short time ! Does not require huge investments ! Enables designers to investigate one area at a time Gives managers a quick feedback about the actual benefits of the system being built (-) Weakness: Keeps the interest for the project constantly high may determine a partial vision of the business domain.
• When planning a DW, a bottom-up strategy should be followed • One Data Mart (DM) at a time is identified and prototyped according to a top-down strategy by building a conceptual schema for each fact of interest • The first DM (DM1) to prototype : ! is the one playing the most strategic role for the enterprise ! should be a backbone for the whole DW ! should lean on available and consistent data sources
=> Mixed strategy …
Bernard ESPINASSE - Data Warehouse Conceptual modeling and Design
9
10
Bernard ESPINASSE - Data Warehouse Conceptual modeling and Design
!"#"$%"$'()*+,$-.")() Phase 1: Goal setting and planning
Phase 2: Infrastructure Design
Phase 3: Design and developpement of Data Marts
Phase 1 : Goal setting and planning od the DW • set system goals, borders, and size • select an approach for design and implementation • estimate costs and benefits • analyze risks and expectations • examine the skills of the working team
Each Data Mart (DM) will be designed according these steps: db administrator
!"#$%&'()(*+,-,' ().'-)/&0$(/-") 1&2#-$&3&)/ ()(*+,-, 4")%&5/#(* .&,-0)
Phase 2 : Infrastructure design of the DW • analyze and compare the possible architectural solutions • assess the available technologies and tools • create a preliminary plan of the whole DW system
designer business user
6"$7*"(. ().'.(/('8"*#3& 9"0-%(* .&,-0) :;9 .&,-0)
Phase 3 : Design and development of DMs • Every iteration causes a new DM and new applications to be created and progressively added to the DW system
Bernard ESPINASSE - Data Warehouse Conceptual modeling and Design
type; type -> category; category -> department … dimensions fact
day holiday
year
quarter
month
date
quantity receips unitPrice numberOfCustomer
• dimension attributes in the nodes along each sub-path of the hierarchy starting from the dimension define progressively coarser granularities.
salesManager salesDistrict
SALE
store
storeCity
state
country
week
Bernard ESPINASSE - Data Warehouse Conceptual modeling and Design
19
Bernard ESPINASSE - Data Warehouse Conceptual modeling and Design
20
non-dimension attributes contains additional information about an attribute of the hierarchy: it cannot be used for aggregation ! Ex : size : aggregating sales according to the size of the product would not make sense! department
size
category
marketingGroup
non dimension attribute
Optional arcs (marked by a dash) express optional relationships between pairs of attributes (useful for logical design) Ex : diet, promotion. The diet attribute takes a value (such as cholesterol-free, gluten-free, or sugar-free) only for food products; for the other products, it is undefined. department
size
brandCity
brandCity
category
marketingGroup
type
optional arc
brand
type brand
product product
diet salesManager
day holiday
salesManager
day holiday
year
quarter
month
date
quantity receips unitPrice numberOfCustomer
salesDistrict
SALE quantity receips unitPrice numberOfCustomer
year
store
quarter
salesDistrict
SALE
month
date
storeCity
state
country
telephone address
country
state
discount
endDate cost
21
Bernard ESPINASSE - Data Warehouse Conceptual modeling and Design
Cross-dimensional attribute is a dimensionnal or descriptive attribute whose value is defined by the combination of 2 or more dimensional attributes, possibly belonging to different hierarchies. Ex : if a product Value Added Tax (VAT) depends both on the product category and on the country where the product is sold, you can use a cross-dimensional attribute to represent it: VAT
department category
marketingGroup
storeCity
promotion startDate
week
size
store
week
advertising
22
Bernard ESPINASSE - Data Warehouse Conceptual modeling and Design
A convergence takes place when 2 dimensional attributes within a hierarchy are connected by 2 or more alternative paths of many-to-one associations (Graphically, use of arrows). Ex : in store dimension, store are grouped into sales districts and no inclusive relationship exists between districts and states, but each district is part of only one country: Store -> salesDistrict -> country or Store -> storeCity -> state -> country convergence
brandCity
cross-dimensionnal attributes
type brand product
diet
holiday
year
quarter
month
salesManager day
salesManager
day
date
quantity receips unitPrice numberOfCustomer
holiday
store
storeCity
state
year
country
month
date
quantity receips unitPrice numberOfCustomer
store
storeCity
state
country
week
week
Bernard ESPINASSE - Data Warehouse Conceptual modeling and Design
quarter
salesDistrict
SALE
salesDistrict
SALE
23
Bernard ESPINASSE - Data Warehouse Conceptual modeling and Design
24
Shared hierarchies exist when entire portion of hierarchies are frequently replicated 2 or more time in fact schemata In particular in time hierarchies, 2 or more date-type dimensions with different meaning can easily exist in a same fact, and need to build a month-year hierarchy on each one of them => an abreviation is introduced Ex: calling and called phone numbers … callingNumberType callingNumberDistrict
hour
CALL callingNumber
calledNumberDistrict
Ex : in a fact schema modeling the sales of books, whose dimensions are date and book. It would certainly be interesting to aggregate and select sales on the basis of book authors. However, it would not be accurate to model author as a dimensional child attribute of book because many different authors can write many books. Then, the relationship between books and authors is modeled as a multiple arc:
number duration
day date
calledNumber
calledNumberType
Multiple arc models a many-to-many association between the 2 dimensional attributes it connects (Graphically, denoted by doubling of the arc)
month
genre
holiday
year
SALE quantity receips unitPrice numberOfCustomer
shared hierarchy type
calling
hour
CALL O telNumber district
year
quarter
month
number duration
called
date
book
author
week date
month
multiple arc
year
roles
25
Bernard ESPINASSE - Data Warehouse Conceptual modeling and Design
3 Types of measure : ! Flow measure: refer to a timeframe (ex: number of products sold in a day) ! Level measure: evaluated at particular time (ex: number of products in inventory) ! Unit measure: evaluated at particular time but are expressed in relative terms (ex: product unit price, discount percentage) ! Suitable operators for aggregation: Flow measures Level measures Unit measures
Temporal hierarchies SUM, AVG, MIN, MAX AVG, MIN, MAX AVG, MIN, MAX
Nontemporal hierarchies SUM, AVG, MIN, MAX SUM, AVG, MIN, MAX AVG, MIN, MAX
• In the DFM, along all the dimensions, by default measures are additive (operator SUM) • Non-additive measure can be explicitely specify with its operator(s) used for aggregation – other that SUM (Ex: AVG and MIN for inventory level) department category type
weight
brand
packaging
ItemPerPallet product
non additive measure
! additive along a dimension when can be used the SUM aggregation operator ! non-additive along a dimension if the aggregation operator is not SUM (ex: inventory level) ! a non-additive measure is non-aggregable is no operator exists (ex: unitPrice product) 27
address
AVG, MIN INVENTORY
3 Natures of measure :
Bernard ESPINASSE - Data Warehouse Conceptual modeling and Design
26
Bernard ESPINASSE - Data Warehouse Conceptual modeling and Design
level incomingQuantity year
quarter
month
date
warehouse
city
country
week
Bernard ESPINASSE - Data Warehouse Conceptual modeling and Design
28
•
common to F and G if at least one dimension attribute is shared. • each hierarchy in H includes all and only the dimension attributes included in the corresponding hierarchies of both F and G. • Different facts are represented in different fact schemata
each hierarchy in H includes all and only the
attribut
semester and year by performing a simple calculation. dimension attributes included in the corresponding dimens hierarchies of both F and G. designe Thus, given the two compatible schemes in in the Figure attributefact dependencies two 3, schemes are not calculate the percentage of non-European employees f solution each city, job job and year. increas attribute quarter could inconflicting: principle be added to the time MAX In some cases, aggregation along a dimension can probabl dimension in the resulting• scheme. thefactother hand, the set On of the attributes in Hthe is the union of the carried out at different abstraction levels even t timeifhie sets that, in F andby G; adopting this corresponding dimension attributes were not explici designer must keep Consider in mind EMPLOYEES store month the 2 fact schemata : Re • the dimensions in H are the intersection of those in shown. For instance, given a month attribute 3.3. within solution, the time for extracting data by quarter will number of emp. F and G , assuming that a given dimension isyear time hierarchy, fact instances cancity be aggregated byschem quart state AVG max. salary • F represents all employees of an enterprise job DW may • Queries the user formulates on the require comparing fact attributes increase significantly; thus,common the best would to Fsolution and G if at least one dimension semester and year by MAX performing a simple calculatio MAX MAX The • attribute G non-European employees. taken from distinct, though related, schemata (drill across in OLAP) isthe shared. Thus, given the (a) two compatible fact schemes in Figure probably be to add explicitly theonly quarter attribute to the • each hierarchy in H includes all and only the attribute quarter could in principle be added toqueries the tim slice-an fact scheme. continent • 2 fact schemata are said compatible if they share at least one dimension time hierarchy in the employeedimension attributes included in the corresponding dimension in the resulting scheme. On the other hand, t attribute
month
EMPLOYEES
store
hierarchiesF of both F and G.
3.3. Representing query patterns on job a fact
• 2 compatible schemata F andnumber G mayofbeemp. overlapped to create a resulting schema scheme year city state AVG max. salary H
MAX
attribut
G designer must keep in mind that, by adopting th abstrac solution, for extracting data by quarter w nation job the time introdu increase significantly; thus, the best solution wou select a [2]. to t probably be to add explicitly the quarter attribute On a NON-EUROPEAN time hierarchy in the employee fact scheme.
quarter
MAX MAX • Without conflict between attribute dependencies in the 2 schemata:
query p
EMPLOYEES
EMPLOYEES store number of emp. year 3.3.AVG The basic OLAP operatorsmonth for formulating typical Representing query patterns on a factthe dim city state placed (a) number of across emp. scheme year queries on DWs are roll-up, drill down, drill and city state fact ins AVG • the set of the fact attributes in H is the union of the sets in F and G max. salary age contain sex MAX slice-and-dice; they are used, respectively, to aggregate fact MAX range continent of those in F and G, assuming The basic OLAP operators for formulatingis typic invol • the dimensions in H are the intersection (a) (b) attributes in order to view data at a higher level of not beas queries on DWs are roll-up, drill down, drill across that a given dimension is common to F and G if at least one dimension The slice-and-dice; job they are used, respectively, to aggregate fa abstraction, disaggregate fact attributes incontinent order to attribute is shared attributes in order MAX to view data at a higher combin level nation job introduce further detail, relate andGcompare distinct facts, F and are compatible, they share the time, job and store dimensions abstraction, disaggregate fact attributes in comput order pattern • each hierarchy in H includes all and only the dimension attributesselect and project facts so as to reduce their nation job dimensionality introduce further detail, relate and compare distinct sold fac and ALL EMPLOYEES year select and project facts so as to reduce their dimensional included in the corresponding hierarchies of both F and G. each ty [2]. number of emp. [2]. the ratio city state AVG max salary of emp. On a fact scheme, a query may be represented by a a fact scheme, a query may be represented by NON-EUROPEAN MAXOnnumb. NON-EUROPEAN of non-eur. emp. MAX query pattern, which consists in a set of markers placed EMPLOYEES quarter AVG query pattern, which consists in aquarter set of markers placed on EMPLOYEES the dimension attributes. One or 30 more markers can (c) Bernard ESPINASSE - Data Warehouse Conceptual modeling and Design 29 Bernard ESPINASSE - Data Warehouse Conceptual and Design number of emp. modeling year city state the dimension attributes. One or more AVG markers can be number of emp. year placed within each hierarchy, to indicate at what level city state Figure 3. Scheme overlapping. AVG fact instances must be aggregated. A dimension may al placed within each hierarchy, to indicate at what level(s) age contain nofact markers, toinindicate none Consider the two schemes Figures that 3.a and 3.b:of its attribut sex fact instances must be aggregated. A dimension may also range is involvedallinemployees the query. the first represents of Non-dimension an enterprise, theattributes ne age (b)attributes contain no markers, to indicate that none of its notthe benon-European shown on theemployees. query pattern. second only Although these sex range schemes are aimed at extracting information, they may be a The data shown different as a result of a query is involved in the query. Non-dimension job attributes need month are compatible; in fact they shareattributes, the time, job and store MAX combination of fact and/or the result of a (b) G is H: not be shown on the query pattern. • Schema resulting from overlapping F and dimensions. The be scheme resulting fromatoverlapping is computation made on out them. Figure 4 shows the que • In some cases, aggregation along a dimension can carried shown inpattern Figure representing 3.c; it can bethe used, for instance, The data shown as adifferent result ofabstraction a query may be even any if the corresponding following query:to"total quant levels dimension attributes job sold and average returns per unit sold for each week and f ALL MAX combination of fact attributes, theEMPLOYEES result of any were year notand/or explicitly shown. each type of product". The average returns per unit sold number of emp.the query computation made on them. Figure 4 shows the ratio between the total returns and the quantity sold. city state AVG max salary of emp. MAX pattern representing the following quantity • Ex: aquery: month attribute within a time hierarchy, fact instances can be numb. of"total non-eur. emp. MAX AVG category sold and average returns per unit sold for each and forsemester and year by performingtype ALL EMPLOYEES aggregated byweek quarter, a simple year (c) manufacturer each type of product". The average returns per unit sold is calculation. number of emp. Scheme overlapping. product the ratio between the total returnsFigure and the3.quantity sold. city state AVG
max salary of emp. MAX numb. of non-eur. emp. MAX AVG (c)
• Thus,thegiven the F and G fact3.a schemata, attribute quarter could in sales manager Consider two fact schemes in Figures and 3.b: the first representsbe alladded employees an time enterprise, the principle to ofthe dimension in the resulting schema H category second only the non-European employees. Although these SALE schemes are aimed at extracting different information, they manufacturer type qty sold month week that, by adopting On the other hand, in mind store city stat are•compatible; in fact they sharethe the designer time, job andmust store keep returns/qty sold dimensions. The scheme resulting fromextracting overlapping is this solution, the time for data by quarter will increase product shown in Figure 3.c; it can be used, for instance, to Figure 4. Query pattern. significantly
• H can be used, for instance, to calculate the percentage of non-European 3. year. Scheme overlapping. employees for each Figure city, job and
Consider the two fact schemes in Figures 3.a and 3.b: the first represents all employees of an enterprise, the second only the non-European employees. Although these schemes are aimed at extracting different information, they are compatible; in fact they share the time, job and store Bernard ESPINASSE - Data Warehouse Conceptual modeling and Design dimensions. The scheme resulting from overlapping is shown in Figure 3.c; it can be used, for instance, to
• thus, thesales bestmanager solution would probably be to add explicitly the quarter attribute to the time hierarchy in the employee fact schema. SALE month week 31
qty sold store city state returns/qty sold - Data Warehouse Conceptual modeling and Design Bernard ESPINASSE
Figure 4. Query pattern.
32
As a guideline, most measures in a fact scheme should be additive. An example of additive measure in the sale scheme is qty sold: the quantity sold for a given sales manager is the sum of the quantities sold for all the stores managed by that sales manager. A measure may be non-additive on one or more dimensions. Examples of this are all the measures expressing a level, such as an inventory level, a temperature, etc. An inventory level is non-additive on time, but it is additive on the other dimensions. A temperature measure is non-additive on all the dimensions, since adding up two temperatures hardly makes sense. However, this kind of non-additive measures can still be Fact schema INVENTORY : Fact aggregated by using operators such as average, maximum, minimum; Figure 5 shows an manager example where both operators AVG and MIN can be used for aggregation; measure qty expresses, for each product, the number of copies present within each warehouse during department category each week.
type
product
invoice number order date
brand units perseason pallet
year quarter month
season
year month week
warehouse city
brand diet
qty stateshipped .....
deal
state
type carrier address
allowance
corporate
address
customer
customer SHIPMENT ship to
city
state
ship from address
address
ship mode
AVG, MIN
type
brand invoice number product diet order date corporate address
INVENTORY qty
department category
SHIPMENT ship to year quarter month datecity qty shipped ship from .....
date
address
season
manager
weight package size
type
weight package size
category weight package size package type product
schema SHIPMENT:
contact person
contact person ship mode
deal
type
terms carrier address incentive
allowance
terms incentive
• Building the attribute tree • Pruning and grafting the attribute tree • Defining dimensions • Defining measures (fact attributes) • Defining the granularity of data (hierarchies).
(a)
(a) Fact schema overlaping INVENTORY and SHIPMENT: category
Fig. 5. The INVENTORY fact scheme.
category
For other measures, aggregation is inherently impossible for conceptual reasons. type weight brand Consider the measure number of customers in the sale example, estimated package for a size given product product, day and store by counting the number of purchase tickets for that product printed season on that day in that store. Since the same ticket may include other products, adding or season averaging the number of customers for two or more products would lead to an inconsistent month SHIPMENT year result. Thus, number of customers is non-aggregable on the product dimension (while it! year INVENTORY is additive on the time and the stores dimensions). In this case, the reason for nonqty shipped aggregability is that the relationship between purchase tickets and products is many-toinventory qty AVG, ..... many instead of many-to-one: measure number of customers cannot be consistently MIN
weight package size
type brand
product
month
SHIPMENT ! INVENTORY
qty shipped inventory qty AVG, ..... MIN
(b) (b) Fig. 8. The SHIPMENT scheme (a) and its overlap with INVENTORY (b). Fig. 8. The SHIPMENT scheme (a) and its overlap with INVENTORY (b).
•
The measures in f are the union of those in f' and f". Thus, the fact on which f is
• The Conceptual measures in f are the union of those in and f". Thus, which f is embracing both f' and f". centred may bef'considered as athe sortfact of on "macro-fact" Bernard ESPINASSE - Data Warehouse modeling and Design
33
Bernard ESPINASSE - Data Warehouse Conceptual modeling and Design
centred may be considered •as aEach sort ofhierarchy "macro-fact" f' andonly f". the attributes included in the corresponding in embracing f includesboth all and • Each hierarchy in f includes all and only the included in thefunctional corresponding hierarchies ofattributes both f' and f". The dependencies expressed by the interhierarchies of both f' and f".attribute The functional expressed by the interlinks independencies f' and f" are preserved. attribute links in f' and f" are • preserved. The domain of each dimension attribute in f is the intersection of the domains of the • The domain of each dimension attribute in f isattributes the intersection of f". the domains of the corresponding in f' and corresponding attributes in f' and f".
34
• Facts correspond to events occurring dynamically
The step to derive DF schemata from Relational schema is :
• Within an Relational schema, a fact is represented by a table:
• 1. Finding and defining facts from Relational schema
• Tables representing frequently updated archives are good candidates to define facts
For each fact :
• Tables representing nearly-static archives or representing structural properties of the domain (such as STORE and CITY), are not candidates to define facts
• 2. Building the Attribute Tree from Relational schema • 3. Building the Fact Schema from Attribute Tree
• Each fact identified on the Relational schema becomes the root of an attribute tree, that become a fact schema.
Note that the step to derive DF schemata from E/R schema is very similar: the main difference concerns the algorithm used to build the attribute tree
Bernard ESPINASSE - Data Warehouse Conceptual modeling and Design
35
Ex : In the case the more important fact is a product sale is represented by the SALES table
Bernard ESPINASSE - Data Warehouse Conceptual modeling and Design
36
Relational schema of the DVD rental BD:
For each fact defined, the attribute tree is built as follow : • Each node of the attribute tree corresponds to one or more Relational schema attributes
• CARDS (cardNumber, expiry) • CUSTOMERS (cardNumber:CARDS, name, gender, address, telephone, personalDocument) • MOVIES (moviesCode, title, category, director, lengh, mainActor) • COPIES (positionOnShelf, movieCode:MOVIES) • RENTALS (positionOnShelf:COPIES, cardNumber:CARDS, date, time)
The table RENTALS is the only candidate for expressing facts, the attribute tree associated is:
• The root of the attribute tree corresponds to the primary key of F
expiry
• For each node v, the corresponding attribute functionally determines all the attributes that correspond to the descendants of v (functionnal dependencies)
title
name cardNumber (CUSTOMER)
positionOnShelf (RENTALS)
movieCode category
telephone cardNumber (CARDS)
gender
positionOnShelf (COPIES)
lengh
address
director
personalDocument
Bernard ESPINASSE - Data Warehouse Conceptual modeling and Design
37
Relational schema of the Flight BD:
date
city
country
Attribute Tree 2 (FLIGHTS_INSTANCES) city
fromAirport
carrier fromAirport flightNumber (FLIGHTS_INSTANCES)
airline
flightNumber (FLIGHTS)
departureTime
FLIGHTS FLIGHT_INSTANCES TICKETS CHECK_IN
flightNumber (FLIGHTS)
departureTime toAirport
name
toAirport
date
name city
39
name
country
carrier
The tables that are candidates for expressing facts are :
Bernard ESPINASSE - Data Warehouse Conceptual modeling and Design
38
name
airline
• • • •
mainActor
Bernard ESPINASSE - Data Warehouse Conceptual modeling and Design
Attribute Tree 1 (FLIGHTS)
FLIGHTS (flightNumber, airline, fromAirport:AIRPORTS) FLIGHT_INSTANCES (FlightNumber:FLIGHTS, date) AIRPORTS (IATAcode, name, city, country) TICKETS (ticketNumber, flightNumber:FLIGHT_INSTANCES), seat, fate, passengersFirstName, passengersSurname, passengersGender) • CHECK-IN (ticketNumber:TICKETS, CheckInTime, numberOfBags) • • • •
time
country
Bernard ESPINASSE - Data Warehouse Conceptual modeling and Design
city
country
40
Attribute Tree 3 (TICKETS):
Attribute Tree 4 (CHECK_IN): city
city
name fare
name
country fare country
checkInTime
carrier
fromAirport flightNumber (FLIGHTS)
fromAirport
ticketNumber (TICKETS)
airline
flightNumber (FLIGHTS)
ticketNumber (TICKETS)
airline
flightNumber (FLIGHTS_INSTANCES)
departureTime flightNumber (FLIGHTS_INSTANCES)
departureTime
checkInTime
carrier
ticketNumber (CHECK_IN)
ticketNumber (CHECK_IN)
numberOfBags
passengerGender toAirport
passengerGender toAirport
passagerLastName
date
passagerFirstName
passagerLastName
date
numberOfBags
passagerFirstName
name city
name city
country
Bernard ESPINASSE - Data Warehouse Conceptual modeling and Design
41
42
Bernard ESPINASSE - Data Warehouse Conceptual modeling and Design
3.1: Pruning and grafting the attribute tree:
For each fact:
expiry
• 3.1. Pruning and grafting the attribute tree:
title
name cardNumber (CUSTOMER)
• We can retain or graft any nodes corresponding to composite keys
positionOnShelf (RENTALS)
movieCode category
telephone cardNumber (CARDS)
gender
• We can modify, add, or delete a fuctional dependency
positionOnShelf (COPIES)
lengh
address
director
personalDocument
• We can add one or more fuctional dependencies if a non-mormalized table exists in the relational schema
gender
• 3.2. Defining Fact Schema with its dimensions (fact dimensions)
date
customer
time
positionOnShelf (RENTALS)
cardNumber (CUSTOMER)
• 3.3. Defining Fact Schema measures (fact attributes)
mainActor
title category
lengh
date
• 3.4. Defining Fact Schema granularity of data (dimension hierarchies).
director mainActor
The step to derive DF schemata from E/R schema is very similar: the main difference concerns the algorithm used to build the attribute tree
Bernard ESPINASSE - Data Warehouse Conceptual modeling and Design
country
Facts TICKETS and CHECK_IN are the best choices because existing functional dependencies permit to include a maximum of attributs in trees 3 and 4.
43
• • • •
movieCode and Title are inverted cardNumber(CARDS) and name (renamed customer) are inverted positionOnShelf(COPIES) and cardNumber(CARDS) are grafted time, expiry, telephone, address, personalDocument, movieCode and cardNumber(CUSTOMERS) are pruned
Bernard ESPINASSE - Data Warehouse Conceptual modeling and Design
44
Fact schema “RENTAL”:
Relational schema of the DVD rental BD:
gender
positionOnShelf (RENTALS)
customer
• •
title category
cardNumber (CUSTOMER)
lengh
• • •
director
date
mainActor
dimensions
CARDS (cardNumber, expiry) CUSTOMERS (cardNumber:CARDS, name, gender, address, telephone, personalDocument) MOVIES (moviesCode, title, category, director, lengh, mainActor) COPIES (positionOnShelf, movieCode:MOVIES) RENTALS (positionOnShelf:COPIES, cardNumber:CARDS, date, time)
SQL measure glossaries for fact schema “RENTAL”: fact
number = SELECT COUNT (*) FROM RENTALS R INNER JOINT COPIES C ON R.positionOnShelf = C.positionOnShelf, COPIES C INNER JOINT MOVIES F RENTALS R INNER JOINT CUSTOMERS C ON R.cardNumber = C.cardNumber GROUP BY F.title, R.date, C.name;
date gender
customer
RENTAL
title
category
number
measure
lengh director mainActor
45
Bernard ESPINASSE - Data Warehouse Conceptual modeling and Design
Pruning and grafting the attribute tree:
46
Bernard ESPINASSE - Data Warehouse Conceptual modeling and Design
Fact schema “TICKET ISSUE”:
country carrier
date
fare
check-in
city
country fromAirport ticketNumber (TICKETS)
airline
city
numberOfBags
Airport
flightNumber
departureTime
from to
date
check-in
numberOfFlights numberOfBags receipts
airline
seat
toAirport
TICKET ISSUE
flightNumber
passengerGender
departureTime
passengerGender
city
arrivalTime
carrier
country
• country is now the child of city • checkIn is now a bolean added on the tree when number node was grafted: ist value is TRUE only for tickets whose passengers have checked in.
Bernard ESPINASSE - Data Warehouse Conceptual modeling and Design
47
Bernard ESPINASSE - Data Warehouse Conceptual modeling and Design
48
FLIGHTS (flightNumber, airline, fromAirport:AIRPORTS) FLIGHT_INSTANCES (FlightNumber:FLIGHTS, date) AIRPORTS (IATAcode, name, city, country) TICKETS (ticketNumber, flightNumber:FLIGHT_INSTANCES), seat, fate, passengersFirstName, passengersSurname, passengersGender) • CHECK-IN (ticketNumber:TICKETS, CheckInTime, numberOfBags) • • • •
The following relationnal logical schema describes an operational database for car rentals : • • • •
SQL measure glossaries for fact schema “TICKET ISSUE”: numberOfFlight = SELECT COUNT (*) FROM TICKETS T INNER JOINT FLIGHT_INSTANCES I ON T.flightNumber = I.flightNumber AND T.date = I.date GROUP BY T.passengerNumber, I.date, T.flightNumber; numberOfBags = SELECT SUM (C.numberOfBag) FROM TICKETS T INNER JOINT FLIGHT_INSTANCES I ON T.flightNumber = I.flightNumber AND T.date = I.date TICKETS T INNER JOINT CHECK_IN C ON T.ticketNumber = C.ticketNumber GROUP BY T.ticketNumber, I.date, T.flightNumber; receipts = SELECT SUM (T.fare) FROM TICKETS T INNER JOINT FLIGHT_INSTANCES I ON T.flightNumber = I.flightNumber AND T.date = I.date GROUP BY T.passengerGender, I.date, T.flightNumber;
• • • •
RENTAL_OFFICES (OfficeName, City, Area, State, Country) CARS (LicensePlate, Category, Model, Brand, Fuel, RegistrationDate) HAVE_OPTIONAL (LicensePlate:CARS, Optional) RENTALS (LicensePlate: CARS, PickupDate, DropoffDate, PickupPlace:RENTALJDFFICES, DropoffPlace :RENTAL_OFFICES, Miles) DRIVERS (LicenseNumber, LicenseExpiration, DriverName, Birthdate) DRIVE (LicenseNumber: DRIVERS,(LicensePlate, PickupDate):RENTALS) INSURANCES (Risk,(LicensePlate, PickupDate):RENTALS, Cost) PAYMENTS ((LicensePlate, PickupDate):RENTALS, Amount, Discount, PaymentMode) !5.,
3#$%&5-
;&,&.
Some hidden functional dependencies hold: City->State->Country->Area, and Model->Brand. Inspect and3(&normalize the source schema, then choose 7$.1 899(*.:,". a fact of interest and design its fact schema. 05#=6785'.)&'()>=7047)..*?);6+@)7 $*+)
26785'.)
&67'(*03
&'() !"#$% =0*3(. !*,(%-.()+/01)
26785'.)
$*89)(/01)
/0.( :,,*8)/01)