IMS5028
Customer Data Warehouse Design
Data modelling
Data warehouse development – – – –
Requirements identification Logical design, data modelling Data extract, transform and load (ETL) Warehouse architecture, technology and tools – Physical database design – Delivery systems – Operational policies
2
Designing a data warehouse – data design
• There are two main approaches to data modelling for data warehouse design – entity relationship modelling and normalisation – dimensional modelling – What works for CRM? (Todman)
3
1
IMS5028
The design of databases using a traditional E-R approach • Entities and relationships • Normalisation 3NF, 4NF
4
Entity relationship schema
Customer Type
groups
Customer
within
contains
makes
Product Type
groups
Product
in
Sale
Region
located at
Store
within
Period
5
Why do we normalise data? • Normalisation is a process for converting complex data structures into simple, stable data structures
• Normalised data models are: • robust and stable • have minimum redundancy
6
2
IMS5028
Why do we normalise data? • Normalised data (3NF): – No repeating groups – No partial dependences – No transitive dependences
• The goal is to protect database integrity by avoiding anomalies (update, delete, create)
7
Dimensional Modeling (star schema) • It is a logical design technique that seeks to present data in a standard framework that is intuitive and allows for highperformance access
8
Dimensional Modelling v s E- R modelling • the purpose of dimensional modelling is to structure data for easy and efficient retrieval and analysis • E-R modelling creates a single model of data required to support organisation’s processes, Whereas DM creates individual models for different areas of business/decision interest eg. • model for sales info • model for Inventory info
9
3
IMS5028
Dimensional Model – Star Schema • Components of dimensional model: –
Fact Tables : contain measurements of business eg. Sales, purchase order, shipment • Most data warehouses have very large fact tables(up to 50 billion records and approximately 1 to 5 terabytes)
–
Dimension Tables : store the descriptions of the dimensions of the business eg. Product, customer, vendor, store • smaller than fact tables
10
Dimensional Modeling • Each dimension table has a single primary key that corresponds exactly to one of the components of the multipart key in the fact table. • A fact table always expresses a many to many relationship (the key is composed of foreign keys • The most useful facts in a fact table are numeric and additive ( typically values are added up)
11
Components of a star schema
(from McFadden)
12
4
IMS5028
A Sample Star Schema Customer Customer key Customer name Customer type Customer credit code Salesperson number Sales territory Standard industry code
Salesperson Salesperson key Salesperson name Sales region Sales branch
Time Product sales fact Product key Customer key Salesperson key Time key Sales units Gross sales amount Sales discount amount Net sales amount Sales commission amount
Time key Day Month Quarter Year
Product Product key Product name Product unit price Product quantity
(adapted fromZikmund et al 2003) 13
Dimensions • The salesperson dimension allows user to analyse sales by salesperson, region, branch • The product dimension allows user to analyse purchasing patterns by product and groupings of product • The customer dimension allows us to analyse purchasing customer purchasing patterns • example query: how many white Toyota cars were sold in February by Melbourne stores to customers living in Balarat • The time dimension?
14
Dimension Tables and normalisation • Dimension tables in star schema are denormalised resulting in: • less tables • simpler for users to navigate • reduced number of complex multi-table joins • Compare the following two models
15
5
IMS5028
Entity relationship schema (3NF)
Customer Type
groups
Customer
within
Region
contains
makes
Product Type
groups
Product
in
located at
Sale
Store
within
Period
16
Corresponding Star schema Customer Customer key Name Customer type ……
Product Product key Product type Weight …..
Sale Time key Store key Customer key Product key Dollar sales Unit sales …
Denormalised data: region details are stored in the store table, product type details are stored in the product table Store Store key Address Region …..
Time Time key Day Month …
17
Snowflake schema • Snowflake schema –all the tables are normalised • Star schemas are preferable to snowflake – fewer joins for info. retrieval
18
6
IMS5028
Corresponding snowflake schema Customer Type
Customer
Product Type
Product
Sale
Store
Region
Time 19
DM vs. E-R modeling debate (Kimball’s view) • OLTP systems are volatile – high rates of update transactions • In normalised models the goal is to reduce data redundancy and prevent update anomalies • Data in a data warehouse does not need to be normalised because it is periodically refreshed not updated by user transactions
20
Steps in the dimensional design process (Kimbal’s approach) Business requirements Dimensional modelling steps: 1. 2. 3. 4.
Choose a business process Choose the grain of the fact table Choose the dimensions Choose the measured facts (Kimball,2002)
21
7
IMS5028
Steps in the dimensional design process • Example processes: orders, shipments, sales, inventory, marketing • Example measured facts (business measurements): – Sales fact table: unit sales, sales dollar amount, cost…. – Shipment fact table: qty shipped , invoice dollar amount, storage cost… – The most useful facts are numeric and additive
22
Fact table: choosing the Grain • The grain – specifies what an individual row in a fact table represents – – – –
daily item sales customer transaction line item on order daily inventory level of a stock item
23
Granularity
• Granularity refers to the level of detail captured in the data warehouse • Choosing the grain is a critical step in design of DW • Example: each individual customer transaction v s daily sales
24
8
IMS5028
example: retail trading (adopted from Kimball and Groth)
Requirement 1: • Management wants to analyse what products are selling, in which stores, on what days, under what promotion conditions
25
Choosing the grain example: grain 1- daily sales
Time Time key TIME ATTRIBUTES TBD
Product
Sales Fact
Product key PRODUCT ATTRIBUTES TBD
Time key Product key Store key Promotion key
Promotion Promotion key PROMOTION ATTRIBUTES TBD
Dollarsales Daily Unit sales Dollarcosts Customer count
Store Store key STORE ATTRIBUTES TBD
from Kimball (1996), p29 26
Example: grain 2- individual POS transaction
Time Time key TIME ATTRIBUTES TBD
Product
Sales Fact
Product key PRODUCT ATTRIBUTES TBD
Time key Product key Store key Promotion key
Promotion Promotion key PROMOTION ATTRIBUTES TBD
POS trans.No Sales Qty Dollar sales …… ….
Store Store key STORE ATTRIBUTES TBD
from Kimball (2002), p37 27
9
IMS5028
example: retail trading
Requirement2 : • Management wants to analyse what type of products and what brand are buying different types of customers
28
Can we add Customer here? Customer Customer key Name Customer type ……
Time Time key TIME ATTRIBUTES TBD
Product
Sales Fact
Product key PRODUCT ATTRIBUTES TBD
Time key Product key Store key Promotion key
Promotion Promotion key PROMOTION ATTRIBUTES TBD
POS trans.No Sales Qty Dollar sales Dollar cost Gross profit
What is missing in the fact table?
Store Store key STORE ATTRIBUTES TBD
from Kimball (2002), p37 29
The Importance of Granularity • Choosing the grain is a critical step in design of DW • The grain determines the dimensionality of the database • It affects the level of detail of a query • Trade off - volume of data against the level of detail
30
10
IMS5028
The effect of granularity (example)
High level of detail Individual phone call
Low level of detail Summary of phone calls
200 calls per month
40000 bytes 200 bytes
How many times Bob call John last week?
Not enough detail to answer
The questions that can be answered depend on the level of granularity (adapted from Inmon, 1996)
31
Example: Product dimension
Product Product key SKU description SKU number Product type brand subcategory category department package type unit of measure
Time Sales Fact
Time key
Time key Product key Store key Promotion key Dollarsales Unit sales Dollarcosts Customer count
Store Store key
Promotion Promotion key
32
Example: time dimension
Time Time key day of week day no in month day no overall week no in year week no overall month month no overall quarter Fiscal period holiday flag weekday flag last day month flag
Product Sales Fact
Product key
Time key Product key Store key Promotion key Dollarsales Unit sales Dollarcosts Customer count
Store Store key
Promotion Promotion key
season event
Adopted from Kimball 33
11
IMS5028
Example: store dimension
Store Store key store name store number store street address store postcode store district
Time key Product key Store key
store region store manager store telephone floor plan type first opened date
Promotion key Dollarsales Unit sales Dollarcosts Customer count
Time Sales Fact
Time key
Product Product key
Promotion
store sqft
Promotion key
Adopted from Kimball 34
Example: promotion dimension
Promotion Promotion key promotion name price reduction type ad type display type
Time key Product key
ad media name display provider promo cost promo begin date promo end date
Store key Promotion key Dollarsales Unit sales Dollarcosts Customer count
Time Time key
Sales Fact
Product Product key
Store Store key
35
Example:
customer dimension
customer customer key customer name customer status address city state Post code phone age type car Income level marital status gender
Time Sales Fact
Time key
Customer key Product key Store key
Store
Promotion key Time key POS trans Sales qty Dollarsales
Store key
Promotion Promotion key
Product Product key
36
12
IMS5028
Shared dimensions • Dimension tables can be shared between subject areas If a dimension table connects to more than one fact table: • represent this dimension table in both schemas • the dimension tables are referred to as conformed between the two dimensional models • conformed dimensions àconsistent information 37
Shared dimensions, example Time Customer shipments warehouse Product orders
Production facts
38
Slowly changing dimensions
• Many dimensions (like Product and Customer) evolve slowly over time – People change names, addresses etc. – Sales forces change names of districts, regions etc.
• Three standard approaches are: – Overwrite old values – Create an additional dimension record – Create a current value field
39
13
IMS5028
Aggregates • Most data warehouses have very large fact tables (up to 50 billion records and approximately 1 to 5 terabytes) • Aggregates (pre -stored summaries) are the most effective way of improving data warehouse performance • An aggregate is a fact table record representing a summarisation of base level fact table records
40
Aggregate fact tables
Sales Fact
Product Dimension
Category Aggregate Sales Fact Time key Categorykey Store key Promotion key Dollarsales Unit sales
Category Dimension Categorykey Category Department
Dollarcosts Customer count
41
Dimensional Modelling example
• http://freedatawarehouse.com/tutorials/ dmtutorial/Dimensional%20Modeling%20T utorial.aspx
42
14
IMS5028
• Read DW examples from the Groth text – Retail – Telecommunications – Healthcare
43
References
• Zikmund R., McLeod R., Gilbert F. Customer Relationship Management, Integrating Marketing Strategy and Information Technology, Wiley,2003 • Groth R., Data Mining, Prentice Hall, 2000, chapter 9 • Kimball r., Ross M., The Data Warehouse Toolkit, first and second edition, Wiley, 2002 • “Modern Data Base Management”, by Mc Fadden e t a l, Chapter on Data Warehouse
44
What works for CRM? • The challenges and issues in the development of customer-centric data warehouse
45
15