Data warehouse development

IMS5028 Customer Data Warehouse Design Data modelling Data warehouse development – – – – Requirements identification Logical design, data modellin...
Author: Kathlyn Shelton
3 downloads 4 Views 310KB Size
IMS5028

Customer Data Warehouse Design

Data modelling

Data warehouse development – – – –

Requirements identification Logical design, data modelling Data extract, transform and load (ETL) Warehouse architecture, technology and tools – Physical database design – Delivery systems – Operational policies

2

Designing a data warehouse – data design

• There are two main approaches to data modelling for data warehouse design – entity relationship modelling and normalisation – dimensional modelling – What works for CRM? (Todman)

3

1

IMS5028

The design of databases using a traditional E-R approach • Entities and relationships • Normalisation 3NF, 4NF

4

Entity relationship schema

Customer Type

groups

Customer

within

contains

makes

Product Type

groups

Product

in

Sale

Region

located at

Store

within

Period

5

Why do we normalise data? • Normalisation is a process for converting complex data structures into simple, stable data structures

• Normalised data models are: • robust and stable • have minimum redundancy

6

2

IMS5028

Why do we normalise data? • Normalised data (3NF): – No repeating groups – No partial dependences – No transitive dependences

• The goal is to protect database integrity by avoiding anomalies (update, delete, create)

7

Dimensional Modeling (star schema) • It is a logical design technique that seeks to present data in a standard framework that is intuitive and allows for highperformance access

8

Dimensional Modelling v s E- R modelling • the purpose of dimensional modelling is to structure data for easy and efficient retrieval and analysis • E-R modelling creates a single model of data required to support organisation’s processes, Whereas DM creates individual models for different areas of business/decision interest eg. • model for sales info • model for Inventory info

9

3

IMS5028

Dimensional Model – Star Schema • Components of dimensional model: –

Fact Tables : contain measurements of business eg. Sales, purchase order, shipment • Most data warehouses have very large fact tables(up to 50 billion records and approximately 1 to 5 terabytes)



Dimension Tables : store the descriptions of the dimensions of the business eg. Product, customer, vendor, store • smaller than fact tables

10

Dimensional Modeling • Each dimension table has a single primary key that corresponds exactly to one of the components of the multipart key in the fact table. • A fact table always expresses a many to many relationship (the key is composed of foreign keys • The most useful facts in a fact table are numeric and additive ( typically values are added up)

11

Components of a star schema

(from McFadden)

12

4

IMS5028

A Sample Star Schema Customer Customer key Customer name Customer type Customer credit code Salesperson number Sales territory Standard industry code

Salesperson Salesperson key Salesperson name Sales region Sales branch

Time Product sales fact Product key Customer key Salesperson key Time key Sales units Gross sales amount Sales discount amount Net sales amount Sales commission amount

Time key Day Month Quarter Year

Product Product key Product name Product unit price Product quantity

(adapted fromZikmund et al 2003) 13

Dimensions • The salesperson dimension allows user to analyse sales by salesperson, region, branch • The product dimension allows user to analyse purchasing patterns by product and groupings of product • The customer dimension allows us to analyse purchasing customer purchasing patterns • example query: how many white Toyota cars were sold in February by Melbourne stores to customers living in Balarat • The time dimension?

14

Dimension Tables and normalisation • Dimension tables in star schema are denormalised resulting in: • less tables • simpler for users to navigate • reduced number of complex multi-table joins • Compare the following two models

15

5

IMS5028

Entity relationship schema (3NF)

Customer Type

groups

Customer

within

Region

contains

makes

Product Type

groups

Product

in

located at

Sale

Store

within

Period

16

Corresponding Star schema Customer Customer key Name Customer type ……

Product Product key Product type Weight …..

Sale Time key Store key Customer key Product key Dollar sales Unit sales …

Denormalised data: region details are stored in the store table, product type details are stored in the product table Store Store key Address Region …..

Time Time key Day Month …

17

Snowflake schema • Snowflake schema –all the tables are normalised • Star schemas are preferable to snowflake – fewer joins for info. retrieval

18

6

IMS5028

Corresponding snowflake schema Customer Type

Customer

Product Type

Product

Sale

Store

Region

Time 19

DM vs. E-R modeling debate (Kimball’s view) • OLTP systems are volatile – high rates of update transactions • In normalised models the goal is to reduce data redundancy and prevent update anomalies • Data in a data warehouse does not need to be normalised because it is periodically refreshed not updated by user transactions

20

Steps in the dimensional design process (Kimbal’s approach) Business requirements Dimensional modelling steps: 1. 2. 3. 4.

Choose a business process Choose the grain of the fact table Choose the dimensions Choose the measured facts (Kimball,2002)

21

7

IMS5028

Steps in the dimensional design process • Example processes: orders, shipments, sales, inventory, marketing • Example measured facts (business measurements): – Sales fact table: unit sales, sales dollar amount, cost…. – Shipment fact table: qty shipped , invoice dollar amount, storage cost… – The most useful facts are numeric and additive

22

Fact table: choosing the Grain • The grain – specifies what an individual row in a fact table represents – – – –

daily item sales customer transaction line item on order daily inventory level of a stock item

23

Granularity

• Granularity refers to the level of detail captured in the data warehouse • Choosing the grain is a critical step in design of DW • Example: each individual customer transaction v s daily sales

24

8

IMS5028

example: retail trading (adopted from Kimball and Groth)

Requirement 1: • Management wants to analyse what products are selling, in which stores, on what days, under what promotion conditions

25

Choosing the grain example: grain 1- daily sales

Time Time key TIME ATTRIBUTES TBD

Product

Sales Fact

Product key PRODUCT ATTRIBUTES TBD

Time key Product key Store key Promotion key

Promotion Promotion key PROMOTION ATTRIBUTES TBD

Dollarsales Daily Unit sales Dollarcosts Customer count

Store Store key STORE ATTRIBUTES TBD

from Kimball (1996), p29 26

Example: grain 2- individual POS transaction

Time Time key TIME ATTRIBUTES TBD

Product

Sales Fact

Product key PRODUCT ATTRIBUTES TBD

Time key Product key Store key Promotion key

Promotion Promotion key PROMOTION ATTRIBUTES TBD

POS trans.No Sales Qty Dollar sales …… ….

Store Store key STORE ATTRIBUTES TBD

from Kimball (2002), p37 27

9

IMS5028

example: retail trading

Requirement2 : • Management wants to analyse what type of products and what brand are buying different types of customers

28

Can we add Customer here? Customer Customer key Name Customer type ……

Time Time key TIME ATTRIBUTES TBD

Product

Sales Fact

Product key PRODUCT ATTRIBUTES TBD

Time key Product key Store key Promotion key

Promotion Promotion key PROMOTION ATTRIBUTES TBD

POS trans.No Sales Qty Dollar sales Dollar cost Gross profit

What is missing in the fact table?

Store Store key STORE ATTRIBUTES TBD

from Kimball (2002), p37 29

The Importance of Granularity • Choosing the grain is a critical step in design of DW • The grain determines the dimensionality of the database • It affects the level of detail of a query • Trade off - volume of data against the level of detail

30

10

IMS5028

The effect of granularity (example)

High level of detail Individual phone call

Low level of detail Summary of phone calls

200 calls per month

40000 bytes 200 bytes

How many times Bob call John last week?

Not enough detail to answer

The questions that can be answered depend on the level of granularity (adapted from Inmon, 1996)

31

Example: Product dimension

Product Product key SKU description SKU number Product type brand subcategory category department package type unit of measure

Time Sales Fact

Time key

Time key Product key Store key Promotion key Dollarsales Unit sales Dollarcosts Customer count

Store Store key

Promotion Promotion key

32

Example: time dimension

Time Time key day of week day no in month day no overall week no in year week no overall month month no overall quarter Fiscal period holiday flag weekday flag last day month flag

Product Sales Fact

Product key

Time key Product key Store key Promotion key Dollarsales Unit sales Dollarcosts Customer count

Store Store key

Promotion Promotion key

season event

Adopted from Kimball 33

11

IMS5028

Example: store dimension

Store Store key store name store number store street address store postcode store district

Time key Product key Store key

store region store manager store telephone floor plan type first opened date

Promotion key Dollarsales Unit sales Dollarcosts Customer count

Time Sales Fact

Time key

Product Product key

Promotion

store sqft

Promotion key

Adopted from Kimball 34

Example: promotion dimension

Promotion Promotion key promotion name price reduction type ad type display type

Time key Product key

ad media name display provider promo cost promo begin date promo end date

Store key Promotion key Dollarsales Unit sales Dollarcosts Customer count

Time Time key

Sales Fact

Product Product key

Store Store key

35

Example:

customer dimension

customer customer key customer name customer status address city state Post code phone age type car Income level marital status gender

Time Sales Fact

Time key

Customer key Product key Store key

Store

Promotion key Time key POS trans Sales qty Dollarsales

Store key

Promotion Promotion key

Product Product key

36

12

IMS5028

Shared dimensions • Dimension tables can be shared between subject areas If a dimension table connects to more than one fact table: • represent this dimension table in both schemas • the dimension tables are referred to as conformed between the two dimensional models • conformed dimensions àconsistent information 37

Shared dimensions, example Time Customer shipments warehouse Product orders

Production facts

38

Slowly changing dimensions

• Many dimensions (like Product and Customer) evolve slowly over time – People change names, addresses etc. – Sales forces change names of districts, regions etc.

• Three standard approaches are: – Overwrite old values – Create an additional dimension record – Create a current value field

39

13

IMS5028

Aggregates • Most data warehouses have very large fact tables (up to 50 billion records and approximately 1 to 5 terabytes) • Aggregates (pre -stored summaries) are the most effective way of improving data warehouse performance • An aggregate is a fact table record representing a summarisation of base level fact table records

40

Aggregate fact tables

Sales Fact

Product Dimension

Category Aggregate Sales Fact Time key Categorykey Store key Promotion key Dollarsales Unit sales

Category Dimension Categorykey Category Department

Dollarcosts Customer count

41

Dimensional Modelling example

• http://freedatawarehouse.com/tutorials/ dmtutorial/Dimensional%20Modeling%20T utorial.aspx

42

14

IMS5028

• Read DW examples from the Groth text – Retail – Telecommunications – Healthcare

43

References

• Zikmund R., McLeod R., Gilbert F. Customer Relationship Management, Integrating Marketing Strategy and Information Technology, Wiley,2003 • Groth R., Data Mining, Prentice Hall, 2000, chapter 9 • Kimball r., Ross M., The Data Warehouse Toolkit, first and second edition, Wiley, 2002 • “Modern Data Base Management”, by Mc Fadden e t a l, Chapter on Data Warehouse

44

What works for CRM? • The challenges and issues in the development of customer-centric data warehouse

45

15