Data Warehouse – Part 01 Based on Chapter 06 The Data Warehouse in DataMining: A Tutorial Tutorial-Based Based Primer by Roiger and Geatz
1
What’s the Problem with Data?
2
http://www.techrepublic.com/whitepapers/surviving-the-data-explosion-through-datareduction/1125783?tag=content;siu-container
Why No Just Use Operational Dbs? Operational
Decision Support Systems
Transactional
For example, systems that
OLTP
Transaction-oriented, i.e.,
designed g for Quick processing of an
individual transaction e.g. a purchase h
3
support decisions through data mining Subject-oriented j
Review – Process for Building an O Operational ti l Db First Step: data modeling – create the entity relationship
diagram (ERD) The data model documents the structure of the data There is no consideration for use in the data model
4
ERDs
5
http://www.sqlservercentral.com/articles/ Miscellaneous/designadatabaseusinganentityre lationshipdiagram/1159/
http://www.umsl.edu/~sauterv/analysis/e r/er_intro.html
Entity - Relationship Entity
Relationship
Concept
Between two entities
Represents a class of
One-to-one One to one
persons, places, things May have attributes Some combination of attributes can uniquely identify each instance of an entity Key 6
Husband-to-wife in US
culture and societyy (at anyy one time) One-to-many Father-to-child
Many-to-many Student-to-teacher Student to teacher
Sample of Credit Card Promotion Data (f (from Table T bl 2 2.3) 3)
7
Income Range
Magazine Watch Promo Promo
Life Ins Promo
CC Ins
Sex
Age
40-50K
Yes
No
No
No
Male
45
30-40K
Yes
Yes
Yes
No
Female
40
40 0 40-50K
No
No
No
No
Male l
42
30-40K
Yes
Yes
Yes
Yes
Male
43
50-60K
Yes
No
Yes
No
Female
38
20-30K
No
No
No
No
Female
55
30-40K
Yes
No
Yes
Yes
Male
35
20-30K 20 30K
No
Yes
No
No
Male
27
30-40K
Yes
No
No
No
Male
43
30-40K
Yes
Yes
Yes
No
Female
41
e.g., What is the cardinality of the relationship customer-to-promotion?
Review – Process for Building an O Operational ti l Db First Step: data modeling – create the entity relationship
diagram (ERD) The data model documents the structure of the data There is no consideration for use in the data model
Second Step: normalization (db normalization not
mathematical normalization))
8
Normalization Reduces duplication of data within tables Result is more tables with fewer columns per table
Effective Normalization Improves data integrity/validity by reducing data redundancy Faster sorting of data Queries run efficiently
Can Have Too Much Normalization Too many relationships Too many slim, small tables To retrieve one piece of information requires access to many
tables bl through h h many joins Compromises performance Compromised maintenance
Normalization A formal process p
12
First Normal Form (1NF) Eliminate any repeating groups of information (a row-
column intersection contains only one value, not a list) No duplicate rows (a primary key can be assigned)
(Problem) (P bl ) T Table: bl Employee E l Employee_ID Last_Name 100 Patel 110 Washington 120 Cortez
Children Babaraj, Salleh, Sara Martha, Ted Sam, Jorge
Second Normal Form (2NF) 1NF plus Each column within the table must depend on the whole
primary key
(Problem) (P bl ) T Table: bl Course_Details C D t il Prefix Course Credits CIS 3320 3 CIS 4380 3 CHEM 3505 5 MIS 3320 3
College Technology gy Technology NSM Business
Third Normal Form (3NF) 2NF plus No column is dependant on any other column within the
table that is not defined as a key. No N data d is derived d d from f other h ddata within h the h table. bl (Problem) Table: Course_Section S i Section I Instructor 12345 100-4 12467 101-6 15083 100-4 16078 210-8 56701 101-66 101 12554 100-12
Office Offi M 355 B 424 M 355 B 434 B 424 M 201 A
Phone Number Ph N b 5-7011 5-6322 5-7011 5-3321 55-6322 6322 5-7337
Relational Model Entities are realized as two-dimensional tables where the columns are the attributes ib off the h entity i andd the h rows are data d instances i off ((examples l of) f) the h entity. i Relationships between entities are realized as relationship that maps the primary key attribute set of one table to one or more columns of the related entity’s table.
16
C Consider id Transaction T ti O Orientation i t ti vs… Table: Course Section S ti Section
I t t Instructor
12451
100-4
19372
101-7
10029
100-12
12452
100-4
T bl Grade Table: G d Record d
17
Student
Section
Grade
1093456
12451
B
1184567
12452
B
2341100
10029
C
1972344
10029
D
Subject Orientation Table: Grade_by_Instructor
18
St d t Student
G d Grade
S ti Section
I t t Instructor
1093456
B
12451
100-4
1184567 84 67
B
12452 24 2
100-4 00 4
2341100
C
10029
100-12
1972344
D
10029
100-12
1093456
C
10029
100-12
1184567
C
10029
100-12
2341100
A
12451
100-44 100
1972344
B
12452
100-4
Data Warehouse Design “A data warehouse is a subject-oriented, integrated, g time-variant, and nonvolatile collection of data in support of management’s decision making process.”*
*Inmon, W. H. (1996). Building the DataWarehouse. NewYork: JohnWiley and Sons, Inc. 19
OLTP vs. vs Data Warehouse Data Warehouse Subject-oriented Denormalized, Denormalized integrated Stores data to be reported
on, analyzed, tested Data is historical, no longer used in operations Data is static Granularity is a design issue 20
OLTP Process-oriented (or transaction-oriented) Normalized, separated Stores data to be processed, collected managed collected, Data is necessary for day-today operations of the business Data will be updated Granularity to the most detailed level
Sources of DW Data External data Data not specific to the organization Economic indicators, weather
Operational O ti l data dt From the OLTP system
Independent data mart Like a data warehouse only focuses on one subject Belongs g to the organization g – but maybe y to a different
department
21
ETL Extract – Transform – Load A routine whereby data is brought into the data warehouse from
other sources Transform Data cleaning Resolve ggranularityy issues Correct data inconsistencies Time-stamp data records
22
Data in a DW is Static Once data is in the data warehouse it is read-only Not always true
23
Data Warehouse Format Multidimensional arrayy of data – not based on relational model Star schema – based on relational model
24
Star Schema
25
Star Schema
http://www.executionmih.com/data-warehouse/star-snowflake-schema.php 26
Fact Table Defines the dimensions of the multi-dimensional space being
created Each record in a fact table contains two types of data Facts F t Dimension keys
Fact table key is a composite key made of keys for each
dimension table
27
Dimension Table Data specific to a dimension One-to-many relation from dimension table to fact table
28
Multidimensional Database Definition “A multidimensional database is structured around measures dimensions measures, dimensions, hierarchies, and cubes rather than tables, rows, columns, andd relations.” l ” Larson, B. (2008). Delivering
Business Intelligence with Microsoft SQL Server 2008. New York: McGraw-Hill Osborne.
29
http://gerardnico.com/wiki/database/database_multidi mensional
[Data] Cube Definition
“A cube is a structure that contains a value
for one or more measures for each unique combination of the members of all its d dimensions. Th These are ddetail, l or lleaf-level fl l values. The cube also contains aggregated values formed byy the dimension hierarchies or when one or more of the dimensions is left out of the hierarchy.” Larson, B. (2008). Delivering Business
Intelligence with Microsoft SQL Server 2008. New York: McGraw-Hill Osborne. 30
A Point Within a Cube is a Value of the M Measure (the (th F Fact) t) The intersection of all
dimensions is a point That point represents a value of the measure for the particular unique combination of dimension values The point is called a detail or leaf-level f value
31
Snowflake Schema Dimension tables are normalized – hence theyy are broken down into two or more tables
32
Constellation Schema More than one fact table
33
Data Warehouse – Part 01 Based on Chapter 06 The Data Warehouse in DataMining: A Tutorial Tutorial-Based Based Primer by Roiger and Geatz
34