Data Warehouse Part 01

Data Warehouse – Part 01 Based on Chapter 06 The Data Warehouse in DataMining: A Tutorial Tutorial-Based Based Primer by Roiger and Geatz 1 What’s ...
14 downloads 2 Views 637KB Size
Data Warehouse – Part 01 Based on Chapter 06 The Data Warehouse in DataMining: A Tutorial Tutorial-Based Based Primer by Roiger and Geatz

1

What’s the Problem with Data?

2

http://www.techrepublic.com/whitepapers/surviving-the-data-explosion-through-datareduction/1125783?tag=content;siu-container

Why No Just Use Operational Dbs? Operational

Decision Support Systems

 Transactional

 For example, systems that

 OLTP

 Transaction-oriented, i.e.,

designed g for  Quick processing of an

individual transaction  e.g. a purchase h

3

support decisions through data mining  Subject-oriented j

Review – Process for Building an O Operational ti l Db  First Step: data modeling – create the entity relationship

diagram (ERD)  The data model documents the structure of the data  There is no consideration for use in the data model

4

ERDs

5

http://www.sqlservercentral.com/articles/ Miscellaneous/designadatabaseusinganentityre lationshipdiagram/1159/

http://www.umsl.edu/~sauterv/analysis/e r/er_intro.html

Entity - Relationship Entity

Relationship

 Concept

 Between two entities

 Represents a class of

 One-to-one One to one

persons, places, things  May have attributes  Some combination of attributes can uniquely identify each instance of an entity  Key 6

 Husband-to-wife in US

culture and societyy (at anyy one time)  One-to-many  Father-to-child

 Many-to-many  Student-to-teacher Student to teacher

Sample of Credit Card Promotion Data (f (from Table T bl 2 2.3) 3)

7

Income Range

Magazine Watch Promo Promo

Life Ins Promo

CC Ins

Sex

Age

40-50K

Yes

No

No

No

Male

45

30-40K

Yes

Yes

Yes

No

Female

40

40 0 40-50K

No

No

No

No

Male l

42

30-40K

Yes

Yes

Yes

Yes

Male

43

50-60K

Yes

No

Yes

No

Female

38

20-30K

No

No

No

No

Female

55

30-40K

Yes

No

Yes

Yes

Male

35

20-30K 20 30K

No

Yes

No

No

Male

27

30-40K

Yes

No

No

No

Male

43

30-40K

Yes

Yes

Yes

No

Female

41

e.g., What is the cardinality of the relationship customer-to-promotion?

Review – Process for Building an O Operational ti l Db  First Step: data modeling – create the entity relationship

diagram (ERD)  The data model documents the structure of the data  There is no consideration for use in the data model

 Second Step: normalization (db normalization not

mathematical normalization))

8

Normalization  Reduces duplication of data within tables  Result is more tables with fewer columns per table

Effective Normalization  Improves data integrity/validity by reducing data redundancy  Faster sorting of data  Queries run efficiently

Can Have Too Much Normalization  Too many relationships  Too many slim, small tables  To retrieve one piece of information requires access to many

tables bl through h h many joins  Compromises performance  Compromised maintenance

Normalization A formal process p

12

First Normal Form (1NF)  Eliminate any repeating groups of information (a row-

column intersection contains only one value, not a list)  No duplicate rows (a primary key can be assigned)

(Problem) (P bl ) T Table: bl Employee E l Employee_ID Last_Name 100 Patel 110 Washington 120 Cortez

Children Babaraj, Salleh, Sara Martha, Ted Sam, Jorge

Second Normal Form (2NF)  1NF plus  Each column within the table must depend on the whole

primary key

(Problem) (P bl ) T Table: bl Course_Details C D t il Prefix Course Credits CIS 3320 3 CIS 4380 3 CHEM 3505 5 MIS 3320 3

College Technology gy Technology NSM Business

Third Normal Form (3NF)  2NF plus  No column is dependant on any other column within the

table that is not defined as a key.  No N data d is derived d d from f other h ddata within h the h table. bl (Problem) Table: Course_Section S i Section I Instructor 12345 100-4 12467 101-6 15083 100-4 16078 210-8 56701 101-66 101 12554 100-12

Office Offi M 355 B 424 M 355 B 434 B 424 M 201 A

Phone Number Ph N b 5-7011 5-6322 5-7011 5-3321 55-6322 6322 5-7337

Relational Model Entities are realized as two-dimensional tables where the columns are the attributes ib off the h entity i andd the h rows are data d instances i off ((examples l of) f) the h entity. i Relationships between entities are realized as relationship that maps the primary key attribute set of one table to one or more columns of the related entity’s table.

16

C Consider id Transaction T ti O Orientation i t ti vs… Table: Course Section S ti Section

I t t Instructor

12451

100-4

19372

101-7

10029

100-12

12452

100-4

T bl Grade Table: G d Record d

17

Student

Section

Grade

1093456

12451

B

1184567

12452

B

2341100

10029

C

1972344

10029

D

Subject Orientation Table: Grade_by_Instructor

18

St d t Student

G d Grade

S ti Section

I t t Instructor

1093456

B

12451

100-4

1184567 84 67

B

12452 24 2

100-4 00 4

2341100

C

10029

100-12

1972344

D

10029

100-12

1093456

C

10029

100-12

1184567

C

10029

100-12

2341100

A

12451

100-44 100

1972344

B

12452

100-4

Data Warehouse Design “A data warehouse is a subject-oriented, integrated, g time-variant, and nonvolatile collection of data in support of management’s decision making process.”*

*Inmon, W. H. (1996). Building the DataWarehouse. NewYork: JohnWiley and Sons, Inc. 19

OLTP vs. vs Data Warehouse Data Warehouse  Subject-oriented  Denormalized, Denormalized integrated  Stores data to be reported

on, analyzed, tested  Data is historical, no longer used in operations  Data is static  Granularity is a design issue 20

OLTP  Process-oriented (or transaction-oriented)  Normalized, separated  Stores data to be processed, collected managed collected,  Data is necessary for day-today operations of the business  Data will be updated  Granularity to the most detailed level

Sources of DW Data  External data  Data not specific to the organization  Economic indicators, weather

 Operational O ti l data dt  From the OLTP system

 Independent data mart  Like a data warehouse only focuses on one subject  Belongs g to the organization g – but maybe y to a different

department

21

ETL  Extract – Transform – Load  A routine whereby data is brought into the data warehouse from

other sources  Transform  Data cleaning  Resolve ggranularityy issues  Correct data inconsistencies  Time-stamp data records

22

Data in a DW is Static  Once data is in the data warehouse it is read-only  Not always true

23

Data Warehouse Format Multidimensional arrayy of data – not based on relational model Star schema – based on relational model

24

Star Schema

25

Star Schema

http://www.executionmih.com/data-warehouse/star-snowflake-schema.php 26

Fact Table  Defines the dimensions of the multi-dimensional space being

created  Each record in a fact table contains two types of data  Facts F t  Dimension keys

 Fact table key is a composite key made of keys for each

dimension table

27

Dimension Table  Data specific to a dimension  One-to-many relation from dimension table to fact table

28

Multidimensional Database  Definition  “A multidimensional database is structured around measures dimensions measures, dimensions, hierarchies, and cubes rather than tables, rows, columns, andd relations.” l ”  Larson, B. (2008). Delivering

Business Intelligence with Microsoft SQL Server 2008. New York: McGraw-Hill Osborne.

29

http://gerardnico.com/wiki/database/database_multidi mensional

[Data] Cube Definition

 “A cube is a structure that contains a value

for one or more measures for each unique combination of the members of all its d dimensions. Th These are ddetail, l or lleaf-level fl l values. The cube also contains aggregated values formed byy the dimension hierarchies or when one or more of the dimensions is left out of the hierarchy.”  Larson, B. (2008). Delivering Business

Intelligence with Microsoft SQL Server 2008. New York: McGraw-Hill Osborne. 30

A Point Within a Cube is a Value of the M Measure (the (th F Fact) t)  The intersection of all

dimensions is a point  That point represents a value of the measure for the particular unique combination of dimension values  The point is called a detail or leaf-level f value

31

Snowflake Schema Dimension tables are normalized – hence theyy are broken down into two or more tables

32

Constellation Schema More than one fact table

33

Data Warehouse – Part 01 Based on Chapter 06 The Data Warehouse in DataMining: A Tutorial Tutorial-Based Based Primer by Roiger and Geatz

34