2. Data Warehouse and OLAP. 2. Data Warehouse and OLAP. Chunping Li. Definition of Data Warehouse. Requirement of Data Warehouse

2. Data Warehouse and OLAP Data Analysis and Mining 2. Data Warehouse and OLAP Chunping Li Data Analysis and Mining - Data Warehouse and OLAP Feat...
Author: Trevor McDonald
4 downloads 1 Views 245KB Size
2. Data Warehouse and OLAP

Data Analysis and Mining

2. Data Warehouse and OLAP Chunping Li

Data Analysis and Mining - Data Warehouse and OLAP

Features of data warehouse

„

Multi-dimensional data model

„

Efficient Data Cube Computation and OLAP operations

[email protected]

2006-5-11

„

„

1

Summery

„

2006-5-11

A decision support repository that is maintained separately from the organizations operational databases

„

Support information processing by providing a solid platform of summarized, historical data for analysis

Data Analysis and Mining - Data Warehouse and OLAP

2

Definition of Data Warehouse

Requirement of Data Warehouse „

Data Analysis and Mining - Data Warehouse and OLAP

2006-5-11

A subject-oriented, integrated, timevariant, and nonvolatile collection of data in support of management’s decision-making process. W.H.Inmon

3

2006-5-11

Data Analysis and Mining - Data Warehouse and OLAP

4

1

Four Features of Data Warehouse Subject Oriented

Data Warehouse—Subject-Oriented

Integrated

„

Organized around major subjects, such as product, sales,etc.

„

Data Warehouse

Focusing on the modeling and analysis of data for decision makers, not on daily operations or transaction processing.

„

Non Volatile

Time Variant

Provide a simple and concise view around particular subject issues by excluding data that are not useful in the decision support process.

Data Analysis and Mining - Data Warehouse and OLAP

2006-5-11

5

2006-5-11

Subject-Oriented

Equity Plans

Shares

Insurance Loans 2006-5-11

6

Data Warehouse—Integrated „

Data is categorized and stored by business subject rather than by application. OLTP Applications

Data Analysis and Mining - Data Warehouse and OLAP

Data Warehouse Subject

„

Customer financial information

Constructed by integrating multiple, heterogeneous data sources „ relational databases, flat files, online transaction records Data extraction and integration techniques are applied. „ Ensure consistency in naming conventions, structures, attribute measures, etc. among different data sources

Savings Data Analysis and Mining - Data Warehouse and OLAP

7

2006-5-11

Data Analysis and Mining - Data Warehouse and OLAP

8

2

Data Warehouse—Time Variant

Integrated „

Savings

The time horizon for the data warehouse is significantly longer than that of operational systems. „ „

Current accounts Loans OLTP Applications

„

2006-5-11

„

Data Warehouse Data Analysis and Mining - Data Warehouse and OLAP

9

„

1997 9 9 1 7

January

02/97

February

„

Operational update of data does not occur in the data warehouse environment. Does not require transaction processing, concurrency control mechanisms „

Requires only two operations in data accessing:

March „

Data Warehouse 2006-5-11

10

A physically separate store of data transformed from the

„

03/97

Data Analysis and Mining - Data Warehouse and OLAP

operational environment.

1997

Data

But the key of operational data may or may not contain “time element”.

Data Warehouse—Non-Volatile

Data is stored as a series of snapshots, each representing a period of time

01/97

Contains an element of time, explicitly or implicitly

2006-5-11

Time Variant

Time

Data warehouse data: provide information from a historical perspective (e.g., past 5-10 years)

Every key structure in the data warehouse „

Customer

Operational database: current data.

Data Analysis and Mining - Data Warehouse and OLAP

11

2006-5-11

Loading (refreshing) of data and read-only of data. Data Analysis and Mining - Data Warehouse and OLAP

12

3

Data Warehouse vs. Operational DBMS

Non-Volatile „

Typically data in the data warehouse is not updated or deleted.

OLTP (online transaction processing) „

Operational (DB)

Warehouse

„

Major task of traditional relational DBMS Day-to-day operations: purchasing, banking, registration, accounting, etc.

Load „

Insert Update Delete

2006-5-11

Read

Read

Data Analysis and Mining - Data Warehouse and OLAP

13

„

Major task of data warehouse system

„

Data analysis and decision making

2006-5-11

Data Analysis and Mining - Data Warehouse and OLAP

Distinct features (OLTP vs. OLAP): „ „

„ „

2006-5-11

User and system orientation: customer vs. market Data contents: current, detailed vs. historical, consolidated Logical design: ER + application vs. star + subject Access patterns: update vs. read-only but complex queries

Data Analysis and Mining - Data Warehouse and OLAP

14

Distinct Features between OLTP and OLAP

Data Warehouse vs. Operational DBMS „

OLAP (online analytical processing)

15

2006-5-11

OLTP

OLAP

Characteristic

Operational processing

Informational processing

Function

Day to day operations

Decision support

Design

Application-oriented

Subject-oriented

Data

Current

Historical

View

Detailed, flat relational

Summarized, multidimensional

Usage

Repetitive

Ad-hoc

Focus

Data in

Information out

Unit of work

Short, simple transaction

Complex query

# Records accessed

Tens

Millions

#Users

Thousands

Hundreds

Size

100MB-GB

100GB-TB

Priority

High performance, availability

High flexibility

Data Analysis and Mining - Data Warehouse and OLAP

16

4

A 2-D View of Sales

2. Data Warehouse and OLAP „

Features of data warehouse

„

Multi-dimensional data model

„

Location = “Vancouver”

Time(quarter)

TV

item(type) CD

PC

Q1 Q2 Q3 Q4

605 680 781 824

825 920 1026 1120

234 310 343 392

Efficient Data Cube Computation and OLAP operations

„

Summery

* According to Dimension time and item, where the sales are located in Vancouver; Measure: dollars_sold (in thousands) Data Analysis and Mining - Data Warehouse and OLAP

2006-5-11

17

Location=“New York”

According to Dimension time, item and location; Measure

Location=“Vancouver”

item

item

dollars_sold (in thousands)

CD

PC

402 423 487 512

TV

CD

741 883 749 923 791 1192 823 1223

PC

TV

CD

PC

503 543 567 673

605 680 781 824

825 920 1026 1120

234 310 343 392

•According to Dimension time, item and location; Measure dollars_sold (in thousands) 2006-5-11

Data Analysis and Mining - Data Warehouse and OLAP

825 234 Vancouver 605 New York 741 883 503 Chicago

Q1

710

820

402

Q2 Q3

731

872

423

785

988

487

Q4

812

1201

512

TV 19

2006-5-11

CD

56 5 7 43 31 39 34 0 3 2

710 820 731 872 785 988 812 1201

Data Cube

67 3

TV

Q1 Q2 Q3 Q4

Time (quarters)

Time

Lo ca tio

n

item

18

A 3-D Data Cube Representation

A 3-D View of Sales Location=“Chicago”

Data Analysis and Mining - Data Warehouse and OLAP

2006-5-11

PC

Item Data Analysis and Mining - Data Warehouse and OLAP

20

5

A 4-D Data Cube Representation

Cube: A Lattice of Cuboids all

supplier=“SUP1”

supplier=“SUP2”

supplier=“SUP3”

time

Vancouver New York Chicago Q1

0-D(apex) cuboid

item

location

supplier 1-D cuboids

Q2

time,item

time,location

item,location

location,supplier

Q3 Q4

time,supplier TV

CD

PC

TV

CD

PC

TV CD

PC

time,item,location

*According to Dimension time, item, location and supplier;

2-D cuboids item,supplier

time,location,supplier

3-D cuboids

time,item,supplier

item,location,supplier

Measure dollars_sold (in thousands) 2006-5-11

Data Analysis and Mining - Data Warehouse and OLAP

4-D(base) cuboid 基本方体 21

2006-5-11

From Tables to Data Cubes „

„

„

„

„

The top 0-D cuboid, which holds the highest-level of summarization, is called the apex cuboid.

22

A data cube, such as sales, allows data to be modeled and viewed in multiple dimensions

„

The lattice of cuboids forms a data cube. A data warehouse is based on a multidimensional data model which views data in the form of a data cube Data Analysis and Mining - Data Warehouse and OLAP

Warehouse and OLAP

From Tables to Data Cubes

An n-D cuboid, which holds the lowest-level of summarization, is called a base cuboid.

2006-5-11

time, item, location,Data supplier Analysis and Mining - Data

„

23

2006-5-11

Dimension tables, such as item (item_name, brand, type), or time(day, week, month, quarter, year) Fact table contains measures (such as dollars_sold) and keys to each of the related dimension tables Data Analysis and Mining - Data Warehouse and OLAP

24

6

Conceptual Modeling of Data Warehouses

Example of Star Schema time

Modeling data warehouses: (dimensions & measures)

„

„

set of dimension tables „

item

time_key day day_of_the_week month quarter year

Star schema: A fact table in the middle connected to a Snowflake schema: A refinement of star schema where

Sales Fact Table time_key item_key

some dimensional hierarchy is normalized into a set of

branch_key

smaller dimension tables, forming a shape similar to

location

branch

snowflake „

item_key item_name brand type supplier_type

location_key

branch_key branch_name branch_type

Galaxy schema: Multiple fact tables share dimension tables, viewed as a collection of stars, therefore called

units_sold dollars_sold

galaxy schema or fact constellation

location_key street city province_or_street country

Measures Data Analysis and Mining - Data Warehouse and OLAP

2006-5-11

25

Example of Snowflake Schema time

time time_key day day_of_the_week month quarter year

item Sales Fact Table time_key item_key

item_key item_name brand type supplier_key

supplier supplier_key supplier_type

Data Analysis and Mining - Data Warehouse and OLAP

2006-5-11

Example of Fact Constellation

time_key day day_of_the_week month quarter year

location_key

branch_key branch_name branch_type

units_sold dollars_sold

Measures 2006-5-11

Data Analysis and Mining - Data Warehouse and OLAP

Sales Fact Table time_key

location_key street city_key

branch

location_key

branch_key branch_name branch_type

city city_key city province_or_street country 27

units_sold dollars_sold

time_key item_key shipper_key

location

to_location

location_key street city province_or_street country

dollars_cost

Measures

2006-5-11

Shipping Fact Table

from_location

branch_key

branch_key

branch

item item_key item_name brand type supplier_type

item_key

location

26

Data Analysis and Mining - Data Warehouse and OLAP

units_shipped shipper shipper_key shipper_name location_key shipper_type 28

7

A Concept Hierarchy: Dimension (location)

Multidimensional Data

all

all

...

Europe

Hierarchical summarization paths

North_America Re

gi on

region

Sales volume as a function of product, month, and region Dimensions: Product, Location, Time

„

country

...

Germany

Spain

...

Canada

Industry Region

Year

Mexico

city

Frankfurt

...

Vancouver ... L. Chan

office

...

Product

Category Country Quarter

Toronto

Product

City

Month

Street

Week

Day

M. Wind Month 29

2. Data Warehouse and OLAP

Multi-dimensional data model

„

Efficient Data Cube Computation and OLAP

m

Features of data warehouse

„

TV CD PC

1Q

2Q

Time 3Q

4Q

Total annual sales of TV in Chicago. sum Chicago

sum

New York Vancouver

operations „

30

A Sample Data Cube

Ite

„

Data Analysis and Mining - Data Warehouse and OLAP

2006-5-11

Location

Data Analysis and Mining - Data Warehouse and OLAP

2006-5-11

sum

Summery

2006-5-11

Data Analysis and Mining - Data Warehouse and OLAP

31

2006-5-11

Data Analysis and Mining - Data Warehouse and OLAP

32

8

Cuboids Corresponding to the Cube all item

location

item, location

Data cube can be viewed as a lattice of cuboids

„

0-D(apex) cuboid

time

item,time

Efficient Data Cube Computation

1-D cuboids time, location 2-D cuboids

„

The bottom-most cuboid is the base cuboid

„

The top-most cuboid (apex) contains only one cell How many cuboids in an n-dimensional cube with L levels?

„

3-D(base) cuboid item, time, location

T

= i

Data Analysis and Mining - Data Warehouse and OLAP

2006-5-11

33

„

by climbing up hierarchy (dimension reduction) Canada

Vancouver New York Chicago Jan Feb Mar

1181

Vancouver 605 New York 471 Chicago 710 820 402 Q1

Q4 TV

CD PC

Q2

Q3

Q3

Q4

Q4 CD

34

from higher level summary to lower level summary or detailed data, or introducing new dimensions

Q2

TV

1 )

Roll down (Drill down): reverse of roll-up „

Q3 roll-up on location from cities to counties

+

605

USA Q1 Q2 Vancouver 605 New York 471 Chicago 710 820 402 Q1

( L i

Typical OLAP Operations

Roll up (drill-up): summarize data „

1

Data Analysis and Mining - Data Warehouse and OLAP

2006-5-11

Typical OLAP Operations „

n ∏ =

PC

102 150 150

Apr May Jun drill-down on time from quarters to months

TV

CD

PC

Jul Aug Sep Oct Nov Dec

2006-5-11

Data Analysis and Mining - Data Warehouse and OLAP

35

2006-5-11

Data Analysis and Mining - Data Warehouse and OLAP

TV

CD PC

36

9

Typical OLAP Operations

Typical OLAP Operations Slice and dice:

„

„

Pivot (rotate):

„

project and select

New York Chicago 471 Q1

„

710

reorient the cube, visualization, 3D to series of 2D planes.

TV

Vancouver 605 New York 471 Chicago 710 820 402 Q1

Location(cities)

Q2 CD

Dice for location=(“Chicago” or “New York” ) and (time=“Q1” or “Q2”) and ( item =“TV” or “CD”)

TV

Vancouver Pivot

New York Chicago

Q2

710 820 TV

Q3 Q4

402

CD PC

CD PC Item(types)

710 820 402

Vancouver

Chicago New York

Vancouver TV

CD

PC

New York Slice for time =“Q1” Chicago 710 820 402 TV

CD

PC

Data Analysis and Mining - Data Warehouse and OLAP

2006-5-11

37

Efficient Data Cube Computation „

„

„

Determine which cuboids should be materialized general method: materialized the cuboids on which other popularly

Materialize every (cuboid) (full materialization), none (no materialization), or some (partial materialization)

referenced cuboids are based

Selection of which cuboids to materialize „

38

Efficient Processing OLAP Queries

Materialization of data cube (Materialized View) „

Data Analysis and Mining - Data Warehouse and OLAP

2006-5-11

„

Determine which operations should be performed on the available cuboids:

Based on size, sharing, access frequency, etc.

„

transform drill, roll, etc. into corresponding SQL and/or OLAP operations, e.g, dice = selection + projection

„

Determine to which materialized cuboids the relevant operations should be applied

2006-5-11

Data Analysis and Mining - Data Warehouse and OLAP

39

2006-5-11

Data Analysis and Mining - Data Warehouse and OLAP

40

10

2. Data Warehouse and OLAP

A Star-Net Query Model Customer Orders

Shipping Method

Customer CONTRACTS AIR-EXPRESS

„

Features of data warehouse

„

Multi-dimensional data model

„

Efficient Data Cube Computation and OLAP

ORDER

TRUCK

PRODUCT LINE Time

Product ANNUALY QTRLY

DAILY

PRODUCT ITEM PRODUCT GROUP

CITY COUNTRY

NAME

SALES PERSON

operations

DISTRICT REGION Location

CATEGORY Itemand Mining - Data Data Analysis Warehouse and OLAP

2006-5-11

„

DIVISION

Summery

Organization 41

2006-5-11

Data Analysis and Mining - Data Warehouse and OLAP

42

43

2006-5-11

Data Analysis and Mining - Data Warehouse and OLAP

44

Summary „

Data warehouse „

„

A multi-dimensional model of a data warehouse „ Star schema, snowflake schema, fact constellations „

„

„ „

A subject-oriented, integrated, time-variant, and nonvolatile collection of data in support of management’s decision-making process

A data cube consists of dimensions & measures

OLAP operations: drilling, rolling, slicing, dicing and pivoting OLAP servers: ROLAP, MOLAP, HOLAP Efficient computation of data cubes „ Partial vs. full vs. no materialization

2006-5-11

Data Analysis and Mining - Data Warehouse and OLAP

11