2. Data Warehouse and OLAP
Data Analysis and Mining
2. Data Warehouse and OLAP Chunping Li
Data Analysis and Mining - Data Warehouse and OLAP
Features of data warehouse
Multi-dimensional data model
Efficient Data Cube Computation and OLAP operations
[email protected]
2006-5-11
1
Summery
2006-5-11
A decision support repository that is maintained separately from the organizations operational databases
Support information processing by providing a solid platform of summarized, historical data for analysis
Data Analysis and Mining - Data Warehouse and OLAP
2
Definition of Data Warehouse
Requirement of Data Warehouse
Data Analysis and Mining - Data Warehouse and OLAP
2006-5-11
A subject-oriented, integrated, timevariant, and nonvolatile collection of data in support of management’s decision-making process. W.H.Inmon
3
2006-5-11
Data Analysis and Mining - Data Warehouse and OLAP
4
1
Four Features of Data Warehouse Subject Oriented
Data Warehouse—Subject-Oriented
Integrated
Organized around major subjects, such as product, sales,etc.
Data Warehouse
Focusing on the modeling and analysis of data for decision makers, not on daily operations or transaction processing.
Non Volatile
Time Variant
Provide a simple and concise view around particular subject issues by excluding data that are not useful in the decision support process.
Data Analysis and Mining - Data Warehouse and OLAP
2006-5-11
5
2006-5-11
Subject-Oriented
Equity Plans
Shares
Insurance Loans 2006-5-11
6
Data Warehouse—Integrated
Data is categorized and stored by business subject rather than by application. OLTP Applications
Data Analysis and Mining - Data Warehouse and OLAP
Data Warehouse Subject
Customer financial information
Constructed by integrating multiple, heterogeneous data sources relational databases, flat files, online transaction records Data extraction and integration techniques are applied. Ensure consistency in naming conventions, structures, attribute measures, etc. among different data sources
Savings Data Analysis and Mining - Data Warehouse and OLAP
7
2006-5-11
Data Analysis and Mining - Data Warehouse and OLAP
8
2
Data Warehouse—Time Variant
Integrated
Savings
The time horizon for the data warehouse is significantly longer than that of operational systems.
Current accounts Loans OLTP Applications
2006-5-11
Data Warehouse Data Analysis and Mining - Data Warehouse and OLAP
9
1997 9 9 1 7
January
02/97
February
Operational update of data does not occur in the data warehouse environment. Does not require transaction processing, concurrency control mechanisms
Requires only two operations in data accessing:
March
Data Warehouse 2006-5-11
10
A physically separate store of data transformed from the
03/97
Data Analysis and Mining - Data Warehouse and OLAP
operational environment.
1997
Data
But the key of operational data may or may not contain “time element”.
Data Warehouse—Non-Volatile
Data is stored as a series of snapshots, each representing a period of time
01/97
Contains an element of time, explicitly or implicitly
2006-5-11
Time Variant
Time
Data warehouse data: provide information from a historical perspective (e.g., past 5-10 years)
Every key structure in the data warehouse
Customer
Operational database: current data.
Data Analysis and Mining - Data Warehouse and OLAP
11
2006-5-11
Loading (refreshing) of data and read-only of data. Data Analysis and Mining - Data Warehouse and OLAP
12
3
Data Warehouse vs. Operational DBMS
Non-Volatile
Typically data in the data warehouse is not updated or deleted.
OLTP (online transaction processing)
Operational (DB)
Warehouse
Major task of traditional relational DBMS Day-to-day operations: purchasing, banking, registration, accounting, etc.
Load
Insert Update Delete
2006-5-11
Read
Read
Data Analysis and Mining - Data Warehouse and OLAP
13
Major task of data warehouse system
Data analysis and decision making
2006-5-11
Data Analysis and Mining - Data Warehouse and OLAP
Distinct features (OLTP vs. OLAP):
2006-5-11
User and system orientation: customer vs. market Data contents: current, detailed vs. historical, consolidated Logical design: ER + application vs. star + subject Access patterns: update vs. read-only but complex queries
Data Analysis and Mining - Data Warehouse and OLAP
14
Distinct Features between OLTP and OLAP
Data Warehouse vs. Operational DBMS
OLAP (online analytical processing)
15
2006-5-11
OLTP
OLAP
Characteristic
Operational processing
Informational processing
Function
Day to day operations
Decision support
Design
Application-oriented
Subject-oriented
Data
Current
Historical
View
Detailed, flat relational
Summarized, multidimensional
Usage
Repetitive
Ad-hoc
Focus
Data in
Information out
Unit of work
Short, simple transaction
Complex query
# Records accessed
Tens
Millions
#Users
Thousands
Hundreds
Size
100MB-GB
100GB-TB
Priority
High performance, availability
High flexibility
Data Analysis and Mining - Data Warehouse and OLAP
16
4
A 2-D View of Sales
2. Data Warehouse and OLAP
Features of data warehouse
Multi-dimensional data model
Location = “Vancouver”
Time(quarter)
TV
item(type) CD
PC
Q1 Q2 Q3 Q4
605 680 781 824
825 920 1026 1120
234 310 343 392
Efficient Data Cube Computation and OLAP operations
Summery
* According to Dimension time and item, where the sales are located in Vancouver; Measure: dollars_sold (in thousands) Data Analysis and Mining - Data Warehouse and OLAP
2006-5-11
17
Location=“New York”
According to Dimension time, item and location; Measure
Location=“Vancouver”
item
item
dollars_sold (in thousands)
CD
PC
402 423 487 512
TV
CD
741 883 749 923 791 1192 823 1223
PC
TV
CD
PC
503 543 567 673
605 680 781 824
825 920 1026 1120
234 310 343 392
•According to Dimension time, item and location; Measure dollars_sold (in thousands) 2006-5-11
Data Analysis and Mining - Data Warehouse and OLAP
825 234 Vancouver 605 New York 741 883 503 Chicago
Q1
710
820
402
Q2 Q3
731
872
423
785
988
487
Q4
812
1201
512
TV 19
2006-5-11
CD
56 5 7 43 31 39 34 0 3 2
710 820 731 872 785 988 812 1201
Data Cube
67 3
TV
Q1 Q2 Q3 Q4
Time (quarters)
Time
Lo ca tio
n
item
18
A 3-D Data Cube Representation
A 3-D View of Sales Location=“Chicago”
Data Analysis and Mining - Data Warehouse and OLAP
2006-5-11
PC
Item Data Analysis and Mining - Data Warehouse and OLAP
20
5
A 4-D Data Cube Representation
Cube: A Lattice of Cuboids all
supplier=“SUP1”
supplier=“SUP2”
supplier=“SUP3”
time
Vancouver New York Chicago Q1
0-D(apex) cuboid
item
location
supplier 1-D cuboids
Q2
time,item
time,location
item,location
location,supplier
Q3 Q4
time,supplier TV
CD
PC
TV
CD
PC
TV CD
PC
time,item,location
*According to Dimension time, item, location and supplier;
2-D cuboids item,supplier
time,location,supplier
3-D cuboids
time,item,supplier
item,location,supplier
Measure dollars_sold (in thousands) 2006-5-11
Data Analysis and Mining - Data Warehouse and OLAP
4-D(base) cuboid 基本方体 21
2006-5-11
From Tables to Data Cubes
The top 0-D cuboid, which holds the highest-level of summarization, is called the apex cuboid.
22
A data cube, such as sales, allows data to be modeled and viewed in multiple dimensions
The lattice of cuboids forms a data cube. A data warehouse is based on a multidimensional data model which views data in the form of a data cube Data Analysis and Mining - Data Warehouse and OLAP
Warehouse and OLAP
From Tables to Data Cubes
An n-D cuboid, which holds the lowest-level of summarization, is called a base cuboid.
2006-5-11
time, item, location,Data supplier Analysis and Mining - Data
23
2006-5-11
Dimension tables, such as item (item_name, brand, type), or time(day, week, month, quarter, year) Fact table contains measures (such as dollars_sold) and keys to each of the related dimension tables Data Analysis and Mining - Data Warehouse and OLAP
24
6
Conceptual Modeling of Data Warehouses
Example of Star Schema time
Modeling data warehouses: (dimensions & measures)
set of dimension tables
item
time_key day day_of_the_week month quarter year
Star schema: A fact table in the middle connected to a Snowflake schema: A refinement of star schema where
Sales Fact Table time_key item_key
some dimensional hierarchy is normalized into a set of
branch_key
smaller dimension tables, forming a shape similar to
location
branch
snowflake
item_key item_name brand type supplier_type
location_key
branch_key branch_name branch_type
Galaxy schema: Multiple fact tables share dimension tables, viewed as a collection of stars, therefore called
units_sold dollars_sold
galaxy schema or fact constellation
location_key street city province_or_street country
Measures Data Analysis and Mining - Data Warehouse and OLAP
2006-5-11
25
Example of Snowflake Schema time
time time_key day day_of_the_week month quarter year
item Sales Fact Table time_key item_key
item_key item_name brand type supplier_key
supplier supplier_key supplier_type
Data Analysis and Mining - Data Warehouse and OLAP
2006-5-11
Example of Fact Constellation
time_key day day_of_the_week month quarter year
location_key
branch_key branch_name branch_type
units_sold dollars_sold
Measures 2006-5-11
Data Analysis and Mining - Data Warehouse and OLAP
Sales Fact Table time_key
location_key street city_key
branch
location_key
branch_key branch_name branch_type
city city_key city province_or_street country 27
units_sold dollars_sold
time_key item_key shipper_key
location
to_location
location_key street city province_or_street country
dollars_cost
Measures
2006-5-11
Shipping Fact Table
from_location
branch_key
branch_key
branch
item item_key item_name brand type supplier_type
item_key
location
26
Data Analysis and Mining - Data Warehouse and OLAP
units_shipped shipper shipper_key shipper_name location_key shipper_type 28
7
A Concept Hierarchy: Dimension (location)
Multidimensional Data
all
all
...
Europe
Hierarchical summarization paths
North_America Re
gi on
region
Sales volume as a function of product, month, and region Dimensions: Product, Location, Time
country
...
Germany
Spain
...
Canada
Industry Region
Year
Mexico
city
Frankfurt
...
Vancouver ... L. Chan
office
...
Product
Category Country Quarter
Toronto
Product
City
Month
Street
Week
Day
M. Wind Month 29
2. Data Warehouse and OLAP
Multi-dimensional data model
Efficient Data Cube Computation and OLAP
m
Features of data warehouse
TV CD PC
1Q
2Q
Time 3Q
4Q
Total annual sales of TV in Chicago. sum Chicago
sum
New York Vancouver
operations
30
A Sample Data Cube
Ite
Data Analysis and Mining - Data Warehouse and OLAP
2006-5-11
Location
Data Analysis and Mining - Data Warehouse and OLAP
2006-5-11
sum
Summery
2006-5-11
Data Analysis and Mining - Data Warehouse and OLAP
31
2006-5-11
Data Analysis and Mining - Data Warehouse and OLAP
32
8
Cuboids Corresponding to the Cube all item
location
item, location
Data cube can be viewed as a lattice of cuboids
0-D(apex) cuboid
time
item,time
Efficient Data Cube Computation
1-D cuboids time, location 2-D cuboids
The bottom-most cuboid is the base cuboid
The top-most cuboid (apex) contains only one cell How many cuboids in an n-dimensional cube with L levels?
3-D(base) cuboid item, time, location
T
= i
Data Analysis and Mining - Data Warehouse and OLAP
2006-5-11
33
by climbing up hierarchy (dimension reduction) Canada
Vancouver New York Chicago Jan Feb Mar
1181
Vancouver 605 New York 471 Chicago 710 820 402 Q1
Q4 TV
CD PC
Q2
Q3
Q3
Q4
Q4 CD
34
from higher level summary to lower level summary or detailed data, or introducing new dimensions
Q2
TV
1 )
Roll down (Drill down): reverse of roll-up
Q3 roll-up on location from cities to counties
+
605
USA Q1 Q2 Vancouver 605 New York 471 Chicago 710 820 402 Q1
( L i
Typical OLAP Operations
Roll up (drill-up): summarize data
1
Data Analysis and Mining - Data Warehouse and OLAP
2006-5-11
Typical OLAP Operations
n ∏ =
PC
102 150 150
Apr May Jun drill-down on time from quarters to months
TV
CD
PC
Jul Aug Sep Oct Nov Dec
2006-5-11
Data Analysis and Mining - Data Warehouse and OLAP
35
2006-5-11
Data Analysis and Mining - Data Warehouse and OLAP
TV
CD PC
36
9
Typical OLAP Operations
Typical OLAP Operations Slice and dice:
Pivot (rotate):
project and select
New York Chicago 471 Q1
710
reorient the cube, visualization, 3D to series of 2D planes.
TV
Vancouver 605 New York 471 Chicago 710 820 402 Q1
Location(cities)
Q2 CD
Dice for location=(“Chicago” or “New York” ) and (time=“Q1” or “Q2”) and ( item =“TV” or “CD”)
TV
Vancouver Pivot
New York Chicago
Q2
710 820 TV
Q3 Q4
402
CD PC
CD PC Item(types)
710 820 402
Vancouver
Chicago New York
Vancouver TV
CD
PC
New York Slice for time =“Q1” Chicago 710 820 402 TV
CD
PC
Data Analysis and Mining - Data Warehouse and OLAP
2006-5-11
37
Efficient Data Cube Computation
Determine which cuboids should be materialized general method: materialized the cuboids on which other popularly
Materialize every (cuboid) (full materialization), none (no materialization), or some (partial materialization)
referenced cuboids are based
Selection of which cuboids to materialize
38
Efficient Processing OLAP Queries
Materialization of data cube (Materialized View)
Data Analysis and Mining - Data Warehouse and OLAP
2006-5-11
Determine which operations should be performed on the available cuboids:
Based on size, sharing, access frequency, etc.
transform drill, roll, etc. into corresponding SQL and/or OLAP operations, e.g, dice = selection + projection
Determine to which materialized cuboids the relevant operations should be applied
2006-5-11
Data Analysis and Mining - Data Warehouse and OLAP
39
2006-5-11
Data Analysis and Mining - Data Warehouse and OLAP
40
10
2. Data Warehouse and OLAP
A Star-Net Query Model Customer Orders
Shipping Method
Customer CONTRACTS AIR-EXPRESS
Features of data warehouse
Multi-dimensional data model
Efficient Data Cube Computation and OLAP
ORDER
TRUCK
PRODUCT LINE Time
Product ANNUALY QTRLY
DAILY
PRODUCT ITEM PRODUCT GROUP
CITY COUNTRY
NAME
SALES PERSON
operations
DISTRICT REGION Location
CATEGORY Itemand Mining - Data Data Analysis Warehouse and OLAP
2006-5-11
DIVISION
Summery
Organization 41
2006-5-11
Data Analysis and Mining - Data Warehouse and OLAP
42
43
2006-5-11
Data Analysis and Mining - Data Warehouse and OLAP
44
Summary
Data warehouse
A multi-dimensional model of a data warehouse Star schema, snowflake schema, fact constellations
A subject-oriented, integrated, time-variant, and nonvolatile collection of data in support of management’s decision-making process
A data cube consists of dimensions & measures
OLAP operations: drilling, rolling, slicing, dicing and pivoting OLAP servers: ROLAP, MOLAP, HOLAP Efficient computation of data cubes Partial vs. full vs. no materialization
2006-5-11
Data Analysis and Mining - Data Warehouse and OLAP
11