Outline zWhat is a data warehouse? zWhy a warehouse? zModels & operations zImplementing a warehouse zFuture directions
Week 13: Data Warehousing
3
Warehousing
What is a Warehouse?
zGrowing industry: $8 billion in 1998 zRange from desktop to huge:
zCollection of diverse data subject oriented aimed at executive, decision maker often a copy of operational data with value-added data (e.g., summaries,
Walmart: 900-CPU, 2,700 disk, 23TB Teradata system
zLots of buzzwords, hype
history)
slice & dice, rollup, MOLAP, pivot, ...
integrated time-varying non-volatile 2
more 4
1
What is a Warehouse?
Motivating Examples
zCollection of tools
zForecasting zComparing performance of units zMonitoring, detecting fraud zVisualization
gathering data cleansing, integrating, ... querying, reporting, analysis data mining monitoring, administering warehouse
5
Warehouse Architecture
Why a Warehouse? zTwo Approaches:
Client
Client
Query-Driven (Lazy) Warehouse (Eager)
Query & Analysis
Metadata
7
Warehouse
? Integration
Source
Source
Source
Source 6
Source
8
2
Query-Driven Approach
Advantages of Query-Driven zNo need to copy data
Client
less storage no need to purchase data
Client
zMore up-to-date data zQuery needs can be unknown zOnly query interface needed at sources zMay be less draining on sources
Mediator Wrapper
Source
Wrapper
Wrapper
Source
Source
9
Advantages of Warehousing
11
OLTP vs. OLAP z OLTP: On Line Transaction Processing
zHigh query performance zQueries not visible outside warehouse zLocal processing at sources unaffected zCan operate when sources unavailable zCan query data not stored in a DBMS zExtra information at warehouse
Describes processing at operational sites
z OLAP: On Line Analytical Processing Describes processing at warehouse
Modify, summarize (store aggregates) Add historical information 10
12
3
OLTP vs. OLAP OLTP z z z z z z z
Warehouse Models & Operators zData Models
OLAP
Mostly updates Many small transactions Mb-Tb of data Raw data Clerical users Up-to-date data Consistency, recoverability critical
relations stars & snowflakes cubes
z z z z
Mostly reads Queries long, complex Gb-Tb of data Summarized, consolidated data z Decision-makers, analysts as users
zOperators slice & dice roll-up, drill down pivoting other 13
15
Data Marts
Star product
zSmaller warehouses zSpans part of organization
prodId p1 p2
name price bolt 10 nut 5
store storeId c1 c2 c3
city nyc sfo la
e.g., marketing (customers, products, sales) sale oderId date o100 1/7/97 o102 2/7/97 105 3/8/97
zDo not require enterprise-wide consensus but long term integration problems?
customer
14
custId 53 81 111
custId 53 53 111
name joe fred sally
prodId p1 p2 p1
storeId c1 c1 c3
address 10 main 12 main 80 willow
qty 1 2 5
amt 12 11 50
city sfo sfo la 16
4
Star Schema
Dimension Hierarchies sType
product prodId name price
store
sale orderId date custId prodId storeId qty amt
customer custId name address city
store storeId s5 s7 s9
city
cityId sfo sfo la
tId t1 t2 t1
region
mgr joe fred nancy
sType tId t1 t2
size small large
city
pop 1M 5M
cityId sfo la
Î snowflake schema Î constellations
store storeId city
location downtown suburbs regId north south
region regId nam e north cold region south warm region
17
19
Terms zFact table zDimension tables zMeasures
Cube Fact table view:
product prodId name price
sale orderId date custId prodId storeId qty amt
sale customer custId name address city
prodId p1 p2 p1 p2
storeId c1 c1 c3 c2
Multi-dimensional cube: amt 12 11 50 8
p1 p2
c1 12 11
c2
c3 50
8
dimensions = 2
store storeId city
18
20
5
3-D Cube Fact table view: sale
prodId p1 p2 p1 p2 p1 p1
storeId c1 c1 c3 c2 c1 c2
Aggregates • Add up amounts for day 1 • In SQL: SELECT sum(amt) FROM SALE WHERE date = 1
Multi-dimensional cube: date 1 1 1 1 2 2
amt 12 11 50 8 44 4
day 2 day 1
c1 c2 c3 p1 44 4 p2 c1 c2 c3 p1 12 50 p2 11 8
sale
prodId p1 p2 p1 p2 p1 p1
storeId c1 c1 c3 c2 c1 c2
date 1 1 1 1 2 2
amt 12 11 50 8 44 4
81
dimensions = 3
21
23
ROLAP vs. MOLAP
Aggregates • Add up amounts by day • In SQL: SELECT date, sum(amt) FROM SALE GROUP BY date
zROLAP: Relational On-Line Analytical Processing zMOLAP: Multi-Dimensional On-Line Analytical Processing
sale
22
prodId p1 p2 p1 p2 p1 p1
storeId c1 c1 c3 c2 c1 c2
date 1 1 1 1 2 2
amt 12 11 50 8 44 4
ans
date 1 2
sum 81 48
24
6
Another Example
Cube Aggregation
• Add up amounts by day, product • In SQL: SELECT date, sum(amt) FROM SALE GROUP BY date, prodId sale
prodId p1 p2 p1 p2 p1 p1
storeId c1 c1 c3 c2 c1 c2
date 1 1 1 1 2 2
amt 12 11 50 8 44 4
sale
prodId p1 p2 p1
date 1 1 2
day 2 day 1
p1 p2 c1 p1 12 p2 11
amt 62 19 48 p1 p2
rollup
c1 56 11
c1 44
c2 4
c2
c3
c3 50
8
c2 4 8
c3 50
sum
c1 67
c2 12
c3 50
129
rollup
drill-down
Example: computing sums ...
p1 p2
drill-down
sum 110 19
25
27
Aggregates
Cube Operators
zOperators: sum, count, max, min, median, ave z“Having” clause zUsing dimension hierarchy
day 2 day 1
average by region (within store) maximum by month (within date)
c1 c2 c3 p1 44 4 p2 c1 c2 c3 p1 12 50 p2 11 8
p1 p2
c1 56 11
c2 4 8
c3 50
sale(c2,p2,*) 26
... sale(c1,*,*) sum
c1 67
p1 p2
sum 110 19
c2 12
c3 50
129 sale(*,*,*) 28
7
Extended Cube
Pivoting Fact table view:
c2 4 8 c312
p1 p2 c1 * 12
p1 p2 c1* 44
c1 56 11 c267 4
c2 44
c3 4 50
11 23
8 8
50
* 62 19 81
*
day 2
day 1
p1 p2 *
c3 50
* 50 48
sale
* 110 19 129
sale(*,p2,*)
48
prodId storeId p1 c1 p2 c1 p1 c3 p2 c2 p1 c1 p1 c2
Multi-dimensional cube: date 1 1 1 1 2 2
amt 12 11 50 8 44 4
day 2 day 1
c1 c2 c3 p1 44 4 p2 c1 c2 c3 p1 12 50 p2 11 8
p1 p2
c1 56 11
c2 4 8
c3 50
29
Aggregation Using Hierarchies
day 2 day 1
c1 c2 c3 p1 44 4 p2 c1 c2 c3 p1 12 50 p2 11 8
Query & Analysis Tools zQuery Building zReport Writers (comparisons, growth, graphs,…) zSpreadsheet Systems zWeb Interfaces zData Mining
customer region country
p1 p2
region A region B 56 54 11 8
31
(customer c1 in Region A; customers c2, c3 in Region B)
30
32
8
Other Operations
Data Cleaning
zTime functions
z Migration (e.g., yen Ö dollars) z Scrubbing: use domain-specific knowledge (e.g., social security numbers) z Fusion (e.g., mail list, customer merging)
e.g., time average
zComputed Attributes e.g., commission = sales * rate
billing DB
zText Queries
customer1(Joe) merged_customer(Joe)
e.g., find documents with words X AND B e.g., rank documents by frequency of words X, Y, Z
service DB
customer2(Joe)
z Auditing: discover rules & relationships (like data mining) 33
Integration zData Cleaning zData Loading zDerived Data
Loading Data zIncremental vs. refresh zOff-line vs. on-line zFrequency of loading
Client
Client Query & Analysis
Metadata
35
At night, 1x a week/month, continuously Warehouse
zParallel/Partitioned load
Integration
Source
Source
Source
34
36
9
Derived Data
Processing
zDerived Warehouse Data
zROLAP servers vs. MOLAP servers zIndex Structures zWhat to Materialize? zAlgorithms
indexes aggregates materialized views (next slide)
Client
Client
Query & Analysis
zWhen to update derived data? zIncremental vs. refresh
Metadata
Warehouse
Integration
Source
Source
Source
37
39
Materialized Views
ROLAP Server
zDefine new warehouse relations using SQL expressions sale
prodId p1 p2 p1 p2 p1 p1
storeId c1 c1 c3 c2 c1 c2
date 1 1 1 1 2 2
joinTb prodId p1 p2 p1 p2 p1 p1
amt 12 11 50 8 44 4
name bolt nut bolt nut bolt bolt
product
id p1 p2
zRelational OLAP Server
nam e price bolt 10 nut 5
storeId c1 c1 c3 c2 c1 c2
date 1 1 1 1 2 2
amt 12 11 50 8 44 4
prodId p1 p2 p1
date 1 1 2
sum 62 19 48
tools
utilities price 10 5 10 5 10 10
sale
does not exist at any source
38
ROLAP server
Special indices, tuning; Schema is “denormalized”
relational DBMS
40
10
MOLAP Server
Inverted Lists
zMulti-Dimensional OLAP Server
18 19
Product
M.D. tools
20 23
milk soda soap
1
utilities
multidimensional server
20 21 22
eggs
2 3 4 Date
23 25 26
could also sit on relational DBMS
r4 r18 r34 r35 r5 r19 r37 r40
inverted lists
age index
rId r4 r18 r19 r34 r35 r36 r5 r41
name joe fred sally nancy tom pat dave jeff
age 20 20 21 20 20 25 21 26
...
Ci ty
Sales
B A
data records
41
Index Structures
43
Using Inverted Lists
zTraditional Access Methods
zQuery:
B-trees, hash tables, R-trees, grids, …
Get people with age = 20 and name = “fred”
zPopular in Warehouses
zList for age = 20: r4, r18, r34, r35 zList for name = “fred”: r18, r52 zAnswer is intersection: r18
inverted lists bit map indexes join indexes text indexes
42
44
11
Managing zMetadata zWarehouse Design zTools
Metadata zBusiness business terms & definition data ownership, charging Client
Client
zOperational
Query & Analysis
Metadata
data lineage data currency (e.g., active, archived, purged) use stats, error reports, audit trails
Warehouse
Integration
Source
Source
Source
45
Metadata
47
Design
zAdministrative
zWhat data is needed? zWhere does it come from? zHow to clean data? zHow to represent in warehouse (schema)? zWhat to summarize? zWhat to materialize? zWhat to index?
definition of sources, tools, ... schemas, dimension hierarchies, … rules for extraction, cleaning, … refresh, purging policies user profiles, access control, ...
46
48
12
Tools
Future Directions
zDevelopment
zBetter performance zLarger warehouses zEasier to use zWhat are companies & research labs working on?
design & edit: schemas, views, scripts, rules, queries, reports
zPlanning & Analysis what-if scenarios (schema changes, refresh rates), capacity planning
zWarehouse Management performance monitoring, usage patterns, exception reporting
zSystem & Network Management measure traffic (sources, warehouse, clients)
zWorkflow Management “reliable scripts” for cleaning & analyzing data 49
Current State of Industry
51
Research (1) zIncremental Maintenance zData Consistency zData Expiration zRecovery zData Quality zError Handling
zExtraction and integration done off-line Usually in large, time-consuming, batches
zEverything copied at warehouse Not selective about what is stored Query benefit vs storage & update cost
zQuery optimization aimed at OLTP High throughput instead of fast response Process whole query before displaying anything 50
52
13
Research (2) zRapid Monitor Construction zTemporal Warehouses zMaterialization & Index Selection zData Fusion zData Mining zIntegration of Text & Relational Data
53
Conclusions zMassive amounts of data and complexity of queries will push limits of current warehouses zNeed better systems: easier to use provide quality information
54
14