Outline. Week 13: Data Warehousing. What is a Warehouse? Warehousing

Outline zWhat is a data warehouse? zWhy a warehouse? zModels & operations zImplementing a warehouse zFuture directions Week 13: Data Warehousing 3 ...
Author: Melissa Bradley
1 downloads 0 Views 81KB Size
Outline zWhat is a data warehouse? zWhy a warehouse? zModels & operations zImplementing a warehouse zFuture directions

Week 13: Data Warehousing

3

Warehousing

What is a Warehouse?

zGrowing industry: $8 billion in 1998 zRange from desktop to huge:

zCollection of diverse data ‹subject oriented ‹aimed at executive, decision maker ‹often a copy of operational data ‹with value-added data (e.g., summaries,

‹Walmart: 900-CPU, 2,700 disk, 23TB Teradata system

zLots of buzzwords, hype

history)

‹slice & dice, rollup, MOLAP, pivot, ...

‹integrated ‹time-varying ‹non-volatile 2

more 4

1

What is a Warehouse?

Motivating Examples

zCollection of tools

zForecasting zComparing performance of units zMonitoring, detecting fraud zVisualization

‹gathering data ‹cleansing, integrating, ... ‹querying, reporting, analysis ‹data mining ‹monitoring, administering warehouse

5

Warehouse Architecture

Why a Warehouse? zTwo Approaches:

Client

Client

‹Query-Driven (Lazy) ‹Warehouse (Eager)

Query & Analysis

Metadata

7

Warehouse

? Integration

Source

Source

Source

Source 6

Source

8

2

Query-Driven Approach

Advantages of Query-Driven zNo need to copy data

Client

‹less storage ‹no need to purchase data

Client

zMore up-to-date data zQuery needs can be unknown zOnly query interface needed at sources zMay be less draining on sources

Mediator Wrapper

Source

Wrapper

Wrapper

Source

Source

9

Advantages of Warehousing

11

OLTP vs. OLAP z OLTP: On Line Transaction Processing

zHigh query performance zQueries not visible outside warehouse zLocal processing at sources unaffected zCan operate when sources unavailable zCan query data not stored in a DBMS zExtra information at warehouse

‹Describes processing at operational sites

z OLAP: On Line Analytical Processing ‹Describes processing at warehouse

‹Modify, summarize (store aggregates) ‹Add historical information 10

12

3

OLTP vs. OLAP OLTP z z z z z z z

Warehouse Models & Operators zData Models

OLAP

Mostly updates Many small transactions Mb-Tb of data Raw data Clerical users Up-to-date data Consistency, recoverability critical

‹relations ‹stars & snowflakes ‹cubes

z z z z

Mostly reads Queries long, complex Gb-Tb of data Summarized, consolidated data z Decision-makers, analysts as users

zOperators ‹slice & dice ‹roll-up, drill down ‹pivoting ‹other 13

15

Data Marts

Star product

zSmaller warehouses zSpans part of organization

prodId p1 p2

name price bolt 10 nut 5

store storeId c1 c2 c3

city nyc sfo la

‹e.g., marketing (customers, products, sales) sale oderId date o100 1/7/97 o102 2/7/97 105 3/8/97

zDo not require enterprise-wide consensus ‹but long term integration problems?

customer

14

custId 53 81 111

custId 53 53 111

name joe fred sally

prodId p1 p2 p1

storeId c1 c1 c3

address 10 main 12 main 80 willow

qty 1 2 5

amt 12 11 50

city sfo sfo la 16

4

Star Schema

Dimension Hierarchies sType

product prodId name price

store

sale orderId date custId prodId storeId qty amt

customer custId name address city

store storeId s5 s7 s9

city

cityId sfo sfo la

tId t1 t2 t1

region

mgr joe fred nancy

sType tId t1 t2

size small large

city

pop 1M 5M

cityId sfo la

Î snowflake schema Î constellations

store storeId city

location downtown suburbs regId north south

region regId nam e north cold region south warm region

17

19

Terms zFact table zDimension tables zMeasures

Cube Fact table view:

product prodId name price

sale orderId date custId prodId storeId qty amt

sale customer custId name address city

prodId p1 p2 p1 p2

storeId c1 c1 c3 c2

Multi-dimensional cube: amt 12 11 50 8

p1 p2

c1 12 11

c2

c3 50

8

dimensions = 2

store storeId city

18

20

5

3-D Cube Fact table view: sale

prodId p1 p2 p1 p2 p1 p1

storeId c1 c1 c3 c2 c1 c2

Aggregates • Add up amounts for day 1 • In SQL: SELECT sum(amt) FROM SALE WHERE date = 1

Multi-dimensional cube: date 1 1 1 1 2 2

amt 12 11 50 8 44 4

day 2 day 1

c1 c2 c3 p1 44 4 p2 c1 c2 c3 p1 12 50 p2 11 8

sale

prodId p1 p2 p1 p2 p1 p1

storeId c1 c1 c3 c2 c1 c2

date 1 1 1 1 2 2

amt 12 11 50 8 44 4

81

dimensions = 3

21

23

ROLAP vs. MOLAP

Aggregates • Add up amounts by day • In SQL: SELECT date, sum(amt) FROM SALE GROUP BY date

zROLAP: Relational On-Line Analytical Processing zMOLAP: Multi-Dimensional On-Line Analytical Processing

sale

22

prodId p1 p2 p1 p2 p1 p1

storeId c1 c1 c3 c2 c1 c2

date 1 1 1 1 2 2

amt 12 11 50 8 44 4

ans

date 1 2

sum 81 48

24

6

Another Example

Cube Aggregation

• Add up amounts by day, product • In SQL: SELECT date, sum(amt) FROM SALE GROUP BY date, prodId sale

prodId p1 p2 p1 p2 p1 p1

storeId c1 c1 c3 c2 c1 c2

date 1 1 1 1 2 2

amt 12 11 50 8 44 4

sale

prodId p1 p2 p1

date 1 1 2

day 2 day 1

p1 p2 c1 p1 12 p2 11

amt 62 19 48 p1 p2

rollup

c1 56 11

c1 44

c2 4

c2

c3

c3 50

8

c2 4 8

c3 50

sum

c1 67

c2 12

c3 50

129

rollup

drill-down

Example: computing sums ...

p1 p2

drill-down

sum 110 19

25

27

Aggregates

Cube Operators

zOperators: sum, count, max, min, median, ave z“Having” clause zUsing dimension hierarchy

day 2 day 1

‹average by region (within store) ‹maximum by month (within date)

c1 c2 c3 p1 44 4 p2 c1 c2 c3 p1 12 50 p2 11 8

p1 p2

c1 56 11

c2 4 8

c3 50

sale(c2,p2,*) 26

... sale(c1,*,*) sum

c1 67

p1 p2

sum 110 19

c2 12

c3 50

129 sale(*,*,*) 28

7

Extended Cube

Pivoting Fact table view:

c2 4 8 c312

p1 p2 c1 * 12

p1 p2 c1* 44

c1 56 11 c267 4

c2 44

c3 4 50

11 23

8 8

50

* 62 19 81

*

day 2

day 1

p1 p2 *

c3 50

* 50 48

sale

* 110 19 129

sale(*,p2,*)

48

prodId storeId p1 c1 p2 c1 p1 c3 p2 c2 p1 c1 p1 c2

Multi-dimensional cube: date 1 1 1 1 2 2

amt 12 11 50 8 44 4

day 2 day 1

c1 c2 c3 p1 44 4 p2 c1 c2 c3 p1 12 50 p2 11 8

p1 p2

c1 56 11

c2 4 8

c3 50

29

Aggregation Using Hierarchies

day 2 day 1

c1 c2 c3 p1 44 4 p2 c1 c2 c3 p1 12 50 p2 11 8

Query & Analysis Tools zQuery Building zReport Writers (comparisons, growth, graphs,…) zSpreadsheet Systems zWeb Interfaces zData Mining

customer region country

p1 p2

region A region B 56 54 11 8

31

(customer c1 in Region A; customers c2, c3 in Region B)

30

32

8

Other Operations

Data Cleaning

zTime functions

z Migration (e.g., yen Ö dollars) z Scrubbing: use domain-specific knowledge (e.g., social security numbers) z Fusion (e.g., mail list, customer merging)

‹e.g., time average

zComputed Attributes ‹e.g., commission = sales * rate

billing DB

zText Queries

customer1(Joe) merged_customer(Joe)

‹e.g., find documents with words X AND B ‹e.g., rank documents by frequency of words X, Y, Z

service DB

customer2(Joe)

z Auditing: discover rules & relationships (like data mining) 33

Integration zData Cleaning zData Loading zDerived Data

Loading Data zIncremental vs. refresh zOff-line vs. on-line zFrequency of loading

Client

Client Query & Analysis

Metadata

35

‹At night, 1x a week/month, continuously Warehouse

zParallel/Partitioned load

Integration

Source

Source

Source

34

36

9

Derived Data

Processing

zDerived Warehouse Data

zROLAP servers vs. MOLAP servers zIndex Structures zWhat to Materialize? zAlgorithms

‹indexes ‹aggregates ‹materialized views (next slide)

Client

Client

Query & Analysis

zWhen to update derived data? zIncremental vs. refresh

Metadata

Warehouse

Integration

Source

Source

Source

37

39

Materialized Views

ROLAP Server

zDefine new warehouse relations using SQL expressions sale

prodId p1 p2 p1 p2 p1 p1

storeId c1 c1 c3 c2 c1 c2

date 1 1 1 1 2 2

joinTb prodId p1 p2 p1 p2 p1 p1

amt 12 11 50 8 44 4

name bolt nut bolt nut bolt bolt

product

id p1 p2

zRelational OLAP Server

nam e price bolt 10 nut 5

storeId c1 c1 c3 c2 c1 c2

date 1 1 1 1 2 2

amt 12 11 50 8 44 4

prodId p1 p2 p1

date 1 1 2

sum 62 19 48

tools

utilities price 10 5 10 5 10 10

sale

does not exist at any source

38

ROLAP server

Special indices, tuning; Schema is “denormalized”

relational DBMS

40

10

MOLAP Server

Inverted Lists

zMulti-Dimensional OLAP Server

18 19

Product

M.D. tools

20 23

milk soda soap

1

utilities

multidimensional server

20 21 22

eggs

2 3 4 Date

23 25 26

could also sit on relational DBMS

r4 r18 r34 r35 r5 r19 r37 r40

inverted lists

age index

rId r4 r18 r19 r34 r35 r36 r5 r41

name joe fred sally nancy tom pat dave jeff

age 20 20 21 20 20 25 21 26

...

Ci ty

Sales

B A

data records

41

Index Structures

43

Using Inverted Lists

zTraditional Access Methods

zQuery:

‹B-trees, hash tables, R-trees, grids, …

‹Get people with age = 20 and name = “fred”

zPopular in Warehouses

zList for age = 20: r4, r18, r34, r35 zList for name = “fred”: r18, r52 zAnswer is intersection: r18

‹inverted lists ‹bit map indexes ‹join indexes ‹text indexes

42

44

11

Managing zMetadata zWarehouse Design zTools

Metadata zBusiness ‹business terms & definition ‹data ownership, charging Client

Client

zOperational

Query & Analysis

Metadata

‹data lineage ‹data currency (e.g., active, archived, purged) ‹use stats, error reports, audit trails

Warehouse

Integration

Source

Source

Source

45

Metadata

47

Design

zAdministrative

zWhat data is needed? zWhere does it come from? zHow to clean data? zHow to represent in warehouse (schema)? zWhat to summarize? zWhat to materialize? zWhat to index?

‹definition of sources, tools, ... ‹schemas, dimension hierarchies, … ‹rules for extraction, cleaning, … ‹refresh, purging policies ‹user profiles, access control, ...

46

48

12

Tools

Future Directions

zDevelopment

zBetter performance zLarger warehouses zEasier to use zWhat are companies & research labs working on?

‹ design & edit: schemas, views, scripts, rules, queries, reports

zPlanning & Analysis ‹ what-if scenarios (schema changes, refresh rates), capacity planning

zWarehouse Management ‹ performance monitoring, usage patterns, exception reporting

zSystem & Network Management ‹ measure traffic (sources, warehouse, clients)

zWorkflow Management ‹ “reliable scripts” for cleaning & analyzing data 49

Current State of Industry

51

Research (1) zIncremental Maintenance zData Consistency zData Expiration zRecovery zData Quality zError Handling

zExtraction and integration done off-line ‹Usually in large, time-consuming, batches

zEverything copied at warehouse ‹Not selective about what is stored ‹Query benefit vs storage & update cost

zQuery optimization aimed at OLTP ‹High throughput instead of fast response ‹Process whole query before displaying anything 50

52

13

Research (2) zRapid Monitor Construction zTemporal Warehouses zMaterialization & Index Selection zData Fusion zData Mining zIntegration of Text & Relational Data

53

Conclusions zMassive amounts of data and complexity of queries will push limits of current warehouses zNeed better systems: ‹easier to use ‹provide quality information

54

14