Lecture Data Warehouse Systems

Lecture Data Warehouse Systems Eva Zangerle SS 2013 PART A: Architecture Chapter 2: Reference Architecture DW Architecture: Reference Model • • ...

Author: Amelia Atkins

1 downloads 2 Views 866KB Size

Report

Download PDF

Recommend Documents

2. Data Warehouse and OLAP. 2. Data Warehouse and OLAP. Chunping Li. Definition of Data Warehouse. Requirement of Data Warehouse

Data warehouse: conceptual design

Data warehouse development

Financial Aid Data Warehouse

Data Warehouse Part 01

GEOGRAPHIC INFORMATION SYSTEMS Lecture 15: Data Sources

Information Lifecycle Management. -Optimization of Data Warehouse Systems - W1A

Information Lifecycle Management. -Optimization of Data Warehouse Systems - T1A

Data Warehouse Schemas

DATA WAREHOUSE. Turin -Italy

6. Data warehouse optimization

Data Warehouse and OLAP

Object-Oriented Data Model for Data Warehouse

Generic Data Model Pattern for Data Warehouse

The Snowflake Elastic Data Warehouse

Data Warehouse Project Departmental Rollup

DW1010 Designing the Data Warehouse

METADATA IN THE DATA WAREHOUSE

Data Warehouse and Strategic Management

A Data Warehouse Modeling Technique

STEEL PRODUCTION DATA WAREHOUSE REENGINEERING

Partitionieren Sie Ihr Data Warehouse!

Data Warehouse Modeler: A CASE Tool for Warehouse Design

Handling Duplicate Data in Data Warehouse for Data Mining

Lecture Data Warehouse Systems Eva Zangerle SS 2013

PART A: Architecture Chapter 2: Reference Architecture

DW Architecture: Reference Model •

•

Goal of a reference model •

Definition of the terms and concepts (long-living, fundamental)

•

Independent of implementation details / product terminology

•

Abstraction (and reduction of complexity)

•

Comparison between DW-models, DW-systems and DW-components (strengths/weaknesses of different models / implementations)

•

Supports definition of precise specifications and implementations of a DW-system

General requirements of the reference architecture •

Ideally typical model: components with operators (functions) and operands (data, databases)

•

Primarily function oriented, i.e. the data-flow required for the functions as well as the control-flow required for process control are described

3

Requirements on the Architecture •

Independency between data sources and analysis systems in regard to availability, load, continuous updates in the data sources (isolation)

•

Persistency: permanent supply of integrated detailed data and derived (aggregated) data from parts of the source data

•

Flexibility of use, i.e. arbitrary evaluations should be possible

•

Supply of user-specific views (e.g. in regard to history)

•

Scalability: easy integration of new data sources at a later time without needing to change the storage structures

•

Efficiency through automation of procedures

•

No ambiguities regarding data structures, access rights, processes

•

Orientation towards the purpose, which is the analysis of data

4

Views on a DW Architecture •

Static view (system components, reference architecture) •

•

Interface components •

Data Warehouse Manager component

•

Metadata manager component

Databases •

•

•

Working Area, Base Database, Data Warehouse, Repository

Functional components •

Monitor; Extraction, Transformation and Loading components

•

Analysis components

Dynamic view (procedure itself) •

Data flow, split into primary data-flow and meta data-flow

• Control flow Static and dynamic view cannot be separated completely, will be intertwined in the following

5

6

DW Reference Architecture •

The Monitor supports extraction, through selection of relevant data from the Data Sources

•

The data required for the DW-process are extracted into the (temporary) Working Area.

•

There they are transformed, i.e. cleaned, integrated, etc.

•

Ideally, the data is then loaded into a Base Database, which is modelled independent of applications and is based on data and schema integration of the different data sources

•

Finally, the data is loaded into a Data Warehouse specifically modelled for an application and is ready for analysis.

•

Metadata and the procedure in the DW-process are stored in the Repository. 7

Data Sources

8

Data Sources •

Data Warehousing means to operate a data store for analysis purposes.

•

Mostly heterogeneous data sources as primary data

•

•

•

Different DB-systems, DB-schema, modelling attribute semantics,...

•

Frequently: internal AND external sources (GIS, market research companies, GfK, ...)

Essential questions: •

Selection of suitable data sources (relevance)

•

Quality (Primary data!) often after conversions: “back to the roots”

Relevance of data sources dependent on •

Purpose of the Data Warehouse

•

Quality of the source data

•

Availability of the source data (organizational, technical, legal, data protection)

•

Cost of acquisition 9

Data Sources cont. •

Classification of source data •

Origin: internal vs. external (purchased)

•

Time: current vs. historical data

•

Utilization: primary data vs. metadata

•

Content type (semantics): unit of measurement, time, diagram, ...

•

Internal representation/data type: numerical (long, integer, …), date(dt|us|...), alphanumerical, BLOB, ...

•

Language and character set: German, Spanish, Latin 1, Latin 2, ...

•

Technical character set: ASCII, UNICODE ...

•

Writing direction: left to right, right to left ...

•

Data protection, security, trust level: non-confidential, confidential, secret, top secret 10

Data Sources - Quality •

Quality requirements for source data (“fitness for use”) •

Correctness (value and metadata correspond to their real world characteristics)

•

Consistency

•

Reliability (no vague estimations)

•

Precision (numerically)

•

Credibility

•

Timeliness

•

Comprehensibility (exact descriptions, metadata)

•

Uniformity (i.e., same structure within whole data set)

•

Completeness (in %)

•

Usability (file format)

•

Relevance

•

Granularity (data resolution across different dimensions)

 define minimum quality of data for data which is to be loaded

11

Data Quality cont. • Low quality of data leads to higher costs • Higher effort required for integration • Garbage in, garbage out • Possibly wrong decisions based on low data quality

12

Components of a DW

Control Components •

Data Warehouse Manager •

Central control component

•

Initialization, control and monitoring of all functions

•

Data flow from data extraction from the source data to the analysis of the data (monitors, extractions, transformations, loading, analysis components)

•

Timely control of all actions:

•

•

Regularly

•

Event-driven

•

On demand

For controlling data requisition and analysis steps the DW Manager uses information from the Repository

14

Control Components cont. •

Metadata Manager •

Controls the metadata administration

•

Administers the metadata flow

•

Link between Data Warehouse Manager and Repository

15

Control Components cont. •

Monitor •

Individual monitor per data source

•

Discovery of updates in the data sources

•

Notification of data warehouse manager about change

•

Design decisions •

Propagate net changes vs. all changes

•

Notification or polling

•

Internal vs. external monitoring

16

Control Components cont. •

Monitoring Techniques •

Active mechanisms: ECA (Triggers) • • •

•

Event, Condition, Action Write changes or changed data periodically, time-, event-, value-based, ... to predefined tables Monitoring of database operations which actually modify data

Replication-based mechanisms • • •

Changed tuples additionally stored in special relations possible on remote servers Snapshot method: local copy of all data, similar to (persistent) views (updated via re-execution of statements) Data replication method: data is directly stored to local tables

17

Control Components cont. •

Monitoring Techniques cont. •

•

Log-based modification detection •

Make use of DB log

•

Problem: detection of actual changes of data ( mostly: execution of all operations, detect delta)

Application-based monitoring •

Applications propagate changes directly to DW (write delta file)

•

Timestamp-based: source data feature timestamp, load new data (newer timestamp) into delta file

•

File diff: periodically compare snapshots of data, detect delta

18

Work Area •

Temporary storage area for required data transformations before spooling into the base database esp. base for integration and cleaning of schema and data.

•

Central storage in the data acquisition area containing •

Extraction component

•

Transformation component

•

Loading component

19

Extraction

Extraction Component •

Control of the periodical transmission of source data to the working area

•

Closely related to monitoring

•

Update time: •

Periodical extraction

•

Extraction on demand

•

Event-driven extraction

•

Immediate extraction on updates

21

Extraction Component cont. •

•

Technique depends on the data source/monitoring strategy: •

Using trigger approach: direct readout of the updated data

•

Using replication: SQL query on the replication table (delta files)

•

Timestamp: selection by timestamp

•

Log/Snapshot: depends on the implementation; e.g. file import

The extraction time determines the analysis accuracy, and depends on the analysis goals!

22

Range of the Extraction •

•

•

Snapshots: source always delivers the whole data stock •

New supplier catalog, new price list, etc.

•

Recognize alterations

•

Represent history correctly

Logs: source delivers every alteration •

Transaction logs, application-controlled logging

•

Import alterations efficiently

Net (netto) logs: source delivers net (netto) alterations, (i.e. at most one aggregated value per value) •

Catalog updates, snapshot deltas

•

No total history possible

•

Alterations are importable efficiently

23

Transformation

Transformation Types •

Two types of transformation •

•

Content transformation: •

Data migration

•

Data cleaning

Structural transformation: Schema integration

25

Transformation: Data Migration •

Data migration (= conversion for data unification) •

Adaptation of data types

•

Conversion of encodings

•

Standardization of character strings

•

Standardization of time representation

•

Conversion of measurement units

•

Combination or separation of attribute values

•

Key treatment: synonyms, homonyms

26

Transformation: Data Cleaning •

Data cleaning: •

Violation of referential integrity

•

Map local primary keys to global (surrogate) primary keys (dictionary table)

•

Erroneous or illegal values (consistency checks, plausibility checks, business rules, ...)

•

Redundant values (duplicate elimination)

•

Incomprehensible and inconsistent values

•

Missing values and NULL-values

27

Transformation: Schema Integration •

•

•

Goal •

Overcome structural heterogeneity, i.e. remove all inconsistencies at schema level

•

Integration of different local schemata (possibly with different DB-models) into a common global data schema

Schema analysis •

Analysis and comparison of schema objects

•

Specification of relations between objects of the different schemata

Schema translation •

Mapping of a schema from the source schema to an equivalent target schema

•

Explicit mapping of every single schema

•

Development of mapping rules

28

Classification of Integration Conflicts •

Four classes of schema conflicts: •

Semantic conflicts

•

Description conflicts

•

Heterogeneity conflicts

•

Structural conflicts

•

Classes of conflicts usually occur in combination

•

Additionally, data conflicts may occur

29

Semantic Conflicts •

Semantic overlapping real-world-extractions with according classes

•

Problem •

Often represents not precisely the same number of real-worldobjects

•

Distinction between semantically equivalent, including, overlapping and disjunct class extensions

•

Solution by merging the competing object sets

•

Example: Overlapping class extensions

30

Description Conflicts •

Different properties/attributes of the same objects in the local schemata

•

Homonyms and synonyms •

Homonym: same name for different concepts

•

Synonym: different names for the same concept

•

Data type conflicts/value range conflicts for a property

•

Scaling conflicts •

•

Application of different, but convertible measuring units

Conflicts through different integrity requirements

31

Heterogeneity Conflicts •

Different data models of the schemata to be integrated •

Different modeling constructs and expressiveness

•

Implicitly often also structural conflicts

•

Solution by transformation into a common global data model

•

Example: object-oriented vs. relational

32

Structural Conflicts •

Although applying the same data model, still often different modeling of the same data •

Especially with semantically rich Data Models (with many modeling constructs)

•

Solution by definition of a global view and by stating a mapping

•

Example: Structural conflicts in the object-oriented data model

33

Schema Conflicts in RelDB

34

Transformation: Schema Integration •

Problem areas of schema integration •

Expert knowledge required, minimal amount of automation possible

•

Maintenance of the global schema

•

Setup and maintenance of a meta database

35

Schema Integration: 4 Phases Requirements, wishes and goal description Choice of integration strategy Choice of integration policy Collection of local metadata Extraction of schema information Analysis of local schemata Adaptation of local schemata Documentation of adaptation process Unification of the local schemata Definition of mappings and transformations Analysis of the global schema Restructuring of the global schema Modification of mappings and transformations

Pre-integration phase

Comparison phase

Unification phase

Restructuring and Merging phase

Loading process 36

Criteria for Integration Techniques •

•

Completeness •

Integrated schema must include all concepts that are contained by any local schema

•

Information from any local schema must not get lost

Minimality •

No redundancy in the integrated schema

•

Mapping of several identical/similar concepts in local schemata onto one concept in the integrated schema

•

Understandability

•

Correctness •

Equivalence of information in the integrated schema with that one in the local schemata

37

Schema Integration •

•

Bottom-Up Integration (global as view) •

Starting point: existing schemata

•

Construction of a derived, integrated schema, which at best includes the complete content of information from the sources

•

New source changes global schema

Top-Down Integration (local as view) •

Starting point: pre-defined target schema (e.g. by standards)

•

Detection of correspondence with local schemata and definition of the mapping

•

The integrated schema considers only parts of the local schemata

38

Bottom-Up Schema Integration Local Schema (LS1,…,LSn)

Pre-Integration Uniform Local Schemata

Schema Matching Inter-Schema Correspondencies

Adaption of Local Schema Transformed Schema

Combination Restructuring Integrated Schema (IS)

Mapping Generation LS1-IS,…,LSn-IS 39

Integration Strategies • Several schema can be integrated into one schema using different strategies

40

Top-Down Schema Integration Local Schema (LS1,…,LSn)

Schema Transformation of the LS into a uniform Model

Manual Creation of the Global Schema GS

Consistent Local Schema

Global Schema (GS)

Schema Matching GS-LS1,…,GS-LSn Inter-Schema Correspondencies

Mapping Generation LS1-GS,…,LSn-GS

41

Structural Heterogeneity: Reasons •

•

•

Different data models •

Legacy systems: hierarchical or network databases

•

Different software systems with different persistence models

Different perspectives •

Real world is interpreted differently

•

Same identifier for different objects and vice versa

Equivalent possibilities of modelling •

•

e.g. relationships as foreign keys or as own relations

Incompatible design specifications •

Different data types, different integrity constraints or different relations between entities 42

Loading

44

Loading Component •

If no Base Database exists (only the DW) then only single loading process exists: •

•

Analysis-specific data is transmitted from the working area directly into the DW

If a Base Database exists, there are two loading processes: •

One for transmitting detailed data independent of analysis from the work area to the base database

•

One for transmitting analysis-specific data (e.g. precalculated aggregates) from the database to the DW

45

Loading Component cont. •

•

Loading process •

Online: The DW is still available for queries

•

Offline: exclusive loading (normally at night)

Loader = Loader/Spooler of the DBMS (e.g. SQL*Loader by Oracle) •

•

For supporting history: no “update in place”

Two loading approaches •

Record-based loading

•

Set-based loading

46

Record-based Loading • Use of standard interfaces •

JDBC, ODBC

• Works in usual transaction context • Triggers, indexes and constraints remain active •

Manual deactivation possible

• No wide area locks • Alterations in data sets become permanent after COMMIT

• Application of prepared statements 47

Set-based Loading • Synonym: BULK LOAD • DB specific extension for loading large amounts of data • Usually done in a special context •

Complete locking of tables

•

No attention to triggers and constraints

•

Indexes are not updated until completion

•

No transactional context – no logging

•

Checkpoints for restart 48

Base Database •

Properties and Tasks •

Integrated data basis (schemata and data) based on the cleaned data

•

Highest resolution and lowest granularity

•

Application neutrality: no specific optimizations, no aggregates in regards to typical OLAP-queries

•

Buffer: Update according to actuality of the data sources in regard to history points.

•

Passing over of the data to the Data Warehouse at other arbitrary times

•

Loading of data into DW might involve already aggregated, condensed data

49

Base Database cont. •

Characteristics •

Logical, central data storage for all DWs

•

Distribution of data to DWs

•

Might even be used for analysis (without using DW)

•

In the case of multiple DWs: •

Hub and Spoke Architecture (dt. Speichen-Nabe Architektur)

•

Instead of n*m only n+m interfaces DWs Base DB DS

•

Base database often not implemented if only one DW exists 50

Base Database cont. •

Quality requirements of a base database and the DW are extended by two requirements as opposed to the quality requirements for data sources: •

•

Traceability •

Documentation of the data flow from the data sources to the base database/DW

•

Documentation of the automatic transformations and the possibilities of manual intervention Exact definitions of the ETL-processes (in order to be reproduced in the next iteration)

Availability •

Metadata description for all object data (Repository)

•

Technical assurance of accessibility (Availability)

•

Efficiency in supply/delivery 51

Distinction Base Database / ODS •

Inmon suggested the Operational Data Store (ODS) as a component of a DW-Systems in literature.

•

According to Inmon there are 5 classes of ODS:

•

•

Class 0: Whole tables are replicated in the ODS from the operative environment. No data integration.

•

Class 1: Transactions are transmitted to the ODS immediately after execution in the operative systems. No integration.

•

Class 2: Data changes are collected, integrated, transformed and loaded into the ODS periodically.

•

Class 3: The ODS is supplied by the Data Warehouse.

•

Class 4: The ODS is a combination of integrated data from the operative data sources and aggregated data from the Data Warehouse.

Base Database in the presented architecture can be viewed as a class 2 ODS

52

ODS Types

53

Data Warehouse •

•

Data basis •

Structuring of the data exclusively for the analysis requirements of the users

•

Main problem: set up of a suitable logical schema (in regard to the analysis requirements, not the source data!)

•

Often transition to multidimensional data model

•

Consequently: different schemata in the DW and the Base Database!

•

All information can be found in the Base Database, Repository and DW!

Requirements for the DBMS of the DW: •

Bulk-Loader (switching off multi-user operation and consistency checks)

•

Access interface for analysis tools (E-SQL, ODBC, JDBC, OLE, etc.)

•

Optimization, tuning in regard to frequent queries (indices, materialized views, ...)

54

Data Warehouse DBMS •

DBMS for Data Warehouse Data Basis •

•

•

Based on a relational DBMS (ROLAP) •

Focus on evaluation of large volume of data

•

Ad-hoc queries

•

No (or almost only read) transactions

•

Typical query pattern show a large number of join and aggregate operations

•

Further database support for special evaluation operators (special indexstructures, materialized views, etc.) is a must

Based on a multidimensional DBMS (MOLAP) •

Proprietary database technology (multidimensional index-structures)

•

Coupling of OLAP-requirements in the DBMS

Based on both systems, i.e. hybrid OLAP (HOLAP) •

Mix of both approaches 55

Data Marts (Partial Views of DW) Analysis

Analysis

Analysis

Data Marts

DW Loading Transformation

Loading Transformation

Loading Transformation

Loading Transformation

Extraction

Extraction

Extraction

Extraction

Store1

Store2

Central Purchasing

Central Stock

56

Data Marts •

Splitting of the data basis of a Data Warehouse •

Centralised DW often difficult to implement (conceptionally, technically, politically)

•

Independence (e.g. mobility)

•

Data protection aspects

•

Organizational aspects (independence of departments)

•

Minimization of data volume

•

Performance benefit (utilize special aggregations)

•

Load balancing

•

Independent of update cycles of the DW

57

Data Marts •

Classification of Data Marts •

Geographically

•

Organizationally

•

Functionally

•

Department/sub-company/contender

•

Dependent (redundant)

•

Independent

58

Two Concepts for Data Marts •

Dependent Data Marts •

Extracts from the DW (without normalization, transformation)

•

At most aggregation

•

Rare approach: no DW created, Data Marts create single virtual DW

Analysis

Analysis

Analysis

Data Marts DW

Two Concepts for Data Marts •

Independent Data Marts •

Isolated Views of the source systems without using a normed base DB or DW

•

Often pragmatic attempt

•

Many integration and consistency problems!

•

Data Marts = processed data sources Analysis Analysis Analysis

Analysis

DW Transformation

Data Marts Loading

Loading

Loading

Loading 60

Repository •

•

Contains metadata •

Schema data and additional information

•

Descriptive and process-related information

•

I.e., information about the setup, maintenance and administration of a Data Warehouse System

Examples of metadata •

Description of the whole Data Warehouse System

•

Names, definitions, structure and contents description of relations, attributes, measures, etc.

•

Identification of data sources

•

Context and meaning of data

61

Repository cont. •

•

Examples of Metadata cont. •

Transformation and integration rules for filling the DW

•

Operational information such as updates, versions, access rights

•

Logical and physical data schema

•

Tuning parameters of the Data Warehouse, etc.

Metadata is managed by the metadata manager •

Metadata manager controls the whole metadata administration of the DWS

62

OLAP - Analysis

63

OLAP and Codd’s Rules •

OLAP = Online-Analytical-Processing

•

Used for interactive, explorative data discovery

•

Codd’s 12 Rules for evaluating OLAP components: 1. Conceptional: multidimensional view 2. Transparency: transparent access to data from different sources 3. Flexible access possibilities on all heterogeneous sources (also to external data sources) 4. Constant response time in report creation (in regards to queries on different dimensions) 5. Client/Server architecture (including well-defined interfaces)

6. Equality of dimensions 7. Adaptive handling of sparse data cubes 64

OLAP and Codd’s Rules cont. •

Codd’s 12 Rules for evaluating OLAP components: 8. Multi-user operation

9. Unlimited, cross-dimensional operations 10. Intuitive data handling (mostly regarding UI) 11. Flexible reporting 12. Unlimited number of dimensions/aggregation levels

65

Extension of Codd’s Rules 1. Data integration (transparent access to all underlying highresolution data)

2. Support of different analysis models 3. Separation of analysis oriented and operative data 4. Separation of storage of DW data and operational data 5. Differentiation between NULL and ‘non-existent’ values 6. Handling of missing values

66

Analysis Tools •

•

Analysis tools utilize the Data Warehouse •

Often also called Business Intelligence Tools (BIT)

•

From simple arithmetic operations (aggregations) to complex statistical evaluations

Functionality •

Reporting tools: e.g. with traffic-light functionality (color-coding: red, yellow, green) in order to grasp large volumes of data rapidly

•

OLAP: interactive data analysis, often based on the multidimensional model (data cube)

•

Data Mining: Search for unknown patterns or relationships in data (clustering, classification, regression analysis, etc.)

67

Analysis Tools cont. •

•

Classification of analysis tools by user goals •

Generating standard or ad-hoc reports

•

Data exploration and data visualization

•

Provision/conversion for further processing in other systems

Presentation types •

Tables

•

Graphics

•

Text

•

Multimedia elements (e.g. for interactive data exploration)

68

Data Analysis •

“Is a dynamic process utilizing an OLAP-tool whereby the user navigates through the multidimensional data structure utilizing multidimensional operators”

•

Operators: •

Pivoting/Rotation: •

•

•

Turns the cube by swapping dimensions

Roll-up, Drill-Down, Drill-Across: •

Roll-up: aggregation of data along the dimension paths

•

Drill-Down is the opposite of Roll-up

•

Drill-Across happens in a patch or cube change

Slice and Dice: individual views on MD •

In a Slice single slices are “cut” from the cube

•

In a Dice partial cubes are viewed

69

Data Mining & Knowledge Discovery •

Basic concept of knowledge discovery: “Knowledge from data”

•

Knowledge Discovery (in databases) is an iterative process •

•

Hypotheses of the data mining approach are verified and/or interpreted

Data Mining •

No verification of statistical assumptions

•

“own” generation of hypotheses

70

Structure of a KDD Process •

Iterative Process •

Selection: Selection of data from a data basis

•

Preprocessing: find faults such as data errors or incompleteness and remove them (not applicable in Data Warehousing)

•

Transformation: minimize quantity

•

Remove attributes without any or little occurrence in the data basis

•

Transform to a suitable form for analysis

•

Data Mining: Task specification determines the data mining method

•

Interpretation and Verification

71

Outlook • How to build a cube • Multidimensional data model • How to store the multidimensional data model

72