Lecture Data Warehouse Systems

Lecture Data Warehouse Systems Eva Zangerle SS 2013 PART A: Architecture Chapter 2: Reference Architecture DW Architecture: Reference Model • • ...
Author: Amelia Atkins
1 downloads 2 Views 866KB Size
Lecture Data Warehouse Systems Eva Zangerle SS 2013

PART A: Architecture Chapter 2: Reference Architecture

DW Architecture: Reference Model •



Goal of a reference model •

Definition of the terms and concepts (long-living, fundamental)



Independent of implementation details / product terminology



Abstraction (and reduction of complexity)



Comparison between DW-models, DW-systems and DW-components (strengths/weaknesses of different models / implementations)



Supports definition of precise specifications and implementations of a DW-system

General requirements of the reference architecture •

Ideally typical model: components with operators (functions) and operands (data, databases)



Primarily function oriented, i.e. the data-flow required for the functions as well as the control-flow required for process control are described

3

Requirements on the Architecture •

Independency between data sources and analysis systems in regard to availability, load, continuous updates in the data sources (isolation)



Persistency: permanent supply of integrated detailed data and derived (aggregated) data from parts of the source data



Flexibility of use, i.e. arbitrary evaluations should be possible



Supply of user-specific views (e.g. in regard to history)



Scalability: easy integration of new data sources at a later time without needing to change the storage structures



Efficiency through automation of procedures



No ambiguities regarding data structures, access rights, processes



Orientation towards the purpose, which is the analysis of data

4

Views on a DW Architecture •

Static view (system components, reference architecture) •



Interface components •

Data Warehouse Manager component



Metadata manager component

Databases •





Working Area, Base Database, Data Warehouse, Repository

Functional components •

Monitor; Extraction, Transformation and Loading components



Analysis components

Dynamic view (procedure itself) •

Data flow, split into primary data-flow and meta data-flow

• Control flow Static and dynamic view cannot be separated completely, will be intertwined in the following

5

6

DW Reference Architecture •

The Monitor supports extraction, through selection of relevant data from the Data Sources



The data required for the DW-process are extracted into the (temporary) Working Area.



There they are transformed, i.e. cleaned, integrated, etc.



Ideally, the data is then loaded into a Base Database, which is modelled independent of applications and is based on data and schema integration of the different data sources



Finally, the data is loaded into a Data Warehouse specifically modelled for an application and is ready for analysis.



Metadata and the procedure in the DW-process are stored in the Repository. 7

Data Sources

8

Data Sources •

Data Warehousing means to operate a data store for analysis purposes.



Mostly heterogeneous data sources as primary data







Different DB-systems, DB-schema, modelling attribute semantics,...



Frequently: internal AND external sources (GIS, market research companies, GfK, ...)

Essential questions: •

Selection of suitable data sources (relevance)



Quality (Primary data!) often after conversions: “back to the roots”

Relevance of data sources dependent on •

Purpose of the Data Warehouse



Quality of the source data



Availability of the source data (organizational, technical, legal, data protection)



Cost of acquisition 9

Data Sources cont. •

Classification of source data •

Origin: internal vs. external (purchased)



Time: current vs. historical data



Utilization: primary data vs. metadata



Content type (semantics): unit of measurement, time, diagram, ...



Internal representation/data type: numerical (long, integer, …), date(dt|us|...), alphanumerical, BLOB, ...



Language and character set: German, Spanish, Latin 1, Latin 2, ...



Technical character set: ASCII, UNICODE ...



Writing direction: left to right, right to left ...



Data protection, security, trust level: non-confidential, confidential, secret, top secret 10

Data Sources - Quality •

Quality requirements for source data (“fitness for use”) •

Correctness (value and metadata correspond to their real world characteristics)



Consistency



Reliability (no vague estimations)



Precision (numerically)



Credibility



Timeliness



Comprehensibility (exact descriptions, metadata)



Uniformity (i.e., same structure within whole data set)



Completeness (in %)



Usability (file format)



Relevance



Granularity (data resolution across different dimensions)

 define minimum quality of data for data which is to be loaded

11

Data Quality cont. • Low quality of data leads to higher costs • Higher effort required for integration • Garbage in, garbage out • Possibly wrong decisions based on low data quality

12

Components of a DW

Control Components •

Data Warehouse Manager •

Central control component



Initialization, control and monitoring of all functions



Data flow from data extraction from the source data to the analysis of the data (monitors, extractions, transformations, loading, analysis components)



Timely control of all actions:





Regularly



Event-driven



On demand

For controlling data requisition and analysis steps the DW Manager uses information from the Repository

14

Control Components cont. •

Metadata Manager •

Controls the metadata administration



Administers the metadata flow



Link between Data Warehouse Manager and Repository

15

Control Components cont. •

Monitor •

Individual monitor per data source



Discovery of updates in the data sources



Notification of data warehouse manager about change



Design decisions •

Propagate net changes vs. all changes



Notification or polling



Internal vs. external monitoring

16

Control Components cont. •

Monitoring Techniques •

Active mechanisms: ECA (Triggers) • • •



Event, Condition, Action Write changes or changed data periodically, time-, event-, value-based, ... to predefined tables Monitoring of database operations which actually modify data

Replication-based mechanisms • • •

Changed tuples additionally stored in special relations possible on remote servers Snapshot method: local copy of all data, similar to (persistent) views (updated via re-execution of statements) Data replication method: data is directly stored to local tables

17

Control Components cont. •

Monitoring Techniques cont. •



Log-based modification detection •

Make use of DB log



Problem: detection of actual changes of data ( mostly: execution of all operations, detect delta)

Application-based monitoring •

Applications propagate changes directly to DW (write delta file)



Timestamp-based: source data feature timestamp, load new data (newer timestamp) into delta file



File diff: periodically compare snapshots of data, detect delta

18

Work Area •

Temporary storage area for required data transformations before spooling into the base database esp. base for integration and cleaning of schema and data.



Central storage in the data acquisition area containing •

Extraction component



Transformation component



Loading component

19

Extraction

Extraction Component •

Control of the periodical transmission of source data to the working area



Closely related to monitoring



Update time: •

Periodical extraction



Extraction on demand



Event-driven extraction



Immediate extraction on updates

21

Extraction Component cont. •



Technique depends on the data source/monitoring strategy: •

Using trigger approach: direct readout of the updated data



Using replication: SQL query on the replication table (delta files)



Timestamp: selection by timestamp



Log/Snapshot: depends on the implementation; e.g. file import

The extraction time determines the analysis accuracy, and depends on the analysis goals!

22

Range of the Extraction •





Snapshots: source always delivers the whole data stock •

New supplier catalog, new price list, etc.



Recognize alterations



Represent history correctly

Logs: source delivers every alteration •

Transaction logs, application-controlled logging



Import alterations efficiently

Net (netto) logs: source delivers net (netto) alterations, (i.e. at most one aggregated value per value) •

Catalog updates, snapshot deltas



No total history possible



Alterations are importable efficiently

23

Transformation

Transformation Types •

Two types of transformation •



Content transformation: •

Data migration



Data cleaning

Structural transformation: Schema integration

25

Transformation: Data Migration •

Data migration (= conversion for data unification) •

Adaptation of data types



Conversion of encodings



Standardization of character strings



Standardization of time representation



Conversion of measurement units



Combination or separation of attribute values



Key treatment: synonyms, homonyms

26

Transformation: Data Cleaning •

Data cleaning: •

Violation of referential integrity



Map local primary keys to global (surrogate) primary keys (dictionary table)



Erroneous or illegal values (consistency checks, plausibility checks, business rules, ...)



Redundant values (duplicate elimination)



Incomprehensible and inconsistent values



Missing values and NULL-values

27

Transformation: Schema Integration •





Goal •

Overcome structural heterogeneity, i.e. remove all inconsistencies at schema level



Integration of different local schemata (possibly with different DB-models) into a common global data schema

Schema analysis •

Analysis and comparison of schema objects



Specification of relations between objects of the different schemata

Schema translation •

Mapping of a schema from the source schema to an equivalent target schema



Explicit mapping of every single schema



Development of mapping rules

28

Classification of Integration Conflicts •

Four classes of schema conflicts: •

Semantic conflicts



Description conflicts



Heterogeneity conflicts



Structural conflicts



Classes of conflicts usually occur in combination



Additionally, data conflicts may occur

29

Semantic Conflicts •

Semantic overlapping real-world-extractions with according classes



Problem •

Often represents not precisely the same number of real-worldobjects



Distinction between semantically equivalent, including, overlapping and disjunct class extensions



Solution by merging the competing object sets



Example: Overlapping class extensions

30

Description Conflicts •

Different properties/attributes of the same objects in the local schemata



Homonyms and synonyms •

Homonym: same name for different concepts



Synonym: different names for the same concept



Data type conflicts/value range conflicts for a property



Scaling conflicts •



Application of different, but convertible measuring units

Conflicts through different integrity requirements

31

Heterogeneity Conflicts •

Different data models of the schemata to be integrated •

Different modeling constructs and expressiveness



Implicitly often also structural conflicts



Solution by transformation into a common global data model



Example: object-oriented vs. relational

32

Structural Conflicts •

Although applying the same data model, still often different modeling of the same data •

Especially with semantically rich Data Models (with many modeling constructs)



Solution by definition of a global view and by stating a mapping



Example: Structural conflicts in the object-oriented data model

33

Schema Conflicts in RelDB

34

Transformation: Schema Integration •

Problem areas of schema integration •

Expert knowledge required, minimal amount of automation possible



Maintenance of the global schema



Setup and maintenance of a meta database

35

Schema Integration: 4 Phases Requirements, wishes and goal description Choice of integration strategy Choice of integration policy Collection of local metadata Extraction of schema information Analysis of local schemata Adaptation of local schemata Documentation of adaptation process Unification of the local schemata Definition of mappings and transformations Analysis of the global schema Restructuring of the global schema Modification of mappings and transformations

Pre-integration phase

Comparison phase

Unification phase

Restructuring and Merging phase

Loading process 36

Criteria for Integration Techniques •



Completeness •

Integrated schema must include all concepts that are contained by any local schema



Information from any local schema must not get lost

Minimality •

No redundancy in the integrated schema



Mapping of several identical/similar concepts in local schemata onto one concept in the integrated schema



Understandability



Correctness •

Equivalence of information in the integrated schema with that one in the local schemata

37

Schema Integration •



Bottom-Up Integration (global as view) •

Starting point: existing schemata



Construction of a derived, integrated schema, which at best includes the complete content of information from the sources



New source changes global schema

Top-Down Integration (local as view) •

Starting point: pre-defined target schema (e.g. by standards)



Detection of correspondence with local schemata and definition of the mapping



The integrated schema considers only parts of the local schemata

38

Bottom-Up Schema Integration Local Schema (LS1,…,LSn)

Pre-Integration Uniform Local Schemata

Schema Matching Inter-Schema Correspondencies

Adaption of Local Schema Transformed Schema

Combination Restructuring Integrated Schema (IS)

Mapping Generation LS1-IS,…,LSn-IS 39

Integration Strategies • Several schema can be integrated into one schema using different strategies

40

Top-Down Schema Integration Local Schema (LS1,…,LSn)

Schema Transformation of the LS into a uniform Model

Manual Creation of the Global Schema GS

Consistent Local Schema

Global Schema (GS)

Schema Matching GS-LS1,…,GS-LSn Inter-Schema Correspondencies

Mapping Generation LS1-GS,…,LSn-GS

41

Structural Heterogeneity: Reasons •





Different data models •

Legacy systems: hierarchical or network databases



Different software systems with different persistence models

Different perspectives •

Real world is interpreted differently



Same identifier for different objects and vice versa

Equivalent possibilities of modelling •



e.g. relationships as foreign keys or as own relations

Incompatible design specifications •

Different data types, different integrity constraints or different relations between entities 42

Loading

44

Loading Component •

If no Base Database exists (only the DW) then only single loading process exists: •



Analysis-specific data is transmitted from the working area directly into the DW

If a Base Database exists, there are two loading processes: •

One for transmitting detailed data independent of analysis from the work area to the base database



One for transmitting analysis-specific data (e.g. precalculated aggregates) from the database to the DW

45

Loading Component cont. •



Loading process •

Online: The DW is still available for queries



Offline: exclusive loading (normally at night)

Loader = Loader/Spooler of the DBMS (e.g. SQL*Loader by Oracle) •



For supporting history: no “update in place”

Two loading approaches •

Record-based loading



Set-based loading

46

Record-based Loading • Use of standard interfaces •

JDBC, ODBC

• Works in usual transaction context • Triggers, indexes and constraints remain active •

Manual deactivation possible

• No wide area locks • Alterations in data sets become permanent after COMMIT

• Application of prepared statements 47

Set-based Loading • Synonym: BULK LOAD • DB specific extension for loading large amounts of data • Usually done in a special context •

Complete locking of tables



No attention to triggers and constraints



Indexes are not updated until completion



No transactional context – no logging



Checkpoints for restart 48

Base Database •

Properties and Tasks •

Integrated data basis (schemata and data) based on the cleaned data



Highest resolution and lowest granularity



Application neutrality: no specific optimizations, no aggregates in regards to typical OLAP-queries



Buffer: Update according to actuality of the data sources in regard to history points.



Passing over of the data to the Data Warehouse at other arbitrary times



Loading of data into DW might involve already aggregated, condensed data

49

Base Database cont. •

Characteristics •

Logical, central data storage for all DWs



Distribution of data to DWs



Might even be used for analysis (without using DW)



In the case of multiple DWs: •

Hub and Spoke Architecture (dt. Speichen-Nabe Architektur)



Instead of n*m only n+m interfaces DWs Base DB DS



Base database often not implemented if only one DW exists 50

Base Database cont. •

Quality requirements of a base database and the DW are extended by two requirements as opposed to the quality requirements for data sources: •



Traceability •

Documentation of the data flow from the data sources to the base database/DW



Documentation of the automatic transformations and the possibilities of manual intervention Exact definitions of the ETL-processes (in order to be reproduced in the next iteration)

Availability •

Metadata description for all object data (Repository)



Technical assurance of accessibility (Availability)



Efficiency in supply/delivery 51

Distinction Base Database / ODS •

Inmon suggested the Operational Data Store (ODS) as a component of a DW-Systems in literature.



According to Inmon there are 5 classes of ODS:





Class 0: Whole tables are replicated in the ODS from the operative environment. No data integration.



Class 1: Transactions are transmitted to the ODS immediately after execution in the operative systems. No integration.



Class 2: Data changes are collected, integrated, transformed and loaded into the ODS periodically.



Class 3: The ODS is supplied by the Data Warehouse.



Class 4: The ODS is a combination of integrated data from the operative data sources and aggregated data from the Data Warehouse.

Base Database in the presented architecture can be viewed as a class 2 ODS

52

ODS Types

53

Data Warehouse •



Data basis •

Structuring of the data exclusively for the analysis requirements of the users



Main problem: set up of a suitable logical schema (in regard to the analysis requirements, not the source data!)



Often transition to multidimensional data model



Consequently: different schemata in the DW and the Base Database!



All information can be found in the Base Database, Repository and DW!

Requirements for the DBMS of the DW: •

Bulk-Loader (switching off multi-user operation and consistency checks)



Access interface for analysis tools (E-SQL, ODBC, JDBC, OLE, etc.)



Optimization, tuning in regard to frequent queries (indices, materialized views, ...)

54

Data Warehouse DBMS •

DBMS for Data Warehouse Data Basis •





Based on a relational DBMS (ROLAP) •

Focus on evaluation of large volume of data



Ad-hoc queries



No (or almost only read) transactions



Typical query pattern show a large number of join and aggregate operations



Further database support for special evaluation operators (special indexstructures, materialized views, etc.) is a must

Based on a multidimensional DBMS (MOLAP) •

Proprietary database technology (multidimensional index-structures)



Coupling of OLAP-requirements in the DBMS

Based on both systems, i.e. hybrid OLAP (HOLAP) •

Mix of both approaches 55

Data Marts (Partial Views of DW) Analysis

Analysis

Analysis

Data Marts

DW Loading Transformation

Loading Transformation

Loading Transformation

Loading Transformation

Extraction

Extraction

Extraction

Extraction

Store1

Store2

Central Purchasing

Central Stock

56

Data Marts •

Splitting of the data basis of a Data Warehouse •

Centralised DW often difficult to implement (conceptionally, technically, politically)



Independence (e.g. mobility)



Data protection aspects



Organizational aspects (independence of departments)



Minimization of data volume



Performance benefit (utilize special aggregations)



Load balancing



Independent of update cycles of the DW

57

Data Marts •

Classification of Data Marts •

Geographically



Organizationally



Functionally



Department/sub-company/contender



Dependent (redundant)



Independent

58

Two Concepts for Data Marts •

Dependent Data Marts •

Extracts from the DW (without normalization, transformation)



At most aggregation



Rare approach: no DW created, Data Marts create single virtual DW

Analysis

Analysis

Analysis

Data Marts DW

Two Concepts for Data Marts •

Independent Data Marts •

Isolated Views of the source systems without using a normed base DB or DW



Often pragmatic attempt



Many integration and consistency problems!



Data Marts = processed data sources Analysis Analysis Analysis

Analysis

DW Transformation

Data Marts Loading

Loading

Loading

Loading 60

Repository •



Contains metadata •

Schema data and additional information



Descriptive and process-related information



I.e., information about the setup, maintenance and administration of a Data Warehouse System

Examples of metadata •

Description of the whole Data Warehouse System



Names, definitions, structure and contents description of relations, attributes, measures, etc.



Identification of data sources



Context and meaning of data

61

Repository cont. •



Examples of Metadata cont. •

Transformation and integration rules for filling the DW



Operational information such as updates, versions, access rights



Logical and physical data schema



Tuning parameters of the Data Warehouse, etc.

Metadata is managed by the metadata manager •

Metadata manager controls the whole metadata administration of the DWS

62

OLAP - Analysis

63

OLAP and Codd’s Rules •

OLAP = Online-Analytical-Processing



Used for interactive, explorative data discovery



Codd’s 12 Rules for evaluating OLAP components: 1. Conceptional: multidimensional view 2. Transparency: transparent access to data from different sources 3. Flexible access possibilities on all heterogeneous sources (also to external data sources) 4. Constant response time in report creation (in regards to queries on different dimensions) 5. Client/Server architecture (including well-defined interfaces)

6. Equality of dimensions 7. Adaptive handling of sparse data cubes 64

OLAP and Codd’s Rules cont. •

Codd’s 12 Rules for evaluating OLAP components: 8. Multi-user operation

9. Unlimited, cross-dimensional operations 10. Intuitive data handling (mostly regarding UI) 11. Flexible reporting 12. Unlimited number of dimensions/aggregation levels

65

Extension of Codd’s Rules 1. Data integration (transparent access to all underlying highresolution data)

2. Support of different analysis models 3. Separation of analysis oriented and operative data 4. Separation of storage of DW data and operational data 5. Differentiation between NULL and ‘non-existent’ values 6. Handling of missing values

66

Analysis Tools •



Analysis tools utilize the Data Warehouse •

Often also called Business Intelligence Tools (BIT)



From simple arithmetic operations (aggregations) to complex statistical evaluations

Functionality •

Reporting tools: e.g. with traffic-light functionality (color-coding: red, yellow, green) in order to grasp large volumes of data rapidly



OLAP: interactive data analysis, often based on the multidimensional model (data cube)



Data Mining: Search for unknown patterns or relationships in data (clustering, classification, regression analysis, etc.)

67

Analysis Tools cont. •



Classification of analysis tools by user goals •

Generating standard or ad-hoc reports



Data exploration and data visualization



Provision/conversion for further processing in other systems

Presentation types •

Tables



Graphics



Text



Multimedia elements (e.g. for interactive data exploration)

68

Data Analysis •

“Is a dynamic process utilizing an OLAP-tool whereby the user navigates through the multidimensional data structure utilizing multidimensional operators”



Operators: •

Pivoting/Rotation: •





Turns the cube by swapping dimensions

Roll-up, Drill-Down, Drill-Across: •

Roll-up: aggregation of data along the dimension paths



Drill-Down is the opposite of Roll-up



Drill-Across happens in a patch or cube change

Slice and Dice: individual views on MD •

In a Slice single slices are “cut” from the cube



In a Dice partial cubes are viewed

69

Data Mining & Knowledge Discovery •

Basic concept of knowledge discovery: “Knowledge from data”



Knowledge Discovery (in databases) is an iterative process •



Hypotheses of the data mining approach are verified and/or interpreted

Data Mining •

No verification of statistical assumptions



“own” generation of hypotheses

70

Structure of a KDD Process •

Iterative Process •

Selection: Selection of data from a data basis



Preprocessing: find faults such as data errors or incompleteness and remove them (not applicable in Data Warehousing)



Transformation: minimize quantity



Remove attributes without any or little occurrence in the data basis



Transform to a suitable form for analysis



Data Mining: Task specification determines the data mining method



Interpretation and Verification

71

Outlook • How to build a cube • Multidimensional data model • How to store the multidimensional data model

72