METADATA IN THE DATA WAREHOUSE

METADATA IN THE DATA WAREHOUSE BY W. H. Inmon M E T A D A T A I N T H E D A T A W A R E H O U S E The first image most people have of the...
Author: Samuel Carr
0 downloads 1 Views 688KB Size
METADATA IN THE DATA WAREHOUSE BY

W. H. Inmon

M

E T A D A T A

I

N

T

H E

D

A T A

W

A R E H O U S E

The first image most people have of the data warehouse is a large collection of historical, integrated data. While that image is correct in many regards, there is another very important element of the data warehouse that is vital - metadata. Metadata is data about data. Metadata has been around as long as there have been programs and data that the programs operate on. Figure 1 shows metadata in a simple form.

While metadata is not new, the role of metadata and its importance in the face of the data warehouse certainly is new. For years the information technology professional has worked in the same environment as metadata, but in many ways has paid little attention to metadata. The information professional has spent a life dedicated to process and functional analysis, user requirements, maintenance, architectures, and the like. The role of metadata has been passive at best in this milieu. But metadata plays a very different role in data warehouse. Relegating metadata to a backwater, passive role in the data warehouse environment is to defeat the purpose of data warehouse. Metadata plays a very active and important part in the data warehouse environment. The reason why metadata plays such an important and active role in the data warehouse environment is apparent when contrasting the operational environment to the data warehouse environment insofar as the user community is concerned. Figure 2 shows the difference in the communities served by data warehouse and operational systems.

©Copyright 2000 by William H. Inmon, all rights reserved

Page 1

M

E T A D A T A

I

N

T

H E

D

A T A

W

A R E H O U S E

The information technology professional is the primary community involved in the usage of operational development and maintenance facilities. It is expected that the information technology community is computer literate, and able to find his/her way around systems. The community served by the data warehouse is a very different community. The data warehouse serves the DSS analysis community. It is anticipated that the DSS analysis community is not computer literate. Instead the expectation is that the DSS analysis community is a businessperson community first, and a technology community second. Simply from the standpoint of who needs help the most in terms of finding one's way around data and systems, it is assumed the DSS analysis community requires a much more formal and intensive level of support than the information technology community. For this reason alone, the formal establishment of and ongoing support of metadata becomes important in the data warehouse environment. But there is a secondary, yet important, reason why metadata plays an important role in the data warehouse environment. In the data warehouse environment, the first thing the DSS analyst needs to know in order to do his/her job is what data is available and where it is in the data warehouse. In other words, when the DSS analyst receives an assignment, the first thing the DSS analyst needs to know is what data there is that might be useful in fulfilling the assignment. To this end the metadata for the warehouse is vital to the preparatory work done by the DSS analyst. Contrast the importance of the metadata to the DSS analyst to the importance of metadata to the information technology professional. The information technology professional has been doing his/her job for many years while treating metadata passively.

©Copyright 2000 by William H. Inmon, all rights reserved

Page 2

M

E T A D A T A

I

N

T

H E

D

A T A

W

A R E H O U S E

The community served and the importance to that community are the primary reasons why metadata is so important in the data warehouse environment. But there are other powerful reasons as well. MAPPING A basic part of the data warehouse environment is that of mapping from the operational environment into the data warehouse. The mapping includes a wide variety of facets, including, but not limited to: • mapping from one attribute to another, • conversions, • changes in naming conventions, • changes in physical characteristics of data, • filtering of data, etc. Figure 3 shows the storing of the mapping in metadata for the data warehouse.

It may not be obvious why mapping information is so important in the data warehouse environment. Consider the vice president of marketing who has just asked for a new report. The DSS analyst turns to the data warehouse for the data for the report. Upon inspection, the vice president proclaims the report to be fiction. The credibility of the DSS analyst goes down until the DSS analyst can prove the data in the report to be valid. The DSS analyst first looks to the validity of the data in the warehouse. If the data warehouse data has not been reported properly, then the reports are adjusted. However, if the reports have been made properly from the data warehouse, the DSS analyst is in the position of having to go back to the operational source to salvage credibility. At this point, if the mapping data has been carefully stored, then the DSS analyst can quickly and gracefully go to the operational source. However, if the mapping has not been stored or has not been stored properly, then the DSS analyst has a difficult time defending his/her conclusions to management. The metadata store for the data warehouse then is a natural place for the storing of mapping information. ©Copyright 2000 by William H. Inmon, all rights reserved

Page 3

M

E T A D A T A

I

N

T

H E

D

A T A

W

A R E H O U S E

MANAGING DATA OVER TIME Another extremely important function of metadata in the data warehouse environment is that of management of data over time. The time horizon of data in the data warehouse is MUCH longer than that normally found in the operational environment. In the data warehouse a time horizon of five to ten years is absolutely normal. A time horizon of a few weeks to ninety days is normal for many operational environments. Nowhere does the difference in managing data over different time horizons show up more poignantly than in the metadata sitting over those environments. Given that data in the data warehouse environment is taken over a broad spectrum of time, it is absolutely normal that the structure of data - the keys, the attributes, the layouts, etc. - should change. Said another way, over a long period of time, the same data structure will have multiple forms. The record of those structures and those changes becomes the purview of the metadata for the data warehouse. Figure 4 shows the storing of multiple data structures in the metadata for the data warehouse.

The storage of multiple data structures for the data warehouse is contrasted with the storage of a single data structure as found in the operational environment. One of the fundamental concepts of data management in the operational environment is that there is one and only one correct definition of data. This assumption is 180 degrees the ©Copyright 2000 by William H. Inmon, all rights reserved

Page 4

M

E T A D A T A

I

N

T

H E

D

A T A

W

A R E H O U S E

opposite of the data found in the data warehouse. Managing data over a long spectrum of time then is another reason why metadata in the data warehouse is so important. VERSIONING OF DATA Because data must be managed over a long spectrum of time in the data warehouse environment (and correspondingly, the associated metadata must be likewise managed), metadata must be "versioned". Versioned data is data that allows changes to be continuously tracked over a long period of time. Figure 5 illustrates versioning.

In Figure 5 it is seen that both the current status of a structure can be found AND the history of changes to that structure can likewise be found. This tracking is a necessary feature for ALL types of metadata in the data warehouse store of metadata. One of the characteristics of versioning is that the data trail be continuous and nonoverlapping. In other words it is important that for any moment in the past there be one and only one value or status of metadata. Versioning can be accomplished by placing the effective from data and the effective to date on the metadata component, or by placing the effective from date only on the metadata component. When only the effective from date is placed on the metadata component, the effective to date must be derived by looking at the next occurrence of the metadata component. The reason why versioning is so important is that when the DSS analyst wants to interrogate a calculation or report made in the past that the versioning allows the DSS analyst to understand what data was and where it came from as it entered the warehouse. Without versioning, the only meaningful data a data warehouse environment has is data written for and managed under the most current definition and structure of data.

©Copyright 2000 by William H. Inmon, all rights reserved

Page 5

M

E T A D A T A

I

N

T

H E

D

A T A

W

A R E H O U S E

METADATA COMPONENTS - BASIC COMPONENTS The basic components of the data warehouse metadata store include the tables that are contained in the warehouse, the keys of those tables, and the attributes. Figure 6 shows these components of the data warehouse.

The contents of the metadata for the data warehouse do not differ significantly from the metadata contents for the operational environment for these basic components, other than the data warehouse components are versioned, and the same components for the operational environment are not normally versioned. METADATA COMPONENTS - MAPPING As previously discussed, one of the most important components of the data warehouse metadata store is that of mapping, as shown in Figure 7.

©Copyright 2000 by William H. Inmon, all rights reserved

Page 6

M

E T A D A T A

I

N

T

H E

D

A T A

W

A R E H O U S E

The typical contents of mapping metadata that are stored in the data warehouse metadata store are: • identification of source field(s), • simple attribute to attribute mapping, • attributes conversions, • physical characteristic conversions, • encoding/reference table conversions, • naming changes, • key changes, • defaults, • logic to choose from multiple sources, • algorithmic changes, and so forth. Like all other data warehouse metadata, these components should be versioned. METADATA COMPONENTS - EXTRACT HISTORY The actual history of extracts and transformations of data coming from the operational environment and heading for the data warehouse environment is another component that belongs in the data warehouse metadata store. Figure 8 shows this part of the data warehouse metadata store.

The extract history simply tells the DSS analyst when data entered the data warehouse. The DSS analyst has many uses for this type of information. One occasion is when the DSS analyst wants to know when the last time data in the warehouse was refreshed. Another occasion is when the DSS analyst wants to do what if processing and the assertions of analysis have changed. The DSS analyst needs to know whether the results obtained for one analysis are different from results obtained by an earlier analysis because of a change in the assertions or a change in the data. There are MANY cases where the DSS analyst needs to use the precise history of when insertions have been done to the data warehouse.

©Copyright 2000 by William H. Inmon, all rights reserved

Page 7

M

E T A D A T A

I

N

T

H E

D

A T A

W

A R E H O U S E

METADATA COMPONENTS - MISCELLANEOUS One important component of the metadata environment is that of alias information, as seen in Figure 9.

Alias information is attribute and key information that allows for alternative names. Alternative names often make a data warehouse environment much more "user friendly". In some cases technicians have influenced naming conventions that cause data warehouse names to be incomprehensible. In other cases one departments names for data have been entered into the warehouse, and another department would like to have their names for the data imposed. Alias' are a good way to resolve these issues. Another useful data warehouse metadata component is that of status. In some cases a data warehouse table is undergoing design. In other cases the table is inactive or may contain misleading data. The existence of a status field is a good way to resolve these differences. Volumetrics are measurements about data in the warehouse. Typical volumetric information might include: • the number of rows currently in the table, • the growth rate of the table, • the statistical profile of the table, • the usage characteristics of the table, • the indexing for the table and its structure, • the byte specifications for the table, etc. Volumetric information is useful for the DSS analyst planning an efficient usage of the data warehouse. It is much more effective to consult volumetrics before submitting a query that will use unknown resources than it is to simply submit the query and hope for the best. Aging/purge criteria is also an important component of data warehouse metadata. Looking into the metadata store for a definition of the life cycle of data warehouse data ©Copyright 2000 by William H. Inmon, all rights reserved

Page 8

M

E T A D A T A

I

N

T

H E

D

A T A

W

A R E H O U S E

is much more efficient than trying to divine the life cycle by examining the data inside the warehouse. METADATA COMPONENTS - SUMMARIZATION ALGORITHMS The data warehouse is made up of multiple levels of data, usually divided along the lines of detail and summary. Storing the algorithms used to summarize or otherwise calculate data can save the DSS analyst untold time in determining what the appropriate source of data is. Figure 10 shows the summarization algorithm connecting two levels of data in the warehouse.

METADATA COMPONENTS - RELATIONSHIP ARTIFACTS Data relationships are as important in the data warehouse environment as they are in any other data base environment. However, the implementation of data relationships in the data warehouse environment is quite different from the implementation of data relationships in other environments. Storing information about the data relationship artifacts found in the data warehouse is a useful component of the data warehouse metadata store, as seen in Figure 11.

©Copyright 2000 by William H. Inmon, all rights reserved

Page 9

M

E T A D A T A

I

N

T

H E

D

A T A

W

A R E H O U S E

(NOTE: see the tech topic on data relationships in the data warehouse environment for a detailed explanation of this topic.) Storing artifact information in the data warehouse metadata makes the DSS analysts job much easier in trying to determine what data is appropriate and most useful for analysis. METADATA COMPONENTS - RELATIONSHIP HISTORY Data artifacts are the actual data that results from a relationship of data occurring. But there is another dimension to that relationship and that is the actual history of the relationship itself. Storing the history of the relationship is also a very important aspect of data warehouse metadata, as shown in Figure 12.

The history of a relationship (as opposed to the actual data resulting from the relationship) can be quite useful in interpreting data warehouse artifacts. Typically the relationship history contains the following information: • the tables participating in the relationship, • the effective dates of the relationship, • the constraints in effect, • cardinality, • a description of the relationship, • the data administrator in charge, etc. METADATA COMPONENTS - OWNERSHIP/STEWARDSHIP Because data is updated in the operational environment there is a strong sense of ownership. The department that can update operational data effectively "owns" the data. There is a corresponding notion of stewardship of data in the data warehouse environment. Data is not updated in the data warehouse. But it is moved into the data warehouse and managed there. There is a responsibility for data in the data warehouse (although the notion of responsibility for data in the operational environment is quite different.) ©Copyright 2000 by William H. Inmon, all rights reserved

Page 10

M

E T A D A T A

I

N

T

H E

D

A T A

W

A R E H O U S E

Storing ownership (operational) and stewardship (data warehouse) information in the data warehouse metadata store can be of great service to the DSS analyst trying to rectify or understand data. Figure 13 shows the storage of ownership/stewardship data in the data warehouse metadata store.

METADATA COMPONENTS - ACCESS PATTERNS The pattern of access of data - either predicted or actual - can be quite useful. This information can be stored in the data warehouse, as shown in Figure 14.

©Copyright 2000 by William H. Inmon, all rights reserved

Page 11

M

E T A D A T A

I

N

T

H E

D

A T A

W

A R E H O U S E

Access pattern data is useful in determining when and what data to migrate to other levels of storage. If a parallel environment is chosen, access patterns are very useful in deciding how to physically spread the data. Unfortunately, determining access patterns is not an easy task in many technologies. In some cases log records can be used for this purpose. In other environments the dbms masks off the actual pattern of access. METADATA COMPONENTS - REFERENCE TABLES/ENCODED DATA Encoded data or reference data is data that is stored in an external table and contains commonly used translations of values. Usually the data in the reference table is lengthy or stored awkwardly and the data in the primary data table is stored concisely. As data changes over time, it is necessary to store the contents of the reference tables as well. If the reference tables are not stored, the DSS analyst may well arrive at the proposition of having to try to decipher primary data into an unknown format. This can be very disconcerting for the DSS analyst. Inaccuracy and frustration are usually the result. In order to manage data over the spectrum of time, it is necessary to store reference tables/encoded tables as well, with the effective from and effective to dates for the tables in the data warehouse metadata store, as shown in Figure 15.

©Copyright 2000 by William H. Inmon, all rights reserved

Page 12

M

E T A D A T A

I

N

T

H E

D

A T A

W

A R E H O U S E

METADATA COMPONENTS - DATA MODEL - DESIGN REFERENCE The heart of the design of the data warehouse is the data model. Building a data warehouse without a data model is a very trying experience. The DSS analyst occasionally needs to understand the relationship between the physical design of the data warehouse and the data model. Figure 16 shows the data model and the relationship between the data model and the physical design of the system residing in the data warehouse metadata store.

©Copyright 2000 by William H. Inmon, all rights reserved

Page 13

M

E T A D A T A

I

N

T

H E

D

A T A

W

A R E H O U S E

RELATIONSHIP TO OTHER METADATA STORES While the metadata for the data warehouse is certainly an important feature of the architected environment, it is hardly the only metadata there is. Metadata exists (in a variety of forms) for the operational environment and the departmental environment. Figure 17 shows that metadata.

The metadata for all the environments needs to be able to communicate when needed for the other environments. There needs to be a two-way interchange of information. The interfaces need to understand what data resides where and the format and use of the data in other environments. SUMMARY Metadata for the data warehouse environment is one of the most important aspects. Metadata helps the DSS analyst find what data is in the warehouse and use that data effectively and efficiently. Some of the components of data warehouse metadata are: • the structure and contents of the warehouse, • the mapping of data into the data warehouse, • the extract/transformation history, • alias information, • status information, • volumetrics, • aging purging criteria, • summary/calculation data between levels of the warehouse, • data relationship artifact information, • relationship history information, • ownership/stewardship information, • access information, • reference/encoding table information, and • data model - design relationship information. ©Copyright 2000 by William H. Inmon, all rights reserved

Page 14