A Data Warehouse Modeling Technique

A Data Warehouse Modeling Technique Page 1 of 10 A Data Warehouse Modeling Technique Author: Prakash Kewalramani Organization: Evaltech, Inc. Evalte...
Author: Julie Wilkins
3 downloads 0 Views 141KB Size
A Data Warehouse Modeling Technique

Page 1 of 10

A Data Warehouse Modeling Technique Author: Prakash Kewalramani Organization: Evaltech, Inc. Evaltech Research Group, Data Warehousing Practice. Date: 10/21/10 Email: [email protected]

Where Research leads to Practice

Abstract: This article provides a summary of criteria we've used in the past to evaluate data migration products. While data migration includes extract, transformation, and load processes, we'll focus here on evaluation criteria for data transformation and leave the discussion of evaluating the other two processes for a future column. Intellectual Property / Copyright Material All text and graphics found in this article are the property of the Evaltech, Inc. and cannot be used or duplicated without the express written permission of the corporation through the Office of Evaltech, Inc.

Evaltech, Inc.

Copyright © 2011

A Data Warehouse Modeling Technique

Page 2 of 10

Organizations have an unquenchable appetite for data, but frequently lack the essential enzymes to digest it. The volume of data is irrelevant if it isn't organized in a manner to add value to the organization. And value to an organization means turning data into actionable information. Multidimensional Data Modeling (MDDM) is one approach to making sense of this pile of data. What Is Multidimensional Modeling? Sometimes introducing a new concept is easier when describing what it is not. Do you remember being told that "A car is not a toy," or "Life isn't easy?" The implication was clear, and a lot of hazy description was avoided. Let's use the same logic in describing Multidimensional Data modeling (MDDM). First of all, MDDM is not tied to a physical representation of data. It is not inseparable from multidimensional databases (like Essbase, Express, Gentium). It does not involve squeezing information into cubes. It isn't terribly difficult to understand or to practice, and it certainly isn't new or the latest buzzword. Despite the excessive amount of attention MDDM is getting these days, the concept has been around as long as end-user computing, starting in the 1960's with products like Express and APL. And even the lowly spreadsheet is a multidimensional model, albeit a two-dimensional one. MDDM is: MDDM is not: A logical model for the business A proprietary database management system information Easy to understand Rocket Science Applicable to relational and A ‘CUBE’ multidimensional databases Extremely useful for analysis A replacement for a relational model A well thought and tested technique The latest buzzword What is MDDM? It is a technique for conceptualizing business models as a set of measures described by ordinary facets of business? It is particularly useful for sifting, summarizing and arranging data to facilitate analysis. In contrast to the techniques for designing On-Line Transaction Systems (OLTP), which rely on entities, relationships, functional decomposition and state transition analysis, MDDM uses facts, metrics, dimensions, hierarchies and sparsity. Multidimensional data models are about numeric data, such as values, counts, weights and occurrences. Though a typical OLTP problem statement might be, "Model the order fulfillment process," a MDDM problem statement is, "What is my profitability by customer over time, by organization?" Part of the allure of MDDM's is their simplicity. A fully normalized OLTP design for an order entry system may have dozens or even hundreds of tables, and making sense of the design for analysis is daunting. Even early Woody Allen movies aren't as funny as watching a well-intentioned data modeler trying to do a "walk-through" of the model with the (euphemistically named) "business people." MDDM's, on the other hand, are quite simple, even elegant in comparison. They not only are understood by the "business people," they are expressed in a way that is quite natural to them. What's more, a good

Evaltech, Inc.

Copyright © 2011

A Data Warehouse Modeling Technique

Page 3 of 10

conceptual MDDM can be implemented in a relational database, a multi-dimensional database or even an object-oriented database. Comparison of the Multidimensional data Model with Relational Model The main purpose of the relational model is its ability to support operational processes. In contrast, the multidimensional data model is designed to support the reporting and analytical needs of knowledge workers, and can best be described by contrasting it with the relational model in four fundamental ways: • • • •

The MDDM views information from the perspective of a "slice of time" instead of atomic transactions The MDDM is globally consistent and conformed for the enterprise; the relational model is internally consistent OLTP systems provide for detailed audit trails, MDDMs are better for the big picture Relationships are modeled explicitly in the relational model, implicitly in the multidimensional model

Transaction View versus Slice of Time

Local Consistency versus Global Consistency

Audit Trail versus Big Picture

Explicit Relationships versus Implied Relationships

Evaltech, Inc.

Copyright © 2011

A Data Warehouse Modeling Technique

Page 4 of 10

OLTP systems record each and every transaction. Examples are journal entries, purchase orders, billing items, hamburgers dropped on the floor or all those collect calls on Fathers' Day. The multidimensional data model is not concerned with actual transactions, only the quantitative result of them at some interval in time, such as days, weeks or months. Transaction View versus Slice of Time

OLTP systems are designed with in their own scope and they are consistent. For example, a General Ledger is locally consistent if it properly records all of the relevant transactions, performs allocations and maintains its account master. But if the data warehouse combines information from more than one General Ledger systems, each with its own chart of accounts, global consistency is not automatic. The multidimensional data model starts from a globally consistent view of the enterprise, meaning a single chart of accounts, in this case. Local Consistency versus Global Consistency

When a customer has a question about their credit card bill, they want to see every transaction. When your overnight package is lost, you want to know who was the last person to see it intact. The systems that manage this information ensure the highest level of confidence and security. Is it any wonder that they have shortcomings when dealing with analytical questions? The multidimensional model is designed to answer questions like, "What was the revenue of the company from all four regions?" or "Who are my top ten customers and why?" or "Why are sales down this year compare to last year?" Audit Trail versus Big Picture

Entity-Relationship modeling is the heart of the relational model. The explicit relationships between customers and sales orders, or between hamburgers and buns are burned into the design of the database. In multidimensional modeling, these relationships are implied by the existence of "facts" at the cross-section of dimensions. For example, if there are sales dollars to Customer 101, of Product 2345, then the relationship between customer and product is implied. Explicit Relationships versus Implied Relationships

Evaltech, Inc.

Copyright © 2011

A Data Warehouse Modeling Technique

Page 5 of 10

Option 1: The Classic Star Schema

In the star schema for database design, each dimension is described by its own table and the facts are arranged in a single large table, indexed by a multi-part key that comprises the individual keys of each dimension. DATA: INFORMATIONWEEK

HOW-TO: SIX BASIC QUESTIONS IN MULTIDIMENSIONAL DATA MODELING These are the steps to follow while building a MDDM. The first step in building a model is to pick a business subject area (weekly sales reporting, monthly financial statements, insurance claims costs) and model it by asking six fundamental questions from the business users. • • • • • •

What business process is being modeled? Or What is the Vision? What are the measures (or "facts")? At what level of detail (granularity) is "Active" analysis conducted? What do the measures have in common (the dimensions)? What are the attributes of the dimensions? Are the attributes stable or variable over time, and is their "cardinality" bounded or unbounded?

First, a few definitions. "Active" analysis means the ability to manipulate information. This can refer to mechanical manipulation, like pivoting, drilling or graphing; agentbased manipulation, like alert reporting, exception reporting, or other agent-based activity; or workflow manipulation, like publishing, distributing or notification. Certain products allow a "reach-through" or "drill-through" process, where analysis can occur at a summary level, but detail at lower levels is only available in listing format; it is not actionable. That is why you must define the boundary for active analysis. It defines the limits of your manipulative analytical reach. "Cardinality" is just a lofty way of saying "how many." Since the frustrated math teachers who invented relational theory based their model on mathematical set theory, the

Evaltech, Inc.

Copyright © 2011

A Data Warehouse Modeling Technique

Page 6 of 10

appellation stuck. Boundedness is another obtuse mathematical term, which simply refers to whether or not there is an upper limit to the cardinality. If you build your multidimensional data model in a relational database, unbounded cardinality is expected. The opposite is true in the multidimensional databases, where changes in cardinality often require complete reorganization of the database, an extremely time-consuming process. In any case, this is why cardinality affects the design of your model. Moving through the questions: Trying to do everything at once should be avoided. Instead, concentrate on a bounded area like customer profitability or sales reporting. One exercise we use in our practice is to have clients formulate a simple statement describing the intent of the model. For example, "Net sales, in dollars and units, of every product, at every store, by week, by company for the last three years, compared to shipments and budgeted sales." This simple statement gives us the business process (retail sell-through data), the facts (Sales $, Sales units, return units, shipment units, price), the dimensions (customer, product, time and view) and the granularity or lowest level of detail (sales, shipments and returns of individual products, aggregated by week and counter). The attributes are worked out by evaluating the rich features of the dimensions. Following the six questions will help you refine your thinking and lead to a solution that is manageable. Before you start, though, you will need a lot more information about facts, dimensions, attributes and, a major problem in MDDM, handling sparsity.

Evaltech, Inc.

Copyright © 2011

A Data Warehouse Modeling Technique

Page 7 of 10

Facts, Dimensions, Attributes and Sparsity The basic components of the multidimensional data model are facts, dimensions, attributes and sparsity. Physical Representation in Relational Databases

The first two components, facts and dimensions, are represented physically in a relational database as tables. In the simplest MDM, the basic star schema, they are the only tables. In the star schema, each dimension is described by its own table, and the facts are arranged in a single large table, indexed by a multi-part key that is made up of the individual keys of each dimension. There are many variations on the simple star schema, but they all share the basic concept of fact tables and dimension tables. In many cases, not all facts share the same dimensionality, and multiple fact tables may be used. An example is selling price, which may not vary between markets or customers, or the Federal Income Tax rate. When dimensions are large (high cardinality), it often makes sense to split the dimension tables at the level of the attributes, also known as the "snowflake schema." Storing aggregations and derivations yields even more exotic combinations, sometimes referred to as "decomposed stars" or "constellation schema." The point to keep in mind is that dimensional modeling for relational databases is designed to create fact tables, usually long and thin, and relatively small, short and wide, dimension tables. While the fact tables contain the actual numeric information, all of the interesting information is in the dimension tables. Queries are designed to exploit this by using the dimension tables for counts, control breaks, aggregation paths and searching for properties of the elements. Many queries can be resolved without even touching the fact table. When "facts" are needed, the method is to gather the key values from the dimension tables and then pull the matching records from the fact table, avoiding costly and time consuming table scans and complex joins. The second two components, attributes and sparsity are not "entities" or represented as tables. Attributes are the extended descriptions and hierarchies of the dimensions, such as brand, color and size in the Product dimension (see diagram at left of an extensive product dimension). Sparsity is handled implicitly. In the simple example shown, it is likely that facts (Dollars, for example) will exist for only a small fraction of the Cartesian product of Product, Market and Period. For example, assuming that the model has a total Evaltech, Inc.

Copyright © 2011

A Data Warehouse Modeling Technique

Page 8 of 10

of 120 markets, 20,000 products and 1,098 days (or 2,625,200,000 possible combinations), there may in fact be actual sales for only 1% or even 0.1% of the possible combinations, sometimes described as 99% or 99.9% sparse, respectively. The star schema model handles sparsity by simply not recording records where those combinations are invalid. In a simple star, that translates to 26,252,000 rows at 99% sparsity or roughly 1-2GB for the model shown. Facts Since dimensional modeling always involves hierarchies in dimensions, aggregation of information is a key element in the usefulness of the model. Since aggregation is an additive process, it is best if facts are limited to additive, numeric values. It is possible to deal with non-additive facts (text, for example) but unless the text is unique for every record in the fact table, it belongs in the dimension tables. Some facts are only semi-additive, like certain counts. Some facts are numeric and still nonadditive, such as prices or rates. Special care must be taken in handling these facts, since they can add the potential of incorrect answers to queries. Some front-ends add functionality to deal with this, but the handling of nonadditive facts is an advanced topic in data warehouse design. Selecting the facts for your model is relatively simple: once a business subject is selected, the list of facts is the answer to the question, "What are we measuring?"

Evaltech, Inc.

Dimensions Dimensions are the classes of descriptors of the facts. If the fact is sales, the dimensions might be time, geography, customer and product. We say that dimensions are "classes" because they can be broken into "attributes." The attributes of a product dimension, for example, may include a hierarchy of item, brand, class, and division. Or the attributes may be descriptive, such as "large metro", "medium size," "down market" or "yes." The concept of dimension is fundamental to the MDDM. Most modelers and vendors by now use the phrase "common business dimensions" as if it were obvious just what they are. In fact, the most-often used dimensions are surprisingly common across applications and industries, but it's in the differences that the real challenge lies. For example, it is rare that a MDDM does not include time as a fundamental dimension. It is so common, in fact, that most tools either have or will shortly implement time intelligence as a feature, which means that calendar conversions and day->week->month->quarter->year rollups are automatic and even some special cases like time ranges (promotion periods, seasons, etc.) and fiscal periods are handled without much effort.

Copyright © 2011

A Data Warehouse Modeling Technique

Page 9 of 10

Attributes provide the depth of the dimensions beyond identifying codes. Through the attributes, we can find the hierarchies of the dimensions, detailed descriptions of each dimensional element and extended properties. It is common for dimension tables to have over 100 columns of attributes; properly designed queries or front-end query generators that understand MDDM exploit these highly denormalized structures to return complex queries quickly. Let's look at how this works. Assume that our model has a product dimension with an attribute for "type" with "lowfat," "ethnic," "premium," "plain-label" and "special diet" as possible values. This is important, and is a key differentiator between relational and multidimensional implementations of the MDDM. Even though there is no defined hierarchy for these attributes, it is possible to generate queries very quickly by using the power of the star schema. Try the following example: How well are "low-fat" products selling this quarter versus the same quarter last year? In SQL, this can be done using the correlated sub query as follows: select product, dollars as sales_95, select dollars from market cm, product cp, period ct, sales cs where ct.month = pt.month and ct.year = pt.year-1 and cp-product = pp.product and cm.city = pm.city as sales_94 from market pm, product pp, period pt, sales ps where year = 1995 and quarter = 4 and product in (select product from product where type = "low-fat")

Notice the last select statement: this is the key to star schema - collect the keys from the dimension table by looking for constraints there, not the fact table. This is how table scans are avoided. Unfortunately the query optimizers of many relational databases are not clever enough to understand this yet, but they are beginning to catch on. Though it is more clumsy, the same result could be achieved without the correlated subquery by using multiple SELECT statements and TEMP tables, with a final join. Unfortunately, this is a weakness in SQL, which simply can't do an outer join and stitch the rows side-by-side in a sort-merge. The correlated subquery is a little exotic, but your front-end should mask this complexity from users. For those relational databases that are thinking clearly about multidimensional modeling, particularly Red Brick, the response time for these queries is exceptional.

Evaltech, Inc.

Copyright © 2011

A Data Warehouse Modeling Technique

Page 10 of 10

We covered the basics in this article, but there is much more to MDDM than basics: advanced topics in schema design for relational databases, techniques for designing multidimensional databases (which we did not cover), the complexities of multiple hierarchies, cross-dimensional relationships, partial dimensionality and handling of partially-additive and non-additive facts. The dimensional models for some processes are far more complex than the examples presented here, especially those that involve complex products, like financial instruments or insurance; any use of "householding" such as credit and or direct marketing; subscription schema involving prepaid packages of services; and itinerary-based models. If your first foray into multi-dimensional models involves any of these aspects, it is best that you enlist the help of skilled practitioners who can assist you. Multidimensional data modeling is more difficult to describe than to learn. All that is really needed is an understanding of the business process being modeled, the discipline to keep the model from expanding unnecessarily and a few basics. A healthy by-product of every modeling effort is a deeper and clearer understanding of the elements of your business. Very few participants in a multidimensional modeling effort come away without having learned something valuable about the drivers and relationships of the process under consideration.

Evaltech, Inc.

Copyright © 2011