Data Modeling - Implementation

Data Modeling - Implementation Written by Zakir Hossain, Manager – Enterprise Architecture Configuration & Database, US Dept. of Defense CEO, Data Gro...

Author: Guest

6 downloads 0 Views 302KB Size

Report

Download PDF

Recommend Documents

Data Modeling for Data Warehouses

Modeling Ordinal Categorical Data

The Data Modeling Problem

SAP BW Data Modeling

Data Vault & Ensemble Modeling

Engineering data modeling & organization:

Data and Process Modeling

Data Modeling in Clinical Data Analysis Projects

Modeling and Implementation of DSP FPGA Solutions

CURRENT APPROACHES TO DATA MODELING

Part 3. Modeling of data

Modeling the Enterprise Data Architecture

A Data Warehouse Modeling Technique

Patterns of Data Modeling. Definitions

A Multicycle Data Path Implementation

DATA MODELING FOR DATA WAREHOUSING AND BIG DATA INTEGRATION

Statistics of Electric Power Blackouts: Data Analysis and Data Modeling

INDUSTRIAL DATA SPACE DATA ECONOMY AND ANALYTICS CONCEPTS AND IMPLEMENTATION

Predictive data mining for delinquency modeling

Modeling Dyadic Data with Binary Latent Factors

USING THE BUILDING BLOCKS FOR MODELING DATA

Grounded Neural Networking: Modeling Complex Quantitative Data

Data-based Dynamic Modeling for Refinery Optimization

Modeling and Data Analysis of Complex Systems

Data Modeling - Implementation Written by Zakir Hossain, Manager – Enterprise Architecture Configuration & Database, US Dept. of Defense CEO, Data Group CS Graduate (OSU), OCP, OCA, MCDBA, MCITP-DBA, Security+, Oracle RAC-Admin, Oracle Backup/Recovery-Admin, Oracle Performance/Monitoring-Admin, Oracle App Server-Admin, System Admin (Windows/RedHat), ITIL V3, SCJP (Sun Certified Java Programmer) Special Note You are not allowed sharing notes, questions/answers with discontinued students and outsiders. If anybody needs to get notes, questions & answers, they need to contact directly either with Zakir or Ms. Navait, Project Coordinator. ..::: NOTE :::.. Altering, printing, sharing with any training institute/individual or commercial/business use without written permission is 100% prohibited. It is a Federal Copy Right Violation of Intelligence Product. Violators will be prosecuted with the fullest Extent of Federal Law.

1/14

Data Modeling - Implementation Database Normalization: The whole process of Data Modeling is called Data Normalization or Database Normalization. Normalization is a process of reducing redundant/duplicate data. It means that we have some data element in a row/record that we may not need in that same table. So, it is a rule of thumb that anytime we find a duplicate data element in a record, we may need to create another table with all the duplicates. Again, Normalization is the process of organizing data in the most optimized way for better performance and to avoid unnecessary redundant data in a database. This includes creating tables and establishing relationships between tables to eliminate redundancy and inconsistent data. For example, you have two tables in your database: Customers and Orders. If your database is not normalized, you may store Customer Name in both Customers and Orders tables to relate the order to a customer. Apart from storing the same information twice, you also have a real problem where data could get out of sync. If your customer gets married and changes her last name, you must update her name in both tables and for every single orders. One of the key strength of relational databases over flat flies such as text file is their ability to relate the data across tables without having to duplicate data. That is how the term "Relational" database has been discovered and RDBMS (Relational Database Management System) become terminology became so popular. So, Normalization is a logical database design method to separate the data into multiple related tables. An important characteristic of a normalized database is that it has narrow tables with fewer columns per table. There are some rules for database normalization. Each rule is called a "Normal Form (NF)”. If the first rule is implemented, the database is considered to be in "First Normal Form – 1NF”. If the first two rules are implemented, the database is considered to be in "Second Normal Form – 2NF”. If the first three rules are implemented, the database is considered to be in "Third Normal Form – 3NF”. Although additional levels of normalization are possible, third normal form is considered the highest level of normalization necessary for most applications. Advantages of Data Normalization:  Minimizes amount of space required to store the data by eliminating redundant data  Minimizes the risk of data inconsistencies within a database  Minimizes the introduction of possible update and delete anomalies  Maximizes the stability of the data structure  Eliminates duplicate/redundant data from tables 2/14

Data Modeling - Implementation     

Reduce the chances of data anomaly/inconsistency that occur in a database Provides efficient management of data and space (HD) Less chances of making mistakes in editing data Effective usage of storage Writing/inserting and Data update is faster

Disadvantage of normalization:  Data Read/Retrieval is slower  Requires writing more complicated script/queries to retrieve data Advantage of Denormalization:  Data Read/Retrieval is faster  Requires writing less complicated script/queries to retrieve data Disadvantage of Denormalization:  Less effective usage of storage  Writing/inserting and Data update is slower  Duplicate data would occupy more disk space  Creates problem with data consistency  Duplicate data creates the risk of logical data corruption by having a particular piece of information in more than one place. This happens mostly due to data insert, update, and delete. Based on this, there are three types of Anomalies in database:  Update Anomalies  Insertion Anomalies  Deletion Anomalies

Types of Normalization/Normalization forms: There are different normalization forms. However, OLTP and OLAP databases for commercial purposes, Data Modeling use only first 3 normalization forms. The other normalization forms are used for academic and research purposes only. The normalization forms are 1NF, 2NF, 3NF, 4NF, 5NF, 6NF etc. However, among all these normalization forms, only 1NF, 2NF, and 3NF normalization forms are used for data normalization and commercial databases. Edgar F. Codd is the inventor of relational model as he first defined the relational model of databases. Codd defined First, second, and third normal form or 1NF, 2NF and 3NF respectively. 3/14

Data Modeling - Implementation Later on, Boyce and Codd defined the Boyce-Codd Normal Form (BCNF) also called 4NF. The 5th and 6th normal forms (5NF, 6NF) were defined later on. Generally, the most databases adopt only upto 3rd Normal Form and in most cases if your model adopts 3NF, will also adopt the 4NF and 5NF (but not always). When you are designing a database, should spend enough time for this segment; so that the database is as normalized as possible. However, normalization may become an issue of cost in terms of performance problem since your application may need to retrieve data from more than one table. More tables involve in retrieving data, more time it takes to retrieve and process data either for selecting, inserting, deleting, and/or updating records set. So, you may have to selectively de-normalize a few tables. In some schemes/system/design such as data warehousing, you keep your design denormalized for performance reasons and primarily because you are typically not updating the data in a data warehouse. A well normalized database not only reduces the data anomalies/differences/variances/glitches and redundancies, it also makes future modifications to the database much easier. For example, let's assume you have a table that stores 4 phone numbers and all numbers are stored in columns of a single row, i.e. one column for each phone number. What would happen if you have to now record 5th phone number? You have to modify your table and add 5th column to store the 5th number and since not all users will have 5 numbers, you will have a null values in several columns for most of your users. Denormalization: Denormalization is the reverse process of normalization. In Denormalization, we accept data redundancy in a table. De-normalization is a logical database design method to put together the data from multiple related tables to one table. . An important characteristic of a De-normalization database is that it has wide tables with more columns per table. It helps to improve performance of a query. It is because it requires to read related data from few tables. So, some calculated values can be kept in a column of a table. In a way, it reduces the overhead of doing calculations with every query.

4/14

Data Modeling - Implementation An example of Denormalization is a database that stores the quarterly sales figures for each item as a separate piece of data. Denormalization is often used in data warehouse or in on-line analytical processing applications meaning it is mostly used for OLAP databases. Why Denormalize a Database: The only reason to Denormalization a database is to increase performance. For example, in an OLAP (On-Line Analytical Processing) application, you could Denormalize the database for creating reports that is used frequently to include summarized numbers for quarterly profit. Instead of calculating these values each time you need them, you can use the existing records to be queried again without calculation. This type of Denormalization works best when the data changes infrequently, such as is often the case with historical data. Entity: An entity is a logical collection of things/part of a record that is relevant to a related data/group. The physical representation of an entity is a table. Name an entity in plural form and use underscore between words if more than one words are used to name an entity. For example, an entity that contains data about your company's employees would be named Employees. Another example, an entity that contains data about customers’ credit cards would be named Credit_Cards. Derived Element: However, some cases, we may have data element in a row/record that we do not need. The data element/attributes that we do not need in an entity is called derived element. For an example, if we have quantity and price, we can easily get total cost as Price * Quantity = Total-Cost So, it may not be required to have Total-Cost as my data element/attribute and this Total-Cost is called a derived element. Table Constraints

5/14

Data Modeling - Implementation There are three main constraints that need to be considered while creating tables: 1. PRIMARY KEY, 2. FOREIGN KEY and 3. Default constraint Candidate key or keys: A candidate key is a key that uniquely identifies a record. Find a field(s) (known as data element/attribute) that uniquely identifies a record. It means no other orders could have same value. Say, first name of a customer, but it cannot be a candidate key, because more than one customer could have same first name. So, we find to find something about a customer, which could uniquely identify a customer. For an example, we could consider customers SSN, Driving license (USA, Canada, UK etc), Voter ID (i.e. in Bangladesh), National ID (Bangladesh) as possible candidate keys because no two persons should have same above mentioned IDs. One of these candidate ID, we consider later as a PK (Primary Key). Having more than one Candidate Keys: For an example, Employee ID, and SSN are the two candidate keys. In this case, we need to find which one is used frequently to find a record that becomes the candidate key for that table. So, it could be SSN or Employee ID, whichever is used frequently to find a record. So, we consider this ID as candidate key for the Primary Key. Primary Key (PK): PK is a key that uniquely identifies a record/row in a table. So, the primary key(s) have a unique value for each record or row in the table. Again, PK (primary key) is a value of a column or value of two or more columns, which uniquely identify a row or record in a table.

A PRIMARY KEY constraint prevents duplicate values for columns and provides a unique identifier to each record. By default this constraint creates a clustered index on the columns. You can have non-clustered index on a primary key. Some examples of Primary Key:    

Social Security Number (SSN) (USA/Canada) Driving license (USA/Canada) Voter ID (Bangladesh) National ID (Bangladesh) 6/14

Data Modeling - Implementation 

Citizenship Certificate Number (USA/Canada)

Special Note: Now a day, it is not recommended and not used any of the above information to create primary key considering the security issues/threat. Key/index Types: Based on number of columns used in a key/index, a key/index is of two types: 1. Simple Key: If a key is composed of only one column 2. Composite Key: If a key is composed of more than one columns Foreign Key (FK): FK is used to define a relationship between two tables. So, a foreign key (FK) is a referential constraint between two tables. The foreign key identifies a column or a set of columns in one table that refers to a column or set of columns in another table. So, a PK in one table is considered as FK in another table. When a FOREIGN KEY constraint is added to an existing column or columns in a table, SQL Server by default checks the existing value/data in the columns to ensure that all values except NULL exist in the column(s) of the referenced PRIMARY KEY constraint. In short, a column value in a table that is used to match record(s) in another table is called a FK. Examples of FK from real life:   

Customer_ID column, which first used as primary key in Customers table and then it is used in another table, Orders. So, the Customer_ID in Orders table considered as a foreign key Product_ID column in Sales table is considered as FOREIGN KEY as the Product_ID from Products table In a Payroll datamart/database, the employee ID in the employee table is used as the primary key and then as a foreign key in another table called pay detail table. This relationship allows finding detail information about the employee and his/her salary related information from different tables.

Considerations to create FK/PK or Features of PK/FK:  FK and PK column must have same data type in both tables  FK and PK column(s) must have same column length in both tables  They cannot be null  They must be unique

UNIQUE Constraint: 7/14

Data Modeling - Implementation UNIQUE constraints enforces that there are no duplicate values entered in a specific column. Although both the UNIQUE constraint and the PRIMARY KEY constraint enforce uniqueness, use a UNIQUE constraint instead of a PRIMARY KEY constraint when you want to enforce the uniqueness of a column that is not the PRIMARY KEY. Unlike PRIMARY KEY constraints, UNIQUE constraints allow one value to be NULL meaning UNIQUE constraint allows only one null value per column. In this example, we will ALTER the table Products to make Product_Name as Unique. IDENTITY_INSERT: SET IDENTITY_INSERT allows explicit values to be inserted into the identity column of a table. Only one column per table can have the IDENTITY_INSERT property set to ON. If a table already has this property set to ON, and a SET IDENTITY_INSERT ON, SQL Server returns an error message. The error message states that SET IDENTITY_INSERT is already ON for the table. SQL Server automatically uses the new inserted value as the current identity value. In this example, we will create a new table with the column Identity, and we will use SET IDENTITY_INSERT to insert values without gaps. Details of Normalization Forms/Rules: There are three popular normalization forms used for purposes in real life. First Normal Form (1NF): First Normal Form has two rules. Rule 1: In the First Normal Form (1NF), every entity has a primary key attribute. Each attribute must have only one value, and not a set of values. In simple word: Eliminate Repeating Groups or Remove duplicate data/similar data from the same row of a table and store the repeating records in another table. Use a key to relate the records between the tables created. For a database to be in 1NF, it must not have any repeating groups. For a single instance, data in repeating group may have multiple values for a given attribute. 8/14

Data Modeling - Implementation Steps/How to: Convert the data into first normal form by adding another entity that represents the relationship to this entity and related repeated values. The result of 1NF will be two tables from one table. First we gather all the requirements for the system to be designed. After gathering all the requirements, we analyze the requirements and find all the related attributes (columns). We create a separate table for each set of related attributes. After that we find all the possible candidate keys for each table. From the possible candidate keys, we create a primary key for each table. Example 1: Recall the above example, i.e. you store 4 phone numbers for every user in 4 columns of the same row. But not all users will have all 4 numbers resulting in lot of null values (hence duplicate data). Adding a 5th phone number will require table modification. So how do we solve this? Well, you can create a table called PhoneNumbers with two columns  

Name PhoneNumber

Now you can have as many phone numbers per user. But, 1NF is not done yet. Rule 2: Use a unique value for each row to identify each record uniquely. Now use this unique value to relate the records between the tables just created. This unique value is called a candidate key. This candidate key will become a Primary Key. If there is more than one unique value, all will consider as candidate keys. However, only one will become the primary key. It is like two or more people can be candidates to be president of a country. However, only one person will become a president. So, the president is the key – primary key. You could say that we can make PhoneNumber a primary key. But what if the same phone number is shared by two users? How about making Name and Phone Number (composite key) as a primary key? Well, close but what if two users who share the same phone number happen to have the same name? To make this table 1NF, we have to have a truly unique key. How about adding an identity field and making it a primary key? This will ensure every record has a unique key and will make our table 1NF. 9/14

Data Modeling - Implementation Second Normal Form (2NF): Tables to be in 2NF, it must satisfy 1NF and all non-key attributes must fully depend on the primary key. Basically, reduce data redundancy by extracting redundant data and moving it a new table. In simple word: Eliminate/Remove Duplicate/Redundant data that is in same column on multiple rows and store these records in a new table. Finally, relate the data between these tables (two tables) using a unique value. Step/How to: Convert the table achieved from 1NF to second normal form by splitting the entity into two different entities, which removes the partial dependency of the original entity. The result will be two tables Example – 1: Let's consider that you have a Customers table that has the following columns: Customers        

CustomerID FirstName LastName Address City State_Province PostalCode Country

Sure, you will have more than one customer from the same city, even more from the same state and definitely more from the same country. You may possibly have many customers from the same postal (zip) code. As you enter customers in this table, you are duplicating all this data i.e. City, State, PostalCode, Country etc. This table is not in 2NF. To make it 2NF compliant, create a new table and store City, State, PostalCode, Country. Let's call this table Addresses. This table has the following columns Addresses  

Address City 10/14

Data Modeling - Implementation   

State PostalCode Country

What columns do you think could be redundant here? Sure PostalCode won't be, but what about City? A city could have multiple postal codes? Definitely a State can have multiple cities and Country will have multiple states. A good design for all the tables may be the following: Customers     

CustomerID FirstName Middle_Name Last_Name AddressID

Addreses      

AddressID Address City PostalCode StateID CountryID

States   

StateID State State_Name

Countries   

CountryID CountryCode CountryName

Note that one table resulted in 4 different tables. It is definitely more complex but more flexible. However, you have an option to pre-populate States and Countries table in advance.

11/14

Data Modeling - Implementation This design satisfies second rule of 2NF i.e. the related tables should be related by foreign key. AddressID is a foreign key in Customers, StateID, and CountryID are foreign keys in Address table. Third Normal Form (3NF): A table to be in 3NF, it must satisfy the 1NF, 2NF and require to remove all non-key attributes relying on another non-key attribute that relies on the primary key. In simple word: Eliminate Columns/Non key attributes dependent on another non key attribute. So, Non key Attributes cannot depend on another non key attribute. Each non-key attribute must depend on a primary key (exactly like 2NF).

So, a table is in 3NF if none of the non-key attributes in the entity relies on any other non-key attributes. The Third Normal Form (3NF) is the form of a database that is the most well-designed databases commercially used. Example 1: You have a payroll application, which records the total hours an employee worked, the pay rate and total pay for the week. The following table schema can be used... PayRoll      

PayRollID EmployeeID WorkDate HourWork PayRateID (Labor Code) Total

Let's examine if this table satisfies 1NF and 2NF. The Payroll table is associated to Employee table via EmployeeID and to PayRate table via PayRateID. It has no redundant or duplicate data in the same row and each row has a primary key (PayRollID). So it satisfies the 1NF. It has foreign keys (EmployeeID and PayRateID) as well. So, it satisfies the 2NF. Thus, it satisfies all conditions for 1NF and 2NF. Table: PayRoll: PayRollID HourWork

WorkDate

EmployeeID PayRateID Total Salary 12/14

Data Modeling - Implementation HourWork * PayRate = Total Salary: It is called calculated value or derived Value. ER Diagram (E-R Diagram): E-R diagram stands for Entity Relationship Diagram. An Entity Relationship Diagram (ERD) is a way to graphically describe the relationship between data in a table. This relationship is established by using same value of two columns of two tables. Relationships describe the connection between two Entities/Tables. There are different types of relationship. For example, a relationship could be one-to-one, one-to-many, and many-to-many. A relationship works by matching data in key columns. A good practice is to use the same name for connected columns of the tables. For example, the Job_ID field/column in Employee table should have same name Salary table meaning Salary table should have a column with the name, Job_ID. In most cases, the relationship matches the primary key from one table, which provides a unique identifier for each row, with an entry in the foreign key in the other table.

Figure: E-R Diagram of DG_Retail_Store Database (Partial)

13/14

Data Modeling - Implementation E-R Diagram helps to organize data into entities of a system of any project. It also helps to define relationship between the entities/tables. E-R Diagram helps Data Analyst/Modeler to produce a good database structure so that the data can be stored and retrieved in a most efficient way. Again, an E-R diagram is a specialized graphical representation of relationship between entities in a database. E-R diagram is a diagram that is used in data modeling for Relational Databases. It shows the structure of each table including column name, data type, length, constraints, Primary Key and links between entities.

14/14