Data Profiling for ETL Processes

Data Profiling for ETL Processes Maunendra Sankar Desarkar IIT Kanpur Abstract The ultimate success of ETL processes ends with the delivery of correc...
Author: Dale Hamilton
1 downloads 1 Views 116KB Size
Data Profiling for ETL Processes Maunendra Sankar Desarkar IIT Kanpur

Abstract The ultimate success of ETL processes ends with the delivery of correct, reliable, relevant and complete information to end users. Hence data cleaning is an important part of any ETL process. In this report, we look at some common errors in data stored in databases and describe the functionality of our tool that helps to discover rules to filter those dirty data. We also present some idea about how to develop custom operators that will be able to perform some data cleaning operation in Europa platform. Such operators will be of immense help in designing ETL processes using Europa. Keywords: ETL, data cleaning, data profiling, Europa operator

1

Introduction

Any company’s profitability depends on its ability to make sound business decisions, based on complete, accurate view of the customers, suppliers and transactions. However, this information is not maintained at a single place, rather scattered throughout the enterprise - across multiple departments, divisions and applications. Additionally, there are company mergers and acquisitions which bring diverse operational systems together. Hence these organizations need a Business Intelligence platform that can consolidate and deliver data from multiple locations into a single and trustworthy source of information. Then the organization can do reporting, query and analysis, performance management and take sound business decisions. However, the data present in the consolidated repository must be error-free for the decision to be an helpful one for the enterprise. As a consequence, any data integration process must include a data cleaning phase. Obtaining a profile or a set of business rules that are followed by the data can help in data cleaning. ETL processes aim at data integration. Hence, data profiling becomes an important part of ETL processes. In this report we discuss some of these rules. We also discuss how to perform data cleaning with the help of these rules. The organization of the report is as follows. In section 2, we give brief introduction to ETL. Section 3 contains some examples of dirty data. Section 4 introduces some concepts of data profiling. In section 5 we describe the tool that we developed for data profiling. We introduce Europa in section 6 and describe the steps required for developing Europa custom operators in section 7. In the next section we mention some possible future works. Finally we conclude in section 9.

2

ETL

ETL stands for extract, transform and load. The entire process is divided into three phases. 1

• Extract: The first phase of an ETL process is to extract data from source systems. Most data warehousing projects consolidate data from different source systems. Each separate system may use different data organization formats. Source data may be in relational databases, text files (standard and delimited), XML, COBOL Copybooks etc. Additionally the files might be having different data representation formats. • Transform: After extraction, the data is transformed, or modified, depending on the specific business logic involved so that it can be sent to the target repository. There are a variety of ways to perform the transformation, and the work involved varies. Some examples are: – Derive a new calculated value (f inalprice = originalprice ∗ (1 − discount)) – Summarize multiple rows of data (Find total number of employees in that branch) – Join data from multiple sources (Get names of those employees who work in the location where some particular product is developed) Before reformatting the data to be available for use by the target schema, most ETL projects perform data cleaning operation to remove inconsistencies present in the source data. This is done during the transformation phase. • Load: The load phase transports and loads the data into to the data warehouse. Depending on the requirements of the organization, this process ranges widely. Some data warehouses simply overwrite old information with new data, some might do this in incremental fashion. There are more complex systems which maintain a history and audit trail of all changes to the data. Several products are available in the market for preparing and executing ETL applications. Examples include IBM Websphere Datastage (IBM), PowerCenter (Informatica) etc.

3

Dirty data - some examples

A taxonomy of dirty data can be found in [KCH+ 03]. In this section we give some examples of different classes of dirty data. 1. Wrong data due to non-enforcement of enforceable integrity constants : • Duplicate entries instead of unique entries • Null values where null not expected 2. Unwanted data, constraints can not be specified : • Age is not updated, so Current age 6= Current date - Date of Birth • Work experience < 0, which is out of possible range 3. Spelling errors 4. Entry into wrong fields • address in place of name 2

5. Use of abbreviations • T. J. Watson and Thomas J. Watson both entries are used • st used in address field instead of street 6. Different ordering • Name used in order and order.

4

Data Profiling

Whenever we say some data present in the source is dirty, it means that there is some rule which the data is supposed to follow, is violated. In data profiling, we try to find out those “rules”. Hence profiling allows to analyze the data values to find areas that are incomplete, inaccurate or ambiguous. It can also verify relationships across columns and tables. If during the profiling stage we can find out the integrity constraints followed by the column, then first error listed in the previous section can be eliminated. Similarly, if the profiling phase can find out the range of data for the columns (where applicable), then the second error in the list can be eliminated. Removal of other kind of errors (listed in the previous section) require more complex rules to be derived from data.

5

Functionality of our tool

The tool that we developed for data profiling task can find out the following rules from the database. 1. Does the column allow null value? 2. Is it a unique key in its table? 3. Is it a categorical field? If yes, what is the domain? 4. Find the acceptable range for the column entries. 5. Find primary key - foreign key relationship. The application allows the user to select a set of columns to analyze. It also allows the user to select the rules he wants to check. Then it comes up with the rules by looking at the entries present in those columns. For checking nullability it finds out the number of null entries in that column. If there is at least one null entry, it means the column allows null values, since we assume the data source to be clean. If there is no null value, then it reports the column under consideration as a non-nullable column. For checking uniqueness of the field, the application finds out the number of distinct elements in the column. If number of distinct elements is same as the number of entries present in the column, it means that all the elements are distinct. In such cases, it declares the column as a unique key. For the third rule, it calculates the ratio of the number of the distinct elements and the total number of elements in the table. If the ratio is very low, it means that average frequency of the elements is very high and the column can be a categorical field. When this ratio falls below some 3

predefined threshold, the application declares it as a categorical field. The domain of acceptable values is the set of the distinct elements in the column. We check this rule for the columns which are of char or varchar type, and are of reasonably small length (length ≤ some predefined threshold). Finding PK-FK relationship is done only if both the columns are non-null and of the type integer or smallint. First it tries to find out if the range of the first column is totally contained in the range of the second column. If it is not, then the second column can not be a foreign key. Otherwise it checks for the inclusion dependency of the first column in the second column. If this checking gives positive result, the application declares the second column as a foreign key of the first column. The rules that the application outputs after analyzing the the columns may not hold in the context in which the database is used. Basically the application provides the expert (user) a set of possible (which is a narrowed down list of all the possible rules) rules which may exist in the database. Like the user may say that the range is not exactly what is given by the application, but it gives some idea about where the range lies. Hence it helps the user to find the rules those are followed by the entries in the database.

6

Europa - a platform to design ETL applications

Europa is a platform for developing ETL applications [Sria]. It provides the user with a set of operators, which can be used to create ETL jobs. Operators are of three types: • Source operators: Source operators are sources of data. Example: database table, flat file, JDBC - ODBC sources etc. • Transform operators: Transform operators accept the data from sources or other transform operators, process them in some fashion and send the output of the processing to other transform operator(s) or target operator(s). Example: sort, filter, union, intersection etc. • Target operators: Target operators are the targets where data can be stored after processing. They generally consume data generated by transform operators. Example: database tables, flat files, JDBC - ODBC targets etc. Operators accept input and provide output through ports. Number of input and output ports that an operator will have can vary depending upon the use of the operator. Besides input-output ports, operators have a set of Properties. These values are manipulated by the ETL project designer to govern the behavior of this operator. Any Europa ETL project contains a control flow and one or more data flows. Data flow is a directed graph of operator nodes interconnected by links used to indicate the ETL data transformation sequence. There can be multiple data flows in a single project. Control flows, on the other hand, describe the sequence in which different data flows are executed. They also describe how error recovery, notification and compensation activities are organized. Control flows do not deal with how exactly data is transformed - that is described in individual data flows. The application developer generates the data flows and control flows which gives high level description of the project. Data flows are saved in the Europa data flow XML format, while control flow is in BPEL.

4

These codes are automatically generated by Europa. Once code for the control flow is generated, the application can be run on Websphere Application Server.

7

Developing custom operators for Europa

Europa is built using eclipse plug-in architecture. Its functionalities can be extended by using custom plug-ins. Users can build their own custom operators and place them in an operator library. Once the operator library is registered to the data flow system, the operators contained in that library become available for use. Operators and operator libraries are in XML format. Below we discuss how operator libraries and custom operators can be created [Sria], [Srib].

7.1

Add an operator library to the Plug-in

Create an operator library and place your operator there. Step 1: Create a new plug-in project. It will have a plugin.xml file. Step 2: Add dependencies to the plug-in. Dependencies vary according to the task that the operator is supposed to perform. Each such dependency will add one import entry in the plugin.xml file. For example, after adding the com.ibm.datatools.etl.dataflow.core and com.ibm.datatools.etl.dataflow.base.oplib plug-ins as dependencies to your plug-in, the plugin.xml file will contain the following lines. Step 3: Create an operator library. For doing this, in the plugin.xml file, choose the extensions tab. Add a new extension. Choose the operatorLibraries extension. Add an operator library to the operatorLibraries extension. Fill in the libFileName attribute with the name of the operator library file that you are going to create. The file name’s path is relative to the plug-in’s home directory. After adding the file name, the plugin.xml file will contain the following lines. (Suppose the operator library file name is sampleOp.oplib).

Step 4: Edit the oplib file. It must have a unique nsURI attribute. The following is an example of an empty oplib file.

Step 5: Create a category. Operators and their details are put in categories. An example is given below. It creates an operator library named sampleCat. Step 6: Design the operator. Select the category where you want to put it. Place the details of the operator inside the and tags of the category. The name and the codeGeneratorClass of the operator are mandatory in its description. You will also have to mention its input, output and properties and parameters, wherever applicable. • name is the identity by which the platform and the ETL application developer will know and refer the operator. • codeGeneratorClass’es className attribute denotes the name of the java class that will generate the code for this operator. This class is to be created by you.

This code snippet means that we are designing a sampleOp custom operator, its code will be generated by the sampleOp class. The path of the sampleOp class is org\europa\sampleOp\codegen\sampleOp. (operator name and code generator class name can be different also). • input and output attributes define the input and output ports of the operator. One input port can accept data from one operator only, whereas one output port can be connected to multiple operators. Number of input and output ports that an operator can have in a dataflow can be limited by setting the lowerBound and upperBound the input and output. upperBound = -1 means there is no limit on upper bound, and you can open as many ports as you want.

6

• property of an operator is analogous to parameters in function calls in any high level program. Designer can set the value of the property and that value can be used by the operator to perform its functionality. For example, File Export operator has a property called File Name. This value can be set according to the need of the user and can be used in the code generation phase. These properties may be of different ‘data types’, for example, Integers, Lists, Strings, Expressions, Database Table, File etc. You can define custom property types also. For each property type, a GUI “editor” can be associated. This editor class and other artifacts would be invoked in the Europa Data Flow Design editor, and whenever an operator has a property of this type - one can enter the value of the property from that editor. Pre-defined property types already have property editors defined and similar editors can be defined by custom operator developers. This is how you can define a property of an operator.

• Properties, inputs and outputs can be defined to be dependent on each other in the operator definition. In some cases, changes in one (the cause) can be used to cause changes in its dependent objects. Such dependencies may be specified using “” tags in the operator definition. For example, the Filter operator allows the user to select the input table and the ‘filterExpression’. This ‘filterExpression’ is a property of type condition. It has a condition property editor associated with it, where you can enter the condition for filtering. The condition property editor has a list of the available columns from the input table. So the ‘filterExpression’ property depends on the input table. This is captured in the following lines, which are there in the description of the Filter operator.

‘name’ attribute of the param tag tells the name of the input, output or property on which this property depends, and ‘type’ denotes what is its type - input, output or property. Step 7: Create the code generator class. Here is an example code generator class.

import java.util.List; import java.util.Vector; import com.ibm.datatools.etl.codegen.IGenericCodeGenerator; import com.ibm.datatools.etl.codeunit.CodeType; 7

import import import import import import

com.ibm.datatools.etl.codeunit.CodeUnit; com.ibm.datatools.etl.common.CodeGenOptions; com.ibm.datatools.etl.dataflow.Operator; com.ibm.datatools.etl.util.CUHelper; com.ibm.datatools.etl.codeunit.Action; com.ibm.datatools.etl.common.Phase;

public class codeGenForSampleOp implements IGenericCodeGenerator { Operator opInst = null; CodeGenOptions options = null; CodeUnit cu = null; public boolean init(Operator op, CodeGenOptions opts) { opInst = op; options = opts; System.out.println("DEBUG:== operator name = " + op.getItemName() + ", label = " + op.getItemLabel()); return false; } public List getCodeUnits() { Vector cuList = new Vector(); CodeUnit customOpCodeUnit=CUHelper.createCU( CodeType.SQLSCRIPT_LITERAL, Action.EXECUTION_LITERAL, Phase.RUNTIME_LITERAL); String fileName = (String)opInst.getPropertyValueRef("fileName"); cuList.add(customOpCodeUnit); return cuList; } }

Here is an example of an operator library. The name of the library is sampleLib. It has a custom operator defined in it. Name of the operator is sampleOp. 8



7.2

Define design time GUI artifacts

Once the backbone of the operator is defined, the next task is to make the custom operator appear in the operator palette in the Europa runtime workbench. Step 1: Create another plug-in project. Add the plugins com.ibm.datatools.etl.dataflow.ui and the com.ibm.datatools.etl.dataflow.base.oplib.ui as dependencies. Step 2: Add the com.ibm.datatools.etl.properties.ui.OperatorLibraryPresentation extension. Enter the OpLibPresentation file name in the plugin.xml file. The file name must have .prxmi extension. After this, the plugin.xml file will look something like this. (suppose the name of the OpLibPresentation file name was sampleOp.prxmi.

9

Step 3: Create and edit the presentation file (the one with prxmi extension). Add header and footer to that and an empty palette category to it. Step 4: Add an palette entry for the custom operator. Put it inside the scope of a category. Here is an example. The icon files have to be placed in the icons subdirectory under the ui plugin directory. Be sure about the icon names. The type of the icon file and the name should be exactly as you have specified in the prxmi file. In the examples above, the entries “sampleOps category name”, “sampleOps category desc”, “sampleOpName”, “sampleOp desc”, are analogous to variables in programming language terminology. Their exact values can be defined in plugin.properties file. Edit the file (create it first if it does not exist) and enter the values of the above mentioned variables. sampleOps_category_name=My custom operators sampleOps_category_desc=This category contains my own operators sampleOpName=sample operator sampleOp_desc=This is a sample operator Now the operator is ready for use. You can create a runtime workbench and execute it. You can see the operator in the operator palette.

10

8

Future Work

The rules that can be found out by our application can be used to filter some classes of dirty data. However, there are many complex rules which are very common in databases. For example, ordering relationship between two columns, algebraic relationship between two or more columns. Sometimes the value in one column restrict the range or domain of some other column. Once these rules are discovered from the source data, the power of the data cleaning phase will increase a lot. Focus can be given on generating Europa operators having complex processing power. Currently the ETL process designer has to draw the data flow and the control flow diagrams. His work can be reduced if data flows can be generated automatically.

9

Conclusion

In this report we have discussed why data profiling is needed for ETL processes. We have seen some classes of dirty data. We have described the functionality of our data profiling tool and the technique to come up with the rules. These rules (or their modified versions) can be used to eliminate some type of dirtiness. However, coming up with more sophisticated rules can ensure more quality data to be sent to the target data repository. We have also described how to create custom operators for Europa platform and make them available for creating data flows in Europa. Acknowledgement I am thankful to my mentor, Natwar Modani, for his help, support and guidance throughout this work. I am indebted to Mr Mukesh Mohania, for many fruitful discussions. I have been very lucky to interact with other employees in IBM India Research Lab, for their suggestions and constant encouragement. Special thanks to Mr Dinkar Rao for explaining everything in minute details whenever I sought help from him.

References [KCH+ 03] Won Y. Kim, Byoung-Ju Choi, Eui Kyeong Hong, Soo-Kyung Kim, and Doheon Lee. A taxonomy of dirty data. Data Min. Knowl. Discov., 7(1):81–99, 2003. [Sria]

Sriram Srinivasan. Europa Developer note.

[Srib]

Sriram Srinivasan. Ibm dwe etl (europa).

11

Suggest Documents