Lab 1: Profile the parts.csv File In this lab you create a new process with profiling and study the results.

Before You Start The labs in this module rely on a specific file being present in the EDQ landingarea. The file that needs to be in the landingarea is called parts.csv. You can find the parts.csv file in the EDQ training data at ~edq_training_assets_12.1.3\Data Files\Case Management. 1. If you haven’t already done so: a) Set up an EDQ 12.1.3 training environment. Instructions for how to do this are online at http://tinyurl.com/edq-training-env. b) If you are using the pre-built virtual machine as your training environment, follow the instructions in section 1.8 of the Virtual Machine Installation Guide to set up a shared folder (the Virtual Machine Installation Guide is available online at http://www.oracle.com/technetwork/middleware/oedq/edq-vm-install-guide2424081.pdf). c) Download the EDQ training data from http://www.oracle.com/webfolder/technetwork/tutorials/tutorial/edq/spt/universalassets/edq_training_assets_12.1.3.zip. d) Unzip the edq_training_assets zip file. If you are using the pre-built virtual machine, unzip this to the C:\share folder on your host machine. 2. If you are using the pre-built virtual machine as your training environment, copy the parts.csv file from the ~edq_training_assets_12.1.3\Data Files\Case Management to the following folder on the virtual machine: /apps/oracle/fmw/user_projects/domains/edq_domain/edq/oedq.local.home/l andingarea To do this, move to the desktop of the virtual machine, and double click the shortcut to EDQ landingarea folder. The shortcut to EDQ landingarea window will be displayed. Next, double click the share folder. Within the share folder, double click the edq_training_assets_12.1.3 folder, then, within that folder, double click the Data Files folder and then double click the Case Management folder. Drag the parts.csv file from

Page 1 of 11

Copyright © 2015, Oracle and/or its affiliates. All rights reserved.

the Case Management folder to the shortcut to EDQ landingarea folder.

3. If you are not using the EDQ-12.1.3-Trn2 virtual machine, copy the parts.csv file from the ~edq_training_assets_12.1.3\Data Files\Case Management to your EDQ instance’s landingarea. Note that the landingarea is a sub-directory of the oedq_local_home directory. The location of oedq_local_home can vary, depending on your machine’s operating system and the type of installation you have carried out. On a Windows 7 machine, the default location of oedq_local_home is: C:\ProgramData\Oracle\Enterprise Data Quality\oedq_local_home.

Navigate to the Launchpad 1. If you are using the pre-built virtual machine as your training environment, open a web browser on your host machine and navigate to the following URL: http://localhost:8011/edq. 2. If you are not using the EDQ-12.1.3-Trn2 virtual machine, open a web browser and navigate to the EDQ launchpad. If you have installed EDQ locally on a Windows machine using the Quick Start installer, the URL is likely to be http://localhost:9002/edq.

Copyright © 2015, Oracle and/or its affiliates. All rights reserved.

Page 2 of 11

Note that the Launchpad may take some time to display when you have just started the virtual machine (this is because the EDQ server is starting). Be patient, and it should appear.

3. In your web browser, make the Launchpad web page a favorite or bookmark it so you can easily return to it in future.

Launch Director Note that to launch the Director user interface, you will need an up to date Java Runtime Environment (JRE) installed on your host computer. You can download a Java Runtime Environment from http://www.java.com.

1. On the Launchpad, click the Director link to launch the application. Allow any Java Web Start confirmation boxes that may appear so that the client software can be downloaded. If you are asked whether you want to run the application, click Run. If a dialog box warns you that the Windows Firewall has blocked some aspects of the program, click Unblock. If a dialog box warns you that dxi files already have an association, just click OK. Director is one of Oracle Enterprise Data Quality’s rich, Java Web Start user interfaces. Clicking the Director link from the Launchpad within your host machine causes the user interface to download via Java Web Start from the EDQ server. Once it has been downloaded, the user interface will invite you to login. And once you have done this, you will be able to interact with the EDQ server, (which may, for example, be running within the EDQ-12.1.3-Trn2

Page 3 of 11

Copyright © 2015, Oracle and/or its affiliates. All rights reserved.

virtual machine), via the Director user interface, (which is running on the host).

2. The Login to localhost dialog will open. Login with the following credentials: a. Username: dnadmin b. Password: dnadmin

dnadmin is a superuser. It has all the security privileges available in EDQ. In a live environment, in which many users may interface with the same EDQ server, most users will have a more restricted set of privileges. However, in the training environment, which is figuratively a sandpit, it will be useful for you to enjoy unrestricted access to all EDQ functionality.

The Director user interface opens:

Copyright © 2015, Oracle and/or its affiliates. All rights reserved.

Page 4 of 11

Create a Project and Stage Data 1. In the Project Browser, right-click Projects and select New Project...

2. Give the new project a name of Case Management Training and optionally add a description. Click Next. 3. Ensure that All Groups is selected in Project Permissions. This will enable any user to view and use the project. Click Finish to create the new project.

Add a Data Store 1. Expand the newly created Case Management Training Project by clicking the + symbol beside it, right-click Data Stores and select New Data Store… 2. In the Data is accessed from drop down leave Server selected and, in the Category drop down, select Text Files (also note what other options are available in these two drop downs).

Page 5 of 11

Copyright © 2015, Oracle and/or its affiliates. All rights reserved.

Ensure that you still have Server and Text Files selected in the two dropdowns before you move to the next step.

3. In the Type list select Server-based delimited Text files and click Next>:

4. In the Data Store Configuration screen, enter the following values*: a) File in server work area: parts.csv b) Treat first row as header: Select this check box c) Field Separator: Comma d) Quote character: Double quote e) Encoding: ISO-8859-1

Copyright © 2015, Oracle and/or its affiliates. All rights reserved.

Page 6 of 11

5. Click Test… to ensure that you can successfully connect to your new Data Store. A dialog box should report success. If it does, just click OK to dismiss it. If it reports failure, check the values that you have entered and try again. Once you have successfully tested the connection, click Next >. 6. Give the data store a name of Connection to PARTS Table CSV File and, optionally, add a description. Click Finish to save the Data Store. You have called the data store Connection to PARTS Table CSV File because the file you are connecting to contains an extract of data from a table called PARTS that we are going to assume exists in a fictional database.

Add a Snapshot 1. In the Project Browser, on the left-side of the Director UI, within your Case Management Training project, right-click Staged Data and select New Snapshot... 2. Select your newly created Connection to PARTS Table CSV File data store. This is where the data for the snapshot will come from. Click Next >. 3. In the Table Selection screen, ensure that the parts.csv is selected, and click Next >. 4. In the Column Selection screen, leave all the columns selected and just click Next >. 5. In Sampling Options screen, leave the default All option selected and just click Next > to continue. 6. In the No Data Handling screen leave the ‘No Data’ Reference Data field blank, and then click Next >. 7. In the Snapshot Name screen, change the name of your new snapshot to PARTS Table and, leaving the Run now check box selected, click Finish to create the snapshot. Notice that after a short delay whilst the parts.csv file is read, the Results Browser is populated with the data from the file. Taking the Snapshot causes Enterprise Data Quality to stage the data from the parts.csv file – meaning that a copy of the data from the file is placed in Enterprise Data Quality repository.

Profile the Data from parts.csv 1. In the Project Browser, right-click Processes and select New Process... 2. Processes start with a Reader and this needs a source for its data. Select your PARTS Table snapshot and click Next >. 3. In Analysis select the Add Profiling checkbox. 4. Ensure all of the profiling checkboxes are selected and then click Next >.

Page 7 of 11

Copyright © 2015, Oracle and/or its affiliates. All rights reserved.

5. Give your process a name of Profile the PARTS table and click Finish to create the process.

6. Click at the top of the Canvas to run the process. 7. Answer the following questions:

Copyright © 2015, Oracle and/or its affiliates. All rights reserved.

Page 8 of 11

Quickstats Profiler 1. Which field is most likely to be the table’s primary key?

2. Comment on the distinct values you find in the UMCODE field?

Patterns Profiler 7. What do you observe about the format of the PARTID field?

Page 9 of 11

Copyright © 2015, Oracle and/or its affiliates. All rights reserved.

It is good practice to begin every data quality project by profiling in order to understand the data you are dealing with. In order to save time, the lab only asked you to answer questions about three of the fields by using two of the profiling processors. In reality, you should interrogate all of the fields using a wider range of the available profiling processors. For the purposes of training, let’s just note these points:  The UNIQUEID attribute contains no duplicate values and is likely to be the table’s primary key.  The UMCODE attribute has the following distinct values:

By drilling down on the underlying data records, we can make educated guesses about the meaning of some of the codes: for example, KT might stand for ‘kit’, and PK might stand for ‘pack’. o Some records are without data in the UMCODE attribute, and there are also several distinct values that appear very infrequently. These have no obvious meaning, and may be typos. For example, perhaps PV and PX may be typos for PC (the letters X and V are either side of C on a QWERTY keyboard, so this would be an easy mistake to make). The PARTID attribute seems to suffer from a number of inconsistent formats. o



Copyright © 2015, Oracle and/or its affiliates. All rights reserved.

Page 10 of 11

We find 14 different patterns. 6 of the patterns have a length of 5 characters and are alpha-numeric. The other patterns are of varying lengths, and most include either whitespace or punctuation. For the purposes of training we want to keep things simple, so we will design an audit process to address the points noted above only. In a real-world setting, our audit process might be more complex than this. o

Page 11 of 11

Copyright © 2015, Oracle and/or its affiliates. All rights reserved.