How to Use PowerCenter with Teradata to Load and Unload Data

How to Use PowerCenter with Teradata to Load and Unload Data © 2009-2011 Informatica Corporation. No part of this document may be reproduced or trans...
Author: Rosalyn Young
51 downloads 0 Views 875KB Size
How to Use PowerCenter with Teradata to Load and Unload Data

© 2009-2011 Informatica Corporation. No part of this document may be reproduced or transmitted in any form, by any means (electronic, photocopying, recording or otherwise) without prior consent of Informatica Corporation.

Abstract This article explains how to move data between PowerCenter and Teradata databases. It explains when to use Teradata relational connections, Teradata load and unload utilities, or pushdown optimization to move data. It also lists issues you might encounter when loading data to or unloading data from Teradata and the workarounds.

Table of Contents Overview ........................................................................................................................................................................... 3 Prerequisites ................................................................................................................................................................. 4 Teradata Relational Connections ...................................................................................................................................... 5 Creating a Teradata Relational Connection ................................................................................................................. 6 Standalone Load and Unload Utilities ............................................................................................................................... 6 Teradata FastLoad ....................................................................................................................................................... 7 Teradata MultiLoad....................................................................................................................................................... 7 Teradata TPump ........................................................................................................................................................... 7 Teradata FastExport ..................................................................................................................................................... 8 Teradata Parallel Transporter ........................................................................................................................................... 8 Pushdown Optimization .................................................................................................................................................... 9 Achieving Full Pushdown without Affecting the Source System ................................................................................. 12 Achieving Full Pushdown with Parallel Lookups ........................................................................................................ 13 Achieving Pushdown with Sorted Aggregation ........................................................................................................... 14 Achieving Pushdown for an Aggregator Transformation ............................................................................................ 14 Achieving Pushdown when a Transformation Contains a Variable Port .................................................................... 14 Improving Pushdown Performance in Mappings with Multiple Targets ...................................................................... 14 Removing Temporary Views when a Pushdown Session Fails .................................................................................. 15 Issues Affecting Pushdown Optimization ................................................................................................................... 16 Issues Affecting Loading to and Unloading from Teradata ............................................................................................. 17 Making 32-bit Load and Unload Utilities Work with 64-bit PowerCenter .................................................................... 17 Increasing Lookup Performance................................................................................................................................. 17 Performing Uncached Lookups with Date/Time Ports in the Lookup Condition ......................................................... 18 Restarting a Failed MultiLoad Job Manually ............................................................................................................... 19 Configuring Sessions that Load to the Same Table ................................................................................................... 19 Setting the Checkpoint when Loading to Named Pipes ............................................................................................. 20 Loading from Partitioned Sessions ............................................................................................................................. 20 Loading to Targets with Date/Time Columns ............................................................................................................. 20 Hiding Passwords ....................................................................................................................................................... 21 Using Error Tables to Identify Problems during Loading ............................................................................................ 21

2

Overview Teradata is a global technology leader in enterprise data warehousing, business analytics, and data warehousing services. Teradata provides a powerful suite of software that includes the Teradata Database, data access and management tools, and data mining applications. PowerCenter works with the Teradata Database and Teradata tools to provide a data integration solution that allows you to integrate data from virtually any business system into Teradata as well as leverage Teradata data for use in other business systems. PowerCenter uses the following techniques when extracting data from and loading data to the Teradata database: 

ETL (extract, transform, and load). This technique extracts data from the source systems, transforms the data within PowerCenter, and loads it to target tables. The PowerCenter Integration Service transforms all data. If you use the PowerCenter Partitioning option, the Integration Service also parallelizes the workload.



ELT (extract, load, and then transform). This technique extracts data from the source systems, loads it to userdefined staging tables in the target database, and transforms the data within the target system using generated SQL. The SQL queries include a final insert into the target tables. The database system transforms all data and parallelizes the workload, if necessary.



ETL-T (ETL and ELT hybrid). This technique extracts data from the source systems, transforms the data within PowerCenter, loads the data to user-defined staging tables in the target database, and further transforms the data within the target system using generated SQL. The SQL queries include a final insert into the target tables. The ELT-T technique is optimized within PowerCenter so that the transformations that better perform within the database system can be performed there and the Integration Service performs the other transformations.

To perform ETL operations, configure PowerCenter sessions to use a Teradata relational connection, a Teradata standalone load or unload utility, or Teradata Parallel Transporter. To use ELT or ETL-T techniques, configure PowerCenter sessions to use pushdown optimization. Use a Teradata relational connection to communicate with Teradata when PowerCenter sessions load or extract small amounts of data (1 GB per session). Standalone load and unload utilities can increase session performance by loading or extracting data directly from a file or pipe rather than running the SQL commands to load or extract the same data. All Teradata standalone load and unload utilities are fully parallel to provide optimal and scalable performance for loading data to or extracting data from the Teradata Database. PowerCenter works with the Teradata FastLoad, MultiLoad, and TPump load utilities and the Teradata FastExport unload utility. Use Teradata Parallel Transporter for PowerCenter sessions that must quickly load or extract large amounts of data (>1 GB per session). Teradata Parallel Transporter provides all of the capabilities of the standalone load and unload utilities, plus it provides more granular control over the load or unload process, enhanced monitoring capabilities, and the ability to automatically drop log, error, and work tables when a session starts. Teradata Parallel Transporter is a parallel, multi-function extract and load environment that provides access to PowerCenter using an open API. It can load dozens of files using a single control file. It also allows you to distribute the workload among several CPUs, eliminating bottlenecks in the data loading and extraction processes. Use pushdown optimization to reduce the amount of data passed between Teradata and PowerCenter or when the Teradata database can process transformation logic faster than PowerCenter. Pushdown optimization improves session performance by “pushing” as much transformation logic as possible to the Teradata source or target database. PowerCenter processes any transformation logic that cannot be pushed to the database. For example, pushing Filter transformation logic to the source database can reduce the amount of data passed to PowerCenter, which decreases session run time. When you run a session configured for pushdown optimization, PowerCenter translates the

3

transformation logic into SQL queries and sends the queries to the Teradata database. The Teradata database executes the SQL queries to process the transformation logic.

Prerequisites Before you run sessions that move data between PowerCenter and Teradata, you might want to install Teradata client tools. You also need to locate the Teradata TDPID.

Teradata Client Tools Teradata client tools help you communicate with the Teradata database and debug problems that occur when a session loads data to or extracts data from the Teradata database. You can install the following Teradata client tools: 

BTEQ. A general-purpose, command-line utility (similar to Oracle SQL*Plus) that enables you to communicate with one or more Teradata databases.



Teradata SQL Assistant. A GUI-based tool that allows you to retrieve data from any ODBC-compliant database server and manipulate and store the data in desktop applications. Teradata Queryman is the older version of this tool.

Install BTEQ or Teradata SQL Assistant to help you debug problems that occur when loading to and extracting from Teradata. Both tools are included in the Teradata Utility Pack, which is available from Teradata.

TDPID The Teradata TPDID indicates the name of the Teradata instance and defines the name a client uses to connect to a server. When you use a Teradata Parallel Transporter or a standalone load or unload utility with PowerCenter, you must specify the TDPID in the connection properties. The Teradata TDPID appears in the hosts file on the machines on which the Integration Service and PowerCenter Client run. By default, the hosts file appears in the following location: 

UNIX: /etc/hosts



Windows: %SystemRoot%\system32\drivers\etc\hosts* * The actual location is defined in the Registry key HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\Tcpip\Parameters\DataBasePath

The hosts file contains client configuration information for Teradata. In a hosts file entry, the TDPID precedes the string “cop1.” For example, the hosts file contains the following entries: 127.0.0.1 localhost 192.168.80.113 td_1 192.168.80.114 td_2 192.168.80.115 td_3 192.168.80.116 td_4

demo1099cop1 custcop1 custcop2 custcop3 custcop4

The first entry has the TDPID “demo1099.” This entry tells the Teradata database that when a client tool references the Teradata instance “demo1099,” it should direct requests to “localhost” (IP address 127.0.0.1). The following entries have the same TDPID, “cust.” Multiple hosts file entries with the same TDPID indicate the Teradata instance is configured for load balancing among nodes. When a client tool attempts to reference Teradata instance “cust,” the Teradata database directs requests to the first node in the entry list, “td_1.” If it takes too long for the node to respond, the database redirects the request to the second node, and so on. This process prevents the first node, “td_1” from becoming overloaded.

4

Teradata Relational Connections Teradata relational connections use ODBC to connect to Teradata. PowerCenter uses the ODBC Driver for Teradata to retrieve metadata and read and write to Teradata. To establish ODBC connectivity between Teradata and PowerCenter, install the ODBC Driver for Teradata on each PowerCenter machine that communicates with Teradata. The ODBC Driver for Teradata is included in the Teradata Tools and Utilities (TTU). You can download the driver from the Teradata web site. Use a Teradata relational connection when extracting or loading small data sets, usually DBG_21684 Target [TD_INVENTORY] does not support multiple partitions. All data will be routed to the first partition.

If you do not route the data to a single file, the session fails with the following error: WRITER_1_*_1> WRT_8240 Error: The external loader [Teradata Mload Loader] does not support partitioned sessions. WRITER_1_*_1> Thu Jun 16 11:58:21 2005 WRITER_1_*_1> WRT_8068 Writer initialization failed. Writer terminating.

For more information about loading from partitioned sessions, see the PowerCenter Advanced Workflow Guide.

Loading to Targets with Date/Time Columns Applies to: FastLoad, MultiLoad, TPump, Teradata PT The target date format determines the format in which dates can be loaded into the column. PowerCenter only supports a limited set of Teradata date formats. Therefore, you must check the target date format to avoid problems loading date/time data. When you create a date/time column in a Teradata database table, you specify the display format for the date/time values. The format you choose determines the format in which date/time values are displayed by Teradata client tools as well as the format in which date/time values can be loaded into the column. For example a column in a Teradata table has the date format “yyyy/mm/dd.” If you run a PowerCenter session that loads a date with the format “mm/dd/yyyy” into the column, the session fails. Before running a session that loads date/time values to Teradata, you verify that the format of each date/time column in the mapping matches the format of the corresponding date/time column in the Teradata target. If the session loads values into multiple date/time columns, check the format of each date/time column in the target because different tables often use different date/time formats. You can use Teradata BTEQ or SQL Assistant to check the format for a date/time column in a Teradata database. If any column in the Teradata target uses the “yyyyddd” date format (4-digit year followed by the 3-digit day), you must either redefine the date format in the Teradata table or convert the date to a character string in PowerCenter. Redefining the date format in the Teradata table does not change the way Teradata stores the date internally.

20

To convert a Teradata “yyyyddd” date column to a character column in PowerCenter: 1.

Edit the target table definition in PowerCenter and change the date column data type from “date” to “char(7).”

2.

Create an Expression transformation with the following expression to convert the date into a string with the format “yyyyddd”: to_char(date_port,’yyyy’) || to_char(date_port,’ddd’)

Note: The expression to_char(date_port, ‘yyyyddd’) does not work. 3.

Link the output port in the Expression transformation to the “char(7)” column in the target definition.

Hiding Passwords Applies to: FastExport, FastLoad, MultiLoad, TPump, Teradata PT When you create a loader or application (FastExport) connection object, you enter the database user name and password in the connection properties. The Integration Service writes the password in the control file in plain text and the Teradata loader does not encrypt the password. To prevent the password from appearing in the control file, enter “PMNullPasswd” as the password. When you do this, the Integration Service writes an empty string for the password in the control file. If you do not want to use “PMNullPasswd,” perform either of following actions: 

Lock the control file directory.



For load utilities, configure PowerCenter to write the control file to a different directory, and then secure that directory.

By default, the Integration Service writes the loader control file to the target file directory and the FastExport control file to the temp file directory. To write the loader control file to a different directory, set the LoaderControlFileDirectory custom property to the new directory for the Integration Service or session. For more information about setting custom properties for the Integration Service, see the PowerCenter Administrator Guide. For more information about setting custom properties for the session, see the PowerCenter Workflow Basics Guide. Finally, MultiLoad and TPump support the RUN FILE command. This command directs control from the current control file to the control file specified in the login script. Place the login statements in a file in a secure location, and then add the RUN FILE command to the generated control file to call it. Run “chmod -w” on the control file to prevent PowerCenter from overwriting it. For example, create a login script as follows (in the file “login.ctl” in a secure directory path): .LOGON demo1099/infatest,infatest;

Modify the generated control file and replace the login statement with the following command: .RUN FILE /login.ctl;

Using Error Tables to Identify Problems during Loading Applies to: FastLoad, MultiLoad, TPump When problems occur during loading data, the Teradata standalone load utilities generate error tables. (FastExport generates an error log file.) The load utilities generate different errors during the different phases of loading data. FastLoad jobs run in two main phases: loading and end loading. During the loading phase, FastLoad initiates the job, locks the target table, and loads the data. During the end loading phase, the Teradata database distributes the rows of data to the target table and unlocks it. FastLoad requires an exclusive lock on the target table during the loading phase. MultiLoad also loads data during two main phases: acquisition and application. In the acquisition phase, MultiLoad reads the input data and writes it to a temporary work table. In the application phase, MultiLoad writes the data from the work table to the actual target table. MultiLoad requires an exclusive lock on the target table during the application phase.

21

Tpump loads data in a single phase. It converts the SQL in the control file into a database macro and applies the macro to the input data. TPump uses standard SQL and standard table locking. The following table lists the error tables you can check to troubleshoot load or unload utility errors: Utility

Data Loading Phase

Default Error Table Name

Error Types

FastLoad

• Loading

• ET_

• Constraint violations, conversion errors, unavailable AMP conditions

• End loading

• UV_

• Unique primary index violations

• Acquisition

• ET_

• All acquisition phase errors, application phase errors if the Teradata database cannot build a valid primary index

• Application

• UV_

• Uniqueness violations, field overflow on columns other than primary index fields, constraint errors

n/a (single phase)

• ET_

• All TPump errors

MultiLoad

TPump

When a load fails, check the “ET_” error table first for specific information. The ErrorField or ErrorFieldName column indicates the column in the target table that could not be loaded. The ErrorCode field provides details that explain why the column failed. For MultiLoad and TPump, the most common ErrorCodes are: 

2689: Trying to load a null value into a non-null field



2665: Invalid date format

In the MultiLoad “UV_” error table, you can also check the DBCErrorField column and DBCErrorCode field. The DBCErrorField column is not initialized in the case of primary key uniqueness violations. The DBCErrorCode that corresponds to a primary key uniqueness violation is 2794. For more information about Teradata error codes, see the Teradata documentation.

Authors Chai Pydimukkala Senior Product Manager, Informatica Corporation Stan Dorcey Senior Product Specialist, Informatica Corporation Lori Troy Senior Technical Writer, Informatica Corporation

Acknowledgements The authors would like to thank Guy Boo, Ashlee Brinan, Eugene Ding, Anudeep Sharma, Lalitha Sundaramurthy, Raymond To, Rama Krishna Tumrukoti, Sonali Verma, Rajeeva Lochan Yellanki, and Marissa Johnston at Informatica for their assistance with this article. Additionally, the authors would like to thank Edgar Bartolome, Steven Greenberg, John Hennessey, and Michael Klassen at Teradata and Stephen Knilans and Michael Taylor at LoganBritton for their technical assistance.

22