Enterprise Data Aggregation: Pentaho PDI Server - Clustered setup on

Enterprise Data Aggregation: Pentaho PDI Server Clustered setup on AWS In this paper, we provide a brief introduction on Pentaho PDI ETL tool, its a...
Author: Constance Sharp
29 downloads 5 Views 1MB Size
Enterprise Data Aggregation: Pentaho PDI Server Clustered setup on AWS

In this paper, we provide a brief introduction on Pentaho PDI ETL tool, its advantages compared to traditional big data tools like Pig or Informatica and then primarily focus on how to setup a scalable ETL infrastructure on Amazon Web Services that can handle any ETL load and SLA requirements.

Pentaho Data Integration Server Pentaho Data Integration (PDI) is an extract, transform, and load (ETL) solution that uses an innovative metadata-driven approach (see below for more details). It includes an easy to use, graphical design environment for building ETL jobs and transformations, resulting in faster development, lower maintenance costs, interactive debugging, and simplified deployment. PDI’s multi-threaded, scale-out architecture provides performance tuning and scalability options for handling even the most demanding ETL workloads. PDI additionally provides big data capabilities like map reduce and limited autoscaling (with AWS cloud watch or similar monitoring tool). Furthermore PDI server is an open sourced version of Pentaho integration suite with a huge support community. For those organizations that don’t have an industrial strength ETL tool or are just venturing into big data processing, PDI is a highly cost effective option.

Multi-threaded Architecture PDI’s streaming engine architecture provides the ability to work with extremely large data volumes, and provides Enterprise-class performance and scalability with a broad range of deployment options including dedicated, clustered ETL servers that can be hosted either on the cloud, internally or hybrid. The architecture allows both vertical and horizontal scaling. The engine executes tasks in parallel and across multiple CPUs on a single machine as well as across multiple servers via clustering and partitioning.

Transformation Processing Engine Pentaho Data Integration’s transformation processing engine starts and executes all steps within a transformation in parallel (multi-threaded) allowing maximum usage of available CPU resources. Done by default this allows processing of an unlimited number of rows and columns in a streaming fashion. Furthermore, the engine is 100% metadata driven (no code generation) resulting in reduced deployment complexity. PDI also provides different processing engines that can be used to influence thread priority or limit execution to a single thread which is useful for parallel performance tuning of large transformations. Additional tuning options include the ability to configure streaming buffer sizes, reduce internal data type conversions (lazy conversion), leverage high performance non-blocking I/O (NIO) for read large blocks at a time and parallel reading of files, and support for multiple step copies to allowing optimization of Java Virtual Machine multi-thread usage.

CLUSTERING AND PARTITIONING Pentaho Data Integration provides advanced clustering and partitioning capabilities that allow organizations to scale out their data integration deployments. Pentaho Data Integration clusters are built for increasing performance and throughput of data transformations; in particular they are built to perform classic “divide and conquer” processing of data sets in parallel. PDI clusters have a strong master/slave topology. There is one master in cluster but there can be many slaves. This cluster scheme can be used to distribute the ETL workload in parallel appropriately across these multiple systems. Transformations are broken into master/slaves topology and deployed to all servers in a cluster – where each server in the cluster is running a PDI engine to listen, receive, execute and monitor transformations. It is also possible to define dynamic clusters where the Slave servers are only known at run-time. This is very useful in cloud computing scenarios where hosts are added or removed at will.

Pentaho ETL Clustering on Amazon EC2 1. Setting up Amazon EC2 instances 2. Setting up Carte instances 3. Spoon Configuration ( Master Server )  Connecting to repository  Adding master, slave servers  Configuring cluster schema  Configuring partition schema 4. Transformation design and execution

1. Setting up Amazon EC2 instances Amazon Elastic Compute Cloud (Amazon EC2) is a web service that provides resizable compute capacity in the cloud. Amazon EC2’s simple web service interface allows you to obtain and configure capacity with minimal effort. It provides you with complete control of your computing resources and lets you run on Amazon’s proven computing environment. Amazon EC2 reduces the time required to obtain and boot new server instances to minutes, allowing you to quickly scale capacity, both up and down, as your computing requirements change. Amazon EC2 changes the economics of computing by allowing you to pay only for capacity that you actually use.    

For this Scenario we have considered 3 Windows Server 2012 micro instances Tagged them as Master, Slave 1 and Slave 2 Install latest Java and setup the Pentaho_Java_Home variable Created a shared folder on master and mapped it on all the servers

2. Setting up Carte Instances Carte is a major component when building clusters because this program acts as a middleware between the master and slave servers. The slaves will register themselves with the master by notifying they are ready to receive tasks to process. We can reach Carte web service to remotely monitor, start and stop transformations / jobs.

Carte Master xml configuration file: This file is located in the pwd directory on each server as “carte-config-master-8080.xml”. We have made below changes for master server

Carte Slave 1 xml configuration file:

Carte Slave 2 xml configuration file:

We can start the Carte Instance on all the 3 servers using Carte.bat from command line [carte.bat “xml configuration file “] Carte Output Master Server

This output means that your master is running and a listener is activated on the master address (IP address) on port 8080. Now let's start Carte on the two slaves.

Carte Output Slave1 Server

Now you can see that Carte has now registered Slave1 to the master server. Carte Output Slave2 Server

Above is the output of Slave2 and you can see that its register to Master server. We now have two slave servers ready to receive job from Master. Now we can go and configure PDI Spoon to run the job.

3. Spoon Configuration ( Master Server ) 

Connecting to repository

First we connect the Repository and add the mapped shared folder to the repository. 

Adding Master Slave Servers

Now let’s create a new transformation and add servers we just created to slave servers. Click on the view tab on the left hand side and right click on the slave server and Choose new. Now add the carte servers we created one by one.

Repeat the same steps for Slave1 and Slave2 servers (with respective IP address and port 8080) to get all the servers added. Note: Username: cluster Password: cluster is default for all the carte instances.



Configuring Cluster Schema

Next right click on Kettle cluster schemas and choose New. Provide a Schema name as Cluster and then click on Select slave servers. Mark all of them in the pop-up window and select OK.

Next we want to make sure that Kettle can connect to all of the carte servers. Right click on the cluster schema you just created and choose Monitor all slave servers:

If all the servers are successfully connected, we can proceed to create partition schema 

Configuring Partition Schema

Partitioning splits a data set into a number of sub-sets according to a rule that is applied on a row of data. Click on the View tab on the left hand side and right click on Partition schemas. Choose New:

4. Transformation Design and Execution For this scenario we are going to use a CSV input file with around half a million row of transaction data. Step 1: CSV file Input: The CSV file has no headers defined we are going to define Columns as Date, Units,Discounts, SalesRep, Region ID and Products . This CSV file is located on the master server. Step 2: Sort rows: Sort the input file by region. Step 3: Group by: Group by region by total units sold. Select clustering and partitioning Step 4: Text file output: load this output in a text file on master server We are going to have master server read and sort the data and slave servers do the group by and sent the data to master server. Caption cx2 for group by step indicates that cluster has been assigned for this step. Dx indicates that partition schema has been assigned to cluster

Output with Cluster & Partitioned schema applied: Region; Total Units 2;8573011 4;8557702 6;8519324 1;8621937 3;8603052 5;8581802 7;8536516 As we have 2 slave servers running, the data will be dynamically partitioned into 2 sets based on the region id field. So even if we do an aggregation on the slave servers, we will derive a clean output set on the server. To be more precise: If we don’t use partitioning in our transformation, each slave server would receive data in a round robin fashion (randomly), so each data set could contain records for each region id. Each slave creates an aggregate and when we combine the data on the master we can possibly end up with two aggregates for same region. This would then require an additional sort and aggregation step on the master to arrive at a final clean aggregate. To avoid this kind of scenario, it is best to define data partitioning, so that each slave server receives a “unique” set of data. Note, this is just one reason why we should apply partitioning.

Output without Partitioned schema being applied on the Cluster: Region;Total Units 1;4313476 2;4231494 3;4317266 4;4233569 5;4270229 6;4253910 7;4259153 1;4308461 2;4341517

3;4285786 4;4324133 5;4311573 6;4265414 7;4277363 Command Line Job Execution We can execute this job in command line using the Kitchen application. Kitchen is an application that can execute jobs designed by Spoon in XML or in a database repository. Usually jobs are scheduled in batch mode to be run automatically at regular intervals. Dynamic Clustering If the number of slave servers are not all known upfront, they can be added or removed at any time, using Kettle’s dynamic cluster schema capabilities. Let us consider a use case where we have millions or billions of records to process during weekends or month end. Here we can add several slaves for this job using dynamic clustering. We can use auto scheduling to automatically start and stop slaves for particular job which can be of huge cost saving. We even leverage AWS features such as Cloud formation, Cloud Watch to perform autoscaling and load balancing. Thus we can provide the same ETL SLAs at peak as when compared to normal loads.

References 1. “Pentaho Solutions – Business Intelligence and Datawarehousing with Pentaho and MySQL” written by Roland Bouman and Jos Van Dongen 2. “Pentaho Kettle Solutions: Building Open Source Etl Solutions With Pentaho Data Integration” written by Jos Van Dongen, Roland Bouman, Matt Casters 3. Pentaho Community Wiki Home – http://wiki.pentaho.com

Getting started eXert Digital can help provide a data roadmap for companies to get started on their Big Data journeys. Businesses whose existing transactional landscapes are driven by ERP-based business applications have the potential to tap unused data in various forms and marry it with their transactional data sources in order to create insights and drive better decision making and higher revenues.

About eXert Digital eXert Digital is a Data services Consulting company, Combining unparalleled experience, comprehensive Capabilities across all industries and business functions, eXert collaborates with clients to help them become data driven businesses. www.exertdigital.com.

For more information, please contact: Muthu Kalyan Partner eXert Digital 815-546-6150 [email protected]

Copyright © 2015 eXertDigital All rights reserved. eXertDigital, its logo, and aretrademarksof eXertDigital LLC.This documentmakes referenceto trademarks that maybe owned byothers.Theuse of such trademarkshereinis not anassertion ofownership of suchtrademarksby eXert Digital LLC and is not intendedto represent or implythe existenceof an association betweeneXertDigital and the lawful owners of such trademarks.