Waterline Data Inventory Installation and Administration Guide Product Version 1.2.5 Document Version 1.9
© 2014 -‐ 2015 Waterline Data, Inc. All rights reserved.
Installation and Administration Guide
Table of Contents
Table of Contents Related Documents ................................................................................................................. 5 System requirements ............................................................................................................. 5 Hadoop compatibility ...................................................................................................................... 5 Edge node minimum requirements ............................................................................................ 5 Database configuration ................................................................................................................... 6 Kerberos compatibility ................................................................................................................... 6 Browser compatibility ..................................................................................................................... 6 Multi-‐byte support ............................................................................................................................ 6 Waterline Data Inventory connections and access ...................................................... 7 Profiling HDFS files ........................................................................................................................... 7 Browsing HDFS files ......................................................................................................................... 8 Profiling Hive tables ......................................................................................................................... 9 Browsing and Creating Hive tables .......................................................................................... 10 Installing Data Inventory: Quick Start ........................................................................... 11 Installing Data Inventory ................................................................................................... 14 1. Choose an installation location ............................................................................................. 14 2. Validate Hadoop configuration ............................................................................................. 14 3. Configure a dedicated user ..................................................................................................... 18 4. Download and extract Waterline Data Inventory ........................................................... 19 5. Run configuration scripts ....................................................................................................... 20 6. Configure Waterline Data Inventory for your cluster ................................................... 23 Upgrading Waterline Data Inventory ............................................................................ 25 Integrating with user management systems ............................................................... 27 Waterline Data Inventory user authentication settings ................................................... 27 SSH configuration ........................................................................................................................... 27 User access configuration for public cloud clusters ........................................................... 27 Kerberos configuration ................................................................................................................ 29 Improve security among Waterline Data Inventory components ....................... 32 Securing internal passwords ...................................................................................................... 33 Encrypting a Derby repository .................................................................................................. 33 Configuring access using Hadoop security: Ranger or Sentry ........................................ 35 Starting Waterline Data Inventory ................................................................................. 37 Running Waterline Data Inventory jobs ....................................................................... 40 Command summary ...................................................................................................................... 40 Full profiling and discovery against HDFS files ................................................................... 42 Profiling only for HDFS files ....................................................................................................... 42 Lineage discovery ........................................................................................................................... 43 Collection discovery ...................................................................................................................... 43 Origin propagation only ............................................................................................................... 43 Tag propagation only .................................................................................................................... 44 Evaluating tag rules ....................................................................................................................... 44 © 2014 -‐ 2015 Waterline Data, Inc. All rights reserved.
3
Table of Contents
Waterline Data Inventory
Full profiling and discovery against Hive tables ................................................................. 44 Profiling only for Hive tables ..................................................................................................... 45 Displaying version information ................................................................................................ 45
Monitoring Waterline Data Inventory jobs ................................................................. 46 Monitoring Hadoop jobs .............................................................................................................. 46 Monitoring local jobs .................................................................................................................... 47 Debugging information ................................................................................................................ 47 Profiling results .............................................................................................................................. 48 Optimizing profiling performance .................................................................................. 48 MapReduce job performance controls .................................................................................... 49 Repository writing performance controls ............................................................................. 49 Supporting self-‐service users ........................................................................................... 50 Configuring web browsers for use with Kerberos .............................................................. 51 Swapping out Derby for MySQL ....................................................................................... 52 Configuring additional Waterline Data Inventory functionality .......................... 53 Communication among Hadoop components ....................................................................... 53 Setting the location and persistence of temporary files ................................................... 55 Starting the web server in a Kerberos environment .......................................................... 55 Secure communication between browser and web server (SSL) ................................... 56 Browser app functionality .......................................................................................................... 56 Profiling functionality .................................................................................................................. 58 Hive functionality ........................................................................................................................... 61 Discovery functionality ................................................................................................................ 62 Obscuring passwords in Waterline Data Inventory configuration files ...................... 65
4
© 2014 -‐ 2015 Waterline Data, Inc. All rights reserved.
Installation and Administration Guide
Related Documents
Waterline Data Inventory reveals information about the metadata and data quality of files in a Hadoop cluster so the users of the data can identify the files they need for analysis and downstream processing. The application installs on an edge node in the cluster and runs MapReduce jobs to collect data and metadata from files in HDFS (or MapR-‐FS) and Hive. It then discovers relationships and patterns in the profiled data and stores the results in its metadata repository. A browser application lets users search, browse, and tag HDFS files and Hive tables using the benefits of the collected metadata and Data Inventory’s discovered relationships. This document describes the process of installing Waterline Data Inventory on a Hadoop cluster.
Related Documents •
Waterline Data Inventory Sandbox. Available on CDH, HDP, MapR and for images on VirtualBox and VMWare.
•
Waterline Data Inventory User Guide, available from the menu in the browser application and in the /docs directory in the installation.
For the most recent documentation and product tutorials, sign in to the Waterline Data community support site, support.waterlinedata.com.
System requirements Waterline Data Inventory runs on an edge node in a Hadoop cluster. The following specifications describe the Data Inventory’s platform compatibilities and the minimum requirements for the edge node.
Hadoop compatibility •
Cloudera CDH 5.x
•
Hortonworks HDP 2.1, 2.2
•
MapR 4.0, 4.1
In addition, reading Hive tables created in Waterline Data Inventory requires Hive 0.13 or later. All of the supported distributions except CDH 5.1 have this support. The edge node on which Waterline Data Inventory is installed needs to have the Hadoop and Hive clients required to access the Hadoop namenode.
Edge node minimum requirements Optimizing input/output operations per second (IOPS) on the edge node is the most important factor in providing the best performance for Waterline Data Inventory operations. Provisioning a higher IOPS disk can reduce the overall profiling time
© 2014 -‐ 2015 Waterline Data, Inc. All rights reserved.
5
System requirements
Waterline Data Inventory
significantly. For example, going from 3000 IOPS to 10000 IOPS can improve performance 1.5 times. •
Two to four 500 GB disks, the faster the disks the better
•
2 quad-‐core CPUs, running at least 2-‐2.5GHz
•
32 GB of RAM
•
Bonded Gigabit Ethernet or 10Gigabit Ethernet
•
JDK version 1.7.x
Database configuration The speed of the repository database is an important component of the overall performance of Waterline Data Inventory operations. Waterline Data Inventory works with MySQL and Derby databases. It is shipped with Embedded Derby by default. This document provides instructions to configure Waterline Data Inventory to work with MySQL (page 52). To configure Waterline Data Inventory to work with other relational databases that support JDBC connectivity, contact
[email protected].
Kerberos compatibility This release is compatible with Kerberos version 5.
Browser compatibility Waterline Data Inventory supports the following browsers. If your cluster uses Kerberos, be sure to configure Kerberos support in end-‐users' browsers: •
Microsoft Internet Explorer 9 and later (not supported on Mac OS)
•
Chrome 36 or later
•
Firefox 31 or later
Multi-‐byte support Waterline Data Inventory handles cluster data transparently: assuming the data is stored in formats that Waterline Data Inventory reads, the application doesn't enforce any additional limitations beyond what Hadoop and its components enforce. That said, there are places where the configuration of your Hadoop environment needs to align with what data you are managing, such as: •
Operating system locale
•
Character set supported by Hive client and server
•
Character set supported by Waterline Data Inventory repository database (Derby, by default) client and server
Waterline Data Inventory browser application allows users to enter multi-‐byte characters to annotate HDFS data. Again, where Waterline Data Inventory interfaces 6
© 2014 -‐ 2015 Waterline Data, Inc. All rights reserved.
Installation and Administration Guide
Waterline Data Inventory connections and access
with other applications, such as Hive, Waterline Data Inventory enforces the requirements of the integrated application.
Waterline Data Inventory connections and access For Waterline Data Inventory to produce an inventory of HDFS, it needs read access to all the files that are included in the inventory. In addition, it needs read access to Hive tables. Waterline Data Inventory uses HDFS to stage the profiling information it collects from HDFS and Hive tables: for the staging directories, Waterline Data Inventory needs write access into HDFS.
Profiling HDFS files To profile HDFS files, Waterline Data Inventory needs two connections configured: 1. HDFS Root Node: Waterline Data Inventory’s connection to HDFS for profiling includes: • Read access for all HDFS files • Write access to staging areas to collect profiling results 2. Repository: The Waterline Data Inventory engine writes profiling and discovery results to a repository on the edge node using the Waterline Data Inventory dedicated user credentials.
Configure the HDFS and repository connections to profile HDFS files
© 2014 -‐ 2015 Waterline Data, Inc. All rights reserved.
7
Waterline Data Inventory connections and access
Waterline Data Inventory
Browsing HDFS files When data scientists and analysts access HDFS files through Waterline Data Inventory, they see only files and tables that they have permission to view: all file system operations are performed as the signed-‐in user. The user permissions are established through the operating system permissions or through a Hadoop authentication system such as Kerberos, Ranger, or a combination of Kerberos and Sentry. When running against a Kerberized cluster, Waterline Data Inventory uses impersonation to perform operations with the access available to the current user. For end-‐users to browse HDFS files, Waterline Data Inventory needs three connections configured: 1. HDFS Root Node 2. Repository 3. Browser URL pointing to the Waterline Data Inventory web server combined with user credentials, whether through explicit login or authentication configured for the browser.
Configure the Web Server connection for user access
8
© 2014 -‐ 2015 Waterline Data, Inc. All rights reserved.
Installation and Administration Guide
Waterline Data Inventory connections and access
Profiling Hive tables To include Hive tables in the inventory, Waterline Data Inventory needs read access to Hive databases as well as read/write access to a staging directory in HDFS where it holds profiling information for Hive tables. This can be the same staging area used for profiling HDFS files. To profile Hive tables, Waterline Data Inventory needs three connections configured: 1. HDFS Root Node, including write access to an HDFS staging area for profiling results. 2. Repository 3. Hive database access: for Waterline Data Inventory to include Hive tables, it needs read access to each Hive database to be included.
Waterline Data Inventory uses MapReduce to profile Hive tables
© 2014 -‐ 2015 Waterline Data, Inc. All rights reserved.
9
Waterline Data Inventory connections and access
Waterline Data Inventory
Browsing and Creating Hive tables Users can create new Hive tables from HDFS files they identify in Waterline Data Inventory. For end-‐users to create and browse Hive tables, Waterline Data Inventory needs three connections configured: 1. Repository 2. Browser URL pointing to the Waterline Data Inventory web server combined with user credentials, whether through explicit login or authentication configured for the browser. 3. Hive database access. Waterline Data Inventory’s connection to Hive for browsing includes read access to all Hive databases. To create new Hive tables from HDFS files, Waterline Data Inventory needs write access to the databases where users would expect new tables to appear.
Users see the Hive tables they have access to and can create new Hive tables from HDFS files
10
© 2014 -‐ 2015 Waterline Data, Inc. All rights reserved.
Installation and Administration Guide
Installing Data Inventory: Quick Start
Installing Data Inventory: Quick Start Here's the minimal version for getting Waterline Data Inventory up-‐and-‐running in a development environment. It assumes you have root access to the environment and that Hadoop, the cluster is not secured with Kerberos, and related services are running and healthy. For instructions suitable for an enterprise environment, SKIP THIS SECTION and go to Installing Data Inventory (page 14). 1. Create a dedicated Waterline Data user that you'll use for Waterline Data Inventory installation and job commands. • From a command window on the installation computer, create a "waterlinedata" user: $ useradd waterlinedata $ passwd waterlinedata
•
Give waterlinedata user read access to the files in HDFS or MapR-‐FS and write access to at least one HDFS location to write profiling results. The access needed may vary depending on the Hadoop distribution. For CDH and HDP, you can give the user access to the hdfs group for both read and write access: $ usermod -a -G hdfs waterlinedata
For MapR: $ usermod -a -G mapr waterlinedata
•
Grant sudo access for running installation scripts. $ su root $ visudo
Add the waterlinedata user in the User privilege section. After installation, sudo access is no longer needed. 2. Go to the directory you want to install Waterline Data Inventory, verify that the waterlinedata user has read, write, and execute permissions on the directory. For example, you can use /opt or /usr/lib to mock typical Hadoop component installs, or /home/waterlinedata for private installation. These instructions assume /opt/waterlinedata. $ cd /opt/waterlinedata
© 2014 -‐ 2015 Waterline Data, Inc. All rights reserved.
11
Installing Data Inventory: Quick Start
Waterline Data Inventory
3. Change to the waterlinedata user and extract Waterline Data Inventory from the TAR file. $ su waterlinedata $ tar xf
4. From the newly created Waterline Data Inventory directory, run an installation script, providing the waterlinedata user password for sudo access when prompted. $ cd waterlinedata $ bin/postInstall
This script prompts you for the location of Hive in the Hadoop environment; typically, other Waterline Data Inventory scripts will locate Hive for you, so you can skip this prompt. If you receive an error later, rerun this script and include the Hive location. If this script runs successfully, the output shows directories created in /var. If the script reports a problem, address the issue and rerun the script until you get a successful result. 5. Run a script to place Waterline Data Inventory and other 3rd party JARs where Waterline Data Inventory can use them. If Hive runs on a different node or your cluster is not configured to run Hive at all, skip this step and follow the instructions in the detailed installation steps for the hiveSetup script on page 21. If HiveServer2 is running on the same node as Waterline Data Inventory, run the following script: $ bin/hiveSetup linkAuxLib
This script identifies the Hive home location and performs actions to move JAR files into Hive's auxlib directory (to avoid conflicting with JAR files already in use by Hive). It also creates symbolic links from these files to the Hive lib directory to allow Beeswax and Beeline access to these files. (To skip creating the symbolic links, use "$ bin/hiveSetup".) The script may prompt you to allow the auxlib directory to be created and to approve any conflicts should these files already existing in either auxlib or lib. If the Hive server is not running, the script will fail to identify the Hive location; to remedy this, do one of the following: • Start the Hive server. • Rerun postInstall (step 4) and specify the location of the Hive executable. • Edit /waterlinedata/bin/.hive_home to include the location of the Hive executable. If the script reports a problem, address the issue and rerun the script until you get a successful result. If you are not successful running these setup scripts, you
12
© 2014 -‐ 2015 Waterline Data, Inc. All rights reserved.
Installation and Administration Guide
Installing Data Inventory: Quick Start
can run /waterlinedata/bin/detect-env information on where problems are occurring.
verbose to get more
6. Configure host name and port numbers where appropriate. If you are running Waterline Data Inventory from a VM image or on a single node cluster, these configuration parameters are already set for you. To set or validate the appropriate configuration settings, review the contents of /waterlinedata/lib/resources/environment.properties. In particular, insert the fully qualified domain names of the cluster root and the node on which Waterline Data Inventory is running in the following properties:
•
HDFS root: waterlinedata.crawler.fs.uri=
For example: hdfs://sandbox.hortonworks.com:8020 hdfp://quickstart.cloudera:8082 maprfs:///
•
Repository node: javax.persistence.jdbc.url= jdbc:derby://:4444
For example: jdbc:derby://sandbox.hortonworks.com:4444 jdbc:derby://quickstart.cloudera:4444 jdbc:derby://maprdemo:4444
For more detailed list of configuration parameters, see 5. Configure Waterline Data Inventory connections to Hadoop (page 23). The application is now installed and configured. To validate that the installation was successful, see Starting Waterline Data Inventory (page 37).
© 2014 -‐ 2015 Waterline Data, Inc. All rights reserved.
13
Installing Data Inventory
Waterline Data Inventory
Installing Data Inventory This version of the installation instructions includes more details about each step and the decisions involved in configuring Waterline Data Inventory in a unique enterprise environment. Installing Waterline Data Inventory involves the following decisions and steps, some of which require root access: • • • • • •
Choose an installation location Validate that Hadoop is running and configured properly Configure a dedicated Waterline Data Inventory user (requires root access) Download and extract Waterline Data Inventory Run a configuration scripts (requires root access) Configure connections to Hadoop and other applications in the Hadoop environment
1. Choose an installation location Install Waterline Data Inventory in the same way other Hadoop cluster edge node applications are installed. Some clusters use /usr/lib; others /opt. It can be installed in other locations, such as in the home directory for the dedicated Waterline Data user /home/waterlinedata. Any location you choose requires root access to complete the configuration.
2. Validate Hadoop configuration Hadoop is a complex system with many overlapping configurations and controls. You can ensure that Waterline Data Inventory will install smoothly if you first validate that the existing Hadoop components are running and communicating properly among themselves. The following steps prepare for Waterline Data Inventory installation by exercising each of the places where Waterline Data Inventory interacts with Hadoop. 1. Identify the host name for the cluster, referred to in this document as . Typically, this is the fs.defaultFS parameter in Hadoop's core-site.xml file. For MapR, find the host name for your cluster using: cat /opt/mapr/conf/mapr-clusters.conf
2. Ensure that Kerberos is configured for the edge node and for end-‐user access. If your cluster is Kerberized, you'll need a Kerberos administrator's help to install Waterline Data Inventory. Before you bring in your Kerberos admin, you can test these basic operations to make sure the foundation is in place: • Make sure the computer you identified as the Waterline Data Inventory installation location (see previous section) is configured with Kerberos: $ kinit
14
© 2014 -‐ 2015 Waterline Data, Inc. All rights reserved.
Installation and Administration Guide
Installing Data Inventory
This command prompts for the current user's password. If it does, type anything and exit the command. If it doesn't, this computer is not yet configured with Kerberos. Work with your Kerberos administrator to install Kerberos, add this computer to the Kerberos database, and generate a keytab for this computer as a Kerberos application server. •
Make sure your browser is configured to use Kerberos to access the cluster. From a browser running on a computer that is not the edge node where you are installing Waterline Data Inventory, sign into a Kerberized cluster component, such as one of the following:
Hue (CDH, MapR) Hue (HDP) Ambari (HDP) Cloudera Manager (CDH) MapR Control System (MapR)
http://:8888 http://:8000 http://:8080 http://:7180 http://:8443
If you are not able to sign in, check that: • • • • •
The current user has a valid ticket (run klist from a terminal on the client computer). The browser is configured to use Kerberos when accessing secure sites. A Kerberos KDC is accessible from this computer. The Hadoop service is running. The active user has access to the Hadoop application.
3. Verify that Hadoop components are running. You can use the cluster management tool (Ambari, Cloudera Manager, or MapR Control System). If the cluster is not managed using one of these tools, check individual services by running the command line for the component. For example: $ hadoop version $ beeline (!quit to exit)
Before installing Waterline Data Inventory, make sure that HDFS, MapReduce, and YARN are running; if Hive is configured for your cluster, Hive and its constituent components (Hive Metastore, HiveServer2, MySQL Server, WebHCat Server) must be running. 4. Check that users have access to HDFS files and Hive tables. Waterline Data Inventory depends on the cluster authorization system to manage user access to HDFS resources. Verify that you have access to some HDFS files and Hive tables so that when you use Waterline Data Inventory to access the same files, you can validate that the proper access is available. You'll need access to these files as an end-‐user and as the Waterline Data Inventory dedicated user.
© 2014 -‐ 2015 Waterline Data, Inc. All rights reserved.
15
Installing Data Inventory
Waterline Data Inventory
To verify that you have access to these files and tables, you can, for example: • Use Hue to navigate to existing data in HDFS or to load new data. Verify that you can access files you own as well as files for which you have access through group membership. If you can't sign into Hue or can't access HDFS files from inside Hue, ask your Hadoop administrator for appropriate credentials. •
Use Beeswax (accessible through Hue) or Beeline (Hive command line) to verify that you can access existing databases and tables. If you can't sign into Beeline or can't access Hive tables, ask your Hadoop administrator for appropriate credentials.
5. Run a sample MapReduce job. All of the Hadoop distributions provide sample code that you can run directly in the jar file: hadoop-mapreduce-examples-.jar
where the version may be specific to the distribution and version of Hadoop. Run an example MapReduce job as follows: a. Use "locate" or "find" to determine where the examples JAR file is. b. Run the sample job "pi" with values for the number of map tasks (10) and samples (1000) to run: hadoop jar /hadoop-mapreduce-examples-*.jar pi 10 1000
If the example runs successfully, you'll see output that shows the MapReduce job running:
16
© 2014 -‐ 2015 Waterline Data, Inc. All rights reserved.
Installation and Administration Guide
Installing Data Inventory
Number of Maps = 10 Samples per Map = 1000 Wrote input for Map #0 Wrote input for Map #1 Wrote input for Map #2 Wrote input for Map #3 Wrote input for Map #4 Wrote input for Map #5 Wrote input for Map #6 Wrote input for Map #7 Wrote input for Map #8 Wrote input for Map #9 Starting Job 15/06/01 04:48:41 INFO impl.TimelineClientImpl: Timeline service address: http://sandbox.hortonworks.com:8188/ws/v1/timeline/ 15/06/01 04:48:41 INFO client.RMProxy: Connecting to ResourceManager at sandbox.hortonworks.com/10.0.2.15:8050 15/06/01 04:48:42 INFO input.FileInputFormat: Total input paths to process : 10 15/06/01 04:48:42 INFO mapreduce.JobSubmitter: number of splits:10 15/06/01 04:48:42 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1432835905062_0003 15/06/01 04:48:43 INFO impl.YarnClientImpl: Submitted application application_1432835905062_0003 15/06/01 04:48:43 INFO mapreduce.Job: The url to track the job: http://sandbox.hortonworks.com:8088/proxy/application_1432835905062_0003/ 15/06/01 04:48:43 INFO mapreduce.Job: Running job: job_1432835905062_0003 15/06/01 04:48:59 INFO mapreduce.Job: Job job_1432835905062_0003 running in uber mode : false 15/06/01 04:48:59 INFO mapreduce.Job: map 0% reduce 0% 15/06/01 04:49:57 INFO mapreduce.Job: map 10% reduce 0% 15/06/01 04:49:58 INFO mapreduce.Job: map 70% reduce 0% 15/06/01 04:49:59 INFO mapreduce.Job: map 80% reduce 0% 15/06/01 04:50:30 INFO mapreduce.Job: map 90% reduce 0% 15/06/01 04:50:32 INFO mapreduce.Job: map 100% reduce 0% 15/06/01 04:50:34 INFO mapreduce.Job: map 100% reduce 100% 15/06/01 04:50:35 INFO mapreduce.Job: Job job_1432835905062_0003 completed successfully Job Finished in 116.303 seconds ... Estimated value of Pi is 3.14080000000000000000
You'll see a similar output pattern when Waterline Data Inventory MapReduce jobs run.
© 2014 -‐ 2015 Waterline Data, Inc. All rights reserved.
17
Installing Data Inventory
Waterline Data Inventory
3. Configure a dedicated user We recommend that you configure a "waterlinedata" user to own the installation directory and to run Waterline Data Inventory jobs. If you choose not to create a "waterlinedata" user, choose another user that will be dedicated to running Waterline Data Inventory jobs. Because of the extensive access privileges that Waterline Data Inventory needs to produce an inventory of HDFS files, it is critical that the user account that runs Waterline Data Inventory jobs be created to adhere to all enterprise security requirements. The dedicated user needs the following access: •
Appropriate security authentication. The dedicated Waterline Data Inventory user (waterlinedata) needs to be an authorized user in the system used by your enterprise to authenticate cluster users.
•
Kerberos credentials. If your cluster is Kerberized, ask your Kerberos administrator to configure a principal name for the dedicated Waterline Data Inventory user and a corresponding keytab file. You'll need this information to configure the Waterline Data Inventory web server and to run Waterline Data Inventory jobs.
•
Temporary root access. The waterlinedata user must be configured with enough "sudo" powers to create these directories during the installation. The sudo access can be removed after installation is complete.
•
Directory access. The waterlinedata user requires full access to the Waterline Data Inventory installation directory and the following runtime directories: • Waterline Data Inventory installation location, typically /opt. • /var/lib/waterline: location for Waterline Data Inventory repository and search indexes • /var/log/waterline: location for the Waterline Data Inventory logs • /var/run/waterline: location for the Waterline Data Inventory runtime state information (not used currently) Other than the installation location, these folders can be created and ownership assigned automatically by the script "postInstall" described in the installation steps below. This script requires root access to run.
•
18
© 2014 -‐ 2015 Waterline Data, Inc. All rights reserved.
Installation and Administration Guide •
Installing Data Inventory
File system access. The waterlinedata user requires read access to any file in HDFS or MapR-‐FS that will be part of the system inventory. It also requires write access to at least one location where it stages profiling data. If you expect your users to create Hive tables from inside Waterline Data Inventory, the waterlinedata user needs access to a staging location for Hive table creation.
Waterline Data Inventory reads all files on the file system but exposes data only according to users' authorization. One way to allow Waterline Data Inventory to have the appropriate access is to add waterlinedata to the file system group (hdfs or mapr). This method assumes that operating system users have the same privileges on HDFS. Your environment may have other methods to achieve the same result (such as Ranger or Sentry). If your environment does not have parallel users in both the operating system and the file system, you need to make sure that the dedicated Waterline Data Inventory user is a part of the HDFS (or MapR-‐FS) super user group: dfs.permissions.superusergroup
If you choose not to grant waterlinedata write access where it also has read access, make sure to give write at least one location where it can stage profiling data. You must identify this location in the Waterline Data Inventory profiler configuration properties, as described in 5. Configure Waterline Data Inventory for your cluster on page 23. •
Hive database access. The waterlinedata user requires read access to each Hive database that will be part of the system inventory. In addition, to allow users to create Hive tables from HDFS files, waterlinedata needs write access to one or more databases where users will store these tables.
•
Shared folder access. If the installation is on a VirtualBox image, it is convenient to include the waterlinedata user as a member in the group created for the VM to share folders between the host and the VM, vboxsf group.
•
Hue user. As a convenience, if you plan to use Hue to manage HDFS or MapR-‐FS files, create a corresponding user account for waterlinedata on Hue.
4. Download and extract Waterline Data Inventory If you haven't already, download the Waterline Data Inventory distribution from the location provided by Waterline Data. As the dedicated waterlinedata user, navigate to the installation directory you identified previously and expand the Waterline Data Inventory TAR file. $ cd $ su waterlinedata
Enter the waterlinedata password. $ tar xf
Errors from this command are likely to indicate that the waterlinedata user does not have write access to the install directory. © 2014 -‐ 2015 Waterline Data, Inc. All rights reserved.
19
Installing Data Inventory
Waterline Data Inventory
5. Run configuration scripts The Waterline Data Inventory distribution includes scripts to automate the process of configuring class paths, placing JAR files in the right locations, and setting permissions. Because the scripts move files into locations where they can be accessible from MapReduce jobs and set permissions to allow the dedicated waterlinedata user to access Hadoop and Hive libraries, you'll need root access to run these scripts. postInstall script This script creates directories and moves configuration files into the appropriate locations. To run postInstall: 1. If the dedicate Waterline Data Inventory user is Kerberized, make sure the dedicated Waterline Data user (typically "waterlinedata") has a valid Kerberos ticket and that the Kerberos ticket cache is available for the user (run klist). If there is not a valid ticket, run kinit to create one. 2. From inside the new waterlinedata directory, run the script to configure the environment. $ cd waterlinedata $ bin/postInstall
This script prompts you to enter the waterlinedata user password for sudo access. If upgrading Waterline Data Inventory, the script prompts to overwrite a Derby properties file: enter "y" for this prompt. This script also prompts you for the location of Hive in the Hadoop environment; typically, other Waterline Data Inventory scripts will locate Hive for you, so you can skip this prompt. If you receive an error later, rerun this script and include the Hive location.
No Hive in your Hadoop? Waterline Data Inventory uses some of the same open source libraries that Hive distributes to read HDFS files. If you don't have Hive installed in your system, you need to provide the location of the Waterline Data Inventory dependencies directory: /waterlinedata/lib/hive
This script makes the following configuration changes: • Creates /var directories for Waterline Data Inventory repository, search indexes, log, and runtime files. • Sets the ownership of the new directories to the current user. • Copies repository properties files from the Waterline Data Inventory installation location into the new directories. • Writes the provided Hive path to bin/.hive_home. 20
© 2014 -‐ 2015 Waterline Data, Inc. All rights reserved.
Installation and Administration Guide
Installing Data Inventory
hiveSetup script Waterline Data Inventory provides functionality for profiling and browsing existing Hive tables and allowing users to create new Hive tables from HDFS files in the inventory. If you want users to have access to this functionality, configure Waterline Data Inventory to work with Hive in your cluster. There are three possible configurations with Hive that require installation steps: •
HiveServer2 installed on the same node as Waterline Data Inventory
•
HiveServer2 installed on a different node than Waterline Data Inventory
•
HiveServer2 is not part of the cluster at all
The installation steps for each of these configurations are described in the following sections. Hive and Waterline Data Inventory share a node If HiveServer2 is running on the same node as Waterline Data Inventory, run the following script from inside the waterlinedata directory: $ bin/hiveSetup linkAuxLib
This script makes the following configuration changes: •
Creates an auxlib directory in the Hive home directory if one does not already exist. For example, /usr/lib/hive/auxlib or /opt/mapr/hive//.
•
Copies JAR files needed for Hive table creation and reading to the auxlib directory.
•
Creates symbolic links for the auxlib JAR files into lib to allow Beeswax and Beeline access to these files. To skip this step, omit the "linkAuxLib" option.
If the Hive server is not running, the script will fail to identify the Hive location; to remedy this, do one of the following: •
Start the Hive server.
•
Rerun postInstall (page 20) and specify the location of the Hive executable.
•
Edit /waterlinedata/bin/.hive_home to include the location of the Hive executable.
If the script reports a problem, address the issue and rerun the script until you get a successful result. If you are not successful running these setup scripts, you can run /waterlinedata/bin/detect-env verbose to get more information on where problems are occurring. Restart HiveServer2 after successfully running hiveSetup.
© 2014 -‐ 2015 Waterline Data, Inc. All rights reserved.
21
Installing Data Inventory
Waterline Data Inventory
Hive and Waterline Data Inventory run on separate nodes If HiveServer2 runs on a different node, you need to provide an alternate local location for Hive jars for the Waterline Data Inventory installation and copy the Waterline Data Inventory-‐specific JAR files to the Hive server: 1. Rerun postInstall (page 20) and specify the local 'surrogate' location for Hive: /lib/hive
For example, /opt/waterlinedata/lib/hive. 2. Run the Hive configuration script from inside the waterlinedata directory: $ bin/hiveSetup linkAuxLib
3. Locate HiveServer2. 4. Create an auxlib folder in the Hive installation, at the same level as the Hive lib folder. Allow all users to read from this folder. $ mkdir /hive/auxlib $ chmod a+r /hive/auxlib
For example, the following command applies to HDP v2.2.4 instances: $ mkdir /usr/hdp/2.2.4.2-2/hive/auxlib $ chmod a+r /usr/hdp/2.2.4.2-2/hive/auxlib
5. Add the following JARs to auxlib. These jars can be found in the Waterline Data Inventory installation, lib/waterlinedata and lib/dependencies folders: • jackson-‐annotations-‐2.2.3.jar • jackson-‐databind-‐2.2.3.jar • opencsv-‐2.3.jar • hive-‐serdes-‐*.jar • hivexmlserde-‐*.jar • waterlinedata-‐formats-‐1.2.0.jar Here's one way to move the files between systems: $ cd /hive/auxlib $ scp waterlinedata:/opt/waterlinedata/lib/waterlinedata/ waterlinedata—formats-1.2.1.jar . $ scp waterlinedata:/opt/waterlinedata/lib/dependencies/ jackson-annotations-2.2.3.jar .
6. Create symbolic links between the files in auxlib and the Hive lib directory. If the JAR already exists in the lib directory, the symbolic link creation will fail; in that case, you don't need to create the symbolic link. $ cd ../lib $ for each in ../auxlib/*.jar ; do ln -svi $each ; done
Repeat for all of the files in auxlib. Restart HiveServer2 after successfully copying and linking the JAR files. 22
© 2014 -‐ 2015 Waterline Data, Inc. All rights reserved.
Installation and Administration Guide
Installing Data Inventory
Hive is not part of the cluster configuration If HiveServer2 is not part of the cluster setup, you need to provide an alternate local location for Hive jars for the Waterline Data Inventory installation: 1. Rerun postInstall (page 20) and specify the local 'surrogate' location for Hive: /lib/hive
2. Run the Hive configuration script from inside the waterlinedata directory: $ bin/hiveSetup linkAuxLib
6. Configure Waterline Data Inventory for your cluster To ensure Data Inventory is correctly installed and to prepare for the initial profiling runs, you need to configure Waterline Data Inventory’s connections to the cluster and to Hive. These connections are configured as entries in the property file waterlinedata/lib/resources/environment.properties. If you are running the Waterline Data Inventory VM sandbox, you can skip this step as the values are already provided.
•
waterlinedata.crawler.fs.uri=hdfs://:8020
Waterline Data Inventory server to Hadoop connection. Set this to the root of HDFS. Typically, this is the fs.defaultFS parameter in Hadoop's core-site.xml file. For MapR, use maprfs:///. You can see the host name for your MapR cluster using: cat /opt/mapr/conf/mapr-clusters.conf •
javax.persistence.jdbc.url=jdbc:derby://:4444/ waterlinedatastore;create=true
Replace with the IP address for the computer on which you've installed Waterline Data Inventory. If you are running a single-‐node cluster, this is the same host name as the cluster root location. • •
javax.persistence.jdbc.user=waterlinedata javax.persistence.jdbc.password=
Credentials used by Waterline Data Inventory processes to access the Waterline Data Inventory repository. If needed, replace the username and password with ones that you choose. Be sure to encrypt the replacement password. •
waterlinedata.metadata.search.index.rootDir=/var/lib/waterline/index
Location to create the Lucene indexes used by Waterline Data Inventory. Change this location to spread the storage of Waterline Data Inventory data across more than one drive or computer.
© 2014 -‐ 2015 Waterline Data, Inc. All rights reserved.
23
Installing Data Inventory • •
Waterline Data Inventory
waterlinedata.hiveurl=jdbc:hive2://:10000/ waterlinedata.hivedatabasename=
Hive connection URL and default database Waterline Data Inventory uses. The Hive database that Waterline Data Inventory uses when end-‐users create Hive tables from HDFS files from inside the browser application and that is profiled when Hive table profiling (page 61) is turned on. Note that the default database should be the same in both entries (Hive defaults to the database named "default"). For SPNEGO-‐Kerberos, the hiveurl needs to include the following: jdbc:hive2://:/; principal=
For example, with Hive running on the same node where Waterline Data Inventory is installed and using the default Hive port and database (on one line): jdbc:hive2://localhost:10000/default;principal= HIVE/edgenode1.acmecorp.com
•
waterlinedata.temproot=
The local file system directory Waterline Data Inventory uses to store temporary files created during discovery processing. Make sure that the dedicated Waterline Data Inventory user has write access to the configured location. By default, this value is set to /tmp. •
waterlinedata.profile.processingdirectory=
The HDFS or MapR-‐FS directory Waterline Data Inventory uses to generate temporary files during HDFS file profiling. Make sure that the dedicated Waterline Data Inventory user has write access to the configured location. If this property is not set (by default, it is commented out), temporary files are created in the first directory identified in the profiling command. • •
waterlinedata.profile.hivedir= waterlinedata.hive.create_table_in_place=true
The HDFS or MapR-‐FS directory Waterline Data Inventory uses to generate copies of files used to create Hive tables. Make sure that the dedicated Waterline Data Inventory user has write access to the configured location. By default, file copies are only created in a few cases based on the type of file format. To change the behavior to have Waterline Data Inventory always make copies of the file to the other directory, set create_table_in_place to false. • •
waterlinedata.web.kerberos.keytab.location= waterlinedata.web.kerberos.username=
The principal and keytab file location for the dedicated Waterline Data Inventory user. For more details on running Waterline Data Inventory in a Kerberized environment, see Kerberos configuration (page 29). 24
© 2014 -‐ 2015 Waterline Data, Inc. All rights reserved.
Installation and Administration Guide
Upgrading Waterline Data Inventory
Firewall configuration If you expect administrators or end-‐users to access Waterline Data Inventory across a firewall, consider allowing access to the following ports at the cluster IP address:
Port
Application Component
Need
8082
50070
Waterline Data Inventory browser application Waterline Data Inventory browser application with HTTPS WebHDFS
10000 19888
Hive Hadoop job history
4444
Derby
8000 8888
Hadoop Hue
End-‐users: Access to Waterline Data Inventory browser application. End-‐users: Access to Waterline Data Inventory browser application with HTTPS. If you configure Jetty to use WebHDFS rather than the native Java API. See Communication between Jetty and Hadoop (page 53). End-‐users: Access to Hive tables. Administrators: Access to troubleshooting information. Administrators: Access to troubleshooting information. Administrators: Access to HDFS files and to MapReduce job status and logs. The port is 8000 for HDP and 8888 for CDH or MapR.
8482
Upgrading Waterline Data Inventory If you have Waterline Data Inventory version 1.2.3 or earlier installed, you can upgrade to Waterline Data Inventory version 1.2.5 with your inventory complete as follows. Note: Waterline Data Inventory version 1.2.4 also includes the updated version of Derby. These instructions assume: •
You have sudo privileges to complete the operations
•
You are signed in as the dedicated Waterline Data Inventory user, typically "waterlinedata"
To upgrade from version 1.2.3 (or earlier) to version 1.2.5: 1. Navigate to the directory in which you installed Waterline Data Inventory, for example /opt/waterlinedata. $ cd /opt/waterlinedata
2. Stop any running processes. $ bin/jettyStop $ bin/derbyStop
3. Remove a Waterline Data JAR file from the Hive auxlib directory: © 2014 -‐ 2015 Waterline Data, Inc. All rights reserved.
25
Upgrading Waterline Data Inventory
Waterline Data Inventory
$ cat bin/.hive_home $ rm /auxlib/waterlinedata-formats-X.X.X.jar
When prompted, confirm that you want to delete the file. 4. Make backups of your repository and logs for the installed version. $ sudo cp -r /var/lib/waterline /var/lib/waterline_vXXX $ sudo cp -r /var/log/waterline /var/log/waterline_vXXX
5. Move the existing Waterline Data Inventory files out of the standard installation location. $ cd .. $ sudo mv waterlinedata waterlinedata_vXXX
6. Replace the standard installation directory, making sure that the directory is owned by the dedicated Waterline Data Inventory user. $ sudo mkdir waterlinedata $ sudo chown waterlinedata:waterlinedata waterlinedata
7. Reboot the edge node where Waterline Data Inventory is installed. 8. Follow the installation instructions for the new version of Waterline Data Inventory starting with step 3. Download and extract Waterline Data Inventory on page 19. 9. Configure the new version of the Derby repository. If you can tolerate reprofiling the content of your inventory, we recommend that you start with a fresh repository. There is no additional configuration required. If you would like to continue using your previous repository—with the understanding that this repository cannot be used in a production environment—you can turn off Derby authentication and continue to use the existing repository. To do so, comment out the following entries in the lib/resources/derby.properties file: #derby.connection.requireAuthentication=true #derby.authentication.provider=NATIVE:waterlinedatastore
10. Validate the following operations against your existing repository before removing the previous version of Waterline Data Inventory files: • View HDFS files • View Hive tables • Create new Hive tables from HDFS files that were already profiled using the previous version • Profile and run discovery operations
26
© 2014 -‐ 2015 Waterline Data, Inc. All rights reserved.
Installation and Administration Guide
Integrating with user management systems
Integrating with user management systems Waterline Data Inventory integrates with your existing Linux and Hadoop authentication mechanisms, such as SSH-‐based authentication and single sign-‐on systems such as Kerberos. By default, it uses SSH authentication, meaning users configured for HDFS are assumed to have a corresponding Linux account; they would sign in to Waterline Data Inventory using their network credentials. To configure different authentication, an administrator would configure the Waterline Data Inventory web server to accept the other authentication system by updating Jetty's login.conf file. Only one system can be active at a time.
Waterline Data Inventory user authentication settings Specify the user management system to use in the following web server configuration file: /waterlinedata/jetty-distribution*/waterlinedata-base/etc/login.conf
This file includes service descriptions, only one of which can be valid at a time. To activate one of the service types, change its entry name to "waterline" and rename other services as necessary.
SSH configuration SSH is a reliable security mechanism that has one limitation: it assumes the password authentication mechanism is available to the web server. As such, it will not work on systems that use Amazon AWS or Google Compute clouds. When configured to use SSH for user authentication, Waterline Data Inventory web server communicates with the host system on the listen address and port defined in /etc/ssh/sshd_config. By default, the port is set to 22. If your organization uses a different convention, update the port (authPort) setting for the sshd service in the following web server configuration file: /waterlinedata/jetty-distribution*/waterlinedata-base/etc/login.conf
User access configuration for public cloud clusters Amazon Web Services (AWS) and Google Cloud Platform do not support a password authentication mechanism for managing users; instead, they use SSH key-‐based authentication. Currently, Waterline Data Inventory does not support SSH keys for authentication on cloud deployments. It uses a local file to determine user credentials. Note that this method does not supersede the cloud provider's security nor does it override the operating system's security concepts. The user list grants access to Waterline Data Inventory web application only. Waterline Data Inventory respects the access privileges granted by the file system: the user list can include user names configured in the operating system. Listed users that are not mirrored in the operating system see only files that can be read by all users. © 2014 -‐ 2015 Waterline Data, Inc. All rights reserved.
27
Integrating with user management systems
Waterline Data Inventory
To configure Waterline Data Inventory user access on public cloud clusters: 1. Navigate to the directory in which you installed Waterline Data Inventory, for example /opt/waterlinedata. $ cd /opt/waterlinedata
2. Stop the web server process. $ bin/jettyStop
3. Create a user access list in a text file named "login.properties" in the web server configuration location. $ cd jetty-distribution-9.2.1.v20140609/waterlinedata-base/etc $ vi login.properties
Enter one line for each user in the form: username=password,groups
where groups can be one or more operating system group names to which this user belongs. Separate group names with commas. For example: waterlinedata=waterlinedata sherlock=Se$4sp0,finance watson=AQ2hc#9GG,finance
These passwords are used only for access to Waterline Data Inventory. They can be obfuscated according to the Jetty web server requirements, described in Jetty's "Secure Password Obfuscation": www.eclipse.org/jetty/documentation/current/configuring-‐security-‐secure-‐ passwords.html 4. Add an entry to the Jetty configuration file login.conf to refer to the user access list you created in step 3: $ vi login.conf
Add the following entry: waterline { org.eclipse.jetty.jaas.spi.PropertyFileLoginModule required debug="true" file="${jetty.base}/etc/login.properties"; };
5. In login.conf, find the previous entry named "waterline" and change it to "waterline_ssh" or "waterline_kerberos" as appropriate. (Only the entry added in step 4 should be named "waterline".) 6. Restart the web server process. $ cd /opt/waterlinedata $ bin/jettyStart
28
© 2014 -‐ 2015 Waterline Data, Inc. All rights reserved.
Installation and Administration Guide
Integrating with user management systems
Kerberos configuration These configuration instructions assume a Kerberos system where application servers, such as Waterline Data Inventory's web server, use keytab credentials while users authenticate with username and password. Kerberos setup requires support from the IT personnel who can advise and configure Kerberos user access. The following steps are required to configure Waterline Data Inventory to operate with Kerberos authentication: 1. User setup. Make sure that the user account dedicated to installing and running Waterline Data Inventory servers and jobs is configured for Kerberos. The access requirements for this user are described in the installation requirement "3. Configure a dedicated user" on page 18. To complete the Waterline Data Inventory Kerberos configuration, you'll need: • Principal name for the dedicated Waterline Data Inventory user • Location of the keytab file corresponding to the principal You'll use these values in step 3, "Authentication method configuration." 2. Setup web server credentials. See Configure Waterline Data Inventory web server as a trusted Kerberos application server, below. 3. Switch Waterline Data Inventory to Kerberos. See Configure Waterline Data Inventory web server to use Kerberos authentication, on page 30. 4. Configure impersonation. See Configure impersonation for Waterline Data Inventory, on page 31. 5. Configure Hive. See Configure the Hive principal in Waterline Data Inventory, on page 32. 6. Review Non-‐Kerberos connections. Ensure that all internal credentials are secure. Some communication among components of Waterline Data Inventory does not use the dedicated Waterline Data Inventory user account. There are a few changes to consider to ensure all communication paths are secure, described in Improve security for communication between Waterline Data Inventory components on page 32. Configure Waterline Data Inventory web server as a trusted Kerberos application server This configuration assumes that the Waterline Data Inventory web server (Jetty) authenticates using the dedicated Waterline Data Inventory user principal and keytab. If you choose to use a separate principal specifically for an application
© 2014 -‐ 2015 Waterline Data, Inc. All rights reserved.
29
Integrating with user management systems
Waterline Data Inventory
server, ensure that you use the application server principal when configuring the Waterline Data Inventory properties in step 2. 1. Include a Kerberos configuration file on the computer on which Waterline Data Inventory's application server (Jetty) will run. Make sure that the Kerberos configuration file (/etc/krb5.conf) includes a description of the realm in which Waterline Data Inventory resides. For example, for a server in a company called "Acme":
[libdefaults] default_realm = ACME.COM dns_lookup_realm = false dns_lookup_kdc = false ticket_lifetime = 24h renew_lifetime = 7d forwardable = true [realms] ACME.COM = { kdc = server1.acme.com:88 admin_server = server1.acme.com:88 } [domain_realm] .acme.com = ACME.COM acme.com = ACME.COM
2. Indicate the location of Kerberos credentials so Jetty can refresh its own ticket as needed. Edit the environment.properties file to include the principal and keytab file location for the dedicated Waterline Data Inventory user. The file is found in: /waterlinedata/lib/resources
The properties to update are: • waterlinedata.web.kerberos.keytab.location= • waterlinedata.web.kerberos.username= For example, "
[email protected]". Configure Waterline Data Inventory web server to use Kerberos authentication By default, Waterline Data Inventory web server uses SSH authentication. To switch to Kerberos, edit the web server login configuration file: 1. Edit the Jetty login.conf file. The file is found in: /waterlinedata/jetty-distribution-.v/waterlinedata-base/etc
30
© 2014 -‐ 2015 Waterline Data, Inc. All rights reserved.
Installation and Administration Guide
Integrating with user management systems
where and are the values for the Jetty distribution provided in the Waterline Data Inventory installation. • Rename the "waterlineKerberos" entry to simply "waterline". • Rename the existing "waterline" entry to "waterlineSSH". • Make sure there is only one "waterline" entry in the file. 2. • •
Configure impersonation for Waterline Data Inventory Secure impersonation is required to use Waterline Data Inventory HDFS delegated authorization capability. This method allows the dedicated Waterline Data Inventory user to submit requests to Hive or HDFS on behalf of another user. For example, when browsing files in HDFS, Waterline Data Inventory uses the signed-‐in user's credentials to query HDFS for directory listings, ensuring the user sees only data the user has access to. In a Kerberos-‐controlled environment, delegated authentication has another value. The Hive metastore is typically accessible only through the dedicated Hive user. Waterline Data Inventory uses delegated authentication to perform operations against the Hive metastore, by passing the Hive principal with the request. Because the dedicated Waterline Data Inventory user has delegated authentication privileges, Hive performs the requests. To use Waterline Data Inventory's HDFS delegated authorization, make the following configuration changes in the core-site.xml file for the cluster. 1. Update core-site.xml with the following properties. The changes to core-site.xml require that you restart the cluster. Be sure to arrange to make the change when convenient for other cluster management tasks. Make this change using the cluster management tools, such as Ambari, Cloudera Manager, or MapR Control System.
© 2014 -‐ 2015 Waterline Data, Inc. All rights reserved.
31
Improve security among Waterline Data Inventory components
Waterline Data Inventory
If you need to, change "waterlinedata" to the name you are using for the dedicated Waterline Data Inventory user. Include all hosts or all groups using an asterisk (*) as the property value. Alternatively, you can specify a comma-‐separated list of fully qualified hostnames or a comma-‐separated list of groups. hadoop.proxyuser.waterlinedata.hosts * hadoop.proxyuser.waterlinedata.groups *
2. Restart the cluster. Configure the Hive principal in Waterline Data Inventory From inside Waterline Data Inventory, users can create Hive tables from HDFS files. In a non-‐Kerberized environment, Waterline Data Inventory requests data from the Hive metastore using the dedicated Waterline Data Inventory user. In a Kerberized environment, it is typical that only the dedicated Hive user can perform operations against the metastore. To allow Waterline Data Inventory to access the Hive metastore, configure Waterline Data Inventory with delegated authentication privileges (previous section) and include the Hive principal in the Waterline Data Inventory configuration. To configure this change to the Hive connection, edit the environment.properties file. 1. In the Waterline Data Inventory environment.properties file, update the Hive connection URL to include the Hive Kerberos principal: • Comment out the existing, non-‐Kerberos instance of the hiveurl property. • Later in the file, uncomment the Kerberos instance of the hiveurl property. • Customize the Kerberos hiveurl to include the Hive Kerberos principal. The connection URL with Hive principal would look like the following example (all on one line): waterlinedata.hiveurl=jdbc:hive2://com.acme.edge:10000/default;principal= hive/
[email protected];auth=kerberos;kerberosAuthType=fromSubject
For more information on configuring Hive with Kerberos in an enterprise environment, see "Multi-‐User Scenarios and Programmatic Login to Kerberos KDC": cwiki.apache.org/confluence/display/Hive/HiveServer2+Clients -‐ HiveServer2Clients-‐Multi-‐UserScenariosandProgrammaticLogintoKerberosKDC
Improve security among Waterline Data Inventory components Kerberos provides a mechanism to ensure secure communications between clients and servers; you can also enhance the security of communication between 32
© 2014 -‐ 2015 Waterline Data, Inc. All rights reserved.
Installation and Administration Guide Improve security among Waterline Data Inventory components Waterline Data Inventory web server and repository database residing on a single computer or distributed across separate computers.
Securing internal passwords The following steps describe how to secure/obfuscate the clear text user passwords Waterline Data Inventory uses to pass information among components, such as Derby. The default passwords provided have been obfuscated using this process. To update the Derby password: 1. Edit the environment.properties file, found here: /waterlinedata/lib/resources/environment.properties
2. In a separate command window, generate the password for database access by running: /waterlinedata/bin/obfuscate
3. Enter the Derby password at the prompt and collect the output from the console (or in obfuscate.out). 4. Locate the entry for javax.persistence.jdbc.password and replace the existing default password with the encrypted text obtained in the previous step. Do NOT modify the line javax.persistence.jdbc.user=waterlinedata.
Encrypting a Derby repository Another measure you can take to secure data at rest in your system is to configure Derby to encrypt the Waterline Data Inventory repository. Make the following changes in a new installation of Waterline Data Inventory. If you need to convert an existing repository from non-‐encrypted to encrypted, refer to Apache's documentation found here: db.apache.org/derby/docs/10.9/devguide/tdevcsecureunencrypteddb.html To configure Derby to initialize an encrypted database: 1. Install Waterline Data Inventory as described in Installing Data Inventory, starting on page 1414. 2. Complete the configuration settings described in Step 6, "Configure Waterline Data Inventory for your cluster", except for the repository properties. 3. Configure the following repository properties in /waterlinedata/lib/resources/environment.properties: •
Derby connection URL "javax.persistence.jdbc.url". This property includes the following parameters: JDBC connection to the datastore. Option to create the database if it doesn't already exist.
"jdbc:derby://:4444/waterlinedatastore; create=true;
© 2014 -‐ 2015 Waterline Data, Inc. All rights reserved.
33
Improve security among Waterline Data Inventory components Option to enable data encryption. Boot password for the encrypted database.
Waterline Data Inventory
dataEncryption=true bootPassword=
For example, the complete property entry might look like the following: javax.persistence.jdbc.url=jdbc:derby://mycluster.acme.com:4444/waterlinedatastore ;create=true;dataEncryption=true;bootPassword=rO0ABXcIAAABTm+81/tzcgAZamF2YXguY3J5 cHRvLlNlYWxlZE9iamVjdD42PabDt1RwAgAEWwANZW5jb2RlZFBhcmFtc3QAAltCWwAQZW5jcnlwdGVkQ2 9udGVudHEAfgABTAAJcGFyYW1zQWxndAASTGphdmEvbGFuZy9TdHJpbmc7TAAHc2VhbEFsZ3EAfgACeHBw dXIAAltCrPMX+AYIVOACAAB4cAAAACBBs6MgWOBquHlkak/Pjk2DYzvwCcZPSVZ/xYDNdMbPl3B0ABRBRV MvRUNCL1BLQ1M1UGFkZGluZw==
•
Derby user and password. Include the dedicated Waterline Data Inventory user (typically "waterlinedata") and an encrypted password for this user. For example: javax.persistence.jdbc.user=waterlinedata javax.persistence.jdbc.password=rO0ABXcIAAABTL6+wStzcgAZamF2YXguY3J5cHRvL lNlYWxlZE9iamVjdD42PabDt1RwAgAEWwANZW5jb2RlZFBhcmFtc3QAAltCWwAQZW5jcnlwdG VkQ29udGVudHEAfgABTAAJcGFyYW1zQWxndAASTGphdmEvbGFuZy9TdHJpbmc7TAAHc2VhbEF sZ3EAfgACeHBwdXIAAltCrPMX+AYIVOACAAB4cAAAACDYZOrytwNZDBzYyS8qc530ISSmjDSq dw0fVY6YXCb+mnB0ABRBRVMvRUNCL1BLQ1M1UGFkZGluZw==
To encrypt the passwords, run the obfuscate utility provided in /waterlinedata/bin. 4. Follow the standard process for staring the Waterline Data Inventory Derby and Jetty services: $ cd /waterlinedata/ $ bin/derbyStart $ bin/jettyStart
If you have an existing installation of Waterline Data Inventory, you need to drop the existing repository to take advantage of Derby encryption. To remove an existing Waterline Data Inventory repository: 1. From the edge node where Waterline Data Inventory is installed, shut down any existing Waterline Data Inventory processes. If a profiling or discovery job is running, wait for the job to complete. $ cd /waterlinedata $ bin/jettyStop $ bin/derbyStop
The derbyStop script prompts for the username and password (configured in lib/resources/environment.properties). By default these values are "waterlinedata" and "waterlinedata".
34
© 2014 -‐ 2015 Waterline Data, Inc. All rights reserved.
Installation and Administration Guide Improve security among Waterline Data Inventory components 2. Remove the repository and indexes by deleting the /var/lib/waterline/db and /var/lib/waterline/index directories: $ rm /var/lib/waterline/db $ rm /var/lib/waterline/index
3. Follow the instructions provided above to configure Waterline Data Inventory to create a new encrypted database.
Configuring access using Hadoop security: Ranger or Sentry Waterline Data Inventory supports coarse-‐grained security based on HDFS file and directory user and group permissions. The following table describes Waterline Data Inventory's operation based on what Hive authorization method your cluster employs.
Security Configuration
HDFS
Hive
SQL Standards Based Authorization ! Browse files and tables ! Search files and tables ! Create tables ! Browse authorized subset (columns or rows) (Fine-‐grained security) Storage Based Authorization ! Browse files and tables ! Search files and tables ! Create tables Default Hive Authorization (Legacy Mode) ! Browse files and tables ! Search files and tables ! Create tables
Yes Yes Yes No
Yes Yes Yes No
Yes Yes Yes Yes Yes Yes
Yes Yes Yes Yes Yes Yes
Secure cluster configuration If a Hadoop cluster runs in secure mode, Waterline Data Inventory can be configured to enable secure impersonation. Secure impersonation allows a given Hadoop superuser to submit jobs or access files on behalf of another user. Secure impersonation is required to use Waterline Data Inventory HDFS delegated authorization capability. This allows the dedicated Waterline Data Inventory user to submit tasks on behalf of another user. The Waterline Data Inventory server uses its credentials to authenticate with Hadoop. However, file system accesses and tasks are authorized as the user who is sign in to the Waterline Data Inventory browser application. To use HDFS delegated authorization, do the following to enable secure impersonation in your Hadoop environment:
© 2014 -‐ 2015 Waterline Data, Inc. All rights reserved.
35
Improve security among Waterline Data Inventory components
Waterline Data Inventory
1. Add the dedicated Waterline Data Inventory user (typically waterlinedata) to the HDFS superuser group on all Hadoop nodes. 2. Create a /user/ directory in HDFS for each user who will access Waterline Data Inventory. 3. Grant read access on the appropriate source data files and directories in HDFS and databases and tables in Hive to the groups (or users). You must enable the secure impersonation properties for the Waterline Data Inventory superuser in the core-site.xml file on your Hadoop nodes. For example:
hadoop.proxyuser.waterlinedata.groups * Allow the superuser 'waterlinedata' to impersonate any user hadoop.proxyuser.waterlinedata.hosts * The superuser 'waterlinedata' can connect from any host to impersonate a user
Access privileges for HDFS If your cluster security is ensured using Apache Ranger or Apache Sentry, here's how to set user access to make sure that both the dedicated Waterline Data Inventory user and end-‐users of the browser application have the access they need.
HDFS User and Area of access
Read
Write
Execute
Waterline Data Inventory dedicated user "waterlinedata" ! HDFS directories and files included in inventory ! Staging area for profiling results Privileged end-‐users ! HDFS directories and files this user needs access to Read-‐only end-‐users ! HDFS directories and files this user needs access to
X X X X
X X
X X X X
36
© 2014 -‐ 2015 Waterline Data, Inc. All rights reserved.
Installation and Administration Guide
Starting Waterline Data Inventory
Access privileges for Hive Ranger and Sentry both control access to data in Hive tables; access is controlled based on the required SQL operation.
Hive User and Area of access
Hive Operation
Waterline Data Inventory dedicated user "waterlinedata" ! Profile existing tables ! Browse existing tables ! Create new tables Privileged end-‐users ! Hive databases and tables this user needs access to Read-‐only end-‐users ! Hive databases and tables this user needs access to
SELECT SHOW DATABASE CREATE, ALTER‡ SELECT, CREATE SELECT
‡ ALTER privileges are required only for creating Hive tables from collections.
Starting Waterline Data Inventory These steps pick up where the section "Installing Data Inventory" left off and assume you have access to the Linux computer where Waterline Data Inventory is installed and can sign in as the dedicated Waterline Data Inventory user. 1. From a command prompt or terminal, access the computer where Waterline Data Inventory is installed and sign in as the dedicated Waterline Data user. 2. Navigate to the Waterline Data Inventory installation directory. For example: $ cd /home/waterlinedata/waterlinedata
3. Start the embedded metadata repository database, Derby. $ bin/derbyStart
You'll see a response that ends with "...started and ready to accept connections on port 4444". 4. Type Enter to return to the shell prompt. 5. Profile a directory in HDFS (or MapR-‐FS). For this first run, select a single directory with a small number of files to validate the installation. Run the following command: $ bin/waterline profile
For example: $ bin/waterline profile /user/waterlinedata/Landing/data.gov
© 2014 -‐ 2015 Waterline Data, Inc. All rights reserved.
37
Starting Waterline Data Inventory
Waterline Data Inventory
The console fills with status messages for each stage of the profiling sequence. When this command completes, you can repeat it with additional directories or move on to viewing the profiled data. 6. Start the embedded web server, Jetty. You may want to open a new console for this command to separate the output from profiling from the Jetty output. (The output from both the profiling and Jetty processes are captured in their own log file in /var/log/waterline.) $ bin/jettyStart
The first time you run Waterline Data Inventory after installation, the system creates the repository tables in Derby. Either starting the Jetty process or running a profiling job will create the repository. Avoid starting both processes at the same time, as both will attempt to create repository tables and problems ensue. 7. After Jetty’s messages pause, open a browser and navigate to: http://:8082
For Kerberized instances, make sure to log in from a browser configured to use Kerberos keytabs for the user (see Configuring web browsers for use with Kerberos on page 51) and use the fully qualified domain name instead of the IP address to make sure that the Kerberos token is passed to the application: http://:8082
If the Waterline Data Inventory login screen doesn’t appear, look in the console output to see if any error occurred. The output is also available at /var/log/waterline/wds-ui.log. Typically, errors at this point are similar to the following: • Contested port. If another application on the cluster is using port 8082, you may not have access to Waterline Data Inventory. If this is the case, do the following: a. Stop Jetty. $ /bin/jettyStop
b. Change the Jetty port number in the file jetty-distribution-9.2.1.v/waterlinedata-base/start.d/http.ini
• •
38
c. Restart Jetty. Port forwarding. If you are accessing the web server remotely, make sure that the connection between hosts allows forwarding of port 8082. User permissions. If the dedicated user does not have the correct permissions, you may see errors in the Jetty output. Review the user access requirements and make sure the user has the correct access.
© 2014 -‐ 2015 Waterline Data, Inc. All rights reserved.
Installation and Administration Guide •
Starting Waterline Data Inventory
Kerberos ticket cache disabled. In a Kerberos-‐controlled environment, if you see the following error in the Jetty console and log, the ticket cache may be disabled for the user starting the Jetty process:
WARN |2015-03-22 13:52:57,827 org.apache.hadoop.security.UserGroupInformation - PriviledgedActionException as:waterlinedata (auth:KERBEROS) cause:java.io.IOException: javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)]
To resolve this error, make sure the user running the Jetty process has read access to HDFS and has a valid Kerberos ticket (run kinit). Then check that the Kerberos ticket cache is available for the user (run klist). 8. Sign into Waterline Data Inventory using any of the Linux users configured for your system, including "waterlinedata". For a Kerberized instance, this login page does not appear. To access Waterline Data Inventory, users will need a valid Kerberos keytab and their browser configured to use it. See Configuring web browsers for use with Kerberos on page 51. 9. Verify that there is field-‐level information for the files in the directory you profiled in step 5. If files show that they were not profiled ("N/A" or "CRAWLED" in Last Profiled), review the console output from step 5 to determine the failure.
If profiling didn't complete successfully, files will show no profile time
© 2014 -‐ 2015 Waterline Data, Inc. All rights reserved.
39
Running Waterline Data Inventory jobs
Waterline Data Inventory
Running Waterline Data Inventory jobs Waterline Data Inventory format discovery and profiling jobs are MapReduce jobs run in Hadoop. These jobs populate the Waterline Data Inventory repository with file format and schema information, sample data, and data quality metrics for files in HDFS and Hive. Waterline Data Inventory can process HDFS files formatted as delimited text files, JSON, Avro, XML, ORC, RC, and Apache log files. Individual files in these formats compressed as sequence files are also profiled. Individual files in delimited text format, Apache log format, or JSON compressed as gzip (GNU zip) are also profiled.
Tag propagation, lineage discovery, collection discovery, and origin propagation jobs are jobs run on the edge node where Waterline Data Inventory is installed. These jobs use data from the repository to suggest relationships among files, to suggest additional tag associations, and to propagate origin information.
Waterline Data Inventory jobs are run on a command line on the computer on which Waterline Data Inventory is installed. The jobs are started using scripts located in the bin subdirectory in the installation location. If you are running Waterline Data Inventory jobs in a development environment, consider opening two separate command windows: one for the Jetty console output and a second to run Waterline Data Inventory jobs.
Command summary Run Waterline Data Inventory commands as options to the waterline script found in the bin directory of the installation: $bin/waterline
The command options and parameters are described in the following table.
40
© 2014 -‐ 2015 Waterline Data, Inc. All rights reserved.
Installation and Administration Guide Command option
Running Waterline Data Inventory jobs Summary
Full profile and discovery of the files in the indicated HDFS directories. Indicate more than one directory with a comma-‐ separated list. details on page 42 MapReduce configuration parameters can be passed through to MapReduce jobs. Profiling of the files in the indicated HDFS directories. No profileOnly discovery processes run. Indicate more than one directory with a comma-‐separated list. details on page 42 Full profile and discovery of the tables in the indicated Hive profileHive databases. Indicate more than one database with a comma-‐ [HDFS staging directory] separated list. By default, Waterline Data Inventory uses the location details on page 44 configured for waterlinedata.profile.hivedir to stage profiling results; if this location is not configured, specify an empty HDFS directory where waterlinedata has read and write access to use as a staging directory for profiling results. MapReduce configuration parameters can be passed through to MapReduce jobs. Profiling of the tables in the indicated Hive databases. No profileHiveOnly discovery processes run. Indicate more than one database [HDFS staging directory] with a comma-‐separated list. By default, Waterline Data Inventory uses the location details on page 45 configured for waterlinedata.profile.hivedir to stage profiling results; if this location is not configured, specify an empty HDFS directory where waterlinedata has read and write access to use as a staging directory for profiling results. MapReduce configuration parameters can be passed through to MapReduce jobs. Discover lineage relationships among all profiled files and runLineage details on page 43 tables and calculate file and table origins. profile
runCollection
runOrigin
tag
evaluateReqex showVersion
Discover collections among all profiled files. If you are details on page 43 running discovery tasks individually, be sure to discover collections before propagating tag associations. Calculate file and table origins using all lineage relationships. details on page 43 Propagate tag associations across all profiled files and tables. Because this operation uses repository data, if you are details on page 44 experimenting with tag associations based on regular expressions, you should consider reprofiling data to get a complete picture of how tag associations from regular expressions will perform. Reapply tag associations based on regular expressions using details on page 44 existing repository data. Display Waterline Data Inventory version information. details on page 45
© 2014 -‐ 2015 Waterline Data, Inc. All rights reserved.
41
Running Waterline Data Inventory jobs
Waterline Data Inventory
Full profiling and discovery against HDFS files $ /bin/waterline profile
This command recursively profiles new and updated files in the directory indicated. When run for the first time, this command profiles all files in the indicated directory. Subsequent runs identify changed, deleted, and new files in the cluster and perform profiling only on those files. Specifically the profile command triggers the following individual operations: •
Format discovery (one MapReduce job)
•
Profiling "crawl" (one or more MapReduce jobs per file format type)
•
Collections discovery (one local job)
•
Origin propagation (one local job)
•
Tag propagation (one local job), including propagating: • User-‐assigned tag associations • Tag associations defined by regular expressions • Tag associations defined by built-‐in reference data
When each job completes, the next job starts, regardless of whether the job completes successfully. The progress of each job is indicated by messages on the console. To see details for the MapReduce jobs, follow the job link provided in the console messages or use Hue to show the MapReduce jobs for the dedicated Waterline Data Inventory user. After profiling all the directories in the cluster, run the lineage discovery command, described on page 43. Example: $ bin/waterline profile /user/waterlinedata/Landing
To profile more than one directory at a time, specify a parent directory or include multiple directories in the command, separated by commas with no space between paths: $ bin/waterline profile ",,"
If you specify a valid HDFS file instead of a directory, Waterline Data Inventory will profile just the file. If no staging directory is defined (waterlinedata.profile.processingdirectory in environment.properties), Waterline Data Inventory will create a staging directory in the same parent directory as the file.
Profiling only for HDFS files $ /bin/waterline profileOnly
This command recursively profiles new and updated files in the directory indicated. When run for the first time, this command profiles all files in the indicated directory. Subsequent runs identify changed, deleted, and new files in the cluster and perform 42
© 2014 -‐ 2015 Waterline Data, Inc. All rights reserved.
Installation and Administration Guide
Running Waterline Data Inventory jobs
profiling only on those files. Specifically the profile command triggers the following individual operations: •
Format discovery (one MapReduce job)
•
Profiling "crawl" (one or more MapReduce jobs per file format type)
The progress of each job is indicated by messages on the console. To see details for the MapReduce jobs, follow the job link provided in the console messages. After profiling all the directories in the cluster, run the lineage discovery, collection discovery, and tag propagation commands, described next. Example: $ bin/waterline profileOnly /user/waterlinedata/Landing
Lineage discovery /bin/waterline runLineage
This command runs two local jobs to discover lineage relationships among files and propagate origin information. This command operates on data in the Waterline Data Inventory repository; if new files are added to the cluster, you must run a profile command to collect data into the repository before you will see information for the new files reflected in lineage relationships. This command allows a -r option, which will rediscover lineage for all files in the cluster, not just new files. The progress of each job is indicated by messages on the console.
Collection discovery $ /bin/waterline runCollection
This command reviews repository data to determine if any folders contain files that can be considered a collection. In addition to running collection discovery as part of profiling in general, run this command when you've added files to the cluster that are likely to be members of existing collections; profiling alone will not update the collection information with the new files. This command allows an -r option, which will rediscover collections across the cluster, not just for new files. The progress of the job is indicated by messages on the console.
Origin propagation only $ /bin/waterline runOrigin
This command propagates origins across the files in the cluster that have lineage relationships. You can use this command to propagate landing information across a cluster that has already been profiled and has lineage information discovered. This command allows a -r option, which will propagate all origins, not just new origins. The progress of the job is indicated by messages on the console.
© 2014 -‐ 2015 Waterline Data, Inc. All rights reserved.
43
Running Waterline Data Inventory jobs
Waterline Data Inventory
Tag propagation only $ /bin/waterline tag
This command propagates new tags across the files and fields in the cluster. Use this command when you know that your cluster has been profiled but you have added tags and tag associations that you want Waterline Data Inventory to consider for propagation. This command allows a -r option, which will propagate all tags, not just new tags. The progress of the job is indicated by messages on the console.
Evaluating tag rules $ /bin/waterline evaluateRegex
This command uses data from the repository to apply tag association rules. Use this command when you configure tagging rules but are not ready to reprofile all data in the cluster to use profiling data to apply the new rules. The tag association results may not be as accurate when using data collected during profiling, but the performance savings will be significant. The progress of the job is indicated by messages on the console.
Full profiling and discovery against Hive tables $ /bin/waterline profileHive
This command profiles new and updated tables in the Hive database or databases indicated. The Hive databases must be from the Hive instance configured in the Waterline Data Inventory profiler properties as described on page 23. In addition, you must identify an HDFS location where Waterline Data Inventory can create staging files for profiling results. When run for the first time, this command profiles all tables in the indicated database. Subsequent runs identify changed, deleted, and new tables in the cluster and perform profiling only on those files. Specifically the profile command triggers the following individual operations: •
Profiling "crawl" (one or more MapReduce jobs depending on the size of data in each table)
•
Collections discovery (one local job)
•
Origin propagation (one local job)
•
Tag propagation (one local job), including propagating: • User-‐assigned tag associations • Tag associations defined by regular expressions • Tag associations defined by built-‐in reference data
When each job completes, the next job starts, regardless of whether the job completes successfully. The progress of each job is indicated by messages on the console. To see details for the MapReduce jobs, follow the job link provided in the 44
© 2014 -‐ 2015 Waterline Data, Inc. All rights reserved.
Installation and Administration Guide
Running Waterline Data Inventory jobs
console messages or use Hue to show the MapReduce jobs for the dedicated Waterline Data Inventory user. Example: $ bin/waterline profileHive default
To profile more than one database at a time, include multiple directories in the command, separated by commas with no space between names: $ bin/waterline profile ,
Profiling only for Hive tables $ /bin/waterline profileHiveOnly
This command profiles new and updated tables in the database or databases indicated. When run for the first time, this command profiles all tables in the indicated database. Subsequent runs identify changed, deleted, and new tables in the cluster and perform profiling only on those tables. Specifically the profile command triggers the following individual operations: •
Profiling "crawl" (one or more MapReduce jobs depending on the size of data in each table)
The progress of each job is indicated by messages on the console. To see details for the MapReduce jobs, follow the job link provided in the console messages. After profiling all the directories in the cluster, run the lineage discovery and tag propagation commands, described on pages 43 and 44. Example: $ bin/waterline profileHiveOnly default,finance
Displaying version information $ /bin/waterline showVersion
This command displays the Waterline Data Inventory version installed. This information specifies a Hadoop distribution. If the Hadoop distribution listed here is different from the distribution running on the cluster, you may have configuration problems. Consider reinstalling with the matching Waterline Data Inventory package.
© 2014 -‐ 2015 Waterline Data, Inc. All rights reserved.
45
Monitoring Waterline Data Inventory jobs
Waterline Data Inventory
Monitoring Waterline Data Inventory jobs Waterline Data Inventory provides a record of job history in the Dashboard of the browser application.
In addition, you can follow detailed progress of each job on the console where you run the command.
Monitoring Hadoop jobs When you run the “profile” command, you’ll see an initial job for format discovery followed by one or more profiling jobs. There will be at least one profiling job for each file type Data Inventory identified in the format discovery pass. The console output includes a link to the job log for the running job. For example: 2014-09-20 18:17:27,048 INFO [WaterlineData Format Discovery Workflow V2] mapreduce.Job (Job.java:submit(1289)) - The url to track the job: http://sandbox.hortonworks.com:8088/proxy/application_1913847052944_0004/
While the job is running, you can follow this link to see the progress of the MapReduce activity. 46
© 2014 -‐ 2015 Waterline Data, Inc. All rights reserved.
Installation and Administration Guide
Monitoring Waterline Data Inventory jobs
Alternatively, you can monitor the progress of these jobs using Hue in a browser: For Cloudera and MapR distributions: http://:8888/jobbrowser
For Hortonworks distributions: http://:8000/jobbrowser
You’ll need to specify the dedicated Waterline Data Inventory user or, if the Waterline Data Inventory user has a corresponding account in Hue, sign into Hue using that user.
Monitoring local jobs After the Hadoop jobs complete, Waterline Data Inventory runs local jobs to process the data collected in the repository. You can follow the progress of these jobs by watching console output in the command window in which you started the job.
Debugging information There are multiple sources of debugging information available for Data Inventory. If you encounter a problem, collect the following information for Waterline Data support. •
Job messages Waterline Data Inventory generates console output for jobs run at the command prompt. If the job encounters problems, you would review the job output for clues to the problem. These messages appear on the console and are collected in log files with debug logging level: MapReduce Jobs (Format discovery and profiling) /var/log/waterline/wds-mrjobs.log
Waterline Data Inventory Jobs (Tag propagation, collection discovery, lineage discovery) /var/log/waterline/wds-inventory.log
•
Web server messages The embedded web server, Jetty, produces output corresponding to user interactions with the browser application. These messages appear on the console and are collected in a log file: /var/log/waterline/wds-ui.log
Use tail to see the most recent entries in the log: $ tail -f /var/log/waterline/wds-ui.log
© 2014 -‐ 2015 Waterline Data, Inc. All rights reserved.
47
Optimizing profiling performance •
Waterline Data Inventory
Lucene search indexes In some cases, it may be useful to examine the search indexes produced by the product. These indexes are found in the following directory: /var/lib/waterline/index
•
Waterline Data Inventory repository In some cases it may be useful to examine the actual repository files produced by the product. The repository datastore is found in the following directory: /var/lib/waterline/db/waterlinedatastore
Profiling results After Waterline Data Inventory jobs run successfully, there may still be individual files that are not profiled or are not profiled completely. There are two places to look to understand the results of a profiling job: •
Dashboard. From inside Waterline Data Inventory browser application, click Dashboard in the toolbar. This page lists the current and past jobs. If files in a job produced errors and were not processed or were not fully processed, the job status indicates the errors.
•
Single File View. The file information for each profiled file includes the profile status for the file. From inside Waterline Data Inventory browser application, navigate to the file. File status values include: • PROFILED. A significant portion of the file profiled successfully or when the appropriate sample of the file is profiled (if sampling is turned on). • PROFILE_FAILED. Profiling encountered too many errors in this file to produce profiling output. Look for specific errors in the output of the profiling job. • CRAWLED. Profiling was not run or the profiling results were not written to the repository. In this case, Waterline Data Inventory will reprofile the file the next time the directory is included in a patch profiling job. Note that collections will always have a status of "CRAWLED"; the individual files that make up the collection will show specific profile status.
Optimizing profiling performance In terms of performance optimization, profiling breaks into two areas to consider: MapReduce operations that occur on the cluster's data nodes and writing profiling data to the Waterline Data Inventory repository on the edge node. Performance in these areas is dependent less on the size of the cluster data than on the number of columns in the cluster data. That is, a 2GB file with 30 columns will profile faster and take up less space in the repository than a 2GB file with 300 columns.
48
© 2014 -‐ 2015 Waterline Data, Inc. All rights reserved.
Installation and Administration Guide
Optimizing profiling performance
MapReduce job performance controls The important factors in MapReduce performance are the number of CPUs available across the cluster and the amount of memory available on each node. In both cases, more is better. Tuning Waterline Data Inventory to run on your cluster is like any other MapReduce operation: you want to make sure that the volume of data being processed and the number of processes running at one time fits within the resources available on the cluster. As part of the cluster configuration (outside Waterline Data Inventory), configure Hadoop parameters based on the data node hardware configuration: •
Memory allocated for map tasks
•
Memory allocated for reduce tasks
•
Java heap space available
Once these parameters are in place, Waterline Data Inventory gives you the ability to control the number of map and reduce tasks started by Waterline Data Inventory MapReduce jobs. These numbers are bound by the number of CPUs available for processing. Within that limit, choose the number of map and reduce based on the shape of the data you are processing to keep the size of data each task processes more or less constant. You would increase the maximum number of map or reduce tasks created when processing many small files (more columns overall); decrease the number of map and reduce tasks when processing fewer larger files (fewer columns overall). Assuming Waterline Data Inventory is the only task running on the cluster (an unlikely assumption!), start with the maximum number of map tasks at 75% of the number of CPUs across the cluster and the maximum number of reduce tasks at 50% of the number of CPUs. These numbers can add up to more than 100% because it's unlikely that both mappers and reducers will reach their maximum limits at the same time for a given job. By default, Waterline Data Inventory triggers MapReduce jobs sequentially. The configured number of map and reduce tasks applies to each job. If you have the resources, you can change Waterline Data Inventory's behavior to run jobs in parallel; you many need to change the number of map or reduce tasks to stay within your cluster's resources as the maximum number of map tasks applies per job.
Repository writing performance controls The most important factor in optimizing the performance of writing profiling results to the Waterline Data Inventory repository is the number of input and output operations per second (IOPS). Profiling results increase in size based on the number of columns profiled and this produces a lot of data to move from HDFS to the edge node. The second most important factor is the efficiency of the repository database itself. While Waterline Data Inventory ships with Embedded Derby configured as the repository, you can significantly improve performance in this area by upgrading to a multithreaded database. © 2014 -‐ 2015 Waterline Data, Inc. All rights reserved.
49
Supporting self-‐service users
Waterline Data Inventory
There are two additional and related parameters to consider to ensure you get the best possible performance during the write to the repository. If processes are running out of memory while writing to the repository (post processing operations after the MapReduce jobs have completed), you can adjust these parameters. •
Heap available for reading from HDFS (Client operation) Restricted by the amount of memory available on the edge node. This is set to 4 GB by default in the waterline script, HADOOP_HEAPSIZE setting: /waterlinedata/bin/waterline
•
Number of reducers. If you adjust the client operation memory and still run out of memory writing to the repository, you can increase the maximum number of reduce tasks available to Waterline Data Inventory jobs so that the volume of data produced by each reduce task is smaller.
Supporting self-‐service users Waterline Data Inventory is designed to enhance the ability of users of Hadoop data to find the right data in Hadoop. It endeavors to open Hadoop to these users while reducing the burden on IT to provide the access while maintaining control over secure and sensitive data. To achieve this balance of better data tools for end-‐users and a secure and controlled data environment, administrators configure end-‐user access to Waterline Data Inventory in the following ways: •
Secure access. Users of the Waterline Data Inventory browser application need to have accounts that can access the cluster, whether through Linux or through an authentication system running on Linux such as Kerberos. Waterline Data Inventory fully supports Kerberos-‐based single sign-‐on; thus if a user is already authenticated, no additional login is required to access the web application.
•
HDFS and MapR-‐FS navigation. If users have a matching account in HDFS, the users' browsing home in Waterline Data Inventory will be their HDFS home directory. If the end-‐users of your organization's cluster data do not have accounts in HDFS, you can configure Waterline Data Inventory to open at a set location in HDFS. See Configuring additional Waterline Data Inventory functionality (page 53).
•
Hive table creation. Waterline Data Inventory integrates with Hive in two ways: it reads Hive tables as part of profiling the cluster and it creates Hive tables from HDFS files upon user request. This second method provides a gateway for data users to act on files they identify using Waterline Data Inventory: users can request a file be
50
© 2014 -‐ 2015 Waterline Data, Inc. All rights reserved.
Installation and Administration Guide
Supporting self-‐service users
copied into a Hive table, then access the Hive database from visualization, reporting, and analytic tools outside the cluster.
Configuring web browsers for use with Kerberos After users' computers are configured for Kerberos authentication, browsers may require additional configuration to support using SPNEGO-‐Kerberos for user authentication. The following resources should help you find the best way to ensure your users can access Waterline Data Inventory's browser application seamlessly: Firefox See "Integrated Authentication" (developer.mozilla.org/en-‐ US/docs/Integrated_Authentication). We've found that these instructions work in our test environment. 1. Install the Firefox extension "Integrated Authentication for Firefox" (addons.mozilla.org/en-‐us/firefox/addon/integrated-‐auth-‐for-‐firefox/): 2. Inside Firefox, open Tools > Integrated Authentication Site. 3. Enter the host name where the Jetty web server is running. Restarting the browser is not required. Chrome See "Activating Kerberos Support" (support.google.com/chrome/a/answer/187202?hl=en). We've found that these instructions work in our test environment: 1. Exit your Chrome browser. 2. Add the host name for the computer where Waterline Data Inventory's Jetty web server is running to the browser's list of accepted sites. (OS X) Add the host name in ~/Library/Application
Support/Google/Chrome/Local State:
"auth": { "server_whitelist" : "" },
The server_whitelist can accept a comma separated list of host names or patterns such as *example.com. (Windows) Add the host name to the list of computers in the Local Intranet security zone. From the control panel, open Internet Options > Security and select "Local Internet". Then open Sites > Advanced and add the web server host name to the zone. Internet Explorer See "Kerberos authentication and troubleshooting delegation issues" (support.microsoft.com/en-‐us/kb/907272).
© 2014 -‐ 2015 Waterline Data, Inc. All rights reserved.
51
Swapping out Derby for MySQL
Waterline Data Inventory
Safari No configuration needed.
Swapping out Derby for MySQL Out of the box, Waterline Data Inventory runs an embedded Derby database instance as its repository of profiling and annotation data. The following instructions describe how to replace Derby with MySQL to persist Waterline Data Inventory metadata. The process involves two steps: •
Set up a MySQL database with a user dedicated to Waterline Data Inventory operations
•
Configure Waterline Data Inventory properties to point to the MySQL instance and database
These steps assume you have an installed instance of MySQL already running on your cluster, such as the instance used by Hive metastore. To swap out Derby for MySQL: 1. Sign in to the running MySQL instance as DBA and create a user dedicated to Waterline Data Inventory operations. For example, create the "waterlinedata" user: mysql> CREATE USER 'waterlinedata' IDENTIFIED BY 'waterlinedata';
2. Create a MySQL database "waterlinedatastore". mysql> CREATE DATABASE waterlinedatastore;
3. Switch to the newly created waterlinedatastore database and execute the following grants, where is replaced with the host name for the node where Waterline Data Inventory is running. mysql> mysql> mysql> mysql> mysql> mysql> mysql> mysql> mysql>
use waterlinedatastore; GRANT USAGE ON waterlinedatastore.* TO 'waterlinedata'@'%' IDENTIFIED BY 'waterlinedata'; GRANT USAGE ON waterlinedatastore.* TO 'waterlinedata'@'' IDENTIFIED BY 'waterlinedata'; GRANT USAGE ON waterlinedatastore.* TO 'waterlinedata'@'localhost' IDENTIFIED BY 'waterlinedata'; flush privileges; GRANT all ON waterlinedatastore.* TO 'waterlinedata'@'localhost' IDENTIFIED BY 'waterlinedata'; GRANT all ON waterlinedatastore.* TO 'waterlinedata'@'' IDENTIFIED BY 'waterlinedata'; GRANT all ON waterlinedatastore.* TO 'waterlinedata'@'%' IDENTIFIED BY 'waterlinedata'; flush privileges;
4. Use the parameters from the previous steps to edit /waterlinedata/lib/resources/environment.properties:
52
© 2014 -‐ 2015 Waterline Data, Inc. All rights reserved.
Installation and Administration Guide
Configuring additional Waterline Data Inventory functionality
javax.persistence.jdbc.driver=com.mysql.jdbc.Driver javax.persistence.jdbc.url=jdbc:mysql://:3306/ waterlinedatastore?createDatabaseIfNotExist=true javax.persistence.jdbc.user=waterlinedata javax.persistence.jdbc.password=
Configuring additional Waterline Data Inventory functionality Waterline Data Inventory provides a number of configuration settings and integration interfaces to enable extended functionality. These settings are managed as properties in properties files in /lib/resources.
Communication among Hadoop components The following configuration properties identify how Waterline Data Inventory components communicate with Hadoop and other applications in the Hadoop environment. Communication between Waterline Data Inventory and Hadoop This property identifies the location of the cluster that the Waterline Data Inventory browser application accesses. If you are installing Waterline Data Inventory on an existing cluster (rather than in a pre-‐configured VM) you'll need to set this value. [environment.properties file] waterlinedata.crawler.fs.uri=maprfs:/// (example) waterlinedata.crawler.fs.uri=hdfs://sandbox.hortonworks.com:8020 (example)
Communication between Jetty and Hadoop The Waterline Data Inventory embedded web server, Jetty, communicates directly with HDFS or MapR-‐FS as well as to the repository. By default, Jetty uses native Java API to retrieve data from HDFS. Waterline Data Inventory provides a configuration property to enable WebHDFS so you can access HDFS from a remote location. [webapp.properties file] waterlinedata.usewebhdfs=false (default) waterlinedata.webhdfs.uri=
For example: waterlinedata.usewebhdfs=true waterlinedata.webhdfs.uri=webhdfs://sandbox.hortonworks.com:50070/ (example)
Communication between Waterline Data Inventory and Hive Waterline Data Inventory can read and write data to Hive. If you are installing Waterline Data Inventory on an existing cluster, you'll need to set this value to enable the Hive functionality.
© 2014 -‐ 2015 Waterline Data, Inc. All rights reserved.
53
Configuring additional Waterline Data Inventory functionality
Waterline Data Inventory
The following property describes the Hive connection. It is shown here with example entries for SSH authentication to a server on the same computer where Waterline Data Inventory is installed: [environment.properties file] waterlinedata.hiveurl=jdbc:hive2://localhost:10000/default
Communication between Waterline Data Inventory and Derby Waterline Data Inventory includes embedded Derby as its repository database. Both Waterline Data Inventory jobs and the web server access Derby using the following connection information. You won't need to change this information unless you are replacing Derby with another database, if you need to change the default port selection, or if you want to change the default password. The values shown here are examples: [environment.properties file] javax.persistence.jdbc.driver=org.apache.derby.jdbc.ClientDriver javax.persistence.jdbc.url=jdbc:derby://sandbox.hortonworks.com:4444/ waterlinedatastore;create=true javax.persistence.jdbc.user=waterlinedata javax.persistence.jdbc.password=
When security is not a factor, you can insert Derby credentials in plain text; however, Waterline Data Inventory provides a utility to provide obfuscate stored passwords, as described in Obscuring passwords in Waterline Data Inventory configuration files (page 65). Changing the default Derby communication port By default, Waterline Data Inventory's instance of Derby communicates on port 4444. If you need to change that port number to avoid a conflict with another Hadoop process, stop Derby and Jetty, then update the port number in the following locations: 1. Repository configuration (lib/resources/environment.properties) On one line: javax.persistence.jdbc.url= jdbc:derby://:4444/waterlinedatastore;create=true
2. Derby configuration (lib/resources/derby.properties) derby.drda.portNumber=4444
3. Environment configuration (bin/detectenv) DERBY_PORT=4444
54
© 2014 -‐ 2015 Waterline Data, Inc. All rights reserved.
Installation and Administration Guide
Configuring additional Waterline Data Inventory functionality
Setting the location and persistence of temporary files The following configuration properties allow you to specify the location of staging files that Waterline Data Inventory creates when it collects profiling information from HDFS files (and Hive tables). An additional property defines where Waterline Data Inventory stages temporary file during discovery processes running on the local edge node. Temporary files for HDFS and Hive profiling Use this configuration property to identify the HDFS (or MapR-‐FS) directory Waterline Data Inventory uses when it needs to generate temporary files during profiling HDFS files or Hive tables. Make sure that the dedicated Waterline Data Inventory user has write access to the configured location. By default, this property is commented out and temporary files are placed in .waterlinedata in the first directory profiled on the cluster. When you set this property, make sure to remove the comment mark at the beginning of the line. When profiling Hive tables, you can override this value by specifying a staging location on the command line. [environment.properties file] waterlinedata.profile.processingdirectory=
Backing files for Hive tables A similar property controls the location of file copies created when users create Hive tables from Waterline Data Inventory. Only some files format types require copies. See Hive table backing file location, page 62. Staging area for discovery tasks This property indicates the local file system directory Waterline Data Inventory uses to store temporary files created during discovery processing. Make sure that the dedicated Waterline Data Inventory user has write access to the configured location. By default, this value is set to /tmp. [environment.properties file] waterlinedata.temproot=
Starting the web server in a Kerberos environment Because there can be more than one user on the edge node with valid Kerberos credentials, Waterline Data Inventory needs to know the keytab and username for the dedicated Waterline Data Inventory user. Otherwise, the web server attempts to start using the first user information provided as the "current user." To identify the keytab location and Kerberos principal for the dedicated Waterline Data Inventory user, set the following properties: [environment.properties file] waterlinedata.web.kerberos.keytab.location=
© 2014 -‐ 2015 Waterline Data, Inc. All rights reserved.
55
Configuring additional Waterline Data Inventory functionality
Waterline Data Inventory
waterlinedata.web.kerberos.username=
For example, "
[email protected]"
Secure communication between browser and web server (SSL) You can configure Waterline Data Inventory to use SSL to communicate between the client where the browser is running and the web server. This setup requires: •
Server X.509 certificate for the external web server address. This can be a real RSA, VeriSign, or similar certificate or a self-‐signed certificate.
•
Secure keystore inside Waterline Data Inventory's Jetty web server distribution.
The Jetty documentation provides instructions for generating a self-‐signed certificate and for creating and loading keystore values: www.eclipse.org/jetty/documentation/current/configuring-‐ssl.html#generating-‐ key-‐pairs-‐and-‐certificates The Waterline Data Inventory Jetty configuration is included in the following directory: /waterlinedata/jetty-distribution-*/waterlinedata-base
Configuration files include:
Component
Configuration File Location
Keystore HTTPS SSL
etc/keystore start.d/https.ini start.d/ssl.ini
Browser app functionality The following sections describe the properties used to control aspects of the Waterline Data Inventory browser application. Self-‐service browsing If your end-‐users have accounts in HDFS or MapR-‐FS and corresponding home directories, Waterline Data Inventory uses the directories as the users' home in the browser application: click "Browse" in Waterline Data Inventory to open the HDFS directory corresponding to the current user. If your end-‐users do not have accounts in HDFS, Waterline Data Inventory defaults the HDFS root directory. To improve end-‐users' experience, consider setting the home directory each user sees when they open Waterline Data Inventory. Set the HDFS directory path in the following property: [webapp.properties file] waterlinedata.defaultdirectory=
56
© 2014 -‐ 2015 Waterline Data, Inc. All rights reserved.
Installation and Administration Guide
Configuring additional Waterline Data Inventory functionality
For example: waterlinedata.defaultdirectory=/user/waterlinedata/Landing
Event auditing For folders, files, tags, lineage relationships and origins, Waterline Data Inventory collects events that occur to that object. For example, Waterline Data Inventory collects when a tag was created and when and by whom it was associated with a file or a field. Collecting this information has a small performance impact on the browser application and increases the size of the repository. You can keep Waterline Data Inventory from collecting new events by setting the following property to false: [waterlinedata.properties file] waterlinedata.auditing.enabled=true (default)
By default, Waterline Data Inventory caches information for pages viewed through the web application. You can control the length of time objects are cached (timeout) and the number of objects cached. The default timeout is set to ensure that the web application does not have to query the server each time the same file is viewed in a user's process of evaluating the file. The number of objects cached refers to items the server supplies to populate the browser interface; "objects" does not correspond to "files" or "tables". We recommend that you keep the default values unless you are working with Waterline Data Technical Support to solve a specific issue. [webapp.properties file] waterlinedata.web.cache.enable=true waterlinedata.web.cache.size=1000 waterlinedata.web.cache.timeout=100
Browser timeout Waterline Data Inventory automatically signs users out of the browser application after 30 minutes. To change this default, edit / jetty-distribution*/waterlinedata-base/etc/webdefault.xml and add or update the following section: 30
To remove any timeout, change this setting to -‐1.
© 2014 -‐ 2015 Waterline Data, Inc. All rights reserved.
57
Configuring additional Waterline Data Inventory functionality
Waterline Data Inventory
Profiling functionality You have control over many aspects of profiling using properties configured in the profiler.properties file: •
Setting persistence of temporary files (page 58)
•
Using samples to calculate data metrics (page 58)
•
Re-‐profiling existing files versus profiling only new and changed files (page 58)
•
Controlling the number of map tasks used per MapReduce job (page 59)
•
Controlling the number of reduce tasks used per MapReduce job (page 59)
•
Running MapReduce jobs in parallel (page 60)
•
Configuring additional date formats (page 60)
•
Identifying field separators (page 60)
•
Controlling most frequent data values (page 61)
Setting persistence of temporary files The following configuration property allows you to keep the staging files in place between profiling runs for debugging purposes. It is true by default: temporary files are deleted after profiling is complete. [profiler.properties file] waterlinedata.deletetempfiles=true
Using samples to calculate data metrics By default, Waterline Data Inventory uses all data in files to calculate field-‐level metrics such as the minimum and maximum values, the cardinality and density of the values, and the most frequent values. You can achieve better profiling performance in very large files by sampling the file data for these operations. When sampling is enabled, Waterline Data Inventory reads the first and last blocks in the file and enough other blocks to reach the sample fraction you specify. For example, with a sample fraction of 10%, Waterline Data Inventory will read 6 blocks of a 250MB file, including the first block, the last block, and 4 additional blocks chosen at random (assuming a 4096 KB block size). [profiler.properties file] waterlinedata.profile.sampled=false (by default) waterlinedata.profile.sampled.fraction=0.1
(by default)
Re-‐profiling existing files versus profiling only new and changed files By default, Waterline Data Inventory only profiles new files or files that have changed since the last profiling job. Change the following property to false to reprofile all files in the target directory. You might choose to do this if you add data 58
© 2014 -‐ 2015 Waterline Data, Inc. All rights reserved.
Installation and Administration Guide
Configuring additional Waterline Data Inventory functionality
formats (see page 60) or change other parameters that affect the profiling data collected. [profiler.properties file] waterlinedata.incremental=true (by default)
The block size in your cluster is configurable. If your block size is large relative to the size of your data files, it may not make sense for you to enable sampling. To determine your cluster's block size, see the following configurations:
Distribution
Configuration Parameter
Location
Default Value
CDH 5.x HDP 2.x MapR 4.x
dfs.blocksize dfs.blocksize ChunkSize
hdfs-‐site.xml hdfs-‐site.xml .dfs_attributes
128 MB 128 MB 256 MB
For more information, see: •
CDH: http://archive.cloudera.com/cdh5/cdh/5/hadoop/ hadoop-‐project-‐dist/hadoop-‐hdfs/hdfs-‐default.xml
•
HDP: http://docs.hortonworks.com/HDPDocuments/HDP2/ HDP-‐2.0.6.0/ds_Hadoop/hadoop-‐project-‐dist/hadoop-‐hdfs/hdfs-‐default.xml
•
MapR: http://doc.mapr.com/display/MapR/Chunk+Size
Controlling the number of map tasks used per MapReduce job You can limit the number of mappers Waterline Data Inventory generates per profiling job. You might consider setting a mapper limit when you are profiling many small files; by default, the ability to combine multiple files into a single mapper is enabled and set to limit mappers to 999. To control the number of mappers per job, set the following properties in waterlinedata/lib/resources/profiler.properties: waterlinedata.profile.combinedmapper=true waterlinedata.profile.combined.max_mappers_per_job=
Controlling the number of reduce tasks used per MapReduce job Waterline Data Inventory allows you to configure the maximum number of reducers used by MapReduce profiling jobs. Consider adjusting this value if jobs are running out of memory during the reduce tasks for Waterline Data Inventory MapReduce jobs. We recommend that this control be set for a relatively small number, smaller than the number of files typically processed by the profiling job and smaller than the number of map tasks used. The option is set to 5 reduce tasks by default.
© 2014 -‐ 2015 Waterline Data, Inc. All rights reserved.
59
Configuring additional Waterline Data Inventory functionality
Waterline Data Inventory
[profiler.properties file] waterlinedata.profile.reducer.count=5
Running MapReduce jobs in parallel By default, Waterline Data Inventory runs MapReduce profiling jobs one after the other. If you have the cluster resources to run jobs in parallel or if you are controlling the resources used through YARN or other resource management tools, consider changing this behavior to allow Waterline Data Inventory to trigger more than one MapReduce job at the same time. The option is set to true by default. [profiler.properties file] waterlinedata.waterlinedata.runjobsinseq=true
Configuring additional date formats When Waterline Data Inventory profiles string data such as in delimited files where no type information is available, it examines the data to reveal likely data types. It uses the format conventions described by the International Components for Unicode (ICU) conventions for dates and numeric values. You can add your own date formats using the conventions described here: icu-‐project.org/apiref/icu4j/com/ibm/icu/text/SimpleDateFormat.html The pre-‐defined formats are listed in the profiler properties file. [profiler.properties file] waterlinedata.profile.datetime.formats=EE MMM dd HH:mm:ss ZZZ yyyy, M/d/yy HH:mm, EEE MMM d h:m:s z yy, yy-MM-dd hh:mm:ss ZZZZZ, yy-MM-dd,yy-MM-dd HH:mm:ss,yy/M/dd,M/d/yy hh:mm:ss a, YYYY-MM-dd'T'HH:mm:ss.SSSSSSSxxx
Identifying field separators Waterline Data Inventory parses flat files such as comma-‐separated or log files to determine field separators, looking for characters that are repeated within each row of the file. If it finds more than one candidate for a field delimiter, it ranks the choices based on the number of occurrences of the character in the file and uses the highest ranked candidate. You can tell Waterline Data Inventory to remove some characters from consideration as field delimiters. There are a number of characters not considered as delimiters by default; you may find that you need to remove characters from this configuration to correctly parse your data. [profiler.properties file] waterlinedata.profile.format.discovery.non_separators="+-.\\/\"`()[]{}'"
60
© 2014 -‐ 2015 Waterline Data, Inc. All rights reserved.
Installation and Administration Guide
Configuring additional Waterline Data Inventory functionality
To include special characters such as tabs, follow the Java conventions for escape sequences described here: docs.oracle.com/javase/tutorial/java/data/characters.html Controlling most frequent data values Waterline Data Inventory collects 2000 of the most frequent values in each field in each file. You can change the number of values collected, control how many characters are included in each sample, and how many of these values are used in search indexes and to propagate tags. [profiler.properties file] Number of most frequent values collected waterlinedata.profile.top_k_capacity=2000 (by default)
Size limit of strings waterlinedata.max.top_k_length=128 (by default)
Number of most frequent values used in search indexes and UI lists waterlinedata.profile.top_k=50 (by default)
Number of most frequent values used to determine tag association matches waterlinedata.profile.top_k_tokens=100 (by default)
Hive functionality The following properties control interaction with Hive. For Hive connection information, see Communication between Waterline Data Inventory and Hive (page 53). Hive table profiling By default, Waterline Data Inventory does not profile Hive tables: from the Hive root in the browser application, users will see Hive tables, but schema-‐level details for the tables are not available. You can profile Hive tables using the "profieHive" script command (see page 42). Always profile Hive tables To include Hive table profiling in all Waterline Data Inventory profiling jobs, set the following option to 'true'. This option is not needed if you use the "profileHive" and "profileHiveOnly" commands: these commands override the value of this property. [profiler.properties file] waterlinedata.profilehive=false (default)
Clear deleted Hive tables By default when profiling Hive tables, Waterline Data Inventory reviews the tables in the database to ensure that the data they are based on still exists. If the backing © 2014 -‐ 2015 Waterline Data, Inc. All rights reserved.
61
Configuring additional Waterline Data Inventory functionality
Waterline Data Inventory
files have been deleted for a given table, Waterline Data Inventory clears out the table. You can turn off this check; turning it off reduces the overall profiling time by a small amount. [profiler.properties file] waterlinedata.cleanorphanedhivetables=true
Hive table backing file location When users create Hive tables from ORC, RC, and Sequence files, Waterline Data Inventory creates a copy of the data in the HDFS (or MapR-‐FS) directory specified by this property and creates the Hive table from the copied file or files. The browser application includes links between the backing file and the Hive table. If users create Hive tables from text, JSON, or log files or collections, Waterline Data Inventory does not create a copy of the file before creating the Hive table. By default, this property is commented out and the backing files are placed in the active user's home directory. An additional property indicates whether Waterline Data Inventory should always make copies of the original data or build a Hive table from the original file when it can. Consider disabling creating Hive tables in place when users are unlikely to have write permission to the directory in which the original HDFS file is located. [environment.properties file] waterlinedata.profile.hivedir= waterlinedata.hive.create_table_in_place=true
Discovery functionality The following properties control how Waterline Data Inventory makes suggestions for lineage relationships among files and for tag associations. Note that from the tag glossary you can disable tag propagation for individual tags, including built-‐in tags. Data type discovery When Waterline Data Inventory profiles data that does not have type information, it reads field values to determine data types. Use this property to disable data type discovery (-‐1), to use all field values to determine data types (0), or to limit data type discovery to the most frequent values (1 -‐ default) as identified by the profiling property waterlinedata.profile.top_k_capacity (page 61). [discovery.properties file] waterlinedata.profile.data_format_discovery=1
Balancing profiling performance against data quality calculations Waterline Data Inventory calculates cardinality and selectivity for each field in each file profiled. In addition, it collects a sample of the most frequent values in the field. Use this parameter to reduce the amount of time Waterline Data Inventory spends 62
© 2014 -‐ 2015 Waterline Data, Inc. All rights reserved.
Installation and Administration Guide
Configuring additional Waterline Data Inventory functionality
during profiling making the sample lists accurate. By default, this optimization is disabled. [discovery.properties file] waterlinedata.profile.high_cardinality.optimization=false (default)
Thresholds for what tag suggestions are exposed Waterline Data Inventory has default values set for all field-‐level tag propagation as follows. Some of these values can be configured individually for each tag from the Glossary. [discovery.properties file] Waterline Data Inventory gives a weight to its suggestions for matching tag associations. You can choose to expose more or fewer of these suggestions by configuring the cutoff weight. Tag associations whose calculated weight is below this value are not exposed to users. You can set this value per tag from the Glossary. waterlinedata.discovery.tolerance.weight=40.0 (by default)
Limit to the number of pre-‐defined tags that will be suggested for a given field. waterlinedata.discovery.tags.max_suggested_ref_tables=3
Limit to the number of any tags that will be suggested for a given field. waterlinedata.discovery.tags.max_suggested=3
Eliminating weak associations. If more than one tag is suggested for a field, the tag with the highest weight will be suggested; other tags must be within this value of the top tag for those tags to be suggested in addition to the top tag. waterlinedata.discovery.tags.value_hit_diff=20.0
Tag association for low-‐cardinality data When fields have low cardinality (the same values appear many times in the field for the file), tag propagation can be skewed toward making connections that are not representative of the data. Waterline Data Inventory provides some tools to help you avoid false positive tag associations among fields with low cardinality. Conventions for indicating missing values One common case where low cardinality values cause unexpected tag associations is when the data includes one or more values to indicate that there isn't a value. For example, if data uses a convention of "not available" or "NA" in the file to identify places where values are not provided, this value may be mistakenly considered to be related to other data that also uses "not available" or "NA" even though other values in the data are unrelated. Waterline Data Inventory provides a blacklist of values that should be ignored when making low cardinality matches. You can modify this comma-‐separated list to meet the requirements of your data, including providing localized versions of these © 2014 -‐ 2015 Waterline Data, Inc. All rights reserved.
63
Configuring additional Waterline Data Inventory functionality
Waterline Data Inventory
indicators. Note that you should include values in lower case as all field values are changed to lower case when matches are calculated. [discovery.properties file] waterlinedata.discovery.tags.null_values=na,n/a,unspecified,not available,null,empty,blank,missing
Tag propagation among low cardinality values For low cardinality values (few distinct values among all field values), Waterline Data Inventory requires 100% of the values in the candidate field to match for a tag to be associated with the candidate field. By default, "low cardinality" fields are fields with two or fewer distinct values. To require more values from a candidate field to match before a tag is suggested for an association, change the following option to a larger number. [discovery.properties file] waterlinedata.discovery.tags.min_cardinality.partial_match=2
Tag association using tag rules Some built-‐in tags and the tags defined by users can have tagging rules that use regular expressions to identify field data that should be associated with the tag. Use this property to disable evaluating tagging rules (-‐1), to use all field values to identify matches with tagging rules (0), or to limit matching tagging rule evaluation to the most frequent values (1 -‐ default) as identified by the profiling property waterlinedata.profile.top_k_capacity (page 61). [discovery.properties file] waterlinedata.profile.regex_evaluation=1
Controlling collections discovery By default, Waterline Data Inventory only considers folders with 3 or more files in any one folder of a recursive tree) to be a candidate for a collection. You can control this value to better reflect the organization of your cluster. Note that there are other qualifications that must be met before the files in the folder are marked as a collection. [discovery.properties file] waterlinedata.discovery.smallest.collection.size=3 (by default)
Controlling lineage relationship discovery When reviewing files for lineage relationships, Waterline Data Inventory is able to tolerate a number of changes to file schemas and data and still find a connection among files. These properties control the parameters used to determine a lineage relationship. The amount of overlapping data between fields to consider the files matching. 64
© 2014 -‐ 2015 Waterline Data, Inc. All rights reserved.
Installation and Administration Guide
Configuring additional Waterline Data Inventory functionality
waterlinedata.discovery.lineage.overlap=0.9 (by default)
If multiple fields from the same resource match the fields from another resource, Waterline Data Inventory uses field names to determine if the fields match. This mechanism is used only if field names are similar within the percentage indicated by this property, 0.8 (80%) by default. waterlinedata.discovery.lineage.field_name_match=0.8
Use HDFS last access date to limit lineage relationship candidates. The HDFS property dfs.namenode.accesstime.precision in hdfs-site.xml must be enabled. (Note that there is no provision for tracking access time in MapR.) waterlinedata.discovery.lineage.use_access_time_filter=true
Limit the time between access of a parent file and creation of a child. This criteria is ignored (no time checking) if set to 0. waterlinedata.discovery.lineage.batch_window_hours=24
Obscuring passwords in Waterline Data Inventory configuration files To convert passwords obfuscated values, run the following command, provide the Hive password, then insert the output in the appropriate resource file. /waterlinedata/bin/obfuscate
The output is also saved as obfuscate.out.
© 2014 -‐ 2015 Waterline Data, Inc. All rights reserved.
65