Waterline Data Inventory Installation and Administration Guide

  Waterline  Data  Inventory   Installation  and  Administration  Guide   Product  Version  1.2.5   Document  Version  1.9   ©  2014  -­‐  2015  Wa...
Author: Mervyn Holmes
0 downloads 2 Views 1MB Size
 

Waterline  Data  Inventory   Installation  and  Administration  Guide   Product  Version  1.2.5   Document  Version  1.9  

©  2014  -­‐  2015  Waterline  Data,  Inc.  All  rights  reserved.  

Installation  and  Administration  Guide  

 

Table  of  Contents  

Table  of  Contents   Related  Documents  .................................................................................................................  5   System  requirements  .............................................................................................................  5   Hadoop  compatibility  ......................................................................................................................  5   Edge  node  minimum  requirements  ............................................................................................  5   Database  configuration  ...................................................................................................................  6   Kerberos  compatibility  ...................................................................................................................  6   Browser  compatibility  .....................................................................................................................  6   Multi-­‐byte  support  ............................................................................................................................  6   Waterline  Data  Inventory  connections  and  access  ......................................................  7   Profiling  HDFS  files  ...........................................................................................................................  7   Browsing  HDFS  files  .........................................................................................................................  8   Profiling  Hive  tables  .........................................................................................................................  9   Browsing  and  Creating  Hive  tables  ..........................................................................................  10   Installing  Data  Inventory:  Quick  Start  ...........................................................................  11   Installing  Data  Inventory  ...................................................................................................  14   1.  Choose  an  installation  location  .............................................................................................  14   2.  Validate  Hadoop  configuration  .............................................................................................  14   3.  Configure  a  dedicated  user  .....................................................................................................  18   4.  Download  and  extract  Waterline  Data  Inventory  ...........................................................  19   5.  Run  configuration  scripts  .......................................................................................................  20   6.  Configure  Waterline  Data  Inventory  for  your  cluster  ...................................................  23   Upgrading  Waterline  Data  Inventory  ............................................................................  25   Integrating  with  user  management  systems  ...............................................................  27   Waterline  Data  Inventory  user  authentication  settings  ...................................................  27   SSH  configuration  ...........................................................................................................................  27   User  access  configuration  for  public  cloud  clusters  ...........................................................  27   Kerberos  configuration  ................................................................................................................  29   Improve  security  among  Waterline  Data  Inventory  components  .......................  32   Securing  internal  passwords  ......................................................................................................  33   Encrypting  a  Derby  repository  ..................................................................................................  33   Configuring  access  using  Hadoop  security:  Ranger  or  Sentry  ........................................  35   Starting  Waterline  Data  Inventory  .................................................................................  37   Running  Waterline  Data  Inventory  jobs  .......................................................................  40   Command  summary  ......................................................................................................................  40   Full  profiling  and  discovery  against  HDFS  files  ...................................................................  42   Profiling  only  for  HDFS  files  .......................................................................................................  42   Lineage  discovery  ...........................................................................................................................  43   Collection  discovery  ......................................................................................................................  43   Origin  propagation  only  ...............................................................................................................  43   Tag  propagation  only  ....................................................................................................................  44   Evaluating  tag  rules  .......................................................................................................................  44   ©  2014  -­‐  2015  Waterline  Data,  Inc.  All  rights  reserved.  

3    

Table  of  Contents  

Waterline  Data  Inventory  

Full  profiling  and  discovery  against  Hive  tables  .................................................................  44   Profiling  only  for  Hive  tables  .....................................................................................................  45   Displaying  version  information  ................................................................................................  45  

Monitoring  Waterline  Data  Inventory  jobs  .................................................................  46   Monitoring  Hadoop  jobs  ..............................................................................................................  46   Monitoring  local  jobs  ....................................................................................................................  47   Debugging  information  ................................................................................................................  47   Profiling  results  ..............................................................................................................................  48   Optimizing  profiling  performance  ..................................................................................  48   MapReduce  job  performance  controls  ....................................................................................  49   Repository  writing  performance  controls  .............................................................................  49   Supporting  self-­‐service  users  ...........................................................................................  50   Configuring  web  browsers  for  use  with  Kerberos  ..............................................................  51   Swapping  out  Derby  for  MySQL  .......................................................................................  52   Configuring  additional  Waterline  Data  Inventory  functionality  ..........................  53   Communication  among  Hadoop  components  .......................................................................  53   Setting  the  location  and  persistence  of  temporary  files  ...................................................  55   Starting  the  web  server  in  a  Kerberos  environment  ..........................................................  55   Secure  communication  between  browser  and  web  server  (SSL)  ...................................  56   Browser  app  functionality  ..........................................................................................................  56   Profiling  functionality  ..................................................................................................................  58   Hive  functionality  ...........................................................................................................................  61   Discovery  functionality  ................................................................................................................  62   Obscuring  passwords  in  Waterline  Data  Inventory  configuration  files  ......................  65      

4  

 

©  2014  -­‐  2015  Waterline  Data,  Inc.  All  rights  reserved.  

Installation  and  Administration  Guide  

Related  Documents  

Waterline  Data  Inventory  reveals  information  about  the  metadata  and  data  quality   of  files  in  a  Hadoop  cluster  so  the  users  of  the  data  can  identify  the  files  they  need   for  analysis  and  downstream  processing.  The  application  installs  on  an  edge  node  in   the  cluster  and  runs  MapReduce  jobs  to  collect  data  and  metadata  from  files  in  HDFS   (or  MapR-­‐FS)  and  Hive.  It  then  discovers  relationships  and  patterns  in  the  profiled   data  and  stores  the  results  in  its  metadata  repository.  A  browser  application  lets   users  search,  browse,  and  tag  HDFS  files  and  Hive  tables  using  the  benefits  of  the   collected  metadata  and  Data  Inventory’s  discovered  relationships.   This  document  describes  the  process  of  installing  Waterline  Data  Inventory  on  a   Hadoop  cluster.    

Related  Documents   •

Waterline  Data  Inventory  Sandbox.  Available  on  CDH,  HDP,  MapR  and  for  images   on  VirtualBox  and  VMWare.  



Waterline  Data  Inventory  User  Guide,  available  from  the    menu  in  the  browser   application  and  in  the  /docs  directory  in  the  installation.  

For  the  most  recent  documentation  and  product  tutorials,  sign  in  to  the  Waterline   Data  community  support  site,  support.waterlinedata.com.  

System  requirements   Waterline  Data  Inventory  runs  on  an  edge  node  in  a  Hadoop  cluster.  The  following   specifications  describe  the  Data  Inventory’s  platform  compatibilities  and  the   minimum  requirements  for  the  edge  node.  

Hadoop  compatibility   •

Cloudera  CDH  5.x    



Hortonworks  HDP  2.1,  2.2  



MapR  4.0,  4.1  

In  addition,  reading  Hive  tables  created  in  Waterline  Data  Inventory  requires   Hive  0.13  or  later.  All  of  the  supported  distributions  except  CDH  5.1  have  this   support.   The  edge  node  on  which  Waterline  Data  Inventory  is  installed  needs  to  have  the   Hadoop  and  Hive  clients  required  to  access  the  Hadoop  namenode.  

Edge  node  minimum  requirements   Optimizing  input/output  operations  per  second  (IOPS)  on  the  edge  node  is  the  most   important  factor  in  providing  the  best  performance  for  Waterline  Data  Inventory   operations.  Provisioning  a  higher  IOPS  disk  can  reduce  the  overall  profiling  time  

©  2014  -­‐  2015  Waterline  Data,  Inc.  All  rights  reserved.  

5    

System  requirements  

Waterline  Data  Inventory  

significantly.  For  example,  going  from  3000  IOPS  to  10000  IOPS  can  improve   performance  1.5  times.   •

Two  to  four  500  GB  disks,  the  faster  the  disks  the  better    



2  quad-­‐core  CPUs,  running  at  least  2-­‐2.5GHz  



32  GB  of  RAM  



Bonded  Gigabit  Ethernet  or  10Gigabit  Ethernet  



JDK  version  1.7.x  

Database  configuration   The  speed  of  the  repository  database  is  an  important  component  of  the  overall   performance  of  Waterline  Data  Inventory  operations.     Waterline  Data  Inventory  works  with  MySQL  and  Derby  databases.  It  is  shipped   with  Embedded  Derby  by  default.  This  document  provides  instructions  to  configure   Waterline  Data  Inventory  to  work  with  MySQL  (page  52).  To  configure  Waterline   Data  Inventory  to  work  with  other  relational  databases  that  support  JDBC   connectivity,  contact  [email protected].    

Kerberos  compatibility   This  release  is  compatible  with  Kerberos  version  5.    

Browser  compatibility   Waterline  Data  Inventory  supports  the  following  browsers.  If  your  cluster  uses   Kerberos,  be  sure  to  configure  Kerberos  support  in  end-­‐users'  browsers:   •

Microsoft  Internet  Explorer  9  and  later  (not  supported  on  Mac  OS)  



Chrome  36  or  later  



Firefox  31  or  later  

Multi-­‐byte  support   Waterline  Data  Inventory  handles  cluster  data  transparently:  assuming  the  data  is   stored  in  formats  that  Waterline  Data  Inventory  reads,  the  application  doesn't   enforce  any  additional  limitations  beyond  what  Hadoop  and  its  components  enforce.   That  said,  there  are  places  where  the  configuration  of  your  Hadoop  environment   needs  to  align  with  what  data  you  are  managing,  such  as:   •

Operating  system  locale  



Character  set  supported  by  Hive  client  and  server  



Character  set  supported  by  Waterline  Data  Inventory  repository  database   (Derby,  by  default)  client  and  server  

Waterline  Data  Inventory  browser  application  allows  users  to  enter  multi-­‐byte   characters  to  annotate  HDFS  data.  Again,  where  Waterline  Data  Inventory  interfaces   6  

©  2014  -­‐  2015  Waterline  Data,  Inc.  All  rights  reserved.  

Installation  and  Administration  Guide  

Waterline  Data  Inventory  connections  and  access  

with  other  applications,  such  as  Hive,  Waterline  Data  Inventory  enforces  the   requirements  of  the  integrated  application.  

Waterline  Data  Inventory  connections  and  access   For  Waterline  Data  Inventory  to  produce  an  inventory  of  HDFS,  it  needs  read  access   to  all  the  files  that  are  included  in  the  inventory.  In  addition,  it  needs  read  access  to   Hive  tables.  Waterline  Data  Inventory  uses  HDFS  to  stage  the  profiling  information  it   collects  from  HDFS  and  Hive  tables:  for  the  staging  directories,  Waterline  Data   Inventory  needs  write  access  into  HDFS.  

Profiling  HDFS  files   To  profile  HDFS  files,  Waterline  Data  Inventory  needs  two  connections  configured:   1. HDFS  Root  Node:  Waterline  Data  Inventory’s  connection  to  HDFS  for  profiling   includes:   • Read  access  for  all  HDFS  files   • Write  access  to  staging  areas  to  collect  profiling  results   2. Repository:  The  Waterline  Data  Inventory  engine  writes  profiling  and  discovery   results  to  a  repository  on  the  edge  node  using  the  Waterline  Data  Inventory   dedicated  user  credentials.  

  Configure  the  HDFS  and  repository  connections  to  profile  HDFS  files  

©  2014  -­‐  2015  Waterline  Data,  Inc.  All  rights  reserved.  

7    

Waterline  Data  Inventory  connections  and  access  

Waterline  Data  Inventory  

Browsing  HDFS  files   When  data  scientists  and  analysts  access  HDFS  files  through  Waterline  Data   Inventory,  they  see  only  files  and  tables  that  they  have  permission  to  view:  all  file   system  operations  are  performed  as  the  signed-­‐in  user.  The  user  permissions  are   established  through  the  operating  system  permissions  or  through  a  Hadoop   authentication  system  such  as  Kerberos,  Ranger,  or  a  combination  of  Kerberos  and   Sentry.  When  running  against  a  Kerberized  cluster,  Waterline  Data  Inventory  uses   impersonation  to  perform  operations  with  the  access  available  to  the  current  user.   For  end-­‐users  to  browse  HDFS  files,  Waterline  Data  Inventory  needs  three   connections  configured:   1. HDFS  Root  Node   2. Repository   3. Browser  URL  pointing  to  the  Waterline  Data  Inventory  web  server  combined   with  user  credentials,  whether  through  explicit  login  or  authentication   configured  for  the  browser.  

  Configure  the  Web  Server  connection  for  user  access  

8  

©  2014  -­‐  2015  Waterline  Data,  Inc.  All  rights  reserved.  

Installation  and  Administration  Guide  

Waterline  Data  Inventory  connections  and  access  

Profiling  Hive  tables   To  include  Hive  tables  in  the  inventory,  Waterline  Data  Inventory  needs  read  access   to  Hive  databases  as  well  as  read/write  access  to  a  staging  directory  in  HDFS  where   it  holds  profiling  information  for  Hive  tables.  This  can  be  the  same  staging  area  used   for  profiling  HDFS  files.   To  profile  Hive  tables,  Waterline  Data  Inventory  needs  three  connections   configured:   1. HDFS  Root  Node,  including  write  access  to  an  HDFS  staging  area  for  profiling   results.   2. Repository   3. Hive  database  access:  for  Waterline  Data  Inventory  to  include  Hive  tables,  it   needs  read  access  to  each  Hive  database  to  be  included.  

  Waterline  Data  Inventory  uses  MapReduce  to  profile  Hive  tables    

©  2014  -­‐  2015  Waterline  Data,  Inc.  All  rights  reserved.  

9    

Waterline  Data  Inventory  connections  and  access  

Waterline  Data  Inventory  

Browsing  and  Creating  Hive  tables   Users  can  create  new  Hive  tables  from  HDFS  files  they  identify  in  Waterline  Data   Inventory.     For  end-­‐users  to  create  and  browse  Hive  tables,  Waterline  Data  Inventory  needs   three  connections  configured:   1. Repository   2. Browser  URL  pointing  to  the  Waterline  Data  Inventory  web  server  combined   with  user  credentials,  whether  through  explicit  login  or  authentication   configured  for  the  browser.   3. Hive  database  access.  Waterline  Data  Inventory’s  connection  to  Hive  for   browsing  includes  read  access  to  all  Hive  databases.  To  create  new  Hive  tables   from  HDFS  files,  Waterline  Data  Inventory  needs  write  access  to  the  databases   where  users  would  expect  new  tables  to  appear.  

  Users  see  the  Hive  tables  they  have  access  to  and  can  create  new  Hive  tables  from  HDFS  files  

10  

©  2014  -­‐  2015  Waterline  Data,  Inc.  All  rights  reserved.  

Installation  and  Administration  Guide  

Installing  Data  Inventory:  Quick  Start  

Installing  Data  Inventory:  Quick  Start   Here's  the  minimal  version  for  getting  Waterline  Data  Inventory  up-­‐and-­‐running  in   a  development  environment.  It  assumes  you  have  root  access  to  the  environment   and  that  Hadoop,  the  cluster  is  not  secured  with  Kerberos,  and  related  services  are   running  and  healthy.     For  instructions  suitable  for  an  enterprise  environment,  SKIP  THIS  SECTION  and  go   to  Installing  Data  Inventory  (page  14).   1. Create  a  dedicated  Waterline  Data  user  that  you'll  use  for  Waterline  Data   Inventory  installation  and  job  commands.   • From  a  command  window  on  the  installation  computer,  create  a   "waterlinedata"  user:   $ useradd waterlinedata $ passwd waterlinedata  



Give  waterlinedata  user  read  access  to  the  files  in  HDFS  or  MapR-­‐FS  and   write  access  to  at  least  one  HDFS  location  to  write  profiling  results.  The   access  needed  may  vary  depending  on  the  Hadoop  distribution.  For  CDH  and   HDP,  you  can  give  the  user  access  to  the  hdfs  group  for  both  read  and  write   access:   $ usermod -a -G hdfs waterlinedata

For  MapR:   $ usermod -a -G mapr waterlinedata



Grant  sudo  access  for  running  installation  scripts.     $ su root $ visudo

Add  the  waterlinedata  user  in  the  User  privilege  section.  After  installation,   sudo  access  is  no  longer  needed.   2. Go  to  the  directory  you  want  to  install  Waterline  Data  Inventory,  verify  that  the   waterlinedata  user  has  read,  write,  and  execute  permissions  on  the  directory.   For  example,  you  can  use    /opt  or  /usr/lib  to  mock  typical  Hadoop  component   installs,  or  /home/waterlinedata  for  private  installation.  These  instructions   assume  /opt/waterlinedata.   $ cd /opt/waterlinedata

©  2014  -­‐  2015  Waterline  Data,  Inc.  All  rights  reserved.  

11    

Installing  Data  Inventory:  Quick  Start  

Waterline  Data  Inventory  

3. Change  to  the  waterlinedata  user  and  extract  Waterline  Data  Inventory  from  the   TAR  file.   $ su waterlinedata $ tar xf

4. From  the  newly  created  Waterline  Data  Inventory  directory,  run  an  installation   script,  providing  the  waterlinedata  user  password  for  sudo  access  when   prompted.   $ cd waterlinedata $ bin/postInstall

This  script  prompts  you  for  the  location  of  Hive  in  the  Hadoop  environment;   typically,  other  Waterline  Data  Inventory  scripts  will    locate  Hive  for  you,  so  you   can  skip  this  prompt.  If  you  receive  an  error  later,  rerun  this  script  and  include   the  Hive  location.     If  this  script  runs  successfully,  the  output  shows  directories  created  in  /var.  If  the   script  reports  a  problem,  address  the  issue  and  rerun  the  script  until  you  get  a   successful  result.   5. Run  a  script  to  place  Waterline  Data  Inventory  and  other  3rd  party  JARs  where   Waterline  Data  Inventory  can  use  them.     If  Hive  runs  on  a  different  node  or  your  cluster  is  not  configured  to  run  Hive  at  all,   skip  this  step  and  follow  the  instructions  in  the  detailed  installation  steps  for  the   hiveSetup  script  on  page  21.   If  HiveServer2  is  running  on  the  same  node  as  Waterline  Data  Inventory,  run  the   following  script:   $ bin/hiveSetup linkAuxLib

This  script  identifies  the  Hive  home  location  and  performs  actions  to  move  JAR   files  into  Hive's  auxlib  directory  (to  avoid  conflicting  with  JAR  files  already  in  use   by  Hive).  It  also  creates  symbolic  links  from  these  files  to  the  Hive  lib  directory   to  allow  Beeswax  and  Beeline  access  to  these  files.  (To  skip  creating  the  symbolic   links,  use  "$ bin/hiveSetup".)   The  script  may  prompt  you  to  allow  the  auxlib  directory  to  be  created  and  to   approve  any  conflicts  should  these  files  already  existing  in  either  auxlib  or  lib.     If  the  Hive  server  is  not  running,  the  script  will  fail  to  identify  the  Hive  location;   to  remedy  this,  do  one  of  the  following:   • Start  the  Hive  server.   • Rerun  postInstall  (step  4)  and  specify  the  location  of  the  Hive  executable.   • Edit  /waterlinedata/bin/.hive_home  to  include  the  location   of  the  Hive  executable.   If  the  script  reports  a  problem,  address  the  issue  and  rerun  the  script  until  you   get  a  successful  result.  If  you  are  not  successful  running  these  setup  scripts,  you  

12  

©  2014  -­‐  2015  Waterline  Data,  Inc.  All  rights  reserved.  

Installation  and  Administration  Guide  

Installing  Data  Inventory:  Quick  Start  

can  run  /waterlinedata/bin/detect-env information  on  where  problems  are  occurring.  

verbose  to  get  more  

6. Configure  host  name  and  port  numbers  where  appropriate.   If  you  are  running  Waterline  Data  Inventory  from  a  VM  image  or  on  a  single  node   cluster,  these  configuration  parameters  are  already  set  for  you.     To  set  or  validate  the  appropriate  configuration  settings,  review  the  contents  of   /waterlinedata/lib/resources/environment.properties.  In  particular,   insert  the  fully  qualified  domain  names  of  the  cluster  root  and  the  node  on  which   Waterline  Data  Inventory  is  running  in  the  following  properties:    



HDFS  root:  waterlinedata.crawler.fs.uri=

For  example:   hdfs://sandbox.hortonworks.com:8020 hdfp://quickstart.cloudera:8082 maprfs:///



Repository  node:  javax.persistence.jdbc.url= jdbc:derby://:4444

For  example:   jdbc:derby://sandbox.hortonworks.com:4444 jdbc:derby://quickstart.cloudera:4444 jdbc:derby://maprdemo:4444

For  more  detailed  list  of  configuration  parameters,  see  5.  Configure  Waterline   Data  Inventory  connections  to  Hadoop  (page  23).   The  application  is  now  installed  and  configured.  To  validate  that  the  installation  was   successful,  see  Starting  Waterline  Data  Inventory  (page  37).  

©  2014  -­‐  2015  Waterline  Data,  Inc.  All  rights  reserved.  

13    

Installing  Data  Inventory  

Waterline  Data  Inventory  

Installing  Data  Inventory   This  version  of  the  installation  instructions  includes  more  details  about  each  step   and  the  decisions  involved  in  configuring  Waterline  Data  Inventory  in  a  unique   enterprise  environment.     Installing  Waterline  Data  Inventory  involves  the  following  decisions  and  steps,  some   of  which  require  root  access:   • • • • • •

Choose  an  installation  location   Validate  that  Hadoop  is  running  and  configured  properly   Configure  a  dedicated  Waterline  Data  Inventory  user  (requires  root  access)   Download  and  extract  Waterline  Data  Inventory   Run  a  configuration  scripts  (requires  root  access)   Configure  connections  to  Hadoop  and  other  applications  in  the  Hadoop   environment  

1.  Choose  an  installation  location   Install  Waterline  Data  Inventory  in  the  same  way  other  Hadoop  cluster  edge  node   applications  are  installed.  Some  clusters  use  /usr/lib;  others  /opt.  It  can  be  installed   in  other  locations,  such  as  in  the  home  directory  for  the  dedicated  Waterline  Data   user  /home/waterlinedata.  Any  location  you  choose  requires  root  access  to  complete   the  configuration.    

2.  Validate  Hadoop  configuration   Hadoop  is  a  complex  system  with  many  overlapping  configurations  and  controls.   You  can  ensure  that  Waterline  Data  Inventory  will  install  smoothly  if  you  first     validate  that  the  existing  Hadoop  components  are  running  and  communicating   properly  among  themselves.  The  following  steps  prepare  for  Waterline  Data   Inventory  installation  by  exercising  each  of  the  places  where  Waterline  Data   Inventory  interacts  with  Hadoop.   1. Identify  the  host  name  for  the  cluster,  referred  to  in  this  document  as  .     Typically,  this  is  the  fs.defaultFS  parameter  in  Hadoop's  core-site.xml  file.  For   MapR,  find  the  host  name  for  your  cluster  using:   cat /opt/mapr/conf/mapr-clusters.conf

2. Ensure  that  Kerberos  is  configured  for  the  edge  node  and  for  end-­‐user  access.   If  your  cluster  is  Kerberized,  you'll  need  a  Kerberos  administrator's  help  to   install  Waterline  Data  Inventory.  Before  you  bring  in  your  Kerberos  admin,  you   can  test  these  basic  operations  to  make  sure  the  foundation  is  in  place:   • Make  sure  the  computer  you  identified  as  the  Waterline  Data  Inventory   installation  location  (see  previous  section)  is  configured  with  Kerberos:   $ kinit

14  

©  2014  -­‐  2015  Waterline  Data,  Inc.  All  rights  reserved.  

Installation  and  Administration  Guide  

Installing  Data  Inventory  

This  command  prompts  for  the  current  user's  password.  If  it  does,  type   anything  and  exit  the  command.  If  it  doesn't,  this  computer  is  not  yet   configured  with  Kerberos.  Work  with  your  Kerberos  administrator  to  install   Kerberos,  add  this  computer  to  the  Kerberos  database,  and  generate  a  keytab   for  this  computer  as  a  Kerberos  application  server.   •

Make  sure  your  browser  is  configured  to  use  Kerberos  to  access  the  cluster.   From  a  browser  running  on  a  computer  that  is  not  the  edge  node  where  you   are  installing  Waterline  Data  Inventory,  sign  into  a  Kerberized  cluster   component,  such  as  one  of  the  following:  

     

Hue  (CDH,  MapR)   Hue  (HDP)   Ambari  (HDP)   Cloudera  Manager  (CDH)   MapR  Control  System  (MapR)  

http://:8888   http://:8000   http://:8080   http://:7180   http://:8443  

If  you  are  not  able  to  sign  in,  check  that:   • • • • •

The  current  user  has  a  valid  ticket  (run  klist  from  a  terminal  on  the  client   computer).   The  browser  is  configured  to  use  Kerberos  when  accessing  secure  sites.   A  Kerberos  KDC  is  accessible  from  this  computer.   The  Hadoop  service  is  running.   The  active  user  has  access  to  the  Hadoop  application.  

3. Verify  that  Hadoop  components  are  running.   You  can  use  the  cluster  management  tool  (Ambari,  Cloudera  Manager,  or  MapR   Control  System).  If  the  cluster  is  not  managed  using  one  of  these  tools,  check   individual  services  by  running  the  command  line  for  the  component.  For   example:   $ hadoop version $ beeline (!quit  to  exit)

Before  installing  Waterline  Data  Inventory,  make  sure  that  HDFS,  MapReduce,   and  YARN  are  running;  if  Hive  is  configured  for  your  cluster,  Hive  and  its   constituent  components  (Hive  Metastore,  HiveServer2,  MySQL  Server,  WebHCat   Server)  must  be  running.   4. Check  that  users  have  access  to  HDFS  files  and  Hive  tables.   Waterline  Data  Inventory  depends  on  the  cluster  authorization  system  to   manage  user  access  to  HDFS  resources.  Verify  that  you  have  access  to  some   HDFS  files  and  Hive  tables  so  that  when  you  use  Waterline  Data  Inventory  to   access  the  same  files,  you  can  validate  that  the  proper  access  is  available.  You'll   need  access  to  these  files  as  an  end-­‐user  and  as  the  Waterline  Data  Inventory   dedicated  user.    

©  2014  -­‐  2015  Waterline  Data,  Inc.  All  rights  reserved.  

15    

Installing  Data  Inventory  

Waterline  Data  Inventory  

To  verify  that  you  have  access  to  these  files  and  tables,  you  can,  for  example:   • Use  Hue  to  navigate  to  existing  data  in  HDFS  or  to  load  new  data.   Verify  that  you  can  access  files  you  own  as  well  as  files  for  which  you  have   access  through  group  membership.  If  you  can't  sign  into  Hue  or  can't  access   HDFS  files  from  inside  Hue,  ask  your  Hadoop  administrator  for  appropriate   credentials.   •

Use  Beeswax  (accessible  through  Hue)  or  Beeline    (Hive  command  line)  to   verify  that  you  can  access  existing  databases  and  tables.  If  you  can't  sign  into   Beeline  or  can't  access  Hive  tables,  ask  your  Hadoop  administrator  for   appropriate  credentials.  

5. Run  a  sample  MapReduce  job.   All  of  the  Hadoop  distributions  provide  sample  code  that  you  can  run  directly  in   the  jar  file:   hadoop-mapreduce-examples-.jar

where  the  version  may  be  specific  to  the  distribution  and  version  of  Hadoop.  Run   an  example  MapReduce  job  as  follows:   a. Use  "locate"  or  "find"  to  determine  where  the  examples  JAR  file  is.   b. Run  the  sample  job  "pi"  with  values  for  the  number  of  map  tasks  (10)  and   samples  (1000)  to  run:   hadoop jar /hadoop-mapreduce-examples-*.jar pi 10 1000

If  the  example  runs  successfully,  you'll  see  output  that  shows  the  MapReduce  job   running:    

16  

©  2014  -­‐  2015  Waterline  Data,  Inc.  All  rights  reserved.  

Installation  and  Administration  Guide  

Installing  Data  Inventory  

Number of Maps = 10 Samples per Map = 1000 Wrote input for Map #0 Wrote input for Map #1 Wrote input for Map #2 Wrote input for Map #3 Wrote input for Map #4 Wrote input for Map #5 Wrote input for Map #6 Wrote input for Map #7 Wrote input for Map #8 Wrote input for Map #9 Starting Job 15/06/01 04:48:41 INFO impl.TimelineClientImpl: Timeline service address: http://sandbox.hortonworks.com:8188/ws/v1/timeline/ 15/06/01 04:48:41 INFO client.RMProxy: Connecting to ResourceManager at sandbox.hortonworks.com/10.0.2.15:8050 15/06/01 04:48:42 INFO input.FileInputFormat: Total input paths to process : 10 15/06/01 04:48:42 INFO mapreduce.JobSubmitter: number of splits:10 15/06/01 04:48:42 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1432835905062_0003 15/06/01 04:48:43 INFO impl.YarnClientImpl: Submitted application application_1432835905062_0003 15/06/01 04:48:43 INFO mapreduce.Job: The url to track the job: http://sandbox.hortonworks.com:8088/proxy/application_1432835905062_0003/ 15/06/01 04:48:43 INFO mapreduce.Job: Running job: job_1432835905062_0003 15/06/01 04:48:59 INFO mapreduce.Job: Job job_1432835905062_0003 running in uber mode : false 15/06/01 04:48:59 INFO mapreduce.Job: map 0% reduce 0% 15/06/01 04:49:57 INFO mapreduce.Job: map 10% reduce 0% 15/06/01 04:49:58 INFO mapreduce.Job: map 70% reduce 0% 15/06/01 04:49:59 INFO mapreduce.Job: map 80% reduce 0% 15/06/01 04:50:30 INFO mapreduce.Job: map 90% reduce 0% 15/06/01 04:50:32 INFO mapreduce.Job: map 100% reduce 0% 15/06/01 04:50:34 INFO mapreduce.Job: map 100% reduce 100% 15/06/01 04:50:35 INFO mapreduce.Job: Job job_1432835905062_0003 completed successfully Job Finished in 116.303 seconds ... Estimated value of Pi is 3.14080000000000000000  

You'll  see  a  similar  output  pattern  when  Waterline  Data  Inventory  MapReduce   jobs  run.  

©  2014  -­‐  2015  Waterline  Data,  Inc.  All  rights  reserved.  

17    

Installing  Data  Inventory  

Waterline  Data  Inventory  

3.  Configure  a  dedicated  user   We  recommend  that  you  configure  a  "waterlinedata"  user  to  own  the  installation   directory  and  to  run  Waterline  Data  Inventory  jobs.  If  you  choose  not  to  create  a   "waterlinedata"  user,  choose  another  user  that  will  be  dedicated  to  running   Waterline  Data  Inventory  jobs.     Because  of  the  extensive  access  privileges  that  Waterline  Data  Inventory  needs  to   produce  an  inventory  of  HDFS  files,  it  is  critical  that  the  user  account  that  runs   Waterline  Data  Inventory  jobs  be  created  to  adhere  to  all  enterprise  security   requirements.     The  dedicated  user  needs  the  following  access:   •

Appropriate  security  authentication.  The  dedicated  Waterline  Data  Inventory   user  (waterlinedata)  needs  to  be  an  authorized  user  in  the  system  used  by  your   enterprise  to  authenticate  cluster  users.    



Kerberos  credentials.  If  your  cluster  is  Kerberized,  ask  your  Kerberos   administrator  to  configure  a  principal  name  for  the  dedicated  Waterline  Data   Inventory  user  and  a  corresponding  keytab  file.  You'll  need  this  information  to   configure  the  Waterline  Data  Inventory  web  server  and  to  run  Waterline  Data   Inventory  jobs.  



Temporary  root  access.  The  waterlinedata  user  must  be  configured  with   enough  "sudo"  powers  to  create  these  directories  during  the  installation.  The   sudo  access  can  be  removed  after  installation  is  complete.  



Directory  access.  The  waterlinedata  user  requires  full  access  to  the  Waterline   Data  Inventory  installation  directory  and  the  following  runtime  directories:   • Waterline  Data  Inventory  installation  location,  typically  /opt.   • /var/lib/waterline:  location  for  Waterline  Data  Inventory  repository  and   search  indexes   • /var/log/waterline:  location  for  the  Waterline  Data  Inventory  logs   • /var/run/waterline:  location  for  the  Waterline  Data  Inventory  runtime  state   information  (not  used  currently)   Other  than  the  installation  location,  these  folders  can  be  created  and  ownership   assigned  automatically  by  the  script  "postInstall"  described  in  the  installation   steps  below.  This  script  requires  root  access  to  run.  



18  

 

©  2014  -­‐  2015  Waterline  Data,  Inc.  All  rights  reserved.  

Installation  and  Administration  Guide   •

Installing  Data  Inventory  

File  system  access.  The  waterlinedata  user  requires  read  access  to  any  file  in   HDFS  or  MapR-­‐FS  that  will  be  part  of  the  system  inventory.  It  also  requires  write   access  to  at  least  one  location  where  it  stages  profiling  data.  If  you  expect  your   users  to  create  Hive  tables  from  inside  Waterline  Data  Inventory,  the   waterlinedata  user  needs  access  to  a  staging  location  for  Hive  table  creation.  

Waterline  Data  Inventory  reads  all  files  on  the  file  system  but  exposes  data  only   according  to  users'  authorization.  One  way  to  allow  Waterline  Data  Inventory  to   have  the  appropriate  access  is  to  add  waterlinedata  to  the  file  system  group  (hdfs  or   mapr).    This  method  assumes  that  operating  system  users  have  the  same  privileges   on  HDFS.  Your  environment  may  have  other  methods  to  achieve  the  same  result   (such  as  Ranger  or  Sentry).   If  your  environment  does  not  have  parallel  users  in  both  the  operating  system   and  the  file  system,  you  need  to  make  sure  that  the  dedicated  Waterline  Data   Inventory  user  is  a  part  of  the  HDFS  (or  MapR-­‐FS)  super  user  group:   dfs.permissions.superusergroup

If  you  choose  not  to  grant  waterlinedata  write  access  where  it  also  has  read   access,  make  sure  to  give  write  at  least  one  location  where  it  can  stage  profiling   data.  You  must  identify  this  location  in  the  Waterline  Data  Inventory  profiler   configuration  properties,  as  described  in  5.  Configure  Waterline  Data  Inventory   for  your  cluster  on  page  23.   •

Hive  database  access.  The  waterlinedata  user  requires  read  access  to  each  Hive   database  that  will  be  part  of  the  system  inventory.  In  addition,  to  allow  users  to   create  Hive  tables  from  HDFS  files,  waterlinedata  needs  write  access  to  one  or   more  databases  where  users  will  store  these  tables.  



Shared  folder  access.  If  the  installation  is  on  a  VirtualBox  image,  it  is   convenient  to  include  the  waterlinedata  user  as  a  member  in  the  group  created   for  the  VM  to  share  folders  between  the  host  and  the  VM,  vboxsf  group.    



Hue  user.  As  a  convenience,  if  you  plan  to  use  Hue  to  manage  HDFS  or  MapR-­‐FS   files,  create  a  corresponding  user  account  for  waterlinedata  on  Hue.  

4.  Download  and  extract  Waterline  Data  Inventory   If  you  haven't  already,  download  the  Waterline  Data  Inventory  distribution  from  the   location  provided  by  Waterline  Data.   As  the  dedicated  waterlinedata  user,  navigate  to  the  installation  directory  you   identified  previously  and  expand  the  Waterline  Data  Inventory  TAR  file.   $ cd $ su waterlinedata

Enter  the  waterlinedata  password.   $ tar xf

Errors  from  this  command  are  likely  to  indicate  that  the  waterlinedata  user  does  not   have  write  access  to  the  install  directory.   ©  2014  -­‐  2015  Waterline  Data,  Inc.  All  rights  reserved.  

19    

Installing  Data  Inventory  

Waterline  Data  Inventory  

5.  Run  configuration  scripts   The  Waterline  Data  Inventory  distribution  includes  scripts  to  automate  the  process   of  configuring  class  paths,  placing  JAR  files  in  the  right  locations,  and  setting   permissions.  Because  the  scripts  move  files  into  locations  where  they  can  be   accessible  from  MapReduce  jobs  and  set  permissions  to  allow  the  dedicated   waterlinedata  user  to  access  Hadoop  and  Hive  libraries,  you'll  need  root  access  to   run  these  scripts.   postInstall  script   This  script  creates  directories  and  moves  configuration  files  into  the  appropriate   locations.   To  run  postInstall:     1. If  the  dedicate  Waterline  Data  Inventory  user  is  Kerberized,  make  sure  the   dedicated  Waterline  Data  user  (typically  "waterlinedata")  has  a  valid  Kerberos   ticket  and  that  the  Kerberos  ticket  cache  is  available  for  the  user  (run  klist).   If  there  is  not  a  valid  ticket,  run  kinit  to  create  one.   2. From  inside  the  new  waterlinedata  directory,  run  the  script  to  configure  the   environment.   $ cd waterlinedata $ bin/postInstall

This  script  prompts  you  to  enter    the  waterlinedata  user  password  for  sudo  access.   If  upgrading  Waterline  Data  Inventory,  the  script  prompts  to  overwrite  a  Derby   properties  file:  enter  "y"  for  this  prompt.   This  script  also  prompts  you  for  the  location  of  Hive  in  the  Hadoop  environment;   typically,  other  Waterline  Data  Inventory  scripts  will    locate  Hive  for  you,  so  you  can   skip  this  prompt.  If  you  receive  an  error  later,  rerun  this  script  and  include  the  Hive   location.      

No  Hive  in  your  Hadoop?   Waterline  Data  Inventory  uses  some  of  the  same  open  source  libraries  that  Hive  distributes  to   read  HDFS  files.  If  you  don't  have  Hive  installed  in  your  system,  you  need  to  provide  the  location   of  the  Waterline  Data  Inventory  dependencies  directory:   /waterlinedata/lib/hive

This  script  makes  the  following  configuration  changes:   • Creates  /var  directories  for  Waterline  Data  Inventory  repository,  search  indexes,   log,  and  runtime  files.   • Sets  the  ownership  of  the  new  directories  to  the  current  user.   • Copies  repository  properties  files  from  the  Waterline  Data  Inventory  installation   location  into  the  new  directories.   • Writes  the  provided  Hive  path  to  bin/.hive_home.     20  

©  2014  -­‐  2015  Waterline  Data,  Inc.  All  rights  reserved.  

Installation  and  Administration  Guide  

Installing  Data  Inventory  

hiveSetup  script   Waterline  Data  Inventory  provides  functionality  for  profiling  and  browsing  existing   Hive  tables  and  allowing  users  to  create  new  Hive  tables  from  HDFS  files  in  the   inventory.  If  you  want  users  to  have  access  to  this  functionality,  configure  Waterline   Data  Inventory  to  work  with  Hive  in  your  cluster.  There  are  three  possible   configurations  with  Hive  that  require  installation  steps:   •

HiveServer2  installed  on  the  same  node  as  Waterline  Data  Inventory  



HiveServer2  installed  on  a  different  node  than  Waterline  Data  Inventory  



HiveServer2  is  not  part  of  the  cluster  at  all  

The  installation  steps  for  each  of  these  configurations  are  described  in  the  following   sections.   Hive  and  Waterline  Data  Inventory  share  a  node   If  HiveServer2  is  running  on  the  same  node  as  Waterline  Data  Inventory,  run  the   following  script  from  inside  the  waterlinedata  directory:   $ bin/hiveSetup linkAuxLib

This  script  makes  the  following  configuration  changes:   •

Creates  an  auxlib  directory  in  the  Hive  home  directory  if  one  does  not  already   exist.  For  example,  /usr/lib/hive/auxlib  or  /opt/mapr/hive//.  



Copies  JAR  files  needed  for  Hive  table  creation  and  reading  to  the  auxlib   directory.  



Creates  symbolic  links  for  the  auxlib  JAR  files  into  lib  to  allow  Beeswax  and   Beeline  access  to  these  files.  To  skip  this  step,  omit  the  "linkAuxLib"  option.  

If  the  Hive  server  is  not  running,  the  script  will  fail  to  identify  the  Hive  location;  to   remedy  this,  do  one  of  the  following:   •

Start  the  Hive  server.  



Rerun  postInstall  (page  20)  and  specify  the  location  of  the  Hive  executable.  



Edit  /waterlinedata/bin/.hive_home  to  include  the  location  of   the  Hive  executable.  

If  the  script  reports  a  problem,  address  the  issue  and  rerun  the  script  until  you  get  a   successful  result.  If  you  are  not  successful  running  these  setup  scripts,  you  can  run   /waterlinedata/bin/detect-env verbose  to  get  more  information   on  where  problems  are  occurring.   Restart  HiveServer2  after  successfully  running  hiveSetup.  

©  2014  -­‐  2015  Waterline  Data,  Inc.  All  rights  reserved.  

21    

Installing  Data  Inventory  

Waterline  Data  Inventory  

Hive  and  Waterline  Data  Inventory  run  on  separate  nodes   If  HiveServer2  runs  on  a  different  node,  you  need  to  provide  an  alternate  local   location  for  Hive  jars  for  the  Waterline  Data  Inventory  installation  and  copy  the   Waterline  Data  Inventory-­‐specific    JAR  files  to  the  Hive  server:   1. Rerun  postInstall  (page  20)  and  specify  the  local  'surrogate'  location  for  Hive:   /lib/hive

For  example,  /opt/waterlinedata/lib/hive.   2. Run  the  Hive  configuration  script    from  inside  the  waterlinedata  directory:   $ bin/hiveSetup linkAuxLib

3. Locate  HiveServer2.   4. Create  an  auxlib  folder  in  the  Hive  installation,  at  the  same  level  as  the  Hive  lib   folder.  Allow  all  users  to  read  from  this  folder.     $ mkdir /hive/auxlib $ chmod a+r /hive/auxlib

For  example,  the  following  command  applies  to  HDP  v2.2.4  instances:   $ mkdir /usr/hdp/2.2.4.2-2/hive/auxlib $ chmod a+r /usr/hdp/2.2.4.2-2/hive/auxlib

5. Add  the  following  JARs  to  auxlib.   These  jars  can  be  found  in  the  Waterline  Data  Inventory  installation,   lib/waterlinedata  and  lib/dependencies  folders:   • jackson-­‐annotations-­‐2.2.3.jar     • jackson-­‐databind-­‐2.2.3.jar     • opencsv-­‐2.3.jar         • hive-­‐serdes-­‐*.jar     • hivexmlserde-­‐*.jar   • waterlinedata-­‐formats-­‐1.2.0.jar   Here's  one  way  to  move  the  files  between  systems:   $ cd /hive/auxlib $ scp waterlinedata:/opt/waterlinedata/lib/waterlinedata/ waterlinedata—formats-1.2.1.jar . $ scp waterlinedata:/opt/waterlinedata/lib/dependencies/ jackson-annotations-2.2.3.jar .

6. Create  symbolic  links  between  the  files  in  auxlib  and  the  Hive  lib  directory.  If   the  JAR  already  exists  in  the  lib  directory,  the  symbolic  link  creation  will  fail;  in   that  case,  you  don't  need  to  create  the  symbolic  link.   $ cd ../lib $ for each in ../auxlib/*.jar ; do ln -svi $each ; done

Repeat  for  all  of  the  files  in  auxlib.   Restart  HiveServer2  after  successfully  copying  and  linking  the  JAR  files.   22  

©  2014  -­‐  2015  Waterline  Data,  Inc.  All  rights  reserved.  

Installation  and  Administration  Guide  

Installing  Data  Inventory  

Hive  is  not  part  of  the  cluster  configuration   If  HiveServer2  is  not  part  of  the  cluster  setup,  you  need  to  provide  an  alternate  local   location  for  Hive  jars  for  the  Waterline  Data  Inventory  installation:   1. Rerun  postInstall  (page  20)    and  specify  the  local  'surrogate'  location  for  Hive:   /lib/hive

2. Run  the  Hive  configuration  script  from  inside  the  waterlinedata  directory:   $ bin/hiveSetup linkAuxLib

6.  Configure  Waterline  Data  Inventory  for  your  cluster   To  ensure  Data  Inventory  is  correctly  installed  and  to  prepare  for  the  initial   profiling  runs,  you  need  to  configure  Waterline  Data  Inventory’s  connections  to  the   cluster  and  to  Hive.  These  connections  are  configured  as  entries  in  the  property  file   waterlinedata/lib/resources/environment.properties.     If  you  are  running  the  Waterline  Data  Inventory  VM  sandbox,  you  can  skip  this  step   as  the  values  are  already  provided.    



waterlinedata.crawler.fs.uri=hdfs://:8020

Waterline  Data  Inventory  server  to  Hadoop  connection.  Set  this  to  the   root  of  HDFS.  Typically,  this  is  the  fs.defaultFS  parameter  in  Hadoop's   core-site.xml  file.   For  MapR,  use  maprfs:///.  You  can  see  the  host  name  for  your  MapR   cluster  using:     cat /opt/mapr/conf/mapr-clusters.conf   •

javax.persistence.jdbc.url=jdbc:derby://:4444/ waterlinedatastore;create=true

Replace    with  the  IP  address  for  the  computer  on  which  you've   installed  Waterline  Data  Inventory.  If  you  are  running  a  single-­‐node   cluster,  this  is  the  same  host  name  as  the  cluster  root  location. • •

javax.persistence.jdbc.user=waterlinedata javax.persistence.jdbc.password=

Credentials  used  by  Waterline  Data  Inventory  processes  to  access  the   Waterline  Data  Inventory  repository.  If  needed,  replace  the  username  and   password  with  ones  that  you  choose.  Be  sure  to  encrypt  the  replacement   password. •

waterlinedata.metadata.search.index.rootDir=/var/lib/waterline/index

Location  to  create  the  Lucene  indexes  used  by  Waterline  Data  Inventory.   Change  this  location  to  spread  the  storage  of  Waterline  Data  Inventory   data  across  more  than  one  drive  or  computer.

©  2014  -­‐  2015  Waterline  Data,  Inc.  All  rights  reserved.  

23    

Installing  Data  Inventory   • •

Waterline  Data  Inventory  

waterlinedata.hiveurl=jdbc:hive2://:10000/ waterlinedata.hivedatabasename=

Hive  connection  URL  and  default  database  Waterline  Data  Inventory  uses.   The  Hive  database  that  Waterline  Data  Inventory  uses  when  end-­‐users   create  Hive  tables  from  HDFS  files  from  inside  the  browser  application   and  that  is  profiled  when  Hive  table  profiling  (page  61)  is  turned  on.  Note   that  the  default  database  should  be  the  same  in  both  entries  (Hive   defaults  to  the  database  named  "default").   For  SPNEGO-­‐Kerberos,  the  hiveurl  needs  to  include  the  following:   jdbc:hive2://:/; principal=

For  example,  with  Hive  running  on  the  same  node  where  Waterline  Data   Inventory  is  installed  and  using  the  default  Hive  port  and  database  (on   one  line):   jdbc:hive2://localhost:10000/default;principal= HIVE/edgenode1.acmecorp.com  



waterlinedata.temproot=  

The  local  file  system  directory  Waterline  Data  Inventory  uses  to  store   temporary  files  created  during  discovery  processing.  Make  sure  that  the   dedicated  Waterline  Data  Inventory  user  has  write  access  to  the   configured  location.  By  default,  this  value  is  set  to  /tmp.   •

waterlinedata.profile.processingdirectory=

The  HDFS  or  MapR-­‐FS  directory  Waterline  Data  Inventory  uses  to   generate  temporary  files  during  HDFS  file  profiling.  Make  sure  that  the   dedicated  Waterline  Data  Inventory  user  has  write  access  to  the   configured  location.  If  this  property  is  not  set  (by  default,  it  is  commented   out),  temporary  files  are  created  in  the  first  directory  identified  in  the   profiling  command.   • •

waterlinedata.profile.hivedir= waterlinedata.hive.create_table_in_place=true

The  HDFS  or  MapR-­‐FS  directory  Waterline  Data  Inventory  uses  to   generate  copies  of  files  used  to  create  Hive  tables.  Make  sure  that  the   dedicated  Waterline  Data  Inventory  user  has  write  access  to  the   configured  location.  By  default,  file  copies  are  only  created  in  a  few  cases   based  on  the  type  of  file  format.  To  change  the  behavior  to  have  Waterline   Data  Inventory  always  make  copies  of  the  file  to  the  other  directory,  set   create_table_in_place  to  false.   • •

waterlinedata.web.kerberos.keytab.location= waterlinedata.web.kerberos.username=

The  principal  and  keytab  file  location  for  the  dedicated  Waterline  Data   Inventory  user.  For  more  details  on  running  Waterline  Data  Inventory  in   a  Kerberized  environment,  see  Kerberos  configuration  (page  29).   24  

©  2014  -­‐  2015  Waterline  Data,  Inc.  All  rights  reserved.  

Installation  and  Administration  Guide  

Upgrading  Waterline  Data  Inventory  

Firewall  configuration   If  you  expect  administrators  or  end-­‐users  to  access  Waterline  Data  Inventory  across   a  firewall,  consider  allowing  access  to  the  following  ports  at  the  cluster  IP  address:    

Port  

Application  Component  

Need  

8082  

50070  

Waterline  Data  Inventory  browser   application   Waterline  Data  Inventory  browser   application  with  HTTPS   WebHDFS  

10000   19888  

Hive   Hadoop  job  history  

4444  

Derby  

8000       8888    

Hadoop  Hue  

End-­‐users:  Access  to  Waterline  Data  Inventory   browser  application.   End-­‐users:  Access  to  Waterline  Data  Inventory   browser  application  with  HTTPS.   If  you  configure  Jetty  to  use  WebHDFS  rather   than  the  native  Java  API.  See  Communication   between  Jetty  and  Hadoop  (page  53).   End-­‐users:  Access  to  Hive  tables.   Administrators:  Access  to  troubleshooting   information.   Administrators:  Access  to  troubleshooting   information.   Administrators:  Access  to  HDFS  files  and  to   MapReduce  job  status  and  logs.  The  port  is   8000  for  HDP  and  8888  for  CDH  or  MapR.  

8482  

Upgrading  Waterline  Data  Inventory   If  you  have  Waterline  Data  Inventory  version  1.2.3  or  earlier  installed,  you  can   upgrade  to  Waterline  Data  Inventory  version  1.2.5  with  your  inventory  complete  as   follows.     Note:  Waterline  Data  Inventory  version  1.2.4  also  includes  the  updated  version  of   Derby.   These  instructions  assume:   •

You  have  sudo  privileges  to  complete  the  operations  



You  are  signed  in  as  the  dedicated  Waterline  Data  Inventory  user,  typically   "waterlinedata"  

To  upgrade  from  version  1.2.3  (or  earlier)  to  version  1.2.5:     1. Navigate  to  the  directory  in  which  you  installed  Waterline  Data  Inventory,  for   example  /opt/waterlinedata.   $ cd /opt/waterlinedata

2. Stop  any  running  processes.   $ bin/jettyStop $ bin/derbyStop

  3. Remove  a  Waterline  Data  JAR  file  from  the  Hive  auxlib  directory:   ©  2014  -­‐  2015  Waterline  Data,  Inc.  All  rights  reserved.  

25    

Upgrading  Waterline  Data  Inventory  

Waterline  Data  Inventory  

$ cat bin/.hive_home $ rm /auxlib/waterlinedata-formats-X.X.X.jar

When  prompted,  confirm  that  you  want  to  delete  the  file.   4. Make  backups  of  your  repository  and  logs  for  the  installed  version.   $ sudo cp -r /var/lib/waterline /var/lib/waterline_vXXX $ sudo cp -r /var/log/waterline /var/log/waterline_vXXX

5. Move  the  existing  Waterline  Data  Inventory  files  out  of  the  standard  installation   location.   $ cd .. $ sudo mv waterlinedata waterlinedata_vXXX

6. Replace  the  standard  installation  directory,  making  sure  that  the  directory  is   owned  by  the  dedicated  Waterline  Data  Inventory  user.   $ sudo mkdir waterlinedata $ sudo chown waterlinedata:waterlinedata waterlinedata

7. Reboot  the  edge  node  where  Waterline  Data  Inventory  is  installed.   8. Follow  the  installation  instructions  for  the  new  version  of  Waterline  Data   Inventory  starting  with  step  3.  Download  and  extract  Waterline  Data  Inventory   on  page  19.   9. Configure  the  new  version  of  the  Derby  repository.   If  you  can  tolerate  reprofiling  the  content  of  your  inventory,  we  recommend  that   you  start  with  a  fresh  repository.  There  is  no  additional  configuration  required.   If  you  would  like  to  continue  using  your  previous  repository—with  the   understanding  that  this  repository  cannot  be  used  in  a  production   environment—you  can  turn  off  Derby  authentication  and  continue  to  use  the   existing  repository.  To  do  so,  comment  out  the  following  entries  in  the   lib/resources/derby.properties  file:   #derby.connection.requireAuthentication=true #derby.authentication.provider=NATIVE:waterlinedatastore

10. Validate  the  following  operations  against  your  existing  repository  before   removing  the  previous  version  of  Waterline  Data  Inventory  files:   • View  HDFS  files   • View  Hive  tables   • Create  new  Hive  tables  from  HDFS  files  that  were  already  profiled  using  the   previous  version   • Profile  and  run  discovery  operations  

26  

©  2014  -­‐  2015  Waterline  Data,  Inc.  All  rights  reserved.  

Installation  and  Administration  Guide  

Integrating  with  user  management  systems  

Integrating  with  user  management  systems   Waterline  Data  Inventory  integrates  with  your  existing  Linux  and  Hadoop   authentication  mechanisms,  such  as  SSH-­‐based  authentication  and  single  sign-­‐on   systems  such  as  Kerberos.     By  default,  it  uses  SSH  authentication,  meaning  users  configured  for  HDFS  are   assumed  to  have  a  corresponding  Linux  account;  they  would  sign  in  to  Waterline   Data  Inventory  using  their  network  credentials.  To  configure  different   authentication,  an  administrator  would  configure  the  Waterline  Data  Inventory  web   server  to  accept  the  other  authentication  system  by  updating  Jetty's  login.conf  file.   Only  one  system  can  be  active  at  a  time.  

Waterline  Data  Inventory  user  authentication  settings     Specify  the  user  management  system  to  use  in  the  following  web  server   configuration  file:   /waterlinedata/jetty-distribution*/waterlinedata-base/etc/login.conf

This  file  includes  service  descriptions,  only  one  of  which  can  be  valid  at  a  time.  To   activate  one  of  the  service  types,  change  its  entry  name  to  "waterline"  and  rename   other  services  as  necessary.  

SSH  configuration   SSH  is  a  reliable  security  mechanism  that  has  one  limitation:  it  assumes  the   password  authentication  mechanism  is  available  to  the  web  server.  As  such,  it  will   not  work  on  systems  that  use  Amazon  AWS  or  Google  Compute  clouds.   When  configured  to  use  SSH  for  user  authentication,  Waterline  Data  Inventory  web   server  communicates  with  the  host  system  on  the  listen  address  and  port  defined  in   /etc/ssh/sshd_config.  By  default,  the  port  is  set  to  22.  If  your  organization  uses  a   different  convention,  update  the  port  (authPort)  setting  for  the  sshd  service  in  the   following  web  server  configuration  file:   /waterlinedata/jetty-distribution*/waterlinedata-base/etc/login.conf

User  access  configuration  for  public  cloud  clusters   Amazon  Web  Services  (AWS)  and  Google  Cloud  Platform  do  not  support  a  password   authentication    mechanism  for  managing  users;  instead,  they  use  SSH  key-­‐based   authentication.  Currently,  Waterline  Data  Inventory  does  not  support  SSH  keys  for   authentication  on  cloud  deployments.  It  uses  a  local  file  to  determine  user   credentials.     Note  that  this  method  does  not  supersede  the  cloud  provider's  security  nor  does  it   override  the  operating  system's  security  concepts.  The  user  list  grants  access  to   Waterline  Data  Inventory  web  application  only.  Waterline  Data  Inventory  respects   the  access  privileges  granted  by  the  file  system:  the  user  list  can  include  user  names   configured  in  the  operating  system.  Listed  users  that  are  not  mirrored  in  the   operating  system  see  only  files  that  can  be  read  by  all  users.   ©  2014  -­‐  2015  Waterline  Data,  Inc.  All  rights  reserved.  

27    

Integrating  with  user  management  systems  

Waterline  Data  Inventory  

To  configure  Waterline  Data  Inventory  user  access  on  public  cloud  clusters:   1. Navigate  to  the  directory  in  which  you  installed  Waterline  Data  Inventory,  for   example  /opt/waterlinedata.   $ cd /opt/waterlinedata

2. Stop  the  web  server  process.   $ bin/jettyStop

3. Create  a  user  access  list  in  a  text  file  named  "login.properties"  in  the  web  server   configuration  location.   $ cd jetty-distribution-9.2.1.v20140609/waterlinedata-base/etc $ vi login.properties

Enter  one  line  for  each  user  in  the  form:   username=password,groups

where  groups  can  be  one  or  more  operating  system  group  names  to  which  this   user  belongs.  Separate  group  names  with  commas.   For  example:   waterlinedata=waterlinedata sherlock=Se$4sp0,finance watson=AQ2hc#9GG,finance

These  passwords  are  used  only  for  access  to  Waterline  Data  Inventory.  They  can   be  obfuscated  according  to  the  Jetty  web  server  requirements,  described  in   Jetty's  "Secure  Password  Obfuscation":   www.eclipse.org/jetty/documentation/current/configuring-­‐security-­‐secure-­‐ passwords.html   4. Add  an  entry  to  the  Jetty  configuration  file  login.conf  to  refer  to  the  user  access   list  you  created  in  step  3:   $ vi login.conf

Add  the  following  entry:   waterline { org.eclipse.jetty.jaas.spi.PropertyFileLoginModule required debug="true" file="${jetty.base}/etc/login.properties"; };

5. In  login.conf,  find  the  previous  entry  named  "waterline"  and  change  it  to   "waterline_ssh"  or  "waterline_kerberos"  as  appropriate.  (Only  the  entry  added  in   step  4  should  be  named  "waterline".)   6. Restart  the  web  server  process.   $ cd /opt/waterlinedata $ bin/jettyStart

28  

©  2014  -­‐  2015  Waterline  Data,  Inc.  All  rights  reserved.  

Installation  and  Administration  Guide  

Integrating  with  user  management  systems  

Kerberos  configuration   These  configuration  instructions  assume  a  Kerberos  system  where  application   servers,  such  as  Waterline  Data  Inventory's  web  server,  use  keytab  credentials   while  users  authenticate  with  username  and  password.   Kerberos  setup  requires  support  from  the  IT  personnel  who  can  advise  and   configure  Kerberos  user  access.   The  following  steps  are  required  to  configure  Waterline  Data  Inventory  to  operate   with  Kerberos  authentication:   1. User  setup.  Make  sure  that  the  user  account  dedicated  to  installing  and  running   Waterline  Data  Inventory  servers  and  jobs  is  configured  for  Kerberos.   The  access  requirements  for  this  user  are  described  in  the  installation   requirement  "3.  Configure  a  dedicated  user"  on  page  18.   To  complete  the  Waterline  Data  Inventory  Kerberos  configuration,  you'll  need:   • Principal  name  for  the  dedicated  Waterline  Data  Inventory  user   • Location  of  the  keytab  file  corresponding  to  the  principal   You'll  use  these  values  in  step  3,  "Authentication  method  configuration."   2. Setup  web  server  credentials.  See  Configure  Waterline  Data  Inventory  web   server  as  a  trusted  Kerberos  application  server,  below.   3. Switch  Waterline  Data  Inventory  to  Kerberos.  See  Configure  Waterline  Data   Inventory  web  server  to  use  Kerberos  authentication,  on  page  30.   4. Configure  impersonation.  See  Configure  impersonation  for  Waterline  Data   Inventory,  on  page  31.   5. Configure  Hive.  See  Configure  the  Hive  principal  in  Waterline  Data  Inventory,   on  page  32.   6. Review  Non-­‐Kerberos  connections.  Ensure  that  all  internal  credentials  are   secure.   Some  communication  among  components  of  Waterline  Data  Inventory  does  not   use  the  dedicated  Waterline  Data  Inventory  user  account.  There  are  a  few   changes  to  consider  to  ensure  all  communication  paths  are  secure,  described  in   Improve  security  for  communication  between  Waterline  Data  Inventory   components  on  page  32.   Configure  Waterline  Data  Inventory  web  server  as  a  trusted  Kerberos  application   server   This  configuration  assumes  that  the  Waterline  Data  Inventory  web  server  (Jetty)   authenticates  using  the  dedicated  Waterline  Data  Inventory  user  principal  and   keytab.  If  you  choose  to  use  a  separate  principal  specifically  for  an  application  

©  2014  -­‐  2015  Waterline  Data,  Inc.  All  rights  reserved.  

29    

Integrating  with  user  management  systems  

Waterline  Data  Inventory  

server,  ensure  that  you  use  the  application  server  principal  when  configuring  the   Waterline  Data  Inventory  properties  in  step  2.   1. Include  a  Kerberos  configuration  file  on  the  computer  on  which  Waterline  Data   Inventory's  application  server  (Jetty)  will  run.   Make  sure  that  the  Kerberos  configuration  file  (/etc/krb5.conf)  includes  a   description  of  the  realm  in  which  Waterline  Data  Inventory  resides.  For  example,   for  a  server  in  a  company  called  "Acme":    

[libdefaults] default_realm = ACME.COM dns_lookup_realm = false dns_lookup_kdc = false ticket_lifetime = 24h renew_lifetime = 7d forwardable = true [realms] ACME.COM = { kdc = server1.acme.com:88 admin_server = server1.acme.com:88 } [domain_realm] .acme.com = ACME.COM acme.com = ACME.COM

2. Indicate  the  location  of  Kerberos  credentials  so  Jetty  can  refresh  its  own  ticket  as   needed.   Edit  the  environment.properties  file  to  include  the  principal  and  keytab  file  location  for   the  dedicated  Waterline  Data  Inventory  user.   The  file  is  found  in:   /waterlinedata/lib/resources

The  properties  to  update  are:   • waterlinedata.web.kerberos.keytab.location=   • waterlinedata.web.kerberos.username=   For  example,  "[email protected]".   Configure  Waterline  Data  Inventory  web  server  to  use  Kerberos  authentication   By  default,  Waterline  Data  Inventory  web  server  uses  SSH  authentication.  To  switch   to  Kerberos,  edit  the  web  server  login  configuration  file:   1. Edit  the  Jetty  login.conf  file.   The  file  is  found  in:   /waterlinedata/jetty-distribution-.v/waterlinedata-base/etc 


30  

©  2014  -­‐  2015  Waterline  Data,  Inc.  All  rights  reserved.  

Installation  and  Administration  Guide  

Integrating  with  user  management  systems  

where    and    are  the  values  for  the  Jetty  distribution  provided  in  the   Waterline  Data  Inventory  installation.   • Rename  the  "waterlineKerberos"  entry  to  simply  "waterline". • Rename  the  existing  "waterline"  entry  to  "waterlineSSH". • Make  sure  there  is  only  one  "waterline"  entry  in  the  file. 2.       • •

     

  Configure  impersonation  for  Waterline  Data  Inventory   Secure  impersonation  is  required  to  use  Waterline  Data  Inventory  HDFS  delegated   authorization  capability.  This  method  allows  the  dedicated  Waterline  Data   Inventory  user  to  submit  requests  to  Hive  or  HDFS  on  behalf  of  another  user.  For   example,  when  browsing  files  in  HDFS,  Waterline  Data  Inventory  uses  the  signed-­‐in   user's  credentials  to  query  HDFS  for  directory  listings,  ensuring  the  user  sees  only   data  the  user  has  access  to.     In  a  Kerberos-­‐controlled  environment,  delegated  authentication  has  another  value.   The  Hive  metastore  is  typically  accessible  only  through  the  dedicated  Hive  user.   Waterline  Data  Inventory  uses  delegated  authentication  to  perform  operations   against  the  Hive  metastore,  by  passing  the  Hive  principal  with  the  request.  Because   the  dedicated  Waterline  Data  Inventory  user  has  delegated  authentication   privileges,  Hive  performs  the  requests.   To  use  Waterline  Data  Inventory's  HDFS  delegated  authorization,  make  the   following  configuration  changes  in  the  core-site.xml  file  for  the  cluster.       1. Update  core-site.xml  with  the  following  properties.   The  changes  to  core-site.xml  require  that  you  restart  the  cluster.  Be  sure  to  arrange   to  make  the  change  when  convenient  for  other  cluster  management  tasks.   Make  this  change  using  the  cluster  management  tools,  such  as  Ambari,  Cloudera   Manager,  or  MapR  Control  System.    

©  2014  -­‐  2015  Waterline  Data,  Inc.  All  rights  reserved.  

31    

Improve  security  among  Waterline  Data  Inventory  components  

Waterline  Data  Inventory  

If  you  need  to,  change  "waterlinedata"  to  the  name  you  are  using  for  the  dedicated   Waterline  Data  Inventory  user.  Include  all  hosts  or  all  groups  using  an  asterisk  (*)  as   the  property  value.  Alternatively,  you  can  specify  a  comma-­‐separated  list  of  fully   qualified  hostnames  or  a  comma-­‐separated  list  of  groups.   hadoop.proxyuser.waterlinedata.hosts * hadoop.proxyuser.waterlinedata.groups *

2. Restart  the  cluster.   Configure  the  Hive  principal  in  Waterline  Data  Inventory   From  inside  Waterline  Data  Inventory,  users  can  create  Hive  tables  from  HDFS  files.   In  a  non-­‐Kerberized  environment,  Waterline  Data  Inventory  requests  data  from  the   Hive  metastore  using  the  dedicated  Waterline  Data  Inventory  user.  In  a  Kerberized   environment,  it  is  typical  that  only  the  dedicated  Hive  user  can  perform  operations   against  the  metastore.  To  allow  Waterline  Data  Inventory  to  access  the  Hive   metastore,  configure  Waterline  Data  Inventory  with  delegated  authentication   privileges  (previous  section)  and  include  the  Hive  principal  in  the  Waterline  Data   Inventory  configuration.  To  configure  this  change  to  the  Hive  connection,  edit  the   environment.properties  file.   1. In  the  Waterline  Data  Inventory  environment.properties  file,  update  the  Hive   connection  URL  to  include  the  Hive  Kerberos  principal:   • Comment  out  the  existing,  non-­‐Kerberos  instance  of  the  hiveurl  property.   • Later  in  the  file,  uncomment  the  Kerberos  instance  of  the  hiveurl  property.   • Customize  the  Kerberos  hiveurl  to  include  the  Hive  Kerberos  principal.   The  connection  URL  with  Hive  principal  would  look  like  the  following  example   (all  on  one  line):   waterlinedata.hiveurl=jdbc:hive2://com.acme.edge:10000/default;principal= hive/[email protected];auth=kerberos;kerberosAuthType=fromSubject

For  more  information  on  configuring  Hive  with  Kerberos  in  an  enterprise   environment,  see  "Multi-­‐User  Scenarios  and  Programmatic  Login  to  Kerberos   KDC":   cwiki.apache.org/confluence/display/Hive/HiveServer2+Clients  -­‐   HiveServer2Clients-­‐Multi-­‐UserScenariosandProgrammaticLogintoKerberosKDC  

Improve  security  among  Waterline  Data  Inventory  components   Kerberos  provides  a  mechanism  to  ensure  secure  communications  between  clients   and  servers;  you  can  also  enhance  the  security  of  communication  between   32  

©  2014  -­‐  2015  Waterline  Data,  Inc.  All  rights  reserved.  

Installation  and  Administration  Guide   Improve  security  among  Waterline  Data  Inventory  components   Waterline  Data  Inventory  web  server  and  repository  database  residing  on  a  single   computer  or  distributed  across  separate  computers.    

Securing  internal  passwords   The  following  steps  describe  how  to  secure/obfuscate  the  clear  text  user  passwords   Waterline  Data  Inventory  uses  to  pass  information  among  components,  such  as   Derby.  The  default  passwords  provided  have  been  obfuscated  using  this  process.     To  update  the  Derby  password:   1. Edit  the  environment.properties  file,  found  here:   /waterlinedata/lib/resources/environment.properties

2. In  a  separate  command  window,  generate  the  password  for  database  access  by   running:   /waterlinedata/bin/obfuscate

3. Enter  the  Derby  password  at  the  prompt  and  collect  the  output  from  the  console   (or  in  obfuscate.out).   4. Locate  the  entry  for  javax.persistence.jdbc.password  and  replace  the  existing   default  password  with  the  encrypted  text  obtained  in  the  previous  step.   Do  NOT  modify  the  line  javax.persistence.jdbc.user=waterlinedata.  

Encrypting  a  Derby  repository   Another  measure  you  can  take  to  secure  data  at  rest  in  your  system  is  to  configure   Derby  to  encrypt  the  Waterline  Data  Inventory  repository.  Make  the  following   changes  in  a  new  installation  of  Waterline  Data  Inventory.     If  you  need  to  convert  an  existing  repository  from  non-­‐encrypted  to  encrypted,  refer   to  Apache's  documentation  found  here:   db.apache.org/derby/docs/10.9/devguide/tdevcsecureunencrypteddb.html   To  configure  Derby  to  initialize  an  encrypted  database:   1. Install  Waterline  Data  Inventory  as  described  in  Installing  Data  Inventory,   starting  on  page  1414.     2. Complete  the  configuration  settings  described  in  Step  6,  "Configure  Waterline   Data  Inventory  for  your  cluster",  except  for  the  repository  properties.   3. Configure  the  following  repository  properties  in     /waterlinedata/lib/resources/environment.properties:   •

Derby  connection  URL  "javax.persistence.jdbc.url".  This  property  includes   the  following  parameters:   JDBC  connection  to  the   datastore.     Option  to  create  the  database  if   it  doesn't  already  exist.  

"jdbc:derby://:4444/waterlinedatastore; create=true;

©  2014  -­‐  2015  Waterline  Data,  Inc.  All  rights  reserved.  

33    

Improve  security  among  Waterline  Data  Inventory  components   Option  to  enable  data   encryption.   Boot  password  for  the   encrypted  database.  

Waterline  Data  Inventory  

dataEncryption=true bootPassword=

For  example,  the  complete  property  entry  might  look  like  the  following:   javax.persistence.jdbc.url=jdbc:derby://mycluster.acme.com:4444/waterlinedatastore ;create=true;dataEncryption=true;bootPassword=rO0ABXcIAAABTm+81/tzcgAZamF2YXguY3J5 cHRvLlNlYWxlZE9iamVjdD42PabDt1RwAgAEWwANZW5jb2RlZFBhcmFtc3QAAltCWwAQZW5jcnlwdGVkQ2 9udGVudHEAfgABTAAJcGFyYW1zQWxndAASTGphdmEvbGFuZy9TdHJpbmc7TAAHc2VhbEFsZ3EAfgACeHBw dXIAAltCrPMX+AYIVOACAAB4cAAAACBBs6MgWOBquHlkak/Pjk2DYzvwCcZPSVZ/xYDNdMbPl3B0ABRBRV MvRUNCL1BLQ1M1UGFkZGluZw==  



Derby  user  and  password.  Include  the  dedicated  Waterline  Data  Inventory   user  (typically  "waterlinedata")  and  an  encrypted  password  for  this  user.  For   example:   javax.persistence.jdbc.user=waterlinedata javax.persistence.jdbc.password=rO0ABXcIAAABTL6+wStzcgAZamF2YXguY3J5cHRvL lNlYWxlZE9iamVjdD42PabDt1RwAgAEWwANZW5jb2RlZFBhcmFtc3QAAltCWwAQZW5jcnlwdG VkQ29udGVudHEAfgABTAAJcGFyYW1zQWxndAASTGphdmEvbGFuZy9TdHJpbmc7TAAHc2VhbEF sZ3EAfgACeHBwdXIAAltCrPMX+AYIVOACAAB4cAAAACDYZOrytwNZDBzYyS8qc530ISSmjDSq dw0fVY6YXCb+mnB0ABRBRVMvRUNCL1BLQ1M1UGFkZGluZw==

To  encrypt  the  passwords,  run  the  obfuscate  utility  provided  in     /waterlinedata/bin.   4. Follow  the  standard  process  for  staring  the  Waterline  Data  Inventory  Derby  and   Jetty  services:   $ cd /waterlinedata/ $ bin/derbyStart $ bin/jettyStart

If  you  have  an  existing  installation  of  Waterline  Data  Inventory,  you  need  to  drop   the  existing  repository  to  take  advantage  of  Derby  encryption.   To  remove  an  existing  Waterline  Data  Inventory  repository:   1. From  the  edge  node  where  Waterline  Data  Inventory  is  installed,  shut  down  any   existing  Waterline  Data  Inventory  processes.  If  a  profiling  or  discovery  job  is   running,  wait  for  the  job  to  complete.   $ cd /waterlinedata $ bin/jettyStop $ bin/derbyStop

The  derbyStop  script  prompts  for  the  username  and  password  (configured  in   lib/resources/environment.properties).  By  default  these  values  are  "waterlinedata"   and  "waterlinedata".  

34  

©  2014  -­‐  2015  Waterline  Data,  Inc.  All  rights  reserved.  

Installation  and  Administration  Guide   Improve  security  among  Waterline  Data  Inventory  components   2. Remove  the  repository  and  indexes  by  deleting  the  /var/lib/waterline/db  and   /var/lib/waterline/index  directories:   $ rm /var/lib/waterline/db $ rm /var/lib/waterline/index

3. Follow  the  instructions  provided  above  to  configure  Waterline  Data  Inventory  to   create  a  new  encrypted  database.    

Configuring  access  using  Hadoop  security:  Ranger  or  Sentry   Waterline  Data  Inventory  supports  coarse-­‐grained  security  based  on  HDFS  file  and   directory  user  and  group  permissions.  The  following  table  describes  Waterline  Data   Inventory's  operation  based  on  what  Hive  authorization  method  your  cluster   employs.    

Security  Configuration  

HDFS  

Hive  

SQL  Standards  Based  Authorization   ! Browse  files  and  tables   ! Search  files  and  tables   ! Create  tables   ! Browse  authorized  subset  (columns  or  rows)   (Fine-­‐grained  security)   Storage  Based  Authorization   ! Browse  files  and  tables   ! Search  files  and  tables   ! Create  tables   Default  Hive  Authorization  (Legacy  Mode)   ! Browse  files  and  tables   ! Search  files  and  tables   ! Create  tables  

  Yes   Yes   Yes   No  

  Yes   Yes   Yes     No  

  Yes   Yes   Yes     Yes   Yes   Yes  

  Yes   Yes   Yes     Yes   Yes   Yes  

Secure  cluster  configuration   If  a  Hadoop  cluster  runs  in  secure  mode,  Waterline  Data  Inventory  can  be   configured  to  enable  secure  impersonation.  Secure  impersonation  allows  a  given   Hadoop  superuser  to  submit  jobs  or  access  files  on  behalf  of  another  user.     Secure  impersonation  is  required  to  use  Waterline  Data  Inventory  HDFS  delegated   authorization  capability.  This  allows  the  dedicated  Waterline  Data  Inventory  user  to   submit  tasks  on  behalf  of  another  user.  The  Waterline  Data  Inventory  server  uses  its   credentials  to  authenticate  with  Hadoop.  However,  file  system  accesses  and  tasks   are  authorized  as  the  user  who  is  sign  in  to  the  Waterline  Data  Inventory  browser   application.   To  use  HDFS  delegated  authorization,  do  the  following  to  enable  secure   impersonation  in  your  Hadoop  environment:  

©  2014  -­‐  2015  Waterline  Data,  Inc.  All  rights  reserved.  

35    

Improve  security  among  Waterline  Data  Inventory  components  

Waterline  Data  Inventory  

1. Add  the  dedicated  Waterline  Data  Inventory  user  (typically  waterlinedata)  to  the   HDFS  superuser  group  on  all  Hadoop  nodes.   2. Create  a  /user/  directory  in  HDFS  for  each  user  who  will  access   Waterline  Data  Inventory.   3. Grant  read  access  on  the  appropriate  source  data  files  and  directories  in  HDFS   and  databases  and  tables  in  Hive  to  the  groups  (or  users).   You  must  enable  the  secure  impersonation  properties  for  the  Waterline  Data   Inventory  superuser  in  the  core-site.xml  file  on  your  Hadoop  nodes.  For  example:    

hadoop.proxyuser.waterlinedata.groups * Allow the superuser 'waterlinedata' to impersonate any user hadoop.proxyuser.waterlinedata.hosts * The superuser 'waterlinedata' can connect from any host to impersonate a user

Access  privileges  for  HDFS   If  your  cluster  security  is  ensured  using  Apache  Ranger  or  Apache  Sentry,  here's   how  to  set  user  access  to  make  sure  that  both  the  dedicated  Waterline  Data   Inventory  user  and  end-­‐users  of  the  browser  application  have  the  access  they  need.    

HDFS  User  and  Area  of  access  

Read  

Write  

Execute  

Waterline  Data  Inventory  dedicated  user  "waterlinedata"   ! HDFS  directories  and  files  included  in  inventory   ! Staging  area  for  profiling  results   Privileged  end-­‐users   ! HDFS  directories  and  files  this  user  needs  access  to   Read-­‐only  end-­‐users   ! HDFS  directories  and  files  this  user  needs  access  to  

  X   X     X     X  

    X     X      

  X   X     X     X  

36  

©  2014  -­‐  2015  Waterline  Data,  Inc.  All  rights  reserved.  

Installation  and  Administration  Guide  

Starting  Waterline  Data  Inventory  

Access  privileges  for  Hive    Ranger  and  Sentry  both  control  access  to  data  in  Hive  tables;  access  is  controlled   based  on  the  required  SQL  operation.    

Hive  User  and  Area  of  access  

Hive  Operation  

Waterline  Data  Inventory  dedicated  user  "waterlinedata"   ! Profile  existing  tables   ! Browse  existing  tables   ! Create  new  tables   Privileged  end-­‐users   ! Hive  databases  and  tables  this  user  needs  access  to   Read-­‐only  end-­‐users   ! Hive  databases  and  tables  this  user  needs  access  to  

  SELECT   SHOW  DATABASE   CREATE,  ALTER‡     SELECT,  CREATE     SELECT  

‡    ALTER  privileges  are  required  only  for  creating  Hive  tables  from  collections.  

Starting  Waterline  Data  Inventory   These  steps  pick  up  where  the  section  "Installing  Data  Inventory"  left  off  and   assume  you  have  access  to  the  Linux  computer  where  Waterline  Data  Inventory  is   installed  and  can  sign  in  as  the  dedicated  Waterline  Data  Inventory  user.   1. From  a  command  prompt  or  terminal,  access  the  computer  where  Waterline   Data  Inventory  is  installed  and  sign  in  as  the  dedicated  Waterline  Data  user.   2. Navigate  to  the  Waterline  Data  Inventory  installation  directory.   For  example:   $ cd /home/waterlinedata/waterlinedata

3. Start  the  embedded  metadata  repository  database,  Derby.   $ bin/derbyStart

You'll  see  a  response  that  ends  with  "...started  and  ready  to  accept  connections   on  port  4444".   4. Type  Enter  to  return  to  the  shell  prompt.   5. Profile  a  directory  in  HDFS  (or  MapR-­‐FS).   For  this  first  run,  select  a  single  directory  with  a  small  number  of  files  to  validate   the  installation.     Run  the  following  command:   $ bin/waterline profile

For  example:   $ bin/waterline profile /user/waterlinedata/Landing/data.gov

©  2014  -­‐  2015  Waterline  Data,  Inc.  All  rights  reserved.  

37    

Starting  Waterline  Data  Inventory  

Waterline  Data  Inventory  

The  console  fills  with  status  messages  for  each  stage  of  the  profiling  sequence.   When  this  command  completes,  you  can  repeat  it  with  additional  directories  or   move  on  to  viewing  the  profiled  data.   6. Start  the  embedded  web  server,  Jetty.   You  may  want  to  open  a  new  console  for  this  command  to  separate  the  output   from  profiling  from  the  Jetty  output.  (The  output  from  both  the  profiling  and   Jetty  processes  are  captured  in  their  own  log  file  in  /var/log/waterline.)   $ bin/jettyStart

The  first  time  you  run  Waterline  Data  Inventory  after  installation,  the  system   creates  the  repository  tables  in  Derby.  Either  starting  the  Jetty  process  or  running  a   profiling  job  will  create  the  repository.  Avoid  starting  both  processes  at  the  same   time,  as  both  will  attempt  to  create  repository  tables  and  problems  ensue.     7. After  Jetty’s  messages  pause,  open  a  browser  and  navigate  to:   http://:8082

For  Kerberized  instances,  make  sure  to  log  in  from  a  browser  configured  to  use   Kerberos  keytabs  for  the  user  (see  Configuring  web  browsers  for  use  with   Kerberos  on  page  51)  and  use  the  fully  qualified  domain  name  instead  of  the  IP   address  to  make  sure  that  the  Kerberos  token  is  passed  to  the  application:   http://:8082

If  the  Waterline  Data  Inventory  login  screen  doesn’t  appear,  look  in  the  console   output  to  see  if  any  error  occurred.  The  output  is  also  available  at   /var/log/waterline/wds-ui.log.  Typically,  errors  at  this  point  are  similar  to  the   following:   • Contested  port.  If  another  application  on  the  cluster  is  using  port  8082,  you   may  not  have  access  to  Waterline  Data  Inventory.  If  this  is  the  case,  do  the   following:   a. Stop  Jetty.   $ /bin/jettyStop

b. Change  the  Jetty  port    number  in  the  file   jetty-distribution-9.2.1.v/waterlinedata-base/start.d/http.ini

• •

38  

c. Restart  Jetty.   Port  forwarding.  If  you  are  accessing  the  web  server  remotely,  make  sure   that  the  connection  between  hosts  allows  forwarding  of  port  8082.   User  permissions.  If  the  dedicated  user  does  not  have  the  correct   permissions,  you  may  see  errors  in  the  Jetty  output.  Review  the  user  access   requirements  and  make  sure  the  user  has  the  correct  access.  

©  2014  -­‐  2015  Waterline  Data,  Inc.  All  rights  reserved.  

Installation  and  Administration  Guide   •

Starting  Waterline  Data  Inventory  

Kerberos  ticket  cache  disabled.  In  a  Kerberos-­‐controlled  environment,  if   you  see  the  following  error  in  the  Jetty  console  and  log,  the  ticket  cache  may   be  disabled  for  the  user  starting  the  Jetty  process:  

WARN |2015-03-22 13:52:57,827 org.apache.hadoop.security.UserGroupInformation - PriviledgedActionException as:waterlinedata (auth:KERBEROS) cause:java.io.IOException: javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)]

To  resolve  this  error,  make  sure  the  user  running  the  Jetty  process  has  read   access  to  HDFS  and  has  a  valid  Kerberos  ticket  (run  kinit).    Then  check  that   the  Kerberos  ticket  cache  is  available  for  the  user  (run  klist).   8. Sign  into  Waterline  Data  Inventory  using  any  of  the  Linux  users  configured  for   your  system,  including  "waterlinedata".   For  a  Kerberized  instance,  this  login  page  does  not  appear.  To  access  Waterline   Data  Inventory,  users  will  need  a    valid  Kerberos  keytab  and  their  browser   configured  to  use  it.  See  Configuring  web  browsers  for  use  with  Kerberos  on   page  51.   9. Verify  that  there  is  field-­‐level  information  for  the  files  in  the  directory  you   profiled  in  step  5.   If  files  show  that  they  were  not  profiled  ("N/A"  or  "CRAWLED"  in  Last  Profiled),   review  the  console  output  from  step  5  to  determine  the  failure.  

  If  profiling  didn't  complete  successfully,  files  will  show  no  profile  time  

©  2014  -­‐  2015  Waterline  Data,  Inc.  All  rights  reserved.  

39    

Running  Waterline  Data  Inventory  jobs  

Waterline  Data  Inventory  

Running  Waterline  Data  Inventory  jobs   Waterline  Data  Inventory  format  discovery  and  profiling  jobs  are  MapReduce  jobs   run  in  Hadoop.  These  jobs  populate  the  Waterline  Data  Inventory  repository  with   file  format  and  schema  information,  sample  data,  and  data  quality  metrics  for  files  in   HDFS  and  Hive.  Waterline  Data  Inventory  can  process  HDFS  files  formatted  as   delimited  text  files,  JSON,  Avro,  XML,  ORC,  RC,  and  Apache  log  files.  Individual  files  in   these  formats  compressed  as  sequence  files  are  also  profiled.    Individual  files  in   delimited  text  format,  Apache  log  format,  or  JSON  compressed  as  gzip  (GNU  zip)  are   also  profiled.  

  Tag  propagation,  lineage  discovery,  collection  discovery,  and  origin  propagation   jobs  are  jobs  run  on  the  edge  node  where  Waterline  Data  Inventory  is  installed.   These  jobs  use  data  from  the  repository  to  suggest  relationships  among  files,  to   suggest  additional  tag  associations,  and  to  propagate  origin  information.  

  Waterline  Data  Inventory  jobs  are  run  on  a  command  line  on  the  computer  on  which   Waterline  Data  Inventory  is  installed.  The  jobs  are  started  using  scripts  located  in   the  bin  subdirectory  in  the  installation  location.     If  you  are  running  Waterline  Data  Inventory  jobs  in  a  development  environment,   consider  opening  two  separate  command  windows:  one  for  the  Jetty  console  output   and  a  second  to  run  Waterline  Data  Inventory  jobs.  

Command  summary   Run  Waterline  Data  Inventory  commands  as  options  to  the  waterline  script  found  in   the  bin  directory  of  the  installation:   $bin/waterline

The  command  options  and  parameters  are  described  in  the  following  table.    

40  

©  2014  -­‐  2015  Waterline  Data,  Inc.  All  rights  reserved.  

Installation  and  Administration  Guide   Command  option  

Running  Waterline  Data  Inventory  jobs   Summary  

Full  profile  and  discovery  of  the  files  in  the  indicated  HDFS   directories.  Indicate  more  than  one  directory  with  a  comma-­‐   separated  list.     details  on  page  42   MapReduce  configuration  parameters  can  be  passed  through   to  MapReduce  jobs.   Profiling  of  the  files  in  the  indicated  HDFS  directories.  No   profileOnly discovery  processes  run.  Indicate  more  than  one  directory   with  a  comma-­‐separated  list.     details  on  page  42   Full  profile  and  discovery  of  the  tables  in  the  indicated  Hive   profileHive databases.  Indicate  more  than  one  database  with  a  comma-­‐ [HDFS staging directory] separated  list.     By  default,  Waterline  Data  Inventory  uses  the  location   details  on  page  44   configured  for  waterlinedata.profile.hivedir  to  stage   profiling  results;  if  this  location  is  not  configured,  specify  an   empty  HDFS  directory  where  waterlinedata  has  read  and   write  access  to  use  as  a  staging  directory  for  profiling  results.   MapReduce  configuration  parameters  can  be  passed  through   to  MapReduce  jobs.   Profiling  of  the  tables  in  the  indicated  Hive  databases.  No   profileHiveOnly discovery  processes  run.  Indicate  more  than  one  database   [HDFS staging directory] with  a  comma-­‐separated  list.     By  default,  Waterline  Data  Inventory  uses  the  location   details  on  page  45   configured  for  waterlinedata.profile.hivedir  to  stage   profiling  results;  if  this  location  is  not  configured,  specify  an   empty  HDFS  directory  where  waterlinedata  has  read  and   write  access  to  use  as  a  staging  directory  for  profiling  results.   MapReduce  configuration  parameters  can  be  passed  through   to  MapReduce  jobs.   Discover  lineage  relationships  among  all  profiled  files  and   runLineage details  on  page  43   tables  and  calculate  file  and  table  origins.   profile

runCollection

runOrigin

 

tag

 

evaluateReqex showVersion

Discover  collections  among  all  profiled  files.  If  you  are   details  on  page  43   running  discovery  tasks  individually,  be  sure  to  discover   collections  before  propagating  tag  associations.   Calculate  file  and  table  origins  using  all  lineage  relationships.   details  on  page  43   Propagate  tag  associations  across  all  profiled  files  and  tables.   Because  this  operation  uses  repository  data,  if  you  are   details  on  page  44   experimenting  with  tag  associations  based  on  regular   expressions,  you  should  consider  reprofiling  data  to  get  a   complete  picture  of  how  tag  associations  from  regular   expressions  will  perform.   Reapply  tag  associations  based  on  regular  expressions  using   details  on  page  44   existing  repository  data.   Display  Waterline  Data  Inventory  version  information.   details  on  page  45  

©  2014  -­‐  2015  Waterline  Data,  Inc.  All  rights  reserved.  

41    

Running  Waterline  Data  Inventory  jobs  

Waterline  Data  Inventory  

Full  profiling  and  discovery  against  HDFS  files   $ /bin/waterline profile

This  command  recursively  profiles  new  and  updated  files  in  the  directory  indicated.   When  run  for  the  first  time,  this  command  profiles  all  files  in  the  indicated  directory.     Subsequent  runs  identify  changed,  deleted,  and  new  files  in  the  cluster  and  perform   profiling  only  on  those  files.  Specifically  the  profile  command  triggers  the  following   individual  operations:   •

Format  discovery  (one  MapReduce  job)  



Profiling  "crawl"  (one  or  more  MapReduce  jobs  per  file  format  type)  



Collections  discovery  (one  local  job)  



Origin  propagation  (one  local  job)  



Tag  propagation  (one  local  job),  including  propagating:   • User-­‐assigned  tag  associations   • Tag  associations  defined  by  regular  expressions   • Tag  associations  defined  by  built-­‐in  reference  data  

When  each  job  completes,  the  next  job  starts,  regardless  of  whether  the  job   completes  successfully.  The  progress  of  each  job  is  indicated  by  messages  on  the   console.  To  see  details  for  the  MapReduce  jobs,  follow  the  job  link  provided  in  the   console  messages  or  use  Hue  to  show  the  MapReduce  jobs  for  the  dedicated   Waterline  Data  Inventory  user.   After  profiling  all  the  directories  in  the  cluster,  run  the  lineage  discovery  command,   described  on  page  43.   Example:   $ bin/waterline profile /user/waterlinedata/Landing

To  profile  more  than  one  directory  at  a  time,  specify  a  parent  directory  or  include   multiple  directories  in  the  command,  separated  by  commas  with  no  space  between   paths:   $ bin/waterline profile ",,"

If  you  specify  a  valid  HDFS  file  instead  of  a  directory,  Waterline  Data  Inventory  will   profile  just  the  file.  If  no  staging  directory  is  defined   (waterlinedata.profile.processingdirectory  in  environment.properties),  Waterline   Data  Inventory  will  create  a  staging  directory  in  the  same  parent  directory  as  the   file.  

Profiling  only  for  HDFS  files   $ /bin/waterline profileOnly

This  command  recursively  profiles  new  and  updated  files  in  the  directory  indicated.   When  run  for  the  first  time,  this  command  profiles  all  files  in  the  indicated  directory.     Subsequent  runs  identify  changed,  deleted,  and  new  files  in  the  cluster  and  perform   42  

©  2014  -­‐  2015  Waterline  Data,  Inc.  All  rights  reserved.  

Installation  and  Administration  Guide  

Running  Waterline  Data  Inventory  jobs  

profiling  only  on  those  files.  Specifically  the  profile  command  triggers  the  following   individual  operations:   •

Format  discovery  (one  MapReduce  job)  



Profiling  "crawl"  (one  or  more  MapReduce  jobs  per  file  format  type)  

The  progress  of  each  job  is  indicated  by  messages  on  the  console.  To  see  details  for   the  MapReduce  jobs,  follow  the  job  link  provided  in  the  console  messages.   After  profiling  all  the  directories  in  the  cluster,  run  the  lineage  discovery,  collection   discovery,  and  tag  propagation  commands,  described  next.   Example:   $ bin/waterline profileOnly /user/waterlinedata/Landing

Lineage  discovery   /bin/waterline runLineage

This  command  runs  two  local  jobs  to  discover  lineage  relationships  among  files  and   propagate  origin  information.  This  command  operates  on  data  in  the  Waterline  Data   Inventory  repository;  if  new  files  are  added  to  the  cluster,  you  must  run  a  profile   command  to  collect  data  into  the  repository  before  you  will  see  information  for  the   new  files  reflected  in  lineage  relationships.  This  command  allows  a  -r  option,  which   will  rediscover  lineage  for  all  files  in  the  cluster,  not  just  new  files.   The  progress  of  each  job  is  indicated  by  messages  on  the  console.  

Collection  discovery   $ /bin/waterline runCollection

This  command  reviews  repository  data  to  determine  if  any  folders  contain  files  that   can  be  considered  a  collection.  In  addition  to  running  collection  discovery  as  part  of   profiling  in  general,  run  this  command  when  you've  added  files  to  the  cluster  that   are  likely  to  be  members  of  existing  collections;  profiling  alone  will  not  update  the   collection  information  with  the  new  files.   This  command  allows  an  -r  option,  which  will  rediscover  collections  across  the   cluster,  not  just  for  new  files.   The  progress  of  the  job  is  indicated  by  messages  on  the  console.  

Origin  propagation  only   $ /bin/waterline runOrigin

This  command  propagates  origins  across  the  files  in  the  cluster  that  have  lineage   relationships.  You  can  use  this  command  to  propagate  landing  information  across  a   cluster  that  has  already  been  profiled  and  has  lineage  information  discovered.  This   command  allows  a  -r  option,  which  will  propagate  all  origins,  not  just  new  origins.   The  progress  of  the  job  is  indicated  by  messages  on  the  console.  

©  2014  -­‐  2015  Waterline  Data,  Inc.  All  rights  reserved.  

43    

Running  Waterline  Data  Inventory  jobs  

Waterline  Data  Inventory  

Tag  propagation  only   $ /bin/waterline tag

This  command  propagates  new  tags  across  the  files  and  fields  in  the  cluster.  Use  this   command  when  you  know  that  your  cluster  has  been  profiled  but  you  have  added   tags  and  tag  associations  that  you  want  Waterline  Data  Inventory  to  consider  for   propagation.    This  command  allows  a  -r  option,  which  will  propagate  all  tags,  not   just  new  tags.   The  progress  of  the  job  is  indicated  by  messages  on  the  console.  

Evaluating  tag  rules   $ /bin/waterline evaluateRegex

This  command  uses  data  from  the  repository  to  apply  tag  association  rules.  Use  this   command  when  you  configure  tagging  rules  but  are  not  ready  to  reprofile  all  data  in   the  cluster  to  use  profiling  data  to  apply  the  new  rules.  The  tag  association  results   may  not  be  as  accurate  when  using  data  collected  during  profiling,  but  the   performance  savings  will  be  significant.   The  progress  of  the  job  is  indicated  by  messages  on  the  console.  

Full  profiling  and  discovery  against  Hive  tables   $ /bin/waterline profileHive

This  command  profiles  new  and  updated  tables  in  the  Hive  database  or  databases   indicated.  The  Hive  databases  must  be  from  the  Hive  instance  configured  in  the   Waterline  Data  Inventory  profiler  properties  as  described  on  page  23.  In  addition,   you  must  identify  an  HDFS  location  where  Waterline  Data  Inventory  can  create   staging  files  for  profiling  results.   When  run  for  the  first  time,  this  command  profiles  all  tables  in  the  indicated   database.  Subsequent  runs  identify  changed,  deleted,  and  new  tables  in  the  cluster   and  perform  profiling  only  on  those  files.  Specifically  the  profile  command  triggers   the  following  individual  operations:   •

Profiling  "crawl"  (one  or  more  MapReduce  jobs  depending  on  the  size  of  data  in   each  table)  



Collections  discovery  (one  local  job)  



Origin  propagation  (one  local  job)  



Tag  propagation  (one  local  job),  including  propagating:   • User-­‐assigned  tag  associations   • Tag  associations  defined  by  regular  expressions   • Tag  associations  defined  by  built-­‐in  reference  data  

When  each  job  completes,  the  next  job  starts,  regardless  of  whether  the  job   completes  successfully.  The  progress  of  each  job  is  indicated  by  messages  on  the   console.  To  see  details  for  the  MapReduce  jobs,  follow  the  job  link  provided  in  the   44  

©  2014  -­‐  2015  Waterline  Data,  Inc.  All  rights  reserved.  

Installation  and  Administration  Guide  

Running  Waterline  Data  Inventory  jobs  

console  messages  or  use  Hue  to  show  the  MapReduce  jobs  for  the  dedicated   Waterline  Data  Inventory  user.   Example:   $ bin/waterline profileHive default

To  profile  more  than  one  database  at  a  time,  include  multiple  directories  in  the   command,  separated  by  commas  with  no  space  between  names:   $ bin/waterline profile ,

Profiling  only  for  Hive  tables   $ /bin/waterline profileHiveOnly

This  command  profiles  new  and  updated  tables  in  the  database  or  databases   indicated.  When  run  for  the  first  time,  this  command  profiles  all  tables  in  the   indicated  database.    Subsequent  runs  identify  changed,  deleted,  and  new  tables  in   the  cluster  and  perform  profiling  only  on  those  tables.  Specifically  the  profile   command  triggers  the  following  individual  operations:   •

Profiling  "crawl"  (one  or  more  MapReduce  jobs  depending  on  the  size  of  data  in   each  table)  

The  progress  of  each  job  is  indicated  by  messages  on  the  console.  To  see  details  for   the  MapReduce  jobs,  follow  the  job  link  provided  in  the  console  messages.   After  profiling  all  the  directories  in  the  cluster,  run  the  lineage  discovery  and  tag   propagation  commands,  described  on  pages  43  and  44.   Example:   $ bin/waterline profileHiveOnly default,finance

Displaying  version  information   $ /bin/waterline showVersion

This  command  displays  the  Waterline  Data  Inventory  version  installed.  This   information  specifies  a  Hadoop  distribution.  If  the  Hadoop  distribution  listed  here  is   different  from  the  distribution  running  on  the  cluster,  you  may  have  configuration   problems.    Consider  reinstalling  with  the  matching  Waterline  Data  Inventory   package.  

©  2014  -­‐  2015  Waterline  Data,  Inc.  All  rights  reserved.  

45    

Monitoring  Waterline  Data  Inventory  jobs  

Waterline  Data  Inventory  

Monitoring  Waterline  Data  Inventory  jobs   Waterline  Data  Inventory  provides  a  record  of  job  history  in  the  Dashboard  of  the   browser  application.    

  In  addition,  you  can  follow  detailed  progress  of  each  job  on  the  console  where  you   run  the  command.  

 

Monitoring  Hadoop  jobs   When  you  run  the  “profile”  command,  you’ll  see  an  initial  job  for  format  discovery   followed  by  one  or  more  profiling  jobs.  There  will  be  at  least  one  profiling  job    for   each  file  type  Data  Inventory  identified  in  the  format  discovery  pass.     The  console  output  includes  a  link  to  the  job  log  for  the  running  job.  For  example:   2014-09-20 18:17:27,048 INFO [WaterlineData Format Discovery Workflow V2] mapreduce.Job (Job.java:submit(1289)) - The url to track the job: http://sandbox.hortonworks.com:8088/proxy/application_1913847052944_0004/

While  the  job  is  running,  you  can  follow  this  link  to  see  the  progress  of  the   MapReduce  activity.     46  

©  2014  -­‐  2015  Waterline  Data,  Inc.  All  rights  reserved.  

Installation  and  Administration  Guide  

Monitoring  Waterline  Data  Inventory  jobs  

Alternatively,  you  can  monitor  the  progress  of  these  jobs  using  Hue  in  a  browser:   For  Cloudera  and  MapR  distributions:   http://:8888/jobbrowser

For  Hortonworks  distributions:   http://:8000/jobbrowser

You’ll  need  to  specify  the  dedicated  Waterline  Data  Inventory  user  or,  if  the   Waterline  Data  Inventory  user  has  a  corresponding  account  in  Hue,  sign  into  Hue   using  that  user.  

Monitoring  local  jobs     After  the  Hadoop  jobs  complete,  Waterline  Data  Inventory  runs  local  jobs  to  process   the  data  collected  in  the  repository.  You  can  follow  the  progress  of  these  jobs  by   watching  console  output  in  the  command  window  in  which  you  started  the  job.  

Debugging  information   There  are  multiple  sources  of  debugging  information  available  for  Data  Inventory.  If   you  encounter  a  problem,  collect  the  following  information  for  Waterline  Data   support.   •

Job  messages   Waterline  Data  Inventory  generates  console  output  for  jobs  run  at  the  command   prompt.  If  the  job  encounters  problems,  you  would  review  the  job  output  for   clues  to  the  problem.  These  messages  appear  on  the  console  and  are  collected  in     log  files  with  debug  logging  level:   MapReduce  Jobs  (Format  discovery  and  profiling)   /var/log/waterline/wds-mrjobs.log

Waterline  Data  Inventory  Jobs  (Tag  propagation,  collection  discovery,  lineage   discovery)   /var/log/waterline/wds-inventory.log



Web  server  messages   The  embedded  web  server,  Jetty,  produces  output  corresponding  to  user   interactions  with  the  browser  application.  These  messages  appear  on  the   console  and  are  collected  in  a  log  file:   /var/log/waterline/wds-ui.log

Use  tail  to  see  the  most  recent  entries  in  the  log:   $ tail -f /var/log/waterline/wds-ui.log

©  2014  -­‐  2015  Waterline  Data,  Inc.  All  rights  reserved.  

47    

Optimizing  profiling  performance   •

Waterline  Data  Inventory  

Lucene  search  indexes   In  some  cases,  it  may  be  useful  to  examine  the  search  indexes  produced  by  the   product.  These  indexes  are  found  in  the  following  directory:   /var/lib/waterline/index



Waterline  Data  Inventory  repository   In  some  cases  it  may  be  useful  to  examine  the  actual  repository  files  produced  by   the  product.  The  repository  datastore  is  found  in  the  following  directory:   /var/lib/waterline/db/waterlinedatastore

Profiling  results   After  Waterline  Data  Inventory  jobs  run  successfully,  there  may  still  be  individual   files  that  are  not  profiled  or  are  not  profiled  completely.  There  are  two  places  to   look  to  understand  the  results  of  a  profiling  job:   •

Dashboard.  From  inside  Waterline  Data  Inventory  browser  application,  click   Dashboard  in  the  toolbar.  This  page  lists  the  current  and  past  jobs.    If  files  in  a   job  produced  errors  and  were  not  processed  or  were  not  fully  processed,  the  job   status  indicates  the  errors.  



Single  File  View.  The  file  information  for  each  profiled  file  includes  the  profile   status  for  the  file.  From  inside  Waterline  Data  Inventory  browser  application,   navigate  to  the  file.  File  status  values  include:   • PROFILED.  A  significant  portion  of  the  file  profiled  successfully  or  when  the   appropriate  sample  of  the  file  is  profiled  (if  sampling  is  turned  on).     • PROFILE_FAILED.  Profiling  encountered  too  many  errors  in  this  file  to   produce  profiling  output.  Look  for  specific  errors  in  the  output  of  the   profiling  job.     • CRAWLED.  Profiling  was  not  run  or  the  profiling  results  were  not  written  to   the  repository.  In  this  case,  Waterline  Data  Inventory  will  reprofile  the  file   the  next  time  the  directory  is  included  in  a  patch  profiling  job.  Note  that   collections  will  always  have  a  status  of  "CRAWLED";  the  individual  files  that   make  up  the  collection  will  show  specific  profile  status.  

Optimizing  profiling  performance   In  terms  of  performance  optimization,  profiling  breaks  into  two  areas  to  consider:   MapReduce  operations  that  occur  on  the  cluster's  data  nodes  and  writing  profiling   data  to  the  Waterline  Data  Inventory  repository  on  the  edge  node.  Performance  in   these  areas  is  dependent  less  on  the  size  of  the  cluster  data  than  on  the  number  of   columns  in  the  cluster  data.  That  is,  a  2GB  file  with  30  columns  will  profile  faster   and  take  up  less  space  in  the  repository  than  a  2GB  file  with  300  columns.    

48  

©  2014  -­‐  2015  Waterline  Data,  Inc.  All  rights  reserved.  

Installation  and  Administration  Guide  

Optimizing  profiling  performance  

MapReduce  job  performance  controls   The  important  factors  in  MapReduce  performance  are  the  number  of  CPUs  available   across  the  cluster  and  the  amount  of  memory  available  on  each  node.  In  both  cases,   more  is  better.     Tuning  Waterline  Data  Inventory  to  run  on  your  cluster  is  like  any  other  MapReduce   operation:  you  want  to  make  sure  that  the  volume  of  data  being  processed  and  the   number  of  processes  running  at  one  time  fits  within  the  resources  available  on  the   cluster.  As  part  of  the  cluster  configuration  (outside  Waterline  Data  Inventory),   configure  Hadoop  parameters  based  on  the  data  node  hardware  configuration:   •

Memory  allocated  for  map  tasks    



Memory  allocated  for  reduce  tasks    



Java  heap  space  available  

Once  these  parameters  are  in  place,  Waterline  Data  Inventory  gives  you  the  ability   to  control  the  number  of  map  and  reduce  tasks  started  by  Waterline  Data  Inventory   MapReduce  jobs.  These  numbers  are  bound  by  the  number  of  CPUs  available  for   processing.  Within  that  limit,  choose  the  number  of  map  and  reduce  based  on  the   shape  of  the  data  you  are  processing  to  keep  the  size  of  data  each  task  processes   more  or  less  constant.  You  would  increase  the  maximum  number  of  map  or  reduce   tasks  created  when  processing  many  small  files  (more  columns  overall);  decrease   the  number  of  map  and  reduce  tasks  when  processing  fewer  larger  files  (fewer   columns  overall).  Assuming  Waterline  Data  Inventory  is  the  only  task  running  on   the  cluster  (an  unlikely  assumption!),  start  with  the  maximum  number  of  map  tasks   at  75%  of  the  number  of  CPUs  across  the  cluster  and  the  maximum  number  of   reduce  tasks  at  50%  of  the  number  of  CPUs.  These  numbers  can  add  up  to  more  than   100%  because  it's  unlikely  that  both  mappers  and  reducers  will  reach  their   maximum  limits  at  the  same  time  for  a  given  job.  By  default,  Waterline  Data   Inventory  triggers  MapReduce  jobs  sequentially.  The  configured  number  of  map  and   reduce  tasks  applies  to  each  job.  If  you  have  the  resources,  you  can  change   Waterline  Data  Inventory's  behavior  to  run  jobs  in  parallel;  you  many  need  to   change  the  number  of  map  or  reduce  tasks  to  stay  within  your  cluster's  resources  as   the  maximum  number  of  map  tasks  applies  per  job.  

Repository  writing  performance  controls   The  most  important  factor  in  optimizing  the  performance  of  writing  profiling  results   to  the  Waterline  Data  Inventory  repository  is  the  number  of  input  and  output   operations  per  second  (IOPS).  Profiling  results  increase  in  size  based  on  the  number   of  columns  profiled  and  this  produces  a  lot  of  data  to  move  from  HDFS  to  the  edge   node.   The  second  most  important  factor  is  the  efficiency  of  the  repository  database  itself.   While  Waterline  Data  Inventory  ships  with  Embedded  Derby  configured  as  the   repository,  you  can  significantly  improve  performance  in  this  area  by  upgrading  to  a   multithreaded  database.   ©  2014  -­‐  2015  Waterline  Data,  Inc.  All  rights  reserved.  

49    

Supporting  self-­‐service  users  

Waterline  Data  Inventory  

There  are  two  additional  and  related  parameters  to  consider  to  ensure  you  get  the   best  possible  performance  during  the  write  to  the  repository.  If  processes  are   running  out  of  memory  while  writing  to  the  repository  (post  processing  operations   after  the  MapReduce  jobs  have  completed),  you  can  adjust  these  parameters.   •

Heap  available  for  reading  from  HDFS  (Client  operation)   Restricted  by  the  amount  of  memory  available  on  the  edge  node.  This  is  set  to  4   GB  by  default  in  the  waterline  script,  HADOOP_HEAPSIZE  setting:   /waterlinedata/bin/waterline



Number  of  reducers.     If  you  adjust  the  client  operation  memory  and  still  run  out  of  memory  writing  to   the  repository,  you  can  increase  the  maximum  number  of  reduce  tasks  available   to  Waterline  Data  Inventory  jobs  so  that  the  volume  of  data  produced  by  each   reduce  task  is  smaller.  

Supporting  self-­‐service  users   Waterline  Data  Inventory  is  designed  to  enhance  the  ability  of  users  of  Hadoop  data   to  find  the  right  data  in  Hadoop.  It  endeavors  to  open  Hadoop  to  these  users  while   reducing  the  burden  on  IT  to  provide  the  access  while  maintaining  control  over   secure  and  sensitive  data.     To  achieve  this  balance  of  better  data  tools  for  end-­‐users  and  a  secure  and   controlled  data  environment,  administrators  configure  end-­‐user  access  to  Waterline   Data  Inventory  in  the  following  ways:   •

Secure  access.   Users  of  the  Waterline  Data  Inventory  browser  application  need  to  have   accounts  that  can  access  the  cluster,  whether  through  Linux  or  through  an   authentication  system  running  on  Linux  such  as  Kerberos.  Waterline  Data   Inventory  fully  supports  Kerberos-­‐based  single  sign-­‐on;  thus  if  a  user  is  already   authenticated,  no  additional  login  is  required  to  access  the  web  application.  



HDFS  and  MapR-­‐FS  navigation.   If  users  have  a  matching  account  in  HDFS,  the  users'  browsing  home  in   Waterline  Data  Inventory  will  be  their  HDFS  home  directory.  If  the  end-­‐users  of   your  organization's  cluster  data  do  not  have  accounts  in  HDFS,  you  can  configure   Waterline  Data  Inventory  to  open  at  a  set  location  in  HDFS.  See  Configuring   additional  Waterline  Data  Inventory  functionality  (page  53).    



Hive  table  creation.   Waterline  Data  Inventory  integrates  with  Hive  in  two  ways:  it  reads  Hive  tables   as  part  of  profiling  the  cluster  and  it  creates  Hive  tables  from  HDFS  files  upon   user  request.  This  second  method  provides  a  gateway  for  data  users  to  act  on   files  they  identify  using  Waterline  Data  Inventory:  users  can  request  a  file  be  

50  

©  2014  -­‐  2015  Waterline  Data,  Inc.  All  rights  reserved.  

Installation  and  Administration  Guide  

Supporting  self-­‐service  users  

copied  into  a  Hive  table,  then  access  the  Hive  database  from  visualization,   reporting,  and  analytic  tools  outside  the  cluster.  

Configuring  web  browsers  for  use  with  Kerberos   After  users'  computers  are  configured  for  Kerberos  authentication,  browsers  may   require  additional  configuration  to  support  using  SPNEGO-­‐Kerberos  for  user   authentication.  The  following  resources  should  help  you  find  the  best  way  to  ensure   your  users  can  access  Waterline  Data  Inventory's  browser  application  seamlessly:   Firefox   See  "Integrated  Authentication"  (developer.mozilla.org/en-­‐ US/docs/Integrated_Authentication).  We've  found  that  these  instructions  work  in   our  test  environment.   1. Install  the  Firefox  extension  "Integrated  Authentication  for  Firefox"   (addons.mozilla.org/en-­‐us/firefox/addon/integrated-­‐auth-­‐for-­‐firefox/):   2. Inside  Firefox,  open  Tools  >  Integrated  Authentication  Site.   3. Enter  the  host  name  where  the  Jetty  web  server  is  running.   Restarting  the  browser  is  not  required.     Chrome   See  "Activating  Kerberos  Support"   (support.google.com/chrome/a/answer/187202?hl=en).    We've  found  that  these   instructions  work  in  our  test  environment:   1. Exit  your  Chrome  browser.   2. Add  the  host  name  for  the  computer  where  Waterline  Data  Inventory's  Jetty  web   server  is  running  to  the  browser's  list  of  accepted  sites.   (OS  X)   Add  the  host  name  in  ~/Library/Application

Support/Google/Chrome/Local State:  

"auth": { "server_whitelist" : "" },

The  server_whitelist  can  accept  a  comma  separated  list  of  host  names  or   patterns  such  as  *example.com.   (Windows)   Add  the  host  name  to  the  list  of  computers  in  the  Local  Intranet  security  zone.   From  the  control  panel,  open  Internet  Options  >  Security  and  select  "Local   Internet".  Then  open  Sites  >  Advanced  and  add  the  web  server  host  name  to  the   zone.   Internet  Explorer   See  "Kerberos  authentication  and  troubleshooting  delegation  issues"   (support.microsoft.com/en-­‐us/kb/907272).    

©  2014  -­‐  2015  Waterline  Data,  Inc.  All  rights  reserved.  

51    

Swapping  out  Derby  for  MySQL  

Waterline  Data  Inventory  

Safari   No  configuration  needed.  

Swapping  out  Derby  for  MySQL   Out  of  the  box,  Waterline  Data  Inventory  runs  an  embedded  Derby  database   instance  as  its  repository  of  profiling  and  annotation  data.  The  following   instructions  describe  how  to  replace  Derby  with  MySQL  to  persist  Waterline  Data   Inventory  metadata.     The  process  involves  two  steps:   •

Set  up  a  MySQL  database  with  a  user  dedicated  to  Waterline  Data  Inventory   operations  



Configure  Waterline  Data  Inventory  properties  to  point  to  the  MySQL  instance   and  database  

These  steps  assume  you  have  an  installed  instance  of  MySQL  already  running  on   your  cluster,  such  as  the  instance  used  by  Hive  metastore.   To  swap  out  Derby  for  MySQL:   1. Sign  in  to  the  running  MySQL  instance  as  DBA  and  create  a  user  dedicated  to   Waterline  Data  Inventory  operations.  For  example,  create  the  "waterlinedata"   user:   mysql> CREATE USER 'waterlinedata' IDENTIFIED BY 'waterlinedata';

2. Create  a  MySQL  database  "waterlinedatastore".   mysql> CREATE DATABASE waterlinedatastore;

3. Switch  to  the  newly  created  waterlinedatastore  database  and  execute  the   following  grants,  where    is  replaced  with  the  host  name  for  the  node   where  Waterline  Data  Inventory  is  running.     mysql> mysql> mysql> mysql> mysql> mysql> mysql> mysql> mysql>

use waterlinedatastore; GRANT USAGE ON waterlinedatastore.* TO 'waterlinedata'@'%' IDENTIFIED BY 'waterlinedata'; GRANT USAGE ON waterlinedatastore.* TO 'waterlinedata'@'' IDENTIFIED BY 'waterlinedata'; GRANT USAGE ON waterlinedatastore.* TO 'waterlinedata'@'localhost' IDENTIFIED BY 'waterlinedata'; flush privileges; GRANT all ON waterlinedatastore.* TO 'waterlinedata'@'localhost' IDENTIFIED BY 'waterlinedata'; GRANT all ON waterlinedatastore.* TO 'waterlinedata'@'' IDENTIFIED BY 'waterlinedata'; GRANT all ON waterlinedatastore.* TO 'waterlinedata'@'%' IDENTIFIED BY 'waterlinedata'; flush privileges;

4. Use  the  parameters  from  the  previous  steps  to  edit  /waterlinedata/lib/resources/environment.properties:  

52  

©  2014  -­‐  2015  Waterline  Data,  Inc.  All  rights  reserved.  

Installation  and  Administration  Guide  

Configuring  additional  Waterline  Data  Inventory  functionality  

javax.persistence.jdbc.driver=com.mysql.jdbc.Driver javax.persistence.jdbc.url=jdbc:mysql://:3306/ waterlinedatastore?createDatabaseIfNotExist=true javax.persistence.jdbc.user=waterlinedata javax.persistence.jdbc.password=

Configuring  additional  Waterline  Data  Inventory  functionality   Waterline  Data  Inventory  provides  a  number  of  configuration  settings  and   integration  interfaces  to  enable  extended  functionality.  These  settings  are  managed   as  properties  in  properties  files  in  /lib/resources.  

Communication  among  Hadoop  components   The  following  configuration  properties  identify  how  Waterline  Data  Inventory   components  communicate  with  Hadoop  and  other  applications  in  the  Hadoop   environment.   Communication  between  Waterline  Data  Inventory  and  Hadoop   This  property  identifies  the  location  of  the  cluster  that  the  Waterline  Data  Inventory   browser  application  accesses.  If  you  are  installing  Waterline  Data  Inventory  on  an   existing  cluster  (rather  than  in  a  pre-­‐configured  VM)  you'll  need  to  set  this  value.   [environment.properties  file]   waterlinedata.crawler.fs.uri=maprfs:/// (example) waterlinedata.crawler.fs.uri=hdfs://sandbox.hortonworks.com:8020 (example)

Communication  between  Jetty  and  Hadoop   The  Waterline  Data  Inventory  embedded  web  server,  Jetty,  communicates  directly   with  HDFS  or  MapR-­‐FS  as  well  as  to  the  repository.  By  default,  Jetty  uses  native  Java   API  to  retrieve  data  from  HDFS.  Waterline  Data  Inventory  provides  a  configuration   property  to  enable  WebHDFS  so  you  can  access  HDFS  from  a  remote  location.       [webapp.properties  file]   waterlinedata.usewebhdfs=false (default) waterlinedata.webhdfs.uri=

For  example:   waterlinedata.usewebhdfs=true waterlinedata.webhdfs.uri=webhdfs://sandbox.hortonworks.com:50070/ (example)

Communication  between  Waterline  Data  Inventory  and  Hive   Waterline  Data  Inventory  can  read  and  write  data  to  Hive.  If  you  are  installing   Waterline  Data  Inventory  on  an  existing  cluster,  you'll  need  to  set  this  value  to   enable  the  Hive  functionality.    

©  2014  -­‐  2015  Waterline  Data,  Inc.  All  rights  reserved.  

53    

Configuring  additional  Waterline  Data  Inventory  functionality  

Waterline  Data  Inventory  

The  following  property  describes  the  Hive  connection.  It  is  shown  here  with   example  entries  for  SSH  authentication  to  a  server  on  the  same  computer  where   Waterline  Data  Inventory  is  installed:   [environment.properties  file]   waterlinedata.hiveurl=jdbc:hive2://localhost:10000/default

Communication  between  Waterline  Data  Inventory  and  Derby   Waterline  Data  Inventory  includes  embedded  Derby  as  its  repository  database.  Both   Waterline  Data  Inventory  jobs  and  the  web  server  access  Derby  using  the  following   connection  information.  You  won't  need  to  change  this  information  unless  you  are   replacing  Derby  with  another  database,  if  you  need  to  change  the  default  port   selection,  or  if  you  want  to  change  the  default  password.  The  values  shown  here  are   examples:   [environment.properties  file]   javax.persistence.jdbc.driver=org.apache.derby.jdbc.ClientDriver javax.persistence.jdbc.url=jdbc:derby://sandbox.hortonworks.com:4444/ waterlinedatastore;create=true javax.persistence.jdbc.user=waterlinedata javax.persistence.jdbc.password=

When  security  is  not  a  factor,  you  can  insert  Derby  credentials  in  plain  text;   however,  Waterline  Data  Inventory  provides  a  utility  to  provide  obfuscate  stored   passwords,  as  described  in  Obscuring  passwords  in  Waterline  Data  Inventory   configuration  files  (page  65).   Changing  the  default  Derby  communication  port   By  default,  Waterline  Data  Inventory's  instance  of  Derby  communicates  on  port   4444.  If  you  need  to  change  that  port  number  to  avoid  a  conflict  with  another   Hadoop  process,  stop  Derby  and  Jetty,  then  update  the  port  number  in  the  following   locations:   1. Repository  configuration  (lib/resources/environment.properties)   On  one  line:   javax.persistence.jdbc.url= jdbc:derby://:4444/waterlinedatastore;create=true

2.  Derby  configuration  (lib/resources/derby.properties)   derby.drda.portNumber=4444

3. Environment  configuration  (bin/detectenv)   DERBY_PORT=4444

54  

©  2014  -­‐  2015  Waterline  Data,  Inc.  All  rights  reserved.  

Installation  and  Administration  Guide  

Configuring  additional  Waterline  Data  Inventory  functionality  

Setting  the  location  and  persistence  of  temporary  files   The  following  configuration  properties  allow  you  to  specify  the  location  of  staging   files  that  Waterline  Data  Inventory  creates  when  it  collects  profiling  information   from  HDFS  files  (and  Hive  tables).  An  additional  property  defines  where  Waterline   Data  Inventory  stages  temporary  file  during  discovery  processes  running  on  the   local  edge  node.   Temporary  files  for  HDFS  and  Hive  profiling   Use  this  configuration  property  to  identify  the  HDFS  (or  MapR-­‐FS)  directory   Waterline  Data  Inventory  uses  when  it  needs  to  generate  temporary  files  during   profiling  HDFS  files  or  Hive  tables.  Make  sure  that  the  dedicated  Waterline  Data   Inventory  user  has  write  access  to  the  configured  location.  By  default,  this  property   is  commented  out  and  temporary  files  are  placed  in  .waterlinedata  in  the  first   directory  profiled  on  the  cluster.  When  you  set  this  property,  make  sure  to  remove   the  comment  mark  at  the  beginning  of  the  line.  When  profiling  Hive  tables,  you  can   override  this  value  by  specifying  a  staging  location  on  the  command  line.   [environment.properties  file]   waterlinedata.profile.processingdirectory=

Backing  files  for  Hive  tables   A  similar  property  controls  the  location  of  file  copies  created  when  users  create   Hive  tables  from  Waterline  Data  Inventory.  Only  some  files  format  types  require   copies.  See  Hive  table  backing  file  location,  page  62.   Staging  area  for  discovery  tasks   This  property  indicates  the  local  file  system  directory  Waterline  Data  Inventory  uses   to  store  temporary  files  created  during  discovery  processing.  Make  sure  that  the   dedicated  Waterline  Data  Inventory  user  has  write  access  to  the  configured  location.   By  default,  this  value  is  set  to  /tmp.   [environment.properties  file]   waterlinedata.temproot=

Starting  the  web  server  in  a  Kerberos  environment   Because  there  can  be  more  than  one  user  on  the  edge  node  with  valid  Kerberos   credentials,  Waterline  Data  Inventory  needs  to  know  the  keytab  and  username  for   the  dedicated  Waterline  Data  Inventory  user.  Otherwise,  the  web  server  attempts  to   start  using  the  first  user  information  provided  as  the  "current  user."  To  identify  the   keytab  location  and  Kerberos  principal  for  the  dedicated  Waterline  Data  Inventory   user,  set  the  following  properties:   [environment.properties  file]   waterlinedata.web.kerberos.keytab.location=

©  2014  -­‐  2015  Waterline  Data,  Inc.  All  rights  reserved.  

55    

Configuring  additional  Waterline  Data  Inventory  functionality  

Waterline  Data  Inventory  

waterlinedata.web.kerberos.username=

For  example,  "[email protected]"  

Secure  communication  between  browser  and  web  server  (SSL)   You  can  configure  Waterline  Data  Inventory  to  use  SSL  to  communicate  between  the   client  where  the  browser  is  running  and  the  web  server.  This  setup  requires:   •

Server  X.509  certificate  for  the  external  web  server  address.  This  can  be  a  real   RSA,  VeriSign,  or  similar  certificate  or  a  self-­‐signed  certificate.    



Secure  keystore  inside  Waterline  Data  Inventory's  Jetty  web  server  distribution.  

The  Jetty  documentation  provides  instructions  for  generating  a  self-­‐signed   certificate  and  for  creating  and  loading  keystore  values:   www.eclipse.org/jetty/documentation/current/configuring-­‐ssl.html#generating-­‐ key-­‐pairs-­‐and-­‐certificates   The  Waterline  Data  Inventory  Jetty  configuration  is  included  in  the  following   directory:   /waterlinedata/jetty-distribution-*/waterlinedata-base

Configuration  files  include:    

Component  

Configuration  File  Location  

Keystore     HTTPS   SSL  

etc/keystore start.d/https.ini start.d/ssl.ini

Browser  app  functionality   The  following  sections  describe  the  properties  used  to  control  aspects  of  the   Waterline  Data  Inventory  browser  application.   Self-­‐service  browsing   If  your  end-­‐users  have  accounts  in  HDFS  or  MapR-­‐FS  and  corresponding  home   directories,  Waterline  Data  Inventory  uses  the  directories  as  the  users'  home  in  the   browser  application:  click  "Browse"  in  Waterline  Data  Inventory  to  open  the  HDFS   directory  corresponding  to  the  current  user.     If  your  end-­‐users  do  not  have  accounts  in  HDFS,  Waterline  Data  Inventory  defaults   the  HDFS  root  directory.  To  improve  end-­‐users'  experience,  consider  setting  the   home  directory  each  user  sees  when  they  open  Waterline  Data  Inventory.  Set  the   HDFS  directory  path  in  the  following  property:   [webapp.properties  file]   waterlinedata.defaultdirectory=

56  

©  2014  -­‐  2015  Waterline  Data,  Inc.  All  rights  reserved.  

Installation  and  Administration  Guide  

Configuring  additional  Waterline  Data  Inventory  functionality  

For  example:   waterlinedata.defaultdirectory=/user/waterlinedata/Landing

Event  auditing   For  folders,  files,  tags,  lineage  relationships  and  origins,  Waterline  Data  Inventory   collects  events  that  occur  to  that  object.  For  example,  Waterline  Data  Inventory   collects  when  a  tag  was  created  and  when  and  by  whom  it  was  associated  with  a  file   or  a  field.  Collecting  this  information  has  a  small  performance  impact  on  the   browser  application  and  increases  the  size  of  the  repository.   You  can  keep  Waterline  Data  Inventory  from  collecting  new  events  by  setting  the     following  property  to  false:   [waterlinedata.properties  file]   waterlinedata.auditing.enabled=true (default)

By  default,  Waterline  Data  Inventory  caches  information  for  pages  viewed  through   the  web  application.  You  can  control  the  length  of  time  objects  are  cached  (timeout)   and  the  number  of  objects  cached.  The  default  timeout  is  set  to  ensure  that  the  web   application  does  not  have  to  query  the  server  each  time  the  same  file  is  viewed  in  a   user's  process  of  evaluating  the  file.  The  number  of  objects  cached  refers  to  items   the  server  supplies  to  populate  the  browser  interface;  "objects"  does  not  correspond   to  "files"  or  "tables".  We  recommend  that  you  keep  the  default  values  unless  you  are   working  with  Waterline  Data  Technical  Support  to  solve  a  specific  issue.   [webapp.properties  file]   waterlinedata.web.cache.enable=true waterlinedata.web.cache.size=1000 waterlinedata.web.cache.timeout=100

Browser  timeout   Waterline  Data  Inventory  automatically  signs  users  out  of  the  browser  application   after  30  minutes.  To  change  this  default,  edit  / jetty-distribution*/waterlinedata-base/etc/webdefault.xml  and  add  or  update  the   following  section: 30

To  remove  any  timeout,  change  this  setting  to  -­‐1.  

©  2014  -­‐  2015  Waterline  Data,  Inc.  All  rights  reserved.  

57    

Configuring  additional  Waterline  Data  Inventory  functionality  

Waterline  Data  Inventory  

Profiling  functionality   You  have  control  over  many  aspects  of  profiling  using  properties  configured  in  the   profiler.properties  file:   •

Setting  persistence  of  temporary  files  (page  58)  



Using  samples  to  calculate  data  metrics  (page  58)  



Re-­‐profiling  existing  files  versus  profiling  only  new  and  changed  files  (page  58)  



Controlling  the  number  of  map  tasks  used  per  MapReduce  job  (page  59)  



Controlling  the  number  of  reduce  tasks  used  per  MapReduce  job  (page  59)  



Running  MapReduce  jobs  in  parallel  (page  60)  



Configuring  additional  date  formats  (page  60)  



Identifying  field  separators  (page  60)  



Controlling  most  frequent  data  values  (page  61)  

Setting  persistence  of  temporary  files   The  following  configuration    property  allows  you  to  keep  the  staging  files  in  place   between  profiling  runs  for  debugging  purposes.  It  is  true  by  default:  temporary  files   are  deleted  after  profiling  is  complete.   [profiler.properties  file]   waterlinedata.deletetempfiles=true

Using  samples  to  calculate  data  metrics   By  default,  Waterline  Data  Inventory  uses  all  data  in  files  to  calculate  field-­‐level   metrics  such  as  the  minimum  and  maximum  values,  the  cardinality  and  density  of   the  values,  and  the  most  frequent  values.  You  can  achieve  better  profiling   performance  in  very  large  files  by  sampling  the  file  data  for  these  operations.  When   sampling  is  enabled,  Waterline  Data  Inventory  reads  the  first  and  last  blocks  in  the   file  and  enough  other  blocks  to  reach  the  sample  fraction  you  specify.  For  example,   with  a  sample  fraction  of  10%,  Waterline  Data  Inventory  will  read  6  blocks  of  a   250MB  file,  including  the  first  block,  the  last  block,  and  4  additional  blocks  chosen  at   random  (assuming  a  4096  KB  block  size).   [profiler.properties  file]   waterlinedata.profile.sampled=false (by default) waterlinedata.profile.sampled.fraction=0.1

(by default)

Re-­‐profiling  existing  files  versus  profiling  only  new  and  changed  files   By  default,  Waterline  Data  Inventory  only  profiles  new  files  or  files  that  have   changed  since  the  last  profiling  job.  Change  the  following  property  to  false  to   reprofile  all  files  in  the  target  directory.  You  might  choose  to  do  this  if  you  add  data   58  

©  2014  -­‐  2015  Waterline  Data,  Inc.  All  rights  reserved.  

Installation  and  Administration  Guide  

Configuring  additional  Waterline  Data  Inventory  functionality  

formats  (see  page  60)  or  change  other  parameters  that  affect  the  profiling  data   collected.     [profiler.properties  file]   waterlinedata.incremental=true (by default)

The  block  size  in  your  cluster  is  configurable.  If  your  block  size  is  large  relative  to   the  size  of  your  data  files,  it  may  not  make  sense  for  you  to  enable  sampling.  To   determine  your  cluster's  block  size,  see  the  following  configurations:    

Distribution  

Configuration  Parameter  

Location  

Default  Value  

CDH  5.x   HDP  2.x   MapR  4.x  

dfs.blocksize   dfs.blocksize   ChunkSize  

hdfs-­‐site.xml   hdfs-­‐site.xml   .dfs_attributes  

128  MB   128  MB   256  MB  

For  more  information,  see:   •

CDH:  http://archive.cloudera.com/cdh5/cdh/5/hadoop/   hadoop-­‐project-­‐dist/hadoop-­‐hdfs/hdfs-­‐default.xml  



HDP:  http://docs.hortonworks.com/HDPDocuments/HDP2/   HDP-­‐2.0.6.0/ds_Hadoop/hadoop-­‐project-­‐dist/hadoop-­‐hdfs/hdfs-­‐default.xml  



MapR:  http://doc.mapr.com/display/MapR/Chunk+Size  

Controlling  the  number  of  map  tasks  used  per  MapReduce  job   You  can  limit  the  number  of  mappers  Waterline  Data  Inventory  generates  per   profiling  job.  You  might  consider  setting  a  mapper  limit  when  you  are  profiling   many  small  files;  by  default,  the  ability  to  combine  multiple  files  into  a  single   mapper  is  enabled  and  set  to  limit  mappers  to  999.     To  control  the  number  of  mappers  per  job,  set  the  following  properties  in   waterlinedata/lib/resources/profiler.properties:   waterlinedata.profile.combinedmapper=true waterlinedata.profile.combined.max_mappers_per_job=

Controlling  the  number  of  reduce  tasks  used  per  MapReduce  job   Waterline  Data  Inventory  allows  you  to  configure  the  maximum  number  of  reducers   used  by  MapReduce  profiling  jobs.  Consider  adjusting  this  value  if  jobs  are  running   out  of  memory  during  the  reduce  tasks  for  Waterline  Data  Inventory  MapReduce   jobs.     We  recommend  that  this  control  be  set  for  a  relatively  small  number,  smaller  than   the  number  of  files  typically  processed  by  the  profiling  job  and  smaller  than  the   number  of  map  tasks  used.       The  option  is  set  to  5  reduce  tasks  by  default.    

©  2014  -­‐  2015  Waterline  Data,  Inc.  All  rights  reserved.  

59    

Configuring  additional  Waterline  Data  Inventory  functionality  

Waterline  Data  Inventory  

[profiler.properties  file]   waterlinedata.profile.reducer.count=5

Running  MapReduce  jobs  in  parallel   By  default,  Waterline  Data  Inventory  runs  MapReduce  profiling  jobs  one  after  the   other.  If  you  have  the  cluster  resources  to  run  jobs  in  parallel  or  if  you  are   controlling  the  resources  used  through  YARN  or  other  resource  management  tools,   consider  changing  this  behavior  to  allow  Waterline  Data  Inventory  to  trigger  more   than  one  MapReduce  job  at  the  same  time.   The  option  is  set  to  true  by  default.     [profiler.properties  file]   waterlinedata.waterlinedata.runjobsinseq=true

Configuring  additional  date  formats   When  Waterline  Data  Inventory  profiles  string  data  such  as  in  delimited  files  where   no  type  information  is  available,  it  examines  the  data  to  reveal  likely  data  types.  It   uses  the  format  conventions  described  by  the  International  Components  for  Unicode   (ICU)  conventions  for  dates  and  numeric  values.  You  can  add  your  own  date  formats   using  the  conventions  described  here:   icu-­‐project.org/apiref/icu4j/com/ibm/icu/text/SimpleDateFormat.html   The  pre-­‐defined  formats  are  listed  in  the  profiler  properties  file.   [profiler.properties  file]   waterlinedata.profile.datetime.formats=EE MMM dd HH:mm:ss ZZZ yyyy, M/d/yy HH:mm, EEE MMM d h:m:s z yy, yy-MM-dd hh:mm:ss ZZZZZ, yy-MM-dd,yy-MM-dd HH:mm:ss,yy/M/dd,M/d/yy hh:mm:ss a, YYYY-MM-dd'T'HH:mm:ss.SSSSSSSxxx

Identifying  field  separators   Waterline  Data  Inventory  parses  flat  files  such  as  comma-­‐separated  or  log  files  to   determine  field  separators,  looking  for  characters  that  are  repeated  within  each  row   of  the  file.  If  it  finds  more  than  one  candidate  for  a  field  delimiter,  it  ranks  the   choices  based  on  the  number  of  occurrences  of  the  character  in  the  file  and  uses  the   highest  ranked  candidate.     You  can  tell  Waterline  Data  Inventory  to  remove  some  characters  from   consideration  as  field  delimiters.  There  are  a  number  of  characters  not  considered   as  delimiters  by  default;  you  may  find  that  you  need  to  remove  characters  from  this   configuration  to  correctly  parse  your  data.   [profiler.properties file] waterlinedata.profile.format.discovery.non_separators="+-.\\/\"`()[]{}'"

60  

©  2014  -­‐  2015  Waterline  Data,  Inc.  All  rights  reserved.  

Installation  and  Administration  Guide  

Configuring  additional  Waterline  Data  Inventory  functionality  

To  include  special  characters  such  as  tabs,  follow  the  Java  conventions  for  escape   sequences  described  here:   docs.oracle.com/javase/tutorial/java/data/characters.html   Controlling  most  frequent  data  values   Waterline  Data  Inventory  collects  2000  of  the  most  frequent  values  in  each  field  in   each  file.  You  can  change  the  number  of  values  collected,  control  how  many   characters  are  included  in  each  sample,  and  how  many  of  these  values  are  used  in   search  indexes  and  to  propagate  tags.   [profiler.properties  file]   Number  of  most  frequent  values  collected   waterlinedata.profile.top_k_capacity=2000 (by default)

Size  limit  of  strings   waterlinedata.max.top_k_length=128 (by default)

Number  of  most  frequent  values  used  in  search  indexes  and  UI  lists   waterlinedata.profile.top_k=50 (by default)

Number  of  most  frequent  values  used  to  determine  tag  association  matches   waterlinedata.profile.top_k_tokens=100 (by default)

Hive  functionality   The  following  properties  control  interaction  with  Hive.  For  Hive  connection   information,  see  Communication  between  Waterline  Data  Inventory  and  Hive   (page  53).     Hive  table  profiling   By  default,  Waterline  Data  Inventory  does  not  profile  Hive  tables:  from  the  Hive  root   in  the  browser  application,  users  will  see  Hive  tables,  but  schema-­‐level  details  for   the  tables  are  not  available.  You  can  profile  Hive  tables  using  the  "profieHive"  script   command  (see  page  42).     Always  profile  Hive  tables   To  include  Hive  table  profiling  in  all  Waterline  Data  Inventory  profiling  jobs,  set  the   following  option  to  'true'.  This  option  is  not  needed  if  you  use  the  "profileHive"  and   "profileHiveOnly"  commands:  these  commands  override  the  value  of  this  property.   [profiler.properties  file]   waterlinedata.profilehive=false (default)

Clear  deleted  Hive  tables   By  default  when  profiling  Hive  tables,  Waterline  Data  Inventory  reviews  the  tables   in  the  database  to  ensure  that  the  data  they  are  based  on  still  exists.  If  the  backing   ©  2014  -­‐  2015  Waterline  Data,  Inc.  All  rights  reserved.  

61    

Configuring  additional  Waterline  Data  Inventory  functionality  

Waterline  Data  Inventory  

files  have  been  deleted  for  a  given  table,  Waterline  Data  Inventory  clears  out  the   table.  You  can  turn  off  this  check;  turning  it  off  reduces  the  overall  profiling  time  by   a  small  amount.   [profiler.properties  file]   waterlinedata.cleanorphanedhivetables=true

Hive  table  backing  file  location   When  users  create  Hive  tables  from  ORC,  RC,  and  Sequence  files,  Waterline  Data   Inventory  creates  a  copy  of  the  data  in  the  HDFS  (or  MapR-­‐FS)  directory  specified  by   this  property  and  creates  the  Hive  table  from  the  copied  file  or  files.  The  browser   application  includes  links  between  the  backing  file  and  the  Hive  table.  If  users  create   Hive  tables  from  text,  JSON,  or  log  files  or  collections,  Waterline  Data  Inventory  does   not  create  a  copy  of  the  file  before  creating  the  Hive  table.  By  default,  this  property   is  commented  out  and  the  backing  files  are  placed  in  the  active  user's  home   directory.   An  additional  property  indicates  whether  Waterline  Data  Inventory  should  always   make  copies  of  the  original  data  or  build  a  Hive  table  from  the  original  file  when  it   can.  Consider  disabling  creating  Hive  tables  in  place  when  users  are  unlikely  to  have   write  permission  to  the  directory  in  which  the  original  HDFS  file  is  located.   [environment.properties  file]   waterlinedata.profile.hivedir= waterlinedata.hive.create_table_in_place=true

Discovery  functionality   The  following  properties  control  how  Waterline  Data  Inventory  makes  suggestions   for  lineage  relationships  among  files  and  for  tag  associations.  Note  that  from  the  tag   glossary  you  can  disable  tag  propagation  for  individual  tags,  including  built-­‐in  tags.   Data  type  discovery   When  Waterline  Data  Inventory  profiles  data  that  does  not  have  type  information,  it   reads  field  values  to  determine  data  types.  Use  this  property  to  disable  data  type   discovery  (-­‐1),  to  use  all  field  values  to  determine  data  types  (0),  or  to  limit  data   type  discovery  to  the  most  frequent  values  (1  -­‐  default)  as  identified  by  the  profiling   property  waterlinedata.profile.top_k_capacity  (page  61).   [discovery.properties  file]   waterlinedata.profile.data_format_discovery=1

Balancing  profiling  performance  against  data  quality  calculations   Waterline  Data  Inventory  calculates  cardinality  and  selectivity  for  each  field  in  each   file  profiled.  In  addition,  it  collects  a  sample  of  the  most  frequent  values  in  the  field.   Use  this  parameter  to  reduce  the  amount  of  time  Waterline  Data  Inventory  spends   62  

©  2014  -­‐  2015  Waterline  Data,  Inc.  All  rights  reserved.  

Installation  and  Administration  Guide  

Configuring  additional  Waterline  Data  Inventory  functionality  

during  profiling  making  the  sample  lists  accurate.  By  default,  this  optimization  is   disabled.   [discovery.properties  file]   waterlinedata.profile.high_cardinality.optimization=false (default)

Thresholds  for  what  tag  suggestions  are  exposed   Waterline  Data  Inventory  has  default  values  set  for  all  field-­‐level  tag  propagation  as   follows.  Some  of  these  values  can  be  configured  individually  for  each  tag  from  the   Glossary.   [discovery.properties  file]   Waterline  Data  Inventory  gives  a  weight  to  its  suggestions  for  matching  tag   associations.  You  can  choose  to  expose  more  or  fewer  of  these  suggestions  by   configuring  the  cutoff  weight.  Tag  associations  whose  calculated  weight  is  below   this  value  are  not  exposed  to  users.  You  can  set  this  value  per  tag  from  the  Glossary.   waterlinedata.discovery.tolerance.weight=40.0 (by default)

Limit  to  the  number  of  pre-­‐defined  tags  that  will  be  suggested  for  a  given  field.   waterlinedata.discovery.tags.max_suggested_ref_tables=3

Limit  to  the  number  of  any  tags  that  will  be  suggested  for  a  given  field.   waterlinedata.discovery.tags.max_suggested=3

Eliminating  weak  associations.  If  more  than  one  tag  is  suggested  for  a  field,  the  tag   with  the  highest  weight  will  be  suggested;  other  tags  must  be  within  this  value  of  the   top  tag  for  those  tags  to  be  suggested  in  addition  to  the  top  tag.   waterlinedata.discovery.tags.value_hit_diff=20.0

Tag  association  for  low-­‐cardinality  data   When  fields  have  low  cardinality  (the  same  values  appear  many  times  in  the  field  for   the  file),  tag  propagation  can  be  skewed  toward  making  connections  that  are  not   representative  of  the  data.  Waterline  Data  Inventory  provides  some  tools  to  help   you  avoid  false  positive  tag  associations  among  fields  with  low  cardinality.   Conventions  for  indicating  missing  values   One  common  case  where  low  cardinality  values  cause  unexpected  tag  associations  is   when  the  data  includes  one  or  more  values  to  indicate  that  there  isn't  a  value.  For   example,  if  data  uses  a  convention  of  "not  available"  or  "NA"  in  the  file  to  identify   places  where  values  are  not  provided,  this  value  may  be  mistakenly  considered  to  be   related  to  other  data  that  also  uses  "not  available"  or  "NA"  even  though  other  values   in  the  data  are  unrelated.     Waterline  Data  Inventory  provides  a  blacklist  of  values  that  should  be  ignored  when   making  low  cardinality  matches.  You  can  modify  this  comma-­‐separated  list  to  meet   the  requirements  of  your  data,  including  providing  localized  versions  of  these   ©  2014  -­‐  2015  Waterline  Data,  Inc.  All  rights  reserved.  

63    

Configuring  additional  Waterline  Data  Inventory  functionality  

Waterline  Data  Inventory  

indicators.  Note  that  you  should  include  values  in  lower  case  as  all  field  values  are   changed  to  lower  case  when  matches  are  calculated.   [discovery.properties  file]   waterlinedata.discovery.tags.null_values=na,n/a,unspecified,not available,null,empty,blank,missing

Tag  propagation  among  low  cardinality  values   For  low  cardinality  values  (few  distinct  values  among  all  field  values),  Waterline   Data  Inventory  requires  100%  of  the  values  in  the  candidate  field  to  match  for  a  tag   to  be  associated  with  the  candidate  field.  By  default,  "low  cardinality"  fields  are   fields  with  two  or  fewer  distinct  values.  To  require  more  values  from  a  candidate   field  to  match  before  a  tag  is  suggested  for  an  association,  change  the  following   option  to  a  larger  number.    [discovery.properties  file]   waterlinedata.discovery.tags.min_cardinality.partial_match=2

Tag  association  using  tag  rules   Some  built-­‐in  tags  and  the  tags  defined  by  users  can  have  tagging  rules  that  use   regular  expressions  to  identify  field  data  that  should  be  associated  with  the  tag.  Use   this  property  to  disable  evaluating  tagging  rules  (-­‐1),  to  use  all  field  values  to   identify  matches  with  tagging  rules  (0),  or  to  limit  matching  tagging  rule  evaluation   to  the  most  frequent  values  (1  -­‐  default)  as  identified  by  the  profiling  property   waterlinedata.profile.top_k_capacity  (page  61).   [discovery.properties  file]   waterlinedata.profile.regex_evaluation=1

Controlling  collections  discovery   By  default,  Waterline  Data  Inventory  only  considers  folders  with  3  or  more  files  in   any  one  folder  of  a  recursive  tree)  to  be  a  candidate  for  a  collection.  You  can  control   this  value  to  better  reflect  the  organization  of  your  cluster.  Note  that  there  are  other   qualifications  that  must  be  met  before  the  files  in  the  folder  are  marked  as  a   collection.   [discovery.properties  file]   waterlinedata.discovery.smallest.collection.size=3 (by default)

Controlling  lineage  relationship  discovery   When  reviewing  files  for  lineage  relationships,  Waterline  Data  Inventory  is  able  to   tolerate  a  number  of  changes  to  file  schemas  and  data  and  still  find  a  connection   among  files.  These  properties  control  the  parameters  used  to  determine  a  lineage   relationship.   The  amount  of  overlapping  data  between  fields  to  consider  the  files  matching.   64  

©  2014  -­‐  2015  Waterline  Data,  Inc.  All  rights  reserved.  

Installation  and  Administration  Guide  

Configuring  additional  Waterline  Data  Inventory  functionality  

waterlinedata.discovery.lineage.overlap=0.9 (by default)

If  multiple  fields  from  the  same  resource  match  the  fields  from  another  resource,   Waterline  Data  Inventory  uses  field  names  to  determine  if  the  fields  match.  This   mechanism  is  used  only  if  field  names  are  similar  within  the  percentage  indicated  by   this  property,  0.8  (80%)  by  default.   waterlinedata.discovery.lineage.field_name_match=0.8

Use  HDFS  last  access  date  to  limit  lineage  relationship  candidates.  The  HDFS   property  dfs.namenode.accesstime.precision  in  hdfs-site.xml  must  be  enabled.   (Note  that  there  is  no  provision  for  tracking  access  time  in  MapR.)   waterlinedata.discovery.lineage.use_access_time_filter=true

Limit  the  time  between  access  of  a  parent  file  and  creation  of  a  child.  This  criteria  is   ignored  (no  time  checking)  if  set  to  0.   waterlinedata.discovery.lineage.batch_window_hours=24

Obscuring  passwords  in  Waterline  Data  Inventory  configuration  files   To  convert  passwords  obfuscated  values,  run  the  following  command,  provide  the   Hive  password,  then  insert  the  output  in  the  appropriate  resource  file.     /waterlinedata/bin/obfuscate

The  output  is  also  saved  as  obfuscate.out.  

©  2014  -­‐  2015  Waterline  Data,  Inc.  All  rights  reserved.  

65    

Suggest Documents