Advanced Document Data Extraction

Advanced  Document   Data  Extraction with  Esker  Teach DAN  STRONG CONTENTS In  this  session  we  will  discuss: § How  to  teach  multiple  lay...
Author: Brook Byrd
0 downloads 2 Views 10MB Size
Advanced  Document   Data  Extraction with  Esker  Teach DAN  STRONG

CONTENTS In  this  session  we  will  discuss: § How  to  teach  multiple  layouts § Tips  and  tricks:  line  item  extraction § Regular  expressions § Q&A  and  teaching  your  documents

#EAUC2016

GETTING  STARTED  WITH  TEACHING

REVIEW:  TEACHING  OR  NOT  …

#EAUC2016

WHAT  CAN  BE  FIXED  AND  WHAT  CANNOT? Not  all  issues  can  be  fixed  with  teaching! Problems  that  cannot be  fixed with  teaching  

Problems  that  can be  fixed with  teaching  

Expected  value  is  not  in  the  document (if  not  constant)

Incorrect  data  is  extracted  due  to   incorrect  zone  being  targeted

Expected  value  is  handwritten

Data  is  well  located  but  partially   extracted  or  not  extracted

The  document  layout/quality  does  not   permit correct  data  recognition

The  business  partner  is  not,  or  is   incorrectly,  recognized

#EAUC2016

GETTING  STARTED  WITH  TEACHING

TEACHING  MULTIPLE  LAYOUTS

#EAUC2016

WHAT  IF  A  BUSINESS  PARTNER  USES  SEVERAL  LAYOUTS? § You  can  teach  several  layouts  for  a  single  business  partner § Best  to  manually  fix  extraction  errors  for  layouts  that  are rarely  sent

#EAUC2016

DIFFERENT  LAYOUT To  check  if  the  business  partner  sent  a  document   with  a  different  layout: 1.  Teach  with  current  file

Ref.

2.  Check  the  “Ref”  box  on   top  of  the  document  display

Document  display   area  shows   the  original  document  overlaying   the  current  document.

#EAUC2016

GETTING  STARTED  WITH  TEACHING

CAPTURING  FIELD  DATA

#EAUC2016

DEFINING  WHERE  THE  VALUE  IS  SEARCHED § You  can  frame  the  exact  area and  specify  that   the  area  position  is  not  fixed  in  the  document  (floating) Default  and  recommended  option

§ You  can  frame  a  large  fixed  area  then  specify which  data  to  retain

#EAUC2016

FLOATING  AREA § Surrounding  text  is  used as  a  reference  to  locate the  extraction  area

#EAUC2016

FLOATING  AREA

ORIGINAL

INCOMING

Words  highlighted  in  blue  are  used   to  reposition  the  extraction  area

#EAUC2016

FIXED  AREA § You  would  usually  frame  a  large  area  and  specify  what  to   look  for in  this  area

#EAUC2016

DEFINING  WHAT  SHOULD  BE  EXTRACTED You  can  specify  what  type  of  data  should  be  extracted: § Date   – Several  possible  formats

§ Number   – Several  possible  formats

§ Regular  expression – [A-­‐Z]{3}-­‐[0-­‐9]{4}    would  extract    ZBT-­‐2455

§ Pattern – aaa-­‐nnnn would  extract    ZBT-­‐2455 – Several  possible  formats

#EAUC2016

REFERENCE  COLUMN(S)  – LINE  ITEMS § A reference  column  is  a  column  that  introduces  a  new  row  in   the  table § It  should  be  a  column  that  is  the  most  representative of  the   row  you  want  to  extract § Several  columns  can  be  used  together  to  define  a  new  row  in   a  table General  rules: • Do  not  use  columns  with  optional  items • Do  not  use  columns  with  a  variant  number of  lines  per  row • Use  columns  where  the  format  is  known (number,  date  or  a  regular  expression)

• Too  many  references   can  lead  to  missing  rows • On  the  other  hand,  not   enough  references  can   lead  to  incorrect  rows #EAUC2016

LINE  ITEM  DATA  EXTRACTION   APPLIED  TO  A  BUSINESS  DOCUMENT



Number  here

This  field  is  called  the   reference  column, it  introduces  a  new  row   in  the  table

#EAUC2016

TABLE  SEARCH  AREA § When  line  items  are  always  in  the  same  area   across  all  pages,  refining  the  search  scope   allows  to: – Speed  up  the  extraction  process – Avoid  extracting irrelevant information

Navigate  your  document   pages  to  make  sure   line  items  are  located  in  the  selected  area #EAUC2016

DEFINING  LINE  ITEM  FIELDS  (COLUMNS) § You  redefine  all  required  fields  by  capturing   the  data  on  the  first  row 2 Define   1

The  area  you  frame  should  be  wider than  the   current  value  to  handle  other  possible  values #EAUC2016

HANDLE  TABLES  SPLIT  INTO  TWO  PARTS § A  table  may  be  split  into  two  parts  as  a  result  of  a  page  break è Select  the  option  ‘Merge  an  item  on  page  break’  to  ensure  the   two  parts  are  grouped

This  option  is  available  when  editing  a  column  and   only  when  a  table  search  area  has  been  defined. #EAUC2016

HANDLE  ROWS  WITH  VARIABLE  NUMBER  OF  LINES § Rows  may  have  a  variable  number  of  lines  

è Select  the  option  ‘Capture  full  row  height’  to  capture  all  lines  of   the  row  when  needed This  option  is  available  when  editing  a  column #EAUC2016

REPLACE  A  STRING § You  can  replace  a  string  captured  from  the  document   by  another  string

§ You  can  use  this  replacement  system: – When  the  characters  recognized  by  the  OCR  are  not what  you  expect  (e.g.,  replace  T0  by  TO) – To  remove  a  description  or  a  comment  from  a  column #EAUC2016

REGULAR  EXPRESSIONS § Start  with  the  basics  and  refer  to  the  online  documentation  for   commonly  used  characters § Use  online  tools  like  regextester.com to  test  your  regular  expressions § Build  a  cheat  sheet

1

Regular  expressions  can  be  defined  as  a   data  format  or  part  of  the  search  parameters. #EAUC2016

REGULAR  EXPRESSIONS Regular  expression  common  characters: [A-­‐Z]  :  Uppercase  character     [a-­‐z] :  Lowercase  character [A-­‐z] :  Uppercase  or  lowercase  character [0-­‐9] :  Any  number  between  0  and  9 \ :  Escape  character

Regular  expression  wildcards: .* . [-­‐] [^-­‐] [  ] * + ?

:  Searches  for  all  characters     :  Searches  for  any  single  character :  Searches  for  any  character  in  the  range :  Searches  for  any  character  that  is  not  in  the  range :  Searches  for  any  string  containing  the  characters  in  the  list :  Searches  for  0  to  n  occurrences  of  the  character  or  regular   expression  situated  immediately  to  the  left :  Searches  for  at  least  one  occurrence  of  the  character  or  regular expression  situated  immediately  to  the  left :  Searches  for  0  to  1  occurrence  of  the  character  or  regular   expression  situated  immediately  to  the  left #EAUC2016

REGULAR  EXPRESSIONS Upper  or   lowercase   letter

One  or  more   occurrence  of  the   characters  within   the  brackets

Number   between   0  and  9

What  d oes  this  mean? This  is  an  example  used  to  extract  alphanumeric   PO  numbers  (e.g.,  123ABC  or  1A2B3C).

#EAUC2016

REGULAR  EXPRESSIONS Optional  space   character One  letter  upper   or  lowercase

One  letter  upper   or  lowercase

One  number   between   0  and  9

One  letter  upper   or  lowercase

One  number   between   0  and  9

One  number   between   0  and  9

What  d oes  this  mean? This  is  an  example  used  to  extract  Canadian   Zip  Codes  (e.g.,  K1A  0A1  or  K1A0A1). #EAUC2016

REGULAR  EXPRESSIONS Optional  open  or   close  parenthesis

Optional  space,  dash,   or  open  or  close   parenthesis

One  or  more   numbers   between   0  and  9

Optional   space  or  dash

One  or  more   numbers   between   0  and  9

One  or  more   numbers   between 0  and  9

What  d oes  this  mean? This  is  an  example  used  to  extract  phone  numbers   (e.g.,   [608] 828-­‐6000  or  6088286000  or  608-­‐828-­‐6000). #EAUC2016

GETTING  STARTED  WITH  TEACHING

TIPS  &  TRICKS

#EAUC2016

TIPS  ON  AREAS  DEFINITION There  are  2  options  to  select  an  area: § For  document  recognition identifiers,  narrow  the  area   to  the  words  you  want  to  use § For  other  fields,  make  sure   the  area  is:

1

– Wide  enough  to  always   extract  wanted  data – Tight  enough  to  avoid   capturing  unwanted  data   (especially  for  the  reference   column[s])

2

1

or

2

#EAUC2016

TIPS  ON  OCR  DATA  EXTRACTION § OCR  extraction  uses  the  60%  rule: – By  default,  when  drawing  an  area,  if  the  area  covers  at  least   60%  of  a  “word”  extracted  by  the  OCR  then  the  whole  word  is   going  to  be  extracted =    « » =    1234567890 =    1234567890

The  OCR  View  option   will  allow  you  to  check: • What  has  been  extracted  by  the  OCR  Engine • How  data  have  been  cut  into  “words” #EAUC2016

TIPS  ON  REGULAR  EXPRESSIONS § Using  a  regular  expression  will  allow  you  to  narrow  the   information  to  retain  (and  get  rid  of  unwanted  data)

#EAUC2016

TIPS  ON  REGULAR  EXPRESSIONS:  SAMPLES Regular  expression [A-­Z]{2}[0-­1]{5}

[0-­9]+[^0-­9A-­Za-­z]+[0-­9]+

([0-­9]{3,5}\-­){1,2}[0-­9]+

Meaning

Matching  with



[A-­Z]{3}  means  “3  upper  case  letters”

AR12345



[0-­1]{5}  means  “5  digits”

GJ56326 12345-­6789



[0-­9]+  means  “one  or  more  digit”



[^0-­9A-­Za-­z]+ means  “anything  but  a  digit,   an  upper  case  letter  or  a  lower  case  letter”



[0-­9]+  means  “one  or  more  digit”



[0-­9]{3,5}\-­ means  “3  to  5  digits  followed  by   1234-­3 a  -­” 12345-­555-­474 ([0-­9]{3,5}\-­){1,2} means  “1  or  2  occurrence   3443-­432-­ of  the  previous  pattern  within  ()” 567890 [0-­9]+  means  “one  or  more  digit”

• •

12-­3456 12345_6789 123:456 1-­-­456

#EAUC2016

GENERAL  TIPS:  TEACHING  PRACTICES § Before teaching,  always ask yourself “Should I  really teach this document   layout?” § Then when teaching,  the  most important  step is the  recognition  of  the   document  layout § Teaching is an  incremental process: – If  a  field is correctly extracted,  there is no  reason to  teach it – Concentrate on  what is not  correctly extracted

§ There  is no  real  risk with teaching:  if  it fails,  then it means you will just have  to  fix things manually § When teaching,  remember to  regularly save what you are  doing (to   avoid  loosing  your  data  because  you  loose  the  ownership)

#EAUC2016

www.esker.com