Advanced Document Data Extraction with Esker Teach DAN STRONG
CONTENTS In this session we will discuss: § How to teach multiple layouts § Tips and tricks: line item extraction § Regular expressions § Q&A and teaching your documents
#EAUC2016
GETTING STARTED WITH TEACHING
REVIEW: TEACHING OR NOT …
#EAUC2016
WHAT CAN BE FIXED AND WHAT CANNOT? Not all issues can be fixed with teaching! Problems that cannot be fixed with teaching
Problems that can be fixed with teaching
Expected value is not in the document (if not constant)
Incorrect data is extracted due to incorrect zone being targeted
Expected value is handwritten
Data is well located but partially extracted or not extracted
The document layout/quality does not permit correct data recognition
The business partner is not, or is incorrectly, recognized
#EAUC2016
GETTING STARTED WITH TEACHING
TEACHING MULTIPLE LAYOUTS
#EAUC2016
WHAT IF A BUSINESS PARTNER USES SEVERAL LAYOUTS? § You can teach several layouts for a single business partner § Best to manually fix extraction errors for layouts that are rarely sent
#EAUC2016
DIFFERENT LAYOUT To check if the business partner sent a document with a different layout: 1. Teach with current file
Ref.
2. Check the “Ref” box on top of the document display
Document display area shows the original document overlaying the current document.
#EAUC2016
GETTING STARTED WITH TEACHING
CAPTURING FIELD DATA
#EAUC2016
DEFINING WHERE THE VALUE IS SEARCHED § You can frame the exact area and specify that the area position is not fixed in the document (floating) Default and recommended option
§ You can frame a large fixed area then specify which data to retain
#EAUC2016
FLOATING AREA § Surrounding text is used as a reference to locate the extraction area
#EAUC2016
FLOATING AREA
ORIGINAL
INCOMING
Words highlighted in blue are used to reposition the extraction area
#EAUC2016
FIXED AREA § You would usually frame a large area and specify what to look for in this area
#EAUC2016
DEFINING WHAT SHOULD BE EXTRACTED You can specify what type of data should be extracted: § Date – Several possible formats
§ Number – Several possible formats
§ Regular expression – [A-‐Z]{3}-‐[0-‐9]{4} would extract ZBT-‐2455
§ Pattern – aaa-‐nnnn would extract ZBT-‐2455 – Several possible formats
#EAUC2016
REFERENCE COLUMN(S) – LINE ITEMS § A reference column is a column that introduces a new row in the table § It should be a column that is the most representative of the row you want to extract § Several columns can be used together to define a new row in a table General rules: • Do not use columns with optional items • Do not use columns with a variant number of lines per row • Use columns where the format is known (number, date or a regular expression)
• Too many references can lead to missing rows • On the other hand, not enough references can lead to incorrect rows #EAUC2016
LINE ITEM DATA EXTRACTION APPLIED TO A BUSINESS DOCUMENT
↑
Number here
This field is called the reference column, it introduces a new row in the table
#EAUC2016
TABLE SEARCH AREA § When line items are always in the same area across all pages, refining the search scope allows to: – Speed up the extraction process – Avoid extracting irrelevant information
Navigate your document pages to make sure line items are located in the selected area #EAUC2016
DEFINING LINE ITEM FIELDS (COLUMNS) § You redefine all required fields by capturing the data on the first row 2 Define 1
The area you frame should be wider than the current value to handle other possible values #EAUC2016
HANDLE TABLES SPLIT INTO TWO PARTS § A table may be split into two parts as a result of a page break è Select the option ‘Merge an item on page break’ to ensure the two parts are grouped
This option is available when editing a column and only when a table search area has been defined. #EAUC2016
HANDLE ROWS WITH VARIABLE NUMBER OF LINES § Rows may have a variable number of lines
è Select the option ‘Capture full row height’ to capture all lines of the row when needed This option is available when editing a column #EAUC2016
REPLACE A STRING § You can replace a string captured from the document by another string
§ You can use this replacement system: – When the characters recognized by the OCR are not what you expect (e.g., replace T0 by TO) – To remove a description or a comment from a column #EAUC2016
REGULAR EXPRESSIONS § Start with the basics and refer to the online documentation for commonly used characters § Use online tools like regextester.com to test your regular expressions § Build a cheat sheet
1
Regular expressions can be defined as a data format or part of the search parameters. #EAUC2016
REGULAR EXPRESSIONS Regular expression common characters: [A-‐Z] : Uppercase character [a-‐z] : Lowercase character [A-‐z] : Uppercase or lowercase character [0-‐9] : Any number between 0 and 9 \ : Escape character
Regular expression wildcards: .* . [-‐] [^-‐] [ ] * + ?
: Searches for all characters : Searches for any single character : Searches for any character in the range : Searches for any character that is not in the range : Searches for any string containing the characters in the list : Searches for 0 to n occurrences of the character or regular expression situated immediately to the left : Searches for at least one occurrence of the character or regular expression situated immediately to the left : Searches for 0 to 1 occurrence of the character or regular expression situated immediately to the left #EAUC2016
REGULAR EXPRESSIONS Upper or lowercase letter
One or more occurrence of the characters within the brackets
Number between 0 and 9
What d oes this mean? This is an example used to extract alphanumeric PO numbers (e.g., 123ABC or 1A2B3C).
#EAUC2016
REGULAR EXPRESSIONS Optional space character One letter upper or lowercase
One letter upper or lowercase
One number between 0 and 9
One letter upper or lowercase
One number between 0 and 9
One number between 0 and 9
What d oes this mean? This is an example used to extract Canadian Zip Codes (e.g., K1A 0A1 or K1A0A1). #EAUC2016
REGULAR EXPRESSIONS Optional open or close parenthesis
Optional space, dash, or open or close parenthesis
One or more numbers between 0 and 9
Optional space or dash
One or more numbers between 0 and 9
One or more numbers between 0 and 9
What d oes this mean? This is an example used to extract phone numbers (e.g., [608] 828-‐6000 or 6088286000 or 608-‐828-‐6000). #EAUC2016
GETTING STARTED WITH TEACHING
TIPS & TRICKS
#EAUC2016
TIPS ON AREAS DEFINITION There are 2 options to select an area: § For document recognition identifiers, narrow the area to the words you want to use § For other fields, make sure the area is:
1
– Wide enough to always extract wanted data – Tight enough to avoid capturing unwanted data (especially for the reference column[s])
2
1
or
2
#EAUC2016
TIPS ON OCR DATA EXTRACTION § OCR extraction uses the 60% rule: – By default, when drawing an area, if the area covers at least 60% of a “word” extracted by the OCR then the whole word is going to be extracted = « » = 1234567890 = 1234567890
The OCR View option will allow you to check: • What has been extracted by the OCR Engine • How data have been cut into “words” #EAUC2016
TIPS ON REGULAR EXPRESSIONS § Using a regular expression will allow you to narrow the information to retain (and get rid of unwanted data)
#EAUC2016
TIPS ON REGULAR EXPRESSIONS: SAMPLES Regular expression [A-Z]{2}[0-1]{5}
[0-9]+[^0-9A-Za-z]+[0-9]+
([0-9]{3,5}\-){1,2}[0-9]+
Meaning
Matching with
•
[A-Z]{3} means “3 upper case letters”
AR12345
•
[0-1]{5} means “5 digits”
GJ56326 12345-6789
•
[0-9]+ means “one or more digit”
•
[^0-9A-Za-z]+ means “anything but a digit, an upper case letter or a lower case letter”
•
[0-9]+ means “one or more digit”
•
[0-9]{3,5}\- means “3 to 5 digits followed by 1234-3 a -” 12345-555-474 ([0-9]{3,5}\-){1,2} means “1 or 2 occurrence 3443-432- of the previous pattern within ()” 567890 [0-9]+ means “one or more digit”
• •
12-3456 12345_6789 123:456 1--456
#EAUC2016
GENERAL TIPS: TEACHING PRACTICES § Before teaching, always ask yourself “Should I really teach this document layout?” § Then when teaching, the most important step is the recognition of the document layout § Teaching is an incremental process: – If a field is correctly extracted, there is no reason to teach it – Concentrate on what is not correctly extracted
§ There is no real risk with teaching: if it fails, then it means you will just have to fix things manually § When teaching, remember to regularly save what you are doing (to avoid loosing your data because you loose the ownership)
#EAUC2016
www.esker.com