Advanced Document Data Extraction

Advanced Document Data Extraction with Esker Teach DAN STRONG CONTENTS In this session we will discuss: § How to teach multiple lay...

Author: Brook Byrd

0 downloads 2 Views 10MB Size

Report

Download PDF

Recommend Documents

Data Collection Data Collection and Data Extraction Using GMT

From Data to Document:

Taltioni Data Model Document

Advanced Data Structures

CDO - advanced data operations

Advanced DATA Step Topics

Advanced data analysis

Advanced, Flexible & Easy to Use Document Managers

Web-Scale Extraction of Structured Data

Developing an Advanced Document Based Map Server

EMC Documentum Advanced Document Transformation Services

Information Extraction Challenges in Managing Unstructured Data

Information Extraction, Data Mining and Joint Inference

Glaucoma Registry Data Definition Document

Granular data and advanced analytics

Advanced Metering Management Data Security

Text Extraction in Complex Color Document Images for Enhanced Readability

Document Layout Structure Extraction Using Bounding Boxes of Different Entities

MALICIOUS PDF DOCUMENT DETECTION BASED ON FEATURE EXTRACTION AND ENTROPY

APCO Emergency Incident Data Document (EIDD) Information Document

The Specialists in SAP Data & Document Management

ELECTRONIC DATA INTERCHANGE (EDI) LAYOUT DOCUMENT

F08 INSTALLATION & PROGRAMMING MANUAL DOCUMENT DATA SHEET

Chapter 3. Advanced Data Mining Neural Networks

Advanced Document Data Extraction with Esker Teach DAN STRONG

CONTENTS In this session we will discuss: § How to teach multiple layouts § Tips and tricks: line item extraction § Regular expressions § Q&A and teaching your documents

#EAUC2016

GETTING STARTED WITH TEACHING

REVIEW: TEACHING OR NOT …

#EAUC2016

WHAT CAN BE FIXED AND WHAT CANNOT? Not all issues can be fixed with teaching! Problems that cannot be fixed with teaching

Problems that can be fixed with teaching

Expected value is not in the document (if not constant)

Incorrect data is extracted due to incorrect zone being targeted

Expected value is handwritten

Data is well located but partially extracted or not extracted

The document layout/quality does not permit correct data recognition

The business partner is not, or is incorrectly, recognized

#EAUC2016

GETTING STARTED WITH TEACHING

TEACHING MULTIPLE LAYOUTS

#EAUC2016

WHAT IF A BUSINESS PARTNER USES SEVERAL LAYOUTS? § You can teach several layouts for a single business partner § Best to manually fix extraction errors for layouts that are rarely sent

#EAUC2016

DIFFERENT LAYOUT To check if the business partner sent a document with a different layout: 1. Teach with current file

Ref.

2. Check the “Ref” box on top of the document display

Document display area shows the original document overlaying the current document.

#EAUC2016

GETTING STARTED WITH TEACHING

CAPTURING FIELD DATA

#EAUC2016

DEFINING WHERE THE VALUE IS SEARCHED § You can frame the exact area and specify that the area position is not fixed in the document (floating) Default and recommended option

§ You can frame a large fixed area then specify which data to retain

#EAUC2016

FLOATING AREA § Surrounding text is used as a reference to locate the extraction area

#EAUC2016

FLOATING AREA

ORIGINAL

INCOMING

Words highlighted in blue are used to reposition the extraction area

#EAUC2016

FIXED AREA § You would usually frame a large area and specify what to look for in this area

#EAUC2016

DEFINING WHAT SHOULD BE EXTRACTED You can specify what type of data should be extracted: § Date – Several possible formats

§ Number – Several possible formats

§ Regular expression – [A-‐Z]{3}-‐[0-‐9]{4} would extract ZBT-‐2455

§ Pattern – aaa-‐nnnn would extract ZBT-‐2455 – Several possible formats

#EAUC2016

REFERENCE COLUMN(S) – LINE ITEMS § A reference column is a column that introduces a new row in the table § It should be a column that is the most representative of the row you want to extract § Several columns can be used together to define a new row in a table General rules: • Do not use columns with optional items • Do not use columns with a variant number of lines per row • Use columns where the format is known (number, date or a regular expression)

• Too many references can lead to missing rows • On the other hand, not enough references can lead to incorrect rows #EAUC2016

LINE ITEM DATA EXTRACTION APPLIED TO A BUSINESS DOCUMENT

↑

Number here

This field is called the reference column, it introduces a new row in the table

#EAUC2016

TABLE SEARCH AREA § When line items are always in the same area across all pages, refining the search scope allows to: – Speed up the extraction process – Avoid extracting irrelevant information

Navigate your document pages to make sure line items are located in the selected area #EAUC2016

DEFINING LINE ITEM FIELDS (COLUMNS) § You redefine all required fields by capturing the data on the first row 2 Define 1

The area you frame should be wider than the current value to handle other possible values #EAUC2016

HANDLE TABLES SPLIT INTO TWO PARTS § A table may be split into two parts as a result of a page break è Select the option ‘Merge an item on page break’ to ensure the two parts are grouped

This option is available when editing a column and only when a table search area has been defined. #EAUC2016

HANDLE ROWS WITH VARIABLE NUMBER OF LINES § Rows may have a variable number of lines

è Select the option ‘Capture full row height’ to capture all lines of the row when needed This option is available when editing a column #EAUC2016

REPLACE A STRING § You can replace a string captured from the document by another string

§ You can use this replacement system: – When the characters recognized by the OCR are not what you expect (e.g., replace T0 by TO) – To remove a description or a comment from a column #EAUC2016

REGULAR EXPRESSIONS § Start with the basics and refer to the online documentation for commonly used characters § Use online tools like regextester.com to test your regular expressions § Build a cheat sheet

1

Regular expressions can be defined as a data format or part of the search parameters. #EAUC2016

REGULAR EXPRESSIONS Regular expression common characters: [A-‐Z] : Uppercase character [a-‐z] : Lowercase character [A-‐z] : Uppercase or lowercase character [0-‐9] : Any number between 0 and 9 \ : Escape character

Regular expression wildcards: .* . [-‐] [^-‐] [ ] * + ?

: Searches for all characters : Searches for any single character : Searches for any character in the range : Searches for any character that is not in the range : Searches for any string containing the characters in the list : Searches for 0 to n occurrences of the character or regular expression situated immediately to the left : Searches for at least one occurrence of the character or regular expression situated immediately to the left : Searches for 0 to 1 occurrence of the character or regular expression situated immediately to the left #EAUC2016

REGULAR EXPRESSIONS Upper or lowercase letter

One or more occurrence of the characters within the brackets

Number between 0 and 9

What d oes this mean? This is an example used to extract alphanumeric PO numbers (e.g., 123ABC or 1A2B3C).

#EAUC2016

REGULAR EXPRESSIONS Optional space character One letter upper or lowercase

One letter upper or lowercase

One number between 0 and 9

One letter upper or lowercase

One number between 0 and 9

One number between 0 and 9

What d oes this mean? This is an example used to extract Canadian Zip Codes (e.g., K1A 0A1 or K1A0A1). #EAUC2016

REGULAR EXPRESSIONS Optional open or close parenthesis

Optional space, dash, or open or close parenthesis

One or more numbers between 0 and 9

Optional space or dash

One or more numbers between 0 and 9

One or more numbers between 0 and 9

What d oes this mean? This is an example used to extract phone numbers (e.g., [608] 828-‐6000 or 6088286000 or 608-‐828-‐6000). #EAUC2016

GETTING STARTED WITH TEACHING

TIPS & TRICKS

#EAUC2016

TIPS ON AREAS DEFINITION There are 2 options to select an area: § For document recognition identifiers, narrow the area to the words you want to use § For other fields, make sure the area is:

1

– Wide enough to always extract wanted data – Tight enough to avoid capturing unwanted data (especially for the reference column[s])

2

1

or

2

#EAUC2016

TIPS ON OCR DATA EXTRACTION § OCR extraction uses the 60% rule: – By default, when drawing an area, if the area covers at least 60% of a “word” extracted by the OCR then the whole word is going to be extracted = « » = 1234567890 = 1234567890

The OCR View option will allow you to check: • What has been extracted by the OCR Engine • How data have been cut into “words” #EAUC2016

TIPS ON REGULAR EXPRESSIONS § Using a regular expression will allow you to narrow the information to retain (and get rid of unwanted data)

#EAUC2016

TIPS ON REGULAR EXPRESSIONS: SAMPLES Regular expression [A-Z]{2}[0-1]{5}

[0-9]+[^0-9A-Za-z]+[0-9]+

([0-9]{3,5}\-){1,2}[0-9]+

Meaning

Matching with

•

[A-Z]{3} means “3 upper case letters”

AR12345

•

[0-1]{5} means “5 digits”

GJ56326 12345-6789

•

[0-9]+ means “one or more digit”

•

[^0-9A-Za-z]+ means “anything but a digit, an upper case letter or a lower case letter”

•

[0-9]+ means “one or more digit”

•

[0-9]{3,5}\- means “3 to 5 digits followed by 1234-3 a -” 12345-555-474 ([0-9]{3,5}\-){1,2} means “1 or 2 occurrence 3443-432- of the previous pattern within ()” 567890 [0-9]+ means “one or more digit”

• •

12-3456 12345_6789 123:456 1--456

#EAUC2016

GENERAL TIPS: TEACHING PRACTICES § Before teaching, always ask yourself “Should I really teach this document layout?” § Then when teaching, the most important step is the recognition of the document layout § Teaching is an incremental process: – If a field is correctly extracted, there is no reason to teach it – Concentrate on what is not correctly extracted

§ There is no real risk with teaching: if it fails, then it means you will just have to fix things manually § When teaching, remember to regularly save what you are doing (to avoid loosing your data because you loose the ownership)

#EAUC2016

www.esker.com