Information Extraction, Data Mining and Joint Inference

7/18/2008 Information Extraction, Data Mining and Joint Inference Andrew McCallum Computer Science Department University of Massachusetts Amherst Jo...
Author: Garey Green
0 downloads 2 Views 3MB Size
7/18/2008

Information Extraction, Data Mining and Joint Inference Andrew McCallum Computer Science Department University of Massachusetts Amherst

Joint work with Charles Sutton, Aron Culotta, Wei Li, Xuerui Wang, Andres Corrada, Ben Wellner, Chris Pal, Michael Hay, Natasha Mohanty, David Mimno, Gideon Mann.

Goal: Mine actionable knowledge from unstructured text.

1

7/18/2008

Extracting Job Openings from the Web foodscience.com-Job2 JobTitle: Ice Cream Guru Employer: foodscience.com JobCategory: Travel/Hospitality JobFunction: Food Services JobLocation: Upper Midwest Contact Phone: 800-488-2611 DateExtracted: January 8, 2001 Source: www.foodscience.com/jobs_midwest.html OtherCompanyJobs: foodscience.com-Job1

A Portal for Job Openings

2

Keyword = Java Location = U.S.

Job Openings: Category = High Tech

7/18/2008

QuickTime™ and a TIFF (LZW) decompressor are needed to see this picture.

Data Mining the Extracted Job Information

3

7/18/2008

IE from Research Papers [McCallum et al ‘99]

IE from Research Papers

QuickTime™ and a TIFF (LZW) decompressor are needed to see this picture.

4

7/18/2008

Mining Research Papers [Rosen-Zvi, Griffiths, Steyvers, Smyth, 2004]

QuickTime™ and a TIFF (LZW) decompressor are needed to see this picture. QuickTime™ and a TIFF (LZW) decompressor are needed to see this picture.

[Giles et al]

IE from Chinese Documents regarding Weather Department of Terrestrial System, Chinese Academy of Sciences

200k+ documents several millennia old - Qing Dynasty Archives - memos - newspaper articles - diaries

5

7/18/2008

What is “Information Extraction” As a family of techniques:

Information Extraction = segmentation + classification + clustering + association

October 14, 2002, 4:00 a.m. PT For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation. Today, Microsoft claims to "love" the opensource concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers. "We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“

Microsoft Corporation CEO Bill Gates Microsoft Gates Microsoft Bill Veghte Microsoft VP Richard Stallman founder Free Software Foundation

Richard Stallman, founder of the Free Software Foundation, countered saying…

What is “Information Extraction” As a family of techniques:

Information Extraction = segmentation + classification + association + clustering

October 14, 2002, 4:00 a.m. PT For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation. Today, Microsoft claims to "love" the opensource concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers. "We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“

Microsoft Corporation CEO Bill Gates Microsoft Gates Microsoft Bill Veghte Microsoft VP Richard Stallman founder Free Software Foundation

Richard Stallman, founder of the Free Software Foundation, countered saying…

6

7/18/2008

What is “Information Extraction” As a family of techniques:

Information Extraction = segmentation + classification + association + clustering

October 14, 2002, 4:00 a.m. PT For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation. Today, Microsoft claims to "love" the opensource concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers. "We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“

Microsoft Corporation CEO Bill Gates Microsoft Gates Microsoft Bill Veghte Microsoft VP Richard Stallman founder Free Software Foundation

Richard Stallman, founder of the Free Software Foundation, countered saying…

What is “Information Extraction” As a family of techniques:

Information Extraction = segmentation + classification + association + clustering

October 14, 2002, 4:00 a.m. PT For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation. Today, Microsoft claims to "love" the opensource concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers. "We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“

* Microsoft Corporation CEO Bill Gates * Microsoft Gates * Microsoft Bill Veghte * Microsoft VP Richard Stallman founder Free Software Foundation

Richard Stallman, founder of the Free Software Foundation, countered saying…

7

7/18/2008

From Text to Actionable Knowledge Spider Filter

Data Mining

IE Segment Classify Associate Cluster

Discover patterns - entity types - links / relations - events Database

Document collection Actionable knowledge Prediction Outlier detection Decision support

Knowledge Discovery

IE Segment Classify Associate Cluster

Problem:

Discover patterns - entity types - links / relations - events Database

Document collection Actionable knowledge

Combined in serial juxtaposition, IE and DM are unaware of each others’ weaknesses and opportunities. 1) DM begins from a populated DB, unaware of where the data came from, or its inherent errors and uncertainties. 2) IE is unaware of emerging patterns and regularities in the DB.  The accuracy of both suffers, and significant mining of complex text sources is beyond reach.

8

7/18/2008

Solution: Uncertainty Info Spider Filter

Data Mining

IE Segment Classify Associate Cluster

Discover patterns - entity types - links / relations - events Database

Document collection Actionable knowledge

Emerging Patterns

Prediction Outlier detection Decision support

Solution: Unified Model

Spider Filter

Data Mining

IE Segment Classify Associate Cluster

Probabilistic Model

Discover patterns - entity types - links / relations - events

Discriminatively-trained undirected graphical models Document collection

Conditional Random Fields [Lafferty, McCallum, Pereira]

Conditional PRMs [Koller…], [Jensen…], [Geetor…], [Domingos…]

Complex Inference and Learning

Actionable knowledge Prediction Outlier detection Decision support

Just what we researchers like to sink our teeth into!

9

7/18/2008

A Natural Language Processing Pipeline Pragmatics Anaphora Resolution Errors cascade & accumulate

Semantic Role Labeling Entity Recognition Parsing Chunking POS tagging

Unified Natural Language Processing Pragmatics Anaphora Resolution Unified, joint inference.

Semantic Role Labeling Entity Recognition Parsing Chunking POS tagging

10

7/18/2008

Scientific Questions • What model structures will capture salient dependencies? • Will joint inference actually improve accuracy?

• How to do inference in these large graphical models? • How to do parameter estimation efficiently in these models, which are built from multiple large components? • How to do structure discovery in these models?

Scientific Questions • What model structures will capture salient dependencies? • Will joint inference actually improve accuracy?

• How to do inference in these large graphical models? • How to do parameter estimation efficiently in these models, which are built from multiple large components? • How to do structure discovery in these models?

11

7/18/2008

Outline • Examples of IE and Data Mining. • Motivate Joint Inference • Brief introduction to Conditional Random Fields • Joint inference: Information Extraction Examples – Joint Labeling of Cascaded Sequences (Belief Propagation) – Joint Labeling of Distant Entities (BP by Tree Reparameterization) – Joint Co-reference Resolution (Graph Partitioning) – Joint Segmentation and Co-ref (Sparse BP) – Probability + First-order Logic, Co-ref on Entities (MCMC)

• Semi-supervised Learning • Demo: Rexa, a Web portal for researchers

Hidden Markov Models HMMs are the standard sequence modeling tool in genomics, music, speech, NLP, … Graphical model

Finite state model

S t-1

St

S t+1

...

... observations

... O

Generates:

t -1

Ot

O t +1

v |o|

vv P( s, o ) ∝ ∏ P(st | st−1 )P(ot | st )

State sequence Observation sequence

transitions

o1

o2

o3

o4

o5

o6 o7

o8

t=1

12

7/18/2008

IE with Hidden Markov Models Given a sequence of observations: Yesterday Yoav Freund spoke this example sentence.

and a trained HMM:

person name location name background

Find the most likely state sequence: (Viterbi) Yesterday Yoav Freund spoke this example sentence.

Any words said to be generated by the designated “person name” state extract as a person name: Person name: Yoav Freund

We want More than an Atomic View of Words Would like richer representation of text: many arbitrary, overlapping features of the words. S

t-1 identity of word ends in “-ski” is capitalized is part of a noun phrase is “Wisniewski” is in a list of city names is under node X in WordNet part of ends in is in bold font noun phrase “-ski” is indented O t 1 is in hyperlink anchor last person name was female next two words are “and Associates”

St

S t+1



… Ot

O t +1

13

7/18/2008

Problems with Richer Representation and a Joint Model These arbitrary features are not independent. – Multiple levels of granularity (chars, words, phrases) – Multiple dependent modalities (words, formatting, layout) – Past & future

Two choices: Model the dependencies. Each state would have its own Bayes Net. But we are already starved for training data!

Ignore the dependencies. This causes “over-counting” of evidence (ala naïve Bayes). Big problem when combining evidence, as in Viterbi!

S t- 1

St

S t+1

S t-1

St

S t+1

O

Ot

O t +1

O

Ot

O t +1

t -1

t -1

Conditional Sequence Models • We prefer a model that is trained to maximize a conditional probability rather than joint probability: P(s|o) instead of P(s,o): – Can examine features, but not responsible for generating them. – Don’t have to explicitly model their dependencies. – Don’t “waste modeling effort” trying to generate what we are given at test time anyway.

14

7/18/2008

From HMMs to Conditional Random Fields v s = s1,s2 ,...sn

v o = o1,o2,...on

[Lafferty, McCallum, Pereira 2001]

St-1

v |o|

Joint

St

St+1 ...

vv P( s, o ) = ∏ P(st | st−1)P(ot | st ) t=1

Conditional

Ot-1

Ot

...

Ot+1

v

1 |o| v v P( s | o ) = v ∏ P(st | st−1)P(ot | st ) P(o ) t=1 =

St-1

St

St+1 ...

v |o|

1 v ∏ Φ s (st ,st−1 )Φ o (ot ,st ) Z(o ) t=1

Ot-1

  where Φ o (t) = exp ∑ λk f k (st ,ot )  k 

Ot

Ot+1

...

(A super-special case of Conditional Random Fields.)

Set parameters by maximum likelihood, using optimization method on δL.

(Linear Chain) Conditional Random Fields [Lafferty, McCallum, Pereira 2001]

Undirected graphical model, trained to maximize conditional probability of output (sequence) given input (sequence) Finite state model

Graphical model OTHER y t -1

PERSON yt

OTHER y t+1

ORG y t+2

TITLE … y t+3

output seq FSM states

... observations

x

t -1

said

p(y | x) =

1 ∏ Φ(y t , y t−1,x,t) Zx t

x

t

Jones

where

x a

t +1

x

t +2

Microsoft

x

t +3

VP …

input seq

  Φ(y t , y t−1,x,t) = exp∑ λk f k (y t , y t−1,x,t)  k 

Wide-spread interest, positive experimental results in many applications. Noun phrase, Named entity [HLT’03], [CoNLL’03] Protein structure prediction [ICML’04] IE from Bioinformatics text [Bioinformatics ‘04],…

Asian word segmentation [COLING’04], [ACL’04] IE from Research papers [HTL’04] Object classification in images [CVPR ‘04]

15

7/18/2008

Table Extraction from Government Reports Cash receipts from marketings of milk during 1995 at $19.9 billion dollars, was slightly below 1994. Producer returns averaged $12.93 per hundredweight, $0.19 per hundredweight below 1994. Marketings totaled 154 billion pounds, 1 percent above 1994. Marketings include whole milk sold to plants and dealers as well as milk sold directly to consumers. An estimated 1.56 billion pounds of milk were used on farms where produced, 8 percent less than 1994. Calves were fed 78 percent of this milk with the remainder consumed in producer households.

Milk Cows and Production of Milk and Milkfat: United States, 1993-95 -------------------------------------------------------------------------------: : Production of Milk and Milkfat 2/ : Number :------------------------------------------------------Year : of : Per Milk Cow : Percentage : Total :Milk Cows 1/:-------------------: of Fat in All :-----------------: : Milk : Milkfat : Milk Produced : Milk : Milkfat -------------------------------------------------------------------------------: 1,000 Head --- Pounds --Percent Million Pounds : 1993 : 9,589 15,704 575 3.66 150,582 5,514.4 1994 : 9,500 16,175 592 3.66 153,664 5,623.7 1995 : 9,461 16,451 602 3.66 155,644 5,694.3 -------------------------------------------------------------------------------1/ Average number during year, excluding heifers not yet fresh. 2/ Excludes milk sucked by calves.

Table Extraction from Government Reports [Pinto, McCallum, Wei, Croft, 2003 SIGIR] 100+ documents from www.fedstats.gov

Labels: CRF Cash receipts from marketings of milk during 1995 at $19.9 billion dollars, was slightly below 1994. Producer returns averaged $12.93 per hundredweight, $0.19 per hundredweight below 1994. Marketings totaled 154 billion pounds, 1 percent above 1994. Marketings include whole milk sold to plants and dealers as well as milk sold directly to consumers.

An estimated 1.56 billion pounds of milk were used on farms where produced, 8 percent less than 1994. Calves were fed 78 percent of this milk with the remainder consumed in producer households.

United States, 1993-95 -------------------------------------------------------------------------------:

Production of Milk and Milkfat 2/ -------------------------------------------------------

Year :

of

: Per Milk Cow

: Percentage :

Total

-------------------: of Fat in All :-----------------:

: Milk : Milkfat : Milk Produced : Milk : Milkfat

----------------------------------------------------------------------------------

Percent

Non-Table Table Title Table Header Table Data Row Table Section Data Row Table Footnote ... (12 in all)

Features:

Milk Cows and Production of Milk and Milkfat:

:

• • • • • • •

• • • • • • •

Percentage of digit chars Percentage of alpha chars Indented Contains 5+ consecutive spaces Whitespace in this line aligns with prev. ... Conjunctions of all previous features, time offset: {0,0}, {-1,0}, {0,1}, {1,2}.

Million Pounds

16

7/18/2008

Table Extraction Experimental Results [Pinto, McCallum, Wei, Croft, 2003 SIGIR]

Line labels, percent correct

HMM Stateless MaxEnt CRF

65 % 85 % 95 %

Table segments, F1

64 % 92 %

IE from Research Papers [McCallum et al ‘99]

17

7/18/2008

IE from Research Papers

Field-level F1 Hidden Markov Models (HMMs)

75.6

[Seymore, McCallum, Rosenfeld, 1999]

Support Vector Machines (SVMs)

89.7 ∆ error 40%

[Han, Giles, et al, 2003]

Conditional Random Fields (CRFs)

93.9

[Peng, McCallum, 2004]

Named Entity Recognition CRICKET MILLNS SIGNS FOR BOLAND CAPE TOWN 1996-08-22 South African provincial side Boland said on Thursday they had signed Leicestershire fast bowler David Millns on a one year contract. Millns, who toured Australia with England A in 1992, replaces former England all-rounder Phillip DeFreitas as Boland's overseas professional.

Labels: PER ORG

LOC

MISC

Examples: Yayuk Basuki Innocent Butare 3M KDP Cleveland Cleveland Nirmal Hriday The Oval Java Basque 1,000 Lakes Rally

18

7/18/2008

Automatically Induced Features [McCallum & Li, 2003, CoNLL]

Index

Feature

0

inside-noun-phrase (ot-1)

5

stopword (ot)

20

capitalized (ot+1)

75

word=the (ot)

100

in-person-lexicon (ot-1)

200

word=in (ot+2)

500

word=Republic (ot+1)

711

word=RBI (ot) & header=BASEBALL

1027

header=CRICKET (ot) & in-English-county-lexicon (ot)

1298

company-suffix-word (firstmentiont+2)

4040

location (ot) & POS=NNP (ot) & capitalized (ot) & stopword (ot-1)

4945

moderately-rare-first-name (ot-1) & very-common-last-name (ot)

4474

word=the (ot-2) & word=of (ot)

Named Entity Extraction Results [McCallum & Li, 2003, CoNLL]

Method

F1

HMMs BBN's Identifinder

73%

CRFs w/out Feature Induction 83%

CRFs with Feature Induction based on LikelihoodGain

90%

19

7/18/2008

Outline • Examples of IE and Data Mining. • Motivate Joint Inference • Brief introduction to Conditional Random Fields • Joint inference: Information Extraction Examples – Joint Labeling of Cascaded Sequences (Belief Propagation) – Joint Labeling of Distant Entities (BP by Tree Reparameterization) – Joint Co-reference Resolution (Graph Partitioning) – Joint Segmentation and Co-ref (Sparse BP) – Probability + First-order Logic, Co-ref on Entities (MCMC)

• Semi-supervised Learning • Demo: Rexa, a Web portal for researchers

1. Jointly labeling cascaded sequences Factorial CRFs [Sutton, Khashayar, McCallum, ICML 2004]

Named-entity tag Noun-phrase boundaries Part-of-speech English words

20

7/18/2008

1. Jointly labeling cascaded sequences Factorial CRFs [Sutton, Khashayar, McCallum, ICML 2004]

Named-entity tag Noun-phrase boundaries Part-of-speech English words

1. Jointly labeling cascaded sequences Factorial CRFs [Sutton, Khashayar, McCallum, ICML 2004]

Named-entity tag Noun-phrase boundaries Part-of-speech English words

But errors cascade--must be perfect at every stage to do well.

21

7/18/2008

1. Jointly labeling cascaded sequences Factorial CRFs [Sutton, Khashayar, McCallum, ICML 2004]

Named-entity tag Noun-phrase boundaries Part-of-speech English words

Joint prediction of part-of-speech and noun-phrase in newswire, matching accuracy with only 50% of the training data.

Inference: Loopy Belief Propagation

Outline • Examples of IE and Data Mining. • Motivate Joint Inference • Brief introduction to Conditional Random Fields • Joint inference: Information Extraction Examples – Joint Labeling of Cascaded Sequences (Belief Propagation) – Joint Labeling of Distant Entities (BP by Tree Reparameterization) – Joint Co-reference Resolution (Graph Partitioning) – Joint Segmentation and Co-ref (Sparse BP) – Probability + First-order Logic, Co-ref on Entities (MCMC)

• Semi-supervised Learning • Demo: Rexa, a Web portal for researchers

22

7/18/2008

2. Jointly labeling distant mentions Skip-chain CRFs [Sutton, McCallum, SRL 2004]



Senator Joe Green said today



.

Green ran

for …

Dependency among similar, distant mentions ignored.

2. Jointly labeling distant mentions Skip-chain CRFs [Sutton, McCallum, SRL 2004]



Senator Joe Green said today



.

Green ran

for …

14% reduction in error on most repeated field in email seminar announcements.

Inference: Tree reparameterized BP [Wainwright et al, 2002]

See also [Finkel, et al, 2005]

23

7/18/2008

Outline • Examples of IE and Data Mining. • Motivate Joint Inference • Brief introduction to Conditional Random Fields • Joint inference: Information Extraction Examples – Joint Labeling of Cascaded Sequences (Belief Propagation) – Joint Labeling of Distant Entities (BP by Tree Reparameterization) – Joint Co-reference Resolution (Graph Partitioning) – Joint Segmentation and Co-ref (Sparse BP) – Probability + First-order Logic, Co-ref on Entities (MCMC)

• Semi-supervised Learning • Demo: Rexa, a Web portal for researchers

3. Joint co-reference among all pairs Affinity Matrix CRF “Entity resolution” “Object correspondence”

. . . Mr Powell . . . 45

. . . Powell . . .

Y/N

−99

Y/N Y/N

11

~25% reduction in error on co-reference of proper nouns in newswire.

. . . she . . . Inference: Correlational clustering graph partitioning

[McCallum, Wellner, IJCAI WS 2003, NIPS 2004]

[Bansal, Blum, Chawla, 2002]

24

7/18/2008

Coreference Resolution AKA "record linkage", "database record deduplication", "citation matching", "object correspondence", "identity uncertainty"

Output

Input News article, with named-entity "mentions" tagged

Number of entities, N = 3

Today Secretary of State Colin Powell met with . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . he . . . . . . . . . . . . . . . . . . . Condoleezza Rice . . . . . . . . . Mr Powell . . . . . . . . . .she . . . . . . . . . . . . . . . . . . . . . Powell . . . . . . . . . . . . . . . President Bush . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Rice . . . . . . . . . . . . . . . . Bush . . . . . . . . . . . . . . . . . . . . . ........... . . . . . . . . . . . . . . . .

#1 Secretary of State Colin Powell he Mr. Powell Powell #2 Condoleezza Rice she Rice

......................... #3

President Bush Bush

Inside the Traditional Solution Pair-wise Affinity Metric Mention (3) . . . Mr Powell . . .

N Y Y Y Y N Y Y N Y N N Y Y

Mention (4)

Y/N?

. . . Powell . . .

Two words in common One word in common "Normalized" mentions are string identical Capitalized word in common > 50% character tri-gram overlap < 25% character tri-gram overlap In same sentence Within two sentences Further than 3 sentences apart "Hobbs Distance" < 3 Number of entities in between two mentions = 0 Number of entities in between two mentions > 4 Font matches Default OVERALL SCORE =

29 13 39 17 19 -34 9 8 -1 11 12 -3 1 -19 98

> threshold=0

25

7/18/2008

The Problem Pair-wise merging decisions are being made independently from each other

. . . Mr Powell . . . affinity = 98

Y

affinity = −104

. . . Powell . . .

N

They should be made in relational dependence with each other.

Y affinity = 11

. . . she . . .

Affinity measures are noisy and imperfect.

A Generative Model Solution [Russell 2001], [Pasula et al 2002]

(Applied to citation matching, and object correspondence in vision) N

id

Issues:

context words

id surname

distance fonts

. . .

gender

age

. . .

1) Generative model makes it difficult to use complex features. 2) Number of entities is hard-coded into the model structure, but we are supposed to predict num entities! Thus we must modify model structure during inference---MCMC.

26

7/18/2008

A Markov Random Field for Co-reference (MRF)

[McCallum & Wellner, 2003, ICML]

. . . Mr Powell . . . 45

. . . Powell . . .

Y/N

−30

Y/N Y/N

Make pair-wise merging decisions in dependent relation to each other by - calculating a joint prob. - including all edge weights - adding dependence on consistent triangles.

11

. . . she . . .

  1 v v P( y | x ) = exp∑ ∑ λl f l (x i , x j , y ij ) + ∑ λ' f '(y ij , y jk , y ik ) Z xv  i, j l i, j,k 

A Markov Random Field for Co-reference (MRF)

[McCallum & Wellner, 2003]

. . . Mr Powell . . . 45

. . . Powell . . .

Y/N

−30

Y/N Y/N

11

Make pair-wise merging decisions in dependent relation to each other by - calculating a joint prob. - including all edge weights - adding dependence on consistent triangles.

−∞

. . . she . . .

  1 v v P( y | x ) = exp∑ ∑ λl f l (x i , x j , y ij ) + ∑ λ' f '(y ij , y jk , y ik ) Z xv  i, j l i, j,k 

27

7/18/2008

A Markov Random Field for Co-reference (MRF)

[McCallum & Wellner, 2003]

. . . Mr Powell . . . −(45)

. . . Powell . . .

N

−(−30)

N Y +(11)

−4

. . . she . . .

  1 v v P( y | x ) = exp∑ ∑ λl f l (x i , x j , y ij ) + ∑ λ' f '(y ij , y jk , y ik ) Z xv  i, j l i, j,k 

A Markov Random Field for Co-reference (MRF)

[McCallum & Wellner, 2003]

. . . Mr Powell . . . +(45) +(

. . . Powell . . .

Y

−(−30)

N Y +(11)

. . . she . . .

−infinity

  1 v v P( y | x ) = exp∑ ∑ λl f l (x i , x j , y ij ) + ∑ λ' f '(y ij , y jk , y ik ) Z xv  i, j l i, j,k 

28

7/18/2008

A Markov Random Field for Co-reference (MRF)

[McCallum & Wellner, 2003]

. . . Mr Powell . . . +(45) +(

. . . Powell . . .

Y

−(−30)

N N −(11)

. . . she . . .

64

  1 v v P( y | x ) = exp∑ ∑ λl f l (x i , x j , y ij ) + ∑ λ' f '(y ij , y jk , y ik ) Z xv  i, j l i, j,k 

Inference in these MRFs = Graph Partitioning [Boykov, Vekler, Zabih, 1999], [Kolmogorov & Zabih, 2002], [Yu, Cross, Shi, 2002]

. . . Mr Powell . . . 45

. . . Powell . . . −106 −30 −134 11

. . . Condoleezza Rice . . .

. . . she . . . 10

v v log(P( y | x )) ∝ ∑ ∑ λl f l (x i , x j , y ij ) = i, j

l

∑w i, j w/in paritions

ij



∑w

ij

i, j across paritions

29

7/18/2008

Inference in these MRFs = Graph Partitioning [Boykov, Vekler, Zabih, 1999], [Kolmogorov & Zabih, 2002], [Yu, Cross, Shi, 2002]

. . . Mr Powell . . . 45

. . . Powell . . . −106 −30 −134 11

. . . Condoleezza Rice . . .

. . . she . . . 10

v v log(P( y | x )) ∝ ∑ ∑ λl f l (x i , x j , y ij ) = i, j

∑w

ij



i, j w/in paritions

l

∑w

ij

= −22

i, j across paritions

Inference in these MRFs = Graph Partitioning [Boykov, Vekler, Zabih, 1999], [Kolmogorov & Zabih, 2002], [Yu, Cross, Shi, 2002]

. . . Mr Powell . . . 45

. . . Powell . . . −106 −30 −134 11

. . . Condoleezza Rice . . .

. . . she . . . 10

v v log(P( y | x )) ∝ ∑ ∑ λl f l (x i , x j , y ij ) = i, j

l

∑w i, j w/in paritions

ij

+

∑ w'

ij

= 314

i, j across paritions

30

7/18/2008

Co-reference Experimental Results [McCallum & Wellner, 2003]

Proper noun co-reference

DARPA ACE broadcast news transcripts, 117 stories Single-link threshold Best prev match [Morton] MRFs

Partition F1 16 % 83 % 88 % ∆error=30%

Pair F1 18 % 89 % 92 % ∆error=28%

DARPA MUC-6 newswire article corpus, 30 stories Single-link threshold Best prev match [Morton] MRFs

Partition F1 11% 70 % 74 % ∆error=13%

Pair F1 7% 76 % 80 % ∆error=17%

Joint Co-reference for Multiple Entity Types [Culotta & McCallum 2005] People Stuart Russell

Y/N

Stuart Russell Y/N Y/N

S. Russel

31

7/18/2008

Joint Co-reference for Multiple Entity Types [Culotta & McCallum 2005] People Stuart Russell

Organizations University of California at Berkeley

Y/N

Y/N

Stuart Russell Y/N

Berkeley

Y/N Y/N

Y/N

S. Russel

Berkeley

Joint Co-reference for Multiple Entity Types [Culotta & McCallum 2005] People Stuart Russell

Organizations University of California at Berkeley

Y/N

Y/N

Stuart Russell Y/N

Berkeley

Y/N Y/N

Y/N

Reduces error by 22% S. Russel

Berkeley

32

7/18/2008

Outline • Examples of IE and Data Mining. • Motivate Joint Inference • Brief introduction to Conditional Random Fields • Joint inference: Information Extraction Examples – Joint Labeling of Cascaded Sequences (Belief Propagation) – Joint Labeling of Distant Entities (BP by Tree Reparameterization) – Joint Co-reference Resolution (Graph Partitioning) – Joint Segmentation and Co-ref (Sparse BP) – Probability + First-order Logic, Co-ref on Entities (MCMC)

• Semi-supervised Learning • Demo: Rexa, a Web portal for researchers

4. Joint segmentation and co-reference Extraction from and matching of research paper citations.

o s

Laurel, B. Interface Agents: Metaphors with Character, in The Art of Human-Computer Interface Design, B. Laurel (ed), AddisonWesley, 1990.

World Knowledge c

y

Brenda Laurel. Interface Agents: Metaphors with Character, in Laurel, The Art of Human-Computer Interface Design, 355-366, 1990.

p

Co-reference decisions

y

Database field values c s

c

y

Citation attributes

s

o

Segmentation o

35% reduction in co-reference error by using segmentation uncertainty. 6-14% reduction in segmentation error by using co-reference. Inference: Sparse Generalized Belief Propagation [Pal, Sutton, McCallum, 2005]

[Wellner, McCallum, Peng, Hay, UAI 2004] see also [Marthi, Milch, Russell, 2003]

33

7/18/2008

4. Joint segmentation and co-reference Joint IE and Coreference from Research Paper Citations Textual citation mentions (noisy, with duplicates)

Paper database, with fields, clean, duplicates collapsed

AUTHORS TITLE Cowell, Dawid… Probab… Montemerlo, Thrun…FastSLAM… Kjaerulff Approxi…

VENUE Springer AAAI… Technic…

QuickTime™ and a TIFF (LZW) decompressor are needed to see this picture.

Citation Segmentation and Coreference Laurel, B.

Interface Agents: Metaphors with Character , in

The Art of Human-Computer Interface Design , T. Smith (ed) , Addison-Wesley , 1990 . Brenda Laurel . Interface Agents: Metaphors with Character , in Smith , The Art of Human-Computr Interface Design , 355-366 , 1990 .

34

7/18/2008

Citation Segmentation and Coreference Laurel, B.

Interface Agents: Metaphors with Character , in

The Art of Human-Computer Interface Design , T. Smith (ed) , Addison-Wesley , 1990 . Brenda Laurel . Interface Agents: Metaphors with Character , in Smith , The Art of Human-Computr Interface Design , 355-366 , 1990 .

1)

Segment citation fields

Citation Segmentation and Coreference Laurel, B. Y ? N

Interface Agents: Metaphors with Character , in

The Art of Human-Computer Interface Design , T. Smith (ed) , Addison-Wesley , 1990 . Brenda Laurel . Interface Agents: Metaphors with Character , in Smith , The Art of Human-Computr Interface Design , 355-366 , 1990 .

1)

Segment citation fields

2)

Resolve coreferent citations

35

7/18/2008

Citation Segmentation and Coreference Laurel, B. Y ? N

Interface Agents: Metaphors with Character , in

The Art of Human-Computer Interface Design , T. Smith (ed) , Addison-Wesley , 1990 . Brenda Laurel . Interface Agents: Metaphors with Character , in Smith , The Art of Human-Computr Interface Design , 355-366 , 1990 . AUTHOR = TITLE = PAGES = BOOKTITLE = EDITOR = PUBLISHER = YEAR =

Brenda Laurel Interface Agents: Metaphors with Character 355-366 The Art of Human-Computer Interface Design T. Smith Addison-Wesley 1990

1)

Segment citation fields

2)

Resolve coreferent citations

3)

Form canonical database record

Resolving conflicts

Citation Segmentation and Coreference Laurel, B. Y ? N

Interface Agents: Metaphors with Character , in

The Art of Human-Computer Interface Design , T. Smith (ed) , Addison-Wesley , 1990 . Brenda Laurel . Interface Agents: Metaphors with Character , in Smith , The Art of Human-Computr Interface Design , 355-366 , 1990 . AUTHOR = TITLE = PAGES = BOOKTITLE = EDITOR = PUBLISHER = YEAR =

Perform

Brenda Laurel Interface Agents: Metaphors with Character 355-366 The Art of Human-Computer Interface Design T. Smith Addison-Wesley 1990

1)

Segment citation fields

2)

Resolve coreferent citations

3)

Form canonical database record

jointly.

36

7/18/2008

IE + Coreference Model

AUT AUT YR TITL TITL

CRF Segmentation

s

Observed citation

x J Besag 1986 On the…

IE + Coreference Model

AUTHOR = “J Besag” YEAR = “1986” TITLE = “On the…”

Citation mention attributes

c

CRF Segmentation

s

Observed citation

x J Besag 1986 On the…

37

7/18/2008

IE + Coreference Model Smyth

,

P Data mining…

Structure for each citation mention

c s x Smyth . 2001 Data Mining…

J Besag 1986 On the…

IE + Coreference Model Smyth

,

P Data mining…

Binary coreference variables for each pair of mentions

c s x Smyth . 2001 Data Mining…

J Besag 1986 On the…

38

7/18/2008

IE + Coreference Model Smyth

,

P Data mining…

Binary coreference variables for each pair of mentions y

n

n

c s x Smyth . 2001 Data Mining…

J Besag 1986 On the…

IE + Coreference Model Smyth

,

P Data mining…

AUTHOR = “P Smyth” YEAR = “2001” TITLE = “Data Mining…” ...

Research paper entity attribute nodes

y

n

n

c s x Smyth . 2001 Data Mining…

J Besag 1986 On the…

39

7/18/2008

IE + Coreference Model Smyth

Research paper entity attribute node

,

P Data mining…

y

y

y

c s x Smyth . 2001 Data Mining…

J Besag 1986 On the…

IE + Coreference Model Smyth

,

P Data mining…

y

n

n

c s x Smyth . 2001 Data Mining…

J Besag 1986 On the…

40

7/18/2008

Inference by Sparse “Generalized BP” Smyth

,

P Data mining…

[Pal, Sutton, McCallum 2005]

Exact inference on these linear-chain regions

From each chain pass an N-best List into coreference

Smyth . 2001 Data Mining…

J Besag 1986 On the…

Inference by Sparse “Generalized BP” Smyth

,

P Data mining…

[Pal, Sutton, McCallum 2005]

Approximate inference by graph partitioning… Make scale to 1M citations with Canopies …integrating out uncertainty in samples of extraction

Smyth . 2001 Data Mining…

[McCallum, Nigam, Ungar 2000]

J Besag 1986 On the…

41

7/18/2008

Inference: Sample = N-best List from CRF Segmentation

Name

Title

Book Title

Year

Laurel, B. Interface

Agents: Metaphors with Character

The Art of Human Computer Interface Design

1990

Laurel, B.

Interface Agents: Metaphors with Character The Art

of Human Computer Interface Design

1990

Agents: Metaphors with Character

The Art of Human Computer Interface Design

Laurel, B. Interface

When calculating similarity with another citation, have more opportunity to find correct, matching fields. Name

Title



Laurel, B

Interface Agents: Metaphors with Character The



Laurel, B.

Interface Agents: Metaphors with Character



Laurel, B. Interface Agents

Metaphors with Character



1990

y?n

Inference by Sparse “Generalized BP” Smyth

,

P Data mining…

[Pal, Sutton, McCallum 2005]

Exact (exhaustive) inference over entity attributes

y

n

n

Smyth . 2001 Data Mining…

J Besag 1986 On the…

42

7/18/2008

Inference by Sparse “Generalized BP” Smyth

,

P Data mining…

[Pal, Sutton, McCallum 2005]

Revisit exact inference on IE linear chain, now conditioned on entity attributes

y

n

n

Smyth . 2001 Data Mining…

J Besag 1986 On the…

Parameter Estimation: Piecewise Training [Sutton & McCallum 2005]

Divide-and-conquer parameter estimation

IE Linear-chain Exact MAP Coref graph edge weights MAP on individual edges Entity attribute potentials MAP, pseudo-likelihood

y

n

n

In all cases: Climb MAP gradient with quasi-Newton method

43

7/18/2008

4. Joint segmentation and co-reference [Wellner, McCallum, Peng, Hay, UAI 2004]

o

Extraction from and matching of research paper citations.

s

Laurel, B. Interface Agents: Metaphors with Character, in The Art of Human-Computer Interface Design, B. Laurel (ed), AddisonWesley, 1990.

World Knowledge c

y

p

Co-reference decisions

y

Database field values

Brenda Laurel. Interface Agents: Metaphors with Character, in Laurel, The Art of Human-Computer Interface Design, 355-366, 1990.

c

c

y

s

Citation attributes

s

o

Segmentation o

35% reduction in co-reference error by using segmentation uncertainty. 6-14% reduction in segmentation error by using co-reference. Inference: Sparse Belief Propagation [Pal, Sutton, McCallum, 2005]

Outline • Examples of IE and Data Mining. • Motivate Joint Inference • Brief introduction to Conditional Random Fields • Joint inference: Information Extraction Examples – Joint Labeling of Cascaded Sequences (Belief Propagation) – Joint Labeling of Distant Entities (BP by Tree Reparameterization) – Joint Co-reference Resolution (Graph Partitioning) – Joint Segmentation and Co-ref (Sparse BP) – Probability + First-order Logic, Co-ref on Entities (MCMC)

• Semi-supervised Learning • Demo: Rexa, a Web portal for researchers

44

7/18/2008

Semi-Supervised Learning How to train with limited labeled data? Augment with lots of unlabeled data “Expectation Regularization” [Mann, McCallum, ICML 2007]

Semi-Supervised Learning How to train with limited labeled data? Augment with lots of unlabeled data “Expectation Regularization” [Mann, McCallum, ICML 2007]

45

7/18/2008

Supervised Learning Decision boundary

Creation of labeled instances requires extensive human effort

What if limited labeled data?

Small amount of labeled data

46

7/18/2008

Semi-Supervised Learning: Labeled & Unlabeled data Small amount of labeled data

Large amount of unlabeled data Augment limited labeled data by using unlabeled data

More Semi-Supervised Algorithms than Applications 30 25 20

# papers Algorithms Applications

15 10 5 0 1998

2000

2002

2004

2006

Compiled from [Zhu, 2007]

47

7/18/2008

Weakness of Many Semi-Supervised Algorithms Difficult to Implement Significantly more complicated than supervised counterparts

Fragile Meta-parameters hard to tune

Lacking in Scalability O(n2) or O(n3) on unlabeled data

“EM will generally degrade [tagging] accuracy, except when only a limited amount of hand-tagged text is available.” [Merialdo, 1994]

“When the percentage of labeled data increases from 50% to 75%, the performance of [Label Propagation with Jensen-Shannon divergence] and SVM become almost same, while [Label propagation with cosine distance] performs significantly worse than SVM.”

[Niu,Ji,Tan, 2005]

48

7/18/2008

Families of Semi-Supervised Learning 1. 2. 3. 4.

Expectation Maximization Graph-Based Methods Auxiliary Functions Decision Boundaries in Sparse Regions

Family 1 : Expectation Maximization [Dempster, Laird, Rubin, 1977]

Fragile -- often worse than supervised

49

7/18/2008

Family 2: Graph-Based Methods [Szummer, Jaakkola, 2002] [Zhu, Ghahramani, 2002]

Lacking in scalability, Sensitive to choice of metric

Family 3: Auxiliary-Task Methods [Ando and Zhang, 2005]

Complicated to find appropriate auxiliary tasks

50

7/18/2008

Family 4: Decision Boundary in Sparse Region

Family 4: Decision Boundary in Sparse Region Transductive SVMs [Joachims, 1999]: Sparsity measured by margin Entropy Regularization [Grandvalet and Bengio, 2005] …by label entropy

51

7/18/2008

Minimal Entropy Solution!

How do we know the minimal entropy solution is wrong?

We suspect at least some of the data is in the second class!

0.8 0.7 0.6 0.5 0.4 0.3

In fact we often have prior knowledge of the relative class proportions

0.2 0.1 0 Class Size

0.8 : Student 0.2 : Professor

52

7/18/2008

How do we know the minimal entropy solution is wrong?

We suspect at least some of the data is in the second class!

In fact we often have prior knowledge of the relative class proportions

0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 Class Size

0.1 : Gene Mention 0.9 : Background

How do we know the minimal entropy solution is wrong?

We suspect at least some of the data is in the second class!

0.6 0.5 0.4 0.3

In fact we often have prior knowledge of the relative class proportions

0.2 0.1 0 Class Size

0.6 : Person 0.4 : Organization

53

7/18/2008

Families of Semi-Supervised Learning 1. 2. 3. 4. 5.

Expectation Maximization Graph-Based Methods Auxiliary Functions Decision Boundaries in Sparse Regions Generalized Expectation 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 Class Size

Family 5: Generalized Expectation

Low density region

0.8

Favor decision boundaries that match the prior

0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 Class Size

54

7/18/2008

Generalized Expectation Simple: Easy to implement Robust: Meta-parameters need little or no tuning Scalable: Linear in number of unlabeled examples

Generalized Expectation Special Cases 0.8 0.7

• Label Regularization

p(y)

0.6 0.5 0.4 0.3 0.2 0.1 0 Class Size

• Expectation Regularization

p(y | feature)

• Generalized Expectation

E [ f(x,y) ]

(general case)

55

7/18/2008

Label Regularization (LR)

Log-likelihood

LR

KL-Divergence between a prior distribution and an expected distribution over the unlabeled data

Prior distribution (provided from supervised training or estimated on the labeled data)

Model’s expected distribution on the unlabeled data

56

7/18/2008

After Training, Model Matches Prior Distribution

Supervised only

Supervised + LR

Gradient for Logistic Regression

When

the gradient is 0

57

7/18/2008

LR Results for Classification Secondary Structure Prediction Accuracy # Labeled Examples 2

100

1000

SVM (supervised)

55.41%

66.29%

Cluster Kernel SVM

57.05%

65.97%

QC Smartsub

57.68%

59.16%

Naïve Bayes (supervised)

52.42%

57.12%

64.47%

Naïve Bayes EM

50.79%

57.34%

57.60%

Logistic Regression (supervised)

52.42%

56.74%

65.43%

Logistic Regression + Entropy Reg.

48.56%

54.45%

58.28%

Logistic Regression + GE

57.08%

58.51%

65.44%

XR Results for Classification: Sliding Window Model CoNLL03 Named Entity Recognition Shared Task

58

7/18/2008

XR Results for Classification: Sliding Window Model 2 BioCreativeII 2007 Gene/Gene Product Extraction

XR Results for Classification: Sliding Window Model 3 Wall Street Journal Part-of-Speech Tagging

59

7/18/2008

XR Results for Classification: SRAA

Simulated/Real Auto/Aviation Text Classification

Noise in Prior Knowledge What happens when users’ estimates of the class proportions is in error?

60

7/18/2008

Noisy Prior Distribution CoNLL03 Named Entity Recognition Shared Task

20% change in probability of majority class

Generalized Expectation Simple: Easy to implement Robust: Meta-parameters need little or no tuning Scalable: Linear in number of unlabeled examples

61

7/18/2008

Generalized Expectation Special Cases 0.8 0.7

• Label Regularization

p(y)

0.6 0.5 0.4 0.3 0.2 0.1 0 Class Size

• Expectation Regularization

p(y | feature)

p ( BASEBALL | “homerun” ) = 0.95

• Generalized Expectation

E [ f(x,y) ]

(general case)

An Alternative Style of Supervision Classifying Baseball versus Hockey Traditional

Human Labeling Effort

Generalized Expectation

Brainstorm a few Keywords ball puck field ice bat stick

(Semi-)Supervised Training via Maximum Likelihood

Semi-Supervised Training via Generalized Expectation

62

7/18/2008

Labeling Features

~1000 unlabeled examples

features labeled . . .

hockey baseball HR Mets

goal Buffalo Leafs puck Lemieux

ball Oilers Sox Pens runs

batting base NHL Bruins Penguins

85%

92%

94.5%

96%

Test accuracy

Accuracy per Human Effort

Labeling time in seconds

63

7/18/2008

Generalized Expectation Special Cases 0.8 0.7

• Label Regularization

p(y)

0.6 0.5 0.4 0.3 0.2 0.1 0 Class Size

• Expectation Regularization

p(y | feature)

• Generalized Expectation

E [ f(x,y) ]

(general case)

Generalized Expectation (GE) criteria [McCallum, Mann, Druck 2007]

• Definition: Parameter estimation objective fn that expresses preference on expectations of the model.

Objective = Score ( E [ f(x,y) ] ) • Sometimes in same equivalence class as – Moment matching

Not just moments Not necessarily matching a single target value

– Maximum likelihood

Not necessarily p(data) Preferences on subset of model factors

– Maximum entropy

Based on constraints and expectations, but parameterization not req. to match constraints

64

7/18/2008

Generalized Expectation criteria Non-traditional opportunities

1. Model factors ≠ Expectation factors

See also [Ganchev, Graca, Taskar NIPS 2007]

Define objective separately from model.

2. Condition expectations on different circumstances E [ NER | text + HTML ] E [ NER | newswire ]

E [ NER | text ] E [ NER | email ]

3. Supervised training signal from domain knowledge L(θ) = p(D| θ) p(θ) 1. Label data 2. Informative prior

Expectations of an expert

E[Adj before Noun] = 0.9 E[Det Det] = 0.01 E[“bank”=Verb] = 0.2

Generalized Expectation criteria Easy communication with domain experts

• Inject domain knowledge into parameter estimation

• Like “informative prior”... • ...but rather than the “language of parameters” (difficult for humans to understand)

• ...use the “language of expectations” (natural for humans)

65

7/18/2008

Use of Domain Knowledge #3 Parameter Estimation • “Expectations” are a natural language in which to express expertise. • GE translates expectations into parameter estimation objective. • Expert has knowledge. Must provide ML tools to integrate safely.

Outline • Examples of IE and Data Mining. • Motivate Joint Inference • Brief introduction to Conditional Random Fields • Joint inference: Information Extraction Examples – Joint Labeling of Cascaded Sequences (Belief Propagation) – Joint Labeling of Distant Entities (BP by Tree Reparameterization) – Joint Co-reference Resolution (Graph Partitioning) – Joint Segmentation and Co-ref (Sparse BP) – Probability + First-order Logic, Co-ref on Entities (MCMC)

• Semi-supervised Learning • Demo: Rexa, a Web portal for researchers

66

7/18/2008

Data Mining Research Literature • Better understand structure of our own research area. • Structure helps us learn a new field. • Aid collaboration • Map how ideas travel through social networks of researchers. • Aids for hiring and finding reviewers! • Measure impact of papers or people.

Traditional Bibliometrics • Analyses a small amount of data (e.g. 19 articles from a single issue of a journal) • Uses “journal” as a proxy for “research topic” (but there is no journal for information extraction) • Uses impact measures almost exclusively based on simple citation counts. How can we use topic models to create new, interesting impact measures? Can create a social network of scientific sub-fields?

67

7/18/2008

Our Data • Over 1.6 million research papers, gathered as part of Rexa.info portal. • Cross linked references / citations.

QuickTime™ and a TIFF (LZW) decompressor are needed to see this picture.

Previous Systems

68

7/18/2008

QuickTime™ and a TIFF (LZW) decompressor are needed to see this picture.

Previous Systems

Cites

Research Paper

69

7/18/2008

More Entities and Relations Expertise Cites

Research Paper

Grant

Venue

Person

University

Groups

70

7/18/2008

71

7/18/2008

72

7/18/2008

73

7/18/2008

74

7/18/2008

75

7/18/2008

76

7/18/2008

Topical Transfer Citation counts from one topic to another. Map “producers and consumers”

Topical Bibliometric Impact Measures [Mann, Mimno, McCallum, 2006]

• Topical Citation Counts • Topical Impact Factors • Topical Longevity • Topical Precedence • Topical Diversity • Topical Transfer

77

7/18/2008

Topical Transfer Transfer from Digital Libraries to other topics Other topic

Cit’s

Paper Title

Web Pages

31

Trawling the Web for Emerging CyberCommunities, Kumar, Raghavan,... 1999.

Computer Vision

14

On being ‘Undigital’ with digital cameras: extending the dynamic...

Video

12

Lessons learned from the creation and deployment of a terabyte digital video libr..

Graphs

12

Trawling the Web for Emerging CyberCommunities

Web Pages

11

WebBase: a repository of Web pages

Topical Diversity Papers that had the most influence across many other fields...

78

7/18/2008

Topical Diversity Entropy of the topic distribution among papers that cite this paper (this topic).

High Diversity

Low Diversity

Summary • Joint inference needed for avoiding cascading errors in information extraction and data mining. • Can be performed in CRFs – – – – –

Cascaded sequences (Factorial CRFs) Distant correlations (Skip-chain CRFs) Co-reference (Affinity-matrix CRFs) Segmentation + Coref Logic + Probability

• Rexa: New research paper search engine, mining the interactions in our community.

79

7/18/2008

Outline •

Model / Feature Engineering – Brief review of IE w/ Conditional Random Fields – Flexibility to use non-independent features



Inference – Entity Resolution with Probability + First-order Logic – Resolution + Canonicalization + Schema Mapping – Inference by Metropolis-Hastings



Parameter Estimation – Semi-supervised Learning with Label Regularization – ...with Feature Labeling – Generalized Expectation criteria

80