Machine Learning for NLP

Machine Learning for NLP Perceptron Learning Joakim Nivre Uppsala University Department of Linguistics and Philology Slides adapted from Ryan McDonal...
Author: Barnard Cross
2 downloads 0 Views 316KB Size
Machine Learning for NLP Perceptron Learning

Joakim Nivre Uppsala University Department of Linguistics and Philology Slides adapted from Ryan McDonald, Google Research

Machine Learning for NLP

1(25)

Introduction

Linear Classifiers I

Classifiers covered so far: I I

I I

Decision trees Nearest neighbor

Next three lectures: linear classifiers Statistics from Google Scholar (October 2009): I I I

“Maximum Entropy” & “NLP” 2660 hits, 141 before 2000 “SVM” & “NLP” 2210 hits, 16 before 2000 “Perceptron” & “NLP”, 947 hits, 118 before 2000

I

All are linear classifiers and basic tools in NLP

I

They are also the bridge to deep learning techniques

Machine Learning for NLP

2(25)

Introduction

Outline I

Today: I I I

I

Next time: I I

I

Preliminaries: input/output, features, etc. Perceptron Assignment 2 Large-margin classifiers (SVMs, MIRA) Logistic regression (Maximum Entropy)

Final session: I I

Naive Bayes classifiers Generative and discriminative models

Machine Learning for NLP

3(25)

Preliminaries

Inputs and Outputs I

Input: x ∈ X I

I

Output: y ∈ Y I

I

e.g., document or sentence with some words x = w1 . . . wn , or a series of previous actions e.g., parse tree, document class, part-of-speech tags, word-sense

Input/output pair: (x, y) ∈ X × Y I I

e.g., a document x and its label y Sometimes x is explicit in y, e.g., a parse tree y will contain the sentence x

Machine Learning for NLP

4(25)

Preliminaries

Feature Representations I

We assume a mapping from input-output pairs (x, y) to a high dimensional feature vector I

I

f(x, y) : X × Y → Rm

For some cases, i.e., binary classification Y = {−1, +1}, we can map only from the input to the feature space I

f(x) : X → Rm

I

However, most problems in NLP require more than two classes, so we focus on the multi-class case

I

For any vector v ∈ Rm , let vj be the j th value

Machine Learning for NLP

5(25)

Preliminaries

Features and Classes I

All features must be numerical I I

I

Multinomial (categorical) features must be binarized I I I

I

Numerical features are represented directly as fi (x, y) ∈ R Binary (boolean) features are represented as fi (x, y) ∈ {0, 1} Instead of: fi (x, y) ∈ {v0 , . . . , vp } We have: fi+0 (x, y) ∈ {0, 1}, . . . , fi+p (x, y) ∈ {0, 1} Such that: fi+j (x, y) = 1 iff fi (x, y) = vj

We need distinct features for distinct output classes I I I

Instead of: fi (x) (1 ≤ i ≤ m) We have: fi+0m (x, y), . . . , fi+Nm (x, y) for Y = {0, . . . , N} Such that: fi+jm (x, y) = fi (x) iff y = yj

Machine Learning for NLP

6(25)

Preliminaries

Examples I

x is a document and y is a label   1 if x contains the word “interest” and y =“financial” fj (x, y) =  0 otherwise fj (x, y) = % of words in x with punctuation and y =“scientific”

I

x is a word and y is a part-of-speech tag  1 if x = “bank” and y = Verb fj (x, y) = 0 otherwise

Machine Learning for NLP

7(25)

Preliminaries

Examples I

x is a name, y is a label classifying the name   1

f0 (x, y) =

 0

f1 (x, y) =

  1

f2 (x, y) =

if x contains “Washington” and y = “Person” otherwise

0



  1  0

f3 (x, y) =

  1 

if x contains “George” and y = “Person” otherwise

0

  1

f4 (x, y) =

 0

f5 (x, y) =

  1 

if x contains “Bridge” and y = “Person” otherwise

f6 (x, y) =

if x contains “General” and y = “Person” otherwise

f7 (x, y) =

if x contains “Washington” and y = “Object” otherwise

0   1  0

  1 

if x contains “George” and y = “Object” otherwise

0

if x contains “Bridge” and y = “Object” otherwise if x contains “General” and y = “Object” otherwise

I x=General George Washington, y=Person → f(x, y) = [1 1 0 1 0 0 0 0] I x=George Washington Bridge, y=Object → f(x, y) = [0 0 0 0 1 1 1 0] I x=George Washington George, y=Object → f(x, y) = [0 0 0 0 1 1 0 0] Machine Learning for NLP

8(25)

Preliminaries

Block Feature Vectors

I x=General George Washington, y=Person → f(x, y) = [1 1 0 1 0 0 0 0] I x=George Washington Bridge, y=Object → f(x, y) = [0 0 0 0 1 1 1 0] I x=George Washington George, y=Object → f(x, y) = [0 0 0 0 1 1 0 0] I

One equal-size block of the feature vector for each label

I

Input features duplicated in each block

I

Non-zero values allowed only in one block

Machine Learning for NLP

9(25)

Linear Classifiers

Linear Classifiers I

Linear classifier: score (or probability) of a particular classification is based on a linear combination of features and their weights

I

Let w ∈ Rm be a high dimensional weight vector If we assume that w is known, then we define our classifier as

I

I

Multiclass Classification: Y = {0, 1, . . . , N} y

=

arg max w · f(x, y) y

=

arg max y

I

m X

wj × fj (x, y)

j=0

Binary Classification just a special case of multiclass

Machine Learning for NLP

10(25)

Linear Classifiers

Linear Classifiers – Bias Terms I

Often linear classifiers presented as y = arg max y

I I

m X

wj × fj (x, y) + by

j=0

Where b is a bias or offset term But this can be folded into f x=General George Washington, y=Person → f(x, y) = [1 1 0 1 1 0 0 0 0 0] x=General George Washington, y=Object → f(x, y) = [0 0 0 0 0 1 1 0 1 1]   1 y =“Object” 1 y =“Person” f9 (x, y) = f4 (x, y) = 0 otherwise 0 otherwise

I

w4 and w9 are now the bias terms for the labels

Machine Learning for NLP

11(25)

Linear Classifiers

Binary Linear Classifier Divides all points:

Machine Learning for NLP

12(25)

Linear Classifiers

Multiclass Linear Classifier Defines regions of space:

I

i.e., + are all points (x, y) where + = arg maxy w · f(x, y)

Machine Learning for NLP

13(25)

Linear Classifiers

Separability I

A set of points is separable, if there exists a w such that classification is perfect Separable

I

Not Separable

This can also be defined mathematically (and we will shortly)

Machine Learning for NLP

14(25)

Linear Classifiers

Supervised Learning – how to find w |T |

I

Input: training examples T = {(xt , yt )}t=1

I

Input: feature representation f Output: w that maximizes/minimizes some important function on the training set

I

I I

I

minimize error (Perceptron, SVMs, Boosting) maximize likelihood of data (Logistic Regression, Naive Bayes)

Assumption: The training data is separable I I

Not necessary, just makes life easier There is a lot of good work in machine learning to tackle the non-separable case

Machine Learning for NLP

15(25)

Linear Classifiers

Perceptron I

Choose a w that minimizes error X w = arg min 1 − 1[yt = arg max w · f(xt , y)] y w t  1[p] =

I

1 p is true 0 otherwise

This is a 0-1 loss function I

Aside: when minimizing error people tend to use hinge-loss or other smoother loss functions

Machine Learning for NLP

16(25)

Linear Classifiers

Perceptron Learning Algorithm |T |

Training data: T = {(xt , yt )}t=1 1. w(0) = 0; i = 0 2. for n : 1..N 3. for t : 1..T 4. Let y 0 = arg maxy w(i) · f(xt , y) 5. if y 0 6= yt 6. w(i+1) = w(i) + f(xt , yt ) − f(xt , y 0 ) 7. i =i +1 8. return wi

Machine Learning for NLP

17(25)

Linear Classifiers

Perceptron: Separability and Margin I

Given an training instance (xt , yt ), define: I I

I

Y¯t = Y − {yt } i.e., Y¯t is the set of incorrect labels for xt

A training set T is separable with margin γ > 0 if there exists a vector u with kuk = 1 such that: u · f(xt , yt ) − u · f(xt , y 0 ) ≥ γ qP 2 2 for all y 0 ∈ Y¯t and ||u|| = j uj (Euclidean or L norm)

I

Assumption: the training set is separable with margin γ

Machine Learning for NLP

18(25)

Linear Classifiers

Perceptron: Main Theorem I

Theorem: For any training set separable with a margin of γ, the following holds for the perceptron algorithm: mistakes made during training ≤

R2 γ2

where R ≥ ||f(xt , yt ) − f(xt , y 0 )|| for (xt , yt ) ∈ T and y 0 ∈ Y¯t I

Thus, after a finite number of training iterations, the error on the training set will converge to zero

I

For proof, see Collins (2002)

Machine Learning for NLP

19(25)

Linear Classifiers

Practical Considerations I

The perceptron is sensitive to the order of training examples I

I

I

I

Consider: 500 positive instances + 500 negative instances

Shuffling: Randomly permute training instances between iterations

Voting and averaging: I I I

Let w1 , . . . wn be all the weight vectors seen in training The voted perceptron predicts the majority vote of w1 , . . . wn The averaged perceptron predicts using the average vector: w=

1X w1 , . . . wn n i

Machine Learning for NLP

20(25)

Linear Classifiers

Perceptron Summary I

Learns a linear classifier that minimizes error I I I

I

Guaranteed to find a w in a finite amount of time Improvement 1: shuffle training data between iterations Improvement 2: average weight vectors seen during training

Perceptron is an example of an online learning algorithm I

w is updated based on a single training instance in isolation w(i+1) = w(i) + f(xt , yt ) − f(xt , y 0 )

I

Compare decision trees that perform batch learning I

All training instances are used to find best split

Machine Learning for NLP

21(25)

Linear Classifiers

Assignment 2

I I

Implement the perceptron (starter code in Python) Three subtasks: I I I

I

Implement the linear classifier (dot product) Implement the perceptron update Evaluate on the spambase data set

For VG I

Implement the averaged perceptron

Machine Learning for NLP

22(25)

Appendix Proofs and Derivations

Machine Learning for NLP

23(25)

Convergence Proof for Perceptron

Perceptron Learning Algorithm |T |

Training data: T = {(xt , yt )}t=1 1. 2. 3. 4. 5. 6. 7. 8.

I I I I

w(0) = 0; i = 0 for n : 1..N for t : 1..T Let y 0 = arg maxy w(i) · f(xt , y) if y 0 6= yt w(i+1) = w(i) + f(xt , yt ) − f(xt , y 0 ) i =i +1 return wi

I w (k−1) are the weights before k th mistake I Suppose k th mistake made at the t th example, (xt , yt ) I y 0 = arg maxy w(k−1) · f(xt , y) I y 0 6= yt I w (k) = w(k−1) + f(xt , yt ) − f(xt , y 0 )

Now: u · w(k) = u · w(k−1) + u · (f(xt , yt ) − f(xt , y 0 )) ≥ u · w(k−1) + γ Now: w(0) = 0 and u · w(0) = 0, by induction on k, u · w(k) ≥ kγ Now: since u · w(k) ≤ ||u|| × ||w(k) || and ||u|| = 1 then ||w(k) || ≥ kγ Now: ||w(k) ||2

=

||w(k−1) ||2 + ||f(xt , yt ) − f(xt , y 0 )||2 + 2w(k−1) · (f(xt , yt ) − f(xt , y 0 ))

(k) 2



||w(k−1) ||2 + R 2

||w

||

(since R ≥ ||f(xt , yt ) − f(xt , y 0 )|| and w(k−1) · f(xt , yt ) − w(k−1) · f(xt , y 0 ) ≤ 0) Machine Learning for NLP

24(25)

Convergence Proof for Perceptron

Perceptron Learning Algorithm I

We have just shown that ||w(k) || ≥ kγ and ||w(k) ||2 ≤ ||w(k−1) ||2 + R 2

I

By induction on k and since w(0) = 0 and ||w(0) ||2 = 0 ||w(k) ||2 ≤ kR 2

I

Therefore, k 2 γ 2 ≤ ||w(k) ||2 ≤ kR 2

I

and solving for k k≤

I

R2 γ2

Therefore the number of errors is bounded!

Machine Learning for NLP

25(25)