Machine Learning for NLP Perceptron Learning
Joakim Nivre Uppsala University Department of Linguistics and Philology Slides adapted from Ryan McDonald, Google Research
Machine Learning for NLP
1(25)
Introduction
Linear Classifiers I
Classifiers covered so far: I I
I I
Decision trees Nearest neighbor
Next three lectures: linear classifiers Statistics from Google Scholar (October 2009): I I I
“Maximum Entropy” & “NLP” 2660 hits, 141 before 2000 “SVM” & “NLP” 2210 hits, 16 before 2000 “Perceptron” & “NLP”, 947 hits, 118 before 2000
I
All are linear classifiers and basic tools in NLP
I
They are also the bridge to deep learning techniques
Machine Learning for NLP
2(25)
Introduction
Outline I
Today: I I I
I
Next time: I I
I
Preliminaries: input/output, features, etc. Perceptron Assignment 2 Large-margin classifiers (SVMs, MIRA) Logistic regression (Maximum Entropy)
Final session: I I
Naive Bayes classifiers Generative and discriminative models
Machine Learning for NLP
3(25)
Preliminaries
Inputs and Outputs I
Input: x ∈ X I
I
Output: y ∈ Y I
I
e.g., document or sentence with some words x = w1 . . . wn , or a series of previous actions e.g., parse tree, document class, part-of-speech tags, word-sense
Input/output pair: (x, y) ∈ X × Y I I
e.g., a document x and its label y Sometimes x is explicit in y, e.g., a parse tree y will contain the sentence x
Machine Learning for NLP
4(25)
Preliminaries
Feature Representations I
We assume a mapping from input-output pairs (x, y) to a high dimensional feature vector I
I
f(x, y) : X × Y → Rm
For some cases, i.e., binary classification Y = {−1, +1}, we can map only from the input to the feature space I
f(x) : X → Rm
I
However, most problems in NLP require more than two classes, so we focus on the multi-class case
I
For any vector v ∈ Rm , let vj be the j th value
Machine Learning for NLP
5(25)
Preliminaries
Features and Classes I
All features must be numerical I I
I
Multinomial (categorical) features must be binarized I I I
I
Numerical features are represented directly as fi (x, y) ∈ R Binary (boolean) features are represented as fi (x, y) ∈ {0, 1} Instead of: fi (x, y) ∈ {v0 , . . . , vp } We have: fi+0 (x, y) ∈ {0, 1}, . . . , fi+p (x, y) ∈ {0, 1} Such that: fi+j (x, y) = 1 iff fi (x, y) = vj
We need distinct features for distinct output classes I I I
Instead of: fi (x) (1 ≤ i ≤ m) We have: fi+0m (x, y), . . . , fi+Nm (x, y) for Y = {0, . . . , N} Such that: fi+jm (x, y) = fi (x) iff y = yj
Machine Learning for NLP
6(25)
Preliminaries
Examples I
x is a document and y is a label 1 if x contains the word “interest” and y =“financial” fj (x, y) = 0 otherwise fj (x, y) = % of words in x with punctuation and y =“scientific”
I
x is a word and y is a part-of-speech tag 1 if x = “bank” and y = Verb fj (x, y) = 0 otherwise
Machine Learning for NLP
7(25)
Preliminaries
Examples I
x is a name, y is a label classifying the name 1
f0 (x, y) =
0
f1 (x, y) =
1
f2 (x, y) =
if x contains “Washington” and y = “Person” otherwise
0
1 0
f3 (x, y) =
1
if x contains “George” and y = “Person” otherwise
0
1
f4 (x, y) =
0
f5 (x, y) =
1
if x contains “Bridge” and y = “Person” otherwise
f6 (x, y) =
if x contains “General” and y = “Person” otherwise
f7 (x, y) =
if x contains “Washington” and y = “Object” otherwise
0 1 0
1
if x contains “George” and y = “Object” otherwise
0
if x contains “Bridge” and y = “Object” otherwise if x contains “General” and y = “Object” otherwise
I x=General George Washington, y=Person → f(x, y) = [1 1 0 1 0 0 0 0] I x=George Washington Bridge, y=Object → f(x, y) = [0 0 0 0 1 1 1 0] I x=George Washington George, y=Object → f(x, y) = [0 0 0 0 1 1 0 0] Machine Learning for NLP
8(25)
Preliminaries
Block Feature Vectors
I x=General George Washington, y=Person → f(x, y) = [1 1 0 1 0 0 0 0] I x=George Washington Bridge, y=Object → f(x, y) = [0 0 0 0 1 1 1 0] I x=George Washington George, y=Object → f(x, y) = [0 0 0 0 1 1 0 0] I
One equal-size block of the feature vector for each label
I
Input features duplicated in each block
I
Non-zero values allowed only in one block
Machine Learning for NLP
9(25)
Linear Classifiers
Linear Classifiers I
Linear classifier: score (or probability) of a particular classification is based on a linear combination of features and their weights
I
Let w ∈ Rm be a high dimensional weight vector If we assume that w is known, then we define our classifier as
I
I
Multiclass Classification: Y = {0, 1, . . . , N} y
=
arg max w · f(x, y) y
=
arg max y
I
m X
wj × fj (x, y)
j=0
Binary Classification just a special case of multiclass
Machine Learning for NLP
10(25)
Linear Classifiers
Linear Classifiers – Bias Terms I
Often linear classifiers presented as y = arg max y
I I
m X
wj × fj (x, y) + by
j=0
Where b is a bias or offset term But this can be folded into f x=General George Washington, y=Person → f(x, y) = [1 1 0 1 1 0 0 0 0 0] x=General George Washington, y=Object → f(x, y) = [0 0 0 0 0 1 1 0 1 1] 1 y =“Object” 1 y =“Person” f9 (x, y) = f4 (x, y) = 0 otherwise 0 otherwise
I
w4 and w9 are now the bias terms for the labels
Machine Learning for NLP
11(25)
Linear Classifiers
Binary Linear Classifier Divides all points:
Machine Learning for NLP
12(25)
Linear Classifiers
Multiclass Linear Classifier Defines regions of space:
I
i.e., + are all points (x, y) where + = arg maxy w · f(x, y)
Machine Learning for NLP
13(25)
Linear Classifiers
Separability I
A set of points is separable, if there exists a w such that classification is perfect Separable
I
Not Separable
This can also be defined mathematically (and we will shortly)
Machine Learning for NLP
14(25)
Linear Classifiers
Supervised Learning – how to find w |T |
I
Input: training examples T = {(xt , yt )}t=1
I
Input: feature representation f Output: w that maximizes/minimizes some important function on the training set
I
I I
I
minimize error (Perceptron, SVMs, Boosting) maximize likelihood of data (Logistic Regression, Naive Bayes)
Assumption: The training data is separable I I
Not necessary, just makes life easier There is a lot of good work in machine learning to tackle the non-separable case
Machine Learning for NLP
15(25)
Linear Classifiers
Perceptron I
Choose a w that minimizes error X w = arg min 1 − 1[yt = arg max w · f(xt , y)] y w t 1[p] =
I
1 p is true 0 otherwise
This is a 0-1 loss function I
Aside: when minimizing error people tend to use hinge-loss or other smoother loss functions
Machine Learning for NLP
16(25)
Linear Classifiers
Perceptron Learning Algorithm |T |
Training data: T = {(xt , yt )}t=1 1. w(0) = 0; i = 0 2. for n : 1..N 3. for t : 1..T 4. Let y 0 = arg maxy w(i) · f(xt , y) 5. if y 0 6= yt 6. w(i+1) = w(i) + f(xt , yt ) − f(xt , y 0 ) 7. i =i +1 8. return wi
Machine Learning for NLP
17(25)
Linear Classifiers
Perceptron: Separability and Margin I
Given an training instance (xt , yt ), define: I I
I
Y¯t = Y − {yt } i.e., Y¯t is the set of incorrect labels for xt
A training set T is separable with margin γ > 0 if there exists a vector u with kuk = 1 such that: u · f(xt , yt ) − u · f(xt , y 0 ) ≥ γ qP 2 2 for all y 0 ∈ Y¯t and ||u|| = j uj (Euclidean or L norm)
I
Assumption: the training set is separable with margin γ
Machine Learning for NLP
18(25)
Linear Classifiers
Perceptron: Main Theorem I
Theorem: For any training set separable with a margin of γ, the following holds for the perceptron algorithm: mistakes made during training ≤
R2 γ2
where R ≥ ||f(xt , yt ) − f(xt , y 0 )|| for (xt , yt ) ∈ T and y 0 ∈ Y¯t I
Thus, after a finite number of training iterations, the error on the training set will converge to zero
I
For proof, see Collins (2002)
Machine Learning for NLP
19(25)
Linear Classifiers
Practical Considerations I
The perceptron is sensitive to the order of training examples I
I
I
I
Consider: 500 positive instances + 500 negative instances
Shuffling: Randomly permute training instances between iterations
Voting and averaging: I I I
Let w1 , . . . wn be all the weight vectors seen in training The voted perceptron predicts the majority vote of w1 , . . . wn The averaged perceptron predicts using the average vector: w=
1X w1 , . . . wn n i
Machine Learning for NLP
20(25)
Linear Classifiers
Perceptron Summary I
Learns a linear classifier that minimizes error I I I
I
Guaranteed to find a w in a finite amount of time Improvement 1: shuffle training data between iterations Improvement 2: average weight vectors seen during training
Perceptron is an example of an online learning algorithm I
w is updated based on a single training instance in isolation w(i+1) = w(i) + f(xt , yt ) − f(xt , y 0 )
I
Compare decision trees that perform batch learning I
All training instances are used to find best split
Machine Learning for NLP
21(25)
Linear Classifiers
Assignment 2
I I
Implement the perceptron (starter code in Python) Three subtasks: I I I
I
Implement the linear classifier (dot product) Implement the perceptron update Evaluate on the spambase data set
For VG I
Implement the averaged perceptron
Machine Learning for NLP
22(25)
Appendix Proofs and Derivations
Machine Learning for NLP
23(25)
Convergence Proof for Perceptron
Perceptron Learning Algorithm |T |
Training data: T = {(xt , yt )}t=1 1. 2. 3. 4. 5. 6. 7. 8.
I I I I
w(0) = 0; i = 0 for n : 1..N for t : 1..T Let y 0 = arg maxy w(i) · f(xt , y) if y 0 6= yt w(i+1) = w(i) + f(xt , yt ) − f(xt , y 0 ) i =i +1 return wi
I w (k−1) are the weights before k th mistake I Suppose k th mistake made at the t th example, (xt , yt ) I y 0 = arg maxy w(k−1) · f(xt , y) I y 0 6= yt I w (k) = w(k−1) + f(xt , yt ) − f(xt , y 0 )
Now: u · w(k) = u · w(k−1) + u · (f(xt , yt ) − f(xt , y 0 )) ≥ u · w(k−1) + γ Now: w(0) = 0 and u · w(0) = 0, by induction on k, u · w(k) ≥ kγ Now: since u · w(k) ≤ ||u|| × ||w(k) || and ||u|| = 1 then ||w(k) || ≥ kγ Now: ||w(k) ||2
=
||w(k−1) ||2 + ||f(xt , yt ) − f(xt , y 0 )||2 + 2w(k−1) · (f(xt , yt ) − f(xt , y 0 ))
(k) 2
≤
||w(k−1) ||2 + R 2
||w
||
(since R ≥ ||f(xt , yt ) − f(xt , y 0 )|| and w(k−1) · f(xt , yt ) − w(k−1) · f(xt , y 0 ) ≤ 0) Machine Learning for NLP
24(25)
Convergence Proof for Perceptron
Perceptron Learning Algorithm I
We have just shown that ||w(k) || ≥ kγ and ||w(k) ||2 ≤ ||w(k−1) ||2 + R 2
I
By induction on k and since w(0) = 0 and ||w(0) ||2 = 0 ||w(k) ||2 ≤ kR 2
I
Therefore, k 2 γ 2 ≤ ||w(k) ||2 ≤ kR 2
I
and solving for k k≤
I
R2 γ2
Therefore the number of errors is bounded!
Machine Learning for NLP
25(25)