NLP Programming Tutorial 10 - Neural Networks

NLP Programming Tutorial 10 – Neural Networks NLP Programming Tutorial 10 Neural Networks Graham Neubig Nara Institute of Science and Technology (NA...
Author: Brianne Ryan
1 downloads 2 Views 253KB Size
NLP Programming Tutorial 10 – Neural Networks

NLP Programming Tutorial 10 Neural Networks

Graham Neubig Nara Institute of Science and Technology (NAIST)

1

NLP Programming Tutorial 10 – Neural Networks

Prediction Problems

Given x, predict y

2

NLP Programming Tutorial 10 – Neural Networks

Example we will use: ●

Given an introductory sentence from Wikipedia



Predict whether the article is about a person Given



Predict

Gonso was a Sanron sect priest (754-827) in the late Nara and early Heian periods.

Yes!

Shichikuzan Chigogataki Fudomyoo is a historical site located at Magura, Maizuru City, Kyoto Prefecture.

No!

This is binary classification (of course!) 3

NLP Programming Tutorial 10 – Neural Networks

Linear Classifiers y = sign (w⋅ϕ(x)) I

= sign ( ∑i=1 w i⋅ϕi ( x)) ●

x: the input



φ(x): vector of feature functions {φ1(x), φ2(x), …, φI(x)}



w: the weight vector {w1, w2, …, wI}



y: the prediction, +1 if “yes”, -1 if “no” ●

(sign(v) is +1 if v >= 0, -1 otherwise)

4

NLP Programming Tutorial 10 – Neural Networks

Example Feature Functions: Unigram Features ●

Equal to “number of times a particular word appears”

x = A site , located in Maizuru , Kyoto φunigram “A”(x) = 1

φunigram “site”(x) = 1

φunigram “,”(x) = 2

φunigram “located”(x) = 1 φunigram “in”(x) = 1 φunigram “Maizuru”(x) = 1 φunigram “Kyoto”(x) = 1 φunigram “the”(x) = 0 φunigram “temple”(x) = 0

… ●

The rest are all 0

For convenience, we use feature names (φunigram “A”) instead of feature indexes (φ1)

5

NLP Programming Tutorial 10 – Neural Networks

Calculating the Weighted Sum x = A site , located in Maizuru , Kyoto φunigram “A”(x) φunigram “site”(x) φunigram “located”(x)

=1 =1 =1

φunigram “Maizuru”(x) φunigram “,”(x) φunigram “in”(x) φunigram “Kyoto”(x) φunigram “priest”(x) φunigram “black”(x)

=1 =2 =1 =1 =0 =0

*



wunigram “a” wunigram “site”

=0 = -3

wunigram “located” wunigram “Maizuru” wunigram “,” wunigram “in” wunigram “Kyoto” wunigram “priest” wunigram “black”

=0 =0 =0 =0 =0 =2 =0

=

0 + -3 + 0 + 0 + 0 0 + + 0 + 0 + 0 +

… =

-3 → No!

6

NLP Programming Tutorial 10 – Neural Networks

The Perceptron ●

Think of it as a “machine” to calculate a weighted sum φ“A” =1 φ“site” = 1 φ“located” = 1 φ“Maizuru”= 1 φ“,” =2 φ“in” =1 φ“Kyoto” = 1 φ“priest” = 0 φ“black” = 0

0 -3 0 0 0 0 0 2 0

I

sign( ∑ i=1 w i⋅ϕi ( x))

-1

7

NLP Programming Tutorial 10 – Neural Networks

Problem: Linear Constraint ●

Perceptron cannot achieve high accuracy on nonlinear functions

X

O

O

X

8

NLP Programming Tutorial 10 – Neural Networks

Neural Networks ●

Neural networks connect multiple perceptrons together φ“A” =1 φ“site” = 1 φ“located” = 1 φ“Maizuru”= 1 φ“,” =2 φ“in” =1 φ“Kyoto” = 1 φ“priest” = 0 φ“black” = 0



-1

Motivation: Can express non-linear functions

9

NLP Programming Tutorial 10 – Neural Networks

Example: ●

Build two classifiers:

φ(x1) = {-1, 1} φ(x2) = {1, 1}

X

φ2

O φ1

O

X

φ(x3) = {-1, -1} φ(x4) = {1, -1}

φ1

1

φ2

1

1

-1

w1

y1

w2

y2

φ1 -1 φ2 -1 1 -1 10

NLP Programming Tutorial 10 – Neural Networks

Example: ●

These classifiers map the points to a new space y(x3) = {-1, 1}

φ(x1) = {-1, 1} φ(x2) = {1, 1}

X

φ2

O

O

y2

φ1

O

y1

X

X

φ(x3) = {-1, -1} φ(x4) = {1, -1}

1 1 -1

y1

-1 -1 -1

y2

y(x1) = {-1, -1} y(x4) = {-1, -1}

O y(x2) = {1, -1}

11

NLP Programming Tutorial 10 – Neural Networks

Example: ●

In the new space, examples are classifiable!

φ(x1) = {-1, 1} φ(x2) = {1, 1} φ2

X

O φ1

O

1 1 1

X

y3

φ(x3) = {-1, -1} φ(x4) = {1, -1} 1 1 -1 -1 -1 -1

y1

y(x3) = {-1, 1} O

y2

y1

y2

y(x1) = {-1, -1} y(x4) = {-1, -1} X

O y(x2) = {1, -1}

12

NLP Programming Tutorial 10 – Neural Networks

Example: ●

Final neural network: φ1

1

φ2

1

1

-1

w1

φ1 -1 φ2 -1 1 -1

1

1

w3

y4

w2

1 1 13

NLP Programming Tutorial 10 – Neural Networks

Representing a Neural Network ●

Assume network is fully connected and in layers



Each perceptron: ● ●

A layer ID A weight vector network = [ (1, w0), (1, w1), (1, w2), (2, w3) ]

Layer 1 φ“A” =1 φ“site” = 1 φ“located” = 1 φ“Maizuru”= 1 φ“,” =2 φ“in” =1 φ“Kyoto” = 1 φ“priest” = 0 φ“black” = 0

Layer 2

0

1

2

3

14

NLP Programming Tutorial 10 – Neural Networks

Neural Network Prediction Process ●

Predict one perceptron at a time using previous layer φ“A” =1 φ“site” = 1 φ“located” = 1 φ“Maizuru”= 1 φ“,” =2 φ“in” =1 φ“Kyoto” = 1 φ“priest” = 0 φ“black” = 0

0

1

3

2

15

NLP Programming Tutorial 10 – Neural Networks

Neural Network Prediction Process ●

Predict one perceptron at a time using previous layer φ“A” =1 φ“site” = 1 φ“located” = 1 φ“Maizuru”= 1 φ“,” =2 φ“in” =1 φ“Kyoto” = 1 φ“priest” = 0 φ“black” = 0

0 -1

1

3

2

16

NLP Programming Tutorial 10 – Neural Networks

Neural Network Prediction Process ●

Predict one perceptron at a time using previous layer φ“A” =1 φ“site” = 1 φ“located” = 1 φ“Maizuru”= 1 φ“,” =2 φ“in” =1 φ“Kyoto” = 1 φ“priest” = 0 φ“black” = 0

0 -1

1

1

3

2

17

NLP Programming Tutorial 10 – Neural Networks

Neural Network Prediction Process ●

Predict one perceptron at a time using previous layer φ“A” =1 φ“site” = 1 φ“located” = 1 φ“Maizuru”= 1 φ“,” =2 φ“in” =1 φ“Kyoto” = 1 φ“priest” = 0 φ“black” = 0

0 -1

1

2

1

3

1 18

NLP Programming Tutorial 10 – Neural Networks

Neural Network Prediction Process ●

Predict one perceptron at a time using previous layer φ“A” =1 φ“site” = 1 φ“located” = 1 φ“Maizuru”= 1 φ“,” =2 φ“in” =1 φ“Kyoto” = 1 φ“priest” = 0 φ“black” = 0

0 -1

1

2

1

3

-1

1 19

NLP Programming Tutorial 10 – Neural Networks

Review: Pseudo-code for Perceptron Predicton predict_one(w, phi) score = 0 for each name, value in phi if name exists in w score += value * w[name] if score >= 0 return 1 else return -1

# score = w*φ(x)

20

NLP Programming Tutorial 10 – Neural Networks

Pseudo-Code for NN Prediction predict_nn(network, phi) y = [ phi, {}, {} … ] # activations for each layer for each node i: layer, weight = network[i] # predict the answer with the previous perceptron answer = predict_one(weight, y[layer-1]) # save this answer as a feature for the next layer y[layer][i] = answer return the answer for the last perceptron

21

NLP Programming Tutorial 10 – Neural Networks

Neural Network Activation Functions ●

Previously described NN uses step function

y =sign( w⋅ϕ( x))

y

2 -4

-3

-2

-1

0

0

1

2

3

4

-2 sign(φ*x)

Step function is not differentiable → use tanh

y =tanh (w⋅ϕ(x )) Python: from math import tanh tanh(x)

y



-4

-3

-2

2 1 0 -1 -1 0 -2

1

tanh(φ*x)

2

3

4 22

NLP Programming Tutorial 10 – Neural Networks

Learning a Perceptron w/ tanh ●

First, calculate the error: δ = y' - y correct tag



system output

Update each weight with:

w ← w + λ⋅δ⋅ϕ(x ) ●

Where λ is the learning rate



(For step function perceptron δ = -2 or +2, λ = 1/2) 23

NLP Programming Tutorial 10 – Neural Networks

Problem: Don't Know Correct Answer! ●

For NNs, only know correct tag for last layer φ“A” =1 φ“site” = 1 φ“located” = 1 φ“Maizuru”= 1 φ“,” =2 φ“in” =1 φ“Kyoto” = 1 φ“priest” = 0 φ“black” = 0

0

y' = ? y=1

y' = ? 1 y=1

2

3

y' = 1 y = -1

y' = ? y=1 24

NLP Programming Tutorial 10 – Neural Networks

Answer: Back-Propogation ●

Pass error backwards along the network w=0.1 w=1

δ = 0.2

w=-0.3

δ = 0.4

j



δ = -0.9

∑i δ i w j ,i

Also consider gradient of tanh y

2 -4

-3

-2

-1

0

0

1

2

3

-2

4

2

d tanh (ϕ( x )∗w)=1−( ϕ( x)∗w) =1− y ●

2 j

Combine:

δ j=(1− y ) ∑i δi w j ,i 2 j

25

NLP Programming Tutorial 10 – Neural Networks

Back Propagation Code update_nn(network, phi, y') create array δ calculate y using predict_nn for each node j in reverse order: if j is the last node δj = y' – yj else δj =(1− y 2j ) ∑i δ i w j ,i for each node j: layer, w = network[j] for each name, val in y[layer-1]: w[name] += λ * δj * val 26

NLP Programming Tutorial 10 – Neural Networks

Training process create network randomize network weights for I iterations for each labeled pair x, y in the data phi = create_features(x) update_nn(w, phi, y) ●



For previous perceptron, we initialized weights to zero In NN: randomly initialize weights (so not all perceptrons are identical) 27

NLP Programming Tutorial 10 – Neural Networks

Exercise

28

NLP Programming Tutorial 10 – Neural Networks

Exercise (1) ●

Write two programs ● ●



train-nn: Creates a neural network model test-nn: Reads a neural network model

Test train-nn ● ● ●

Input: test/03-train-input.txt Use one iteration, one hidden layer, two hidden nodes Calculate updates by hand and make sure they are correct

29

NLP Programming Tutorial 10 – Neural Networks

Exercise (2) ●

Train a model on data/titles-en-train.labeled



Predict the labels of data/titles-en-test.word



Grade your answers ●



script/grade-prediction.py data-en/titles-en-test.labeled your_answer

Compare: ● ●

With a single perceptron/SVM classifiers With different neural network structures

30

NLP Programming Tutorial 10 – Neural Networks

Thank You!

31