VS 298: Neural Computation

Due: 2 October 2008

Problem Set 3 Solutions Professor: Bruno Olshausen

GSI: Amir Khosrowshahi

This solution set was provided by students Hania K¨over and Jack Culpepper from when this class was taught a prior year. Please see accompanying .zip for all code and figures.

1

Hebbian learning and PCA

In Hebbian learning, we increase weights that produce positive correlations between the inputs and outputs of the network. The neural substrate’s analog of this is to strengthen synapses that cause the inputs and outputs of the system to spike together – i.e., “cells that fire together wire together.” Figure 1(a) shows the weights produced when pure Hebbian learning was used with a single neuron trained on the 2d Gaussian data in D1 . In this case, the learning rule is extremely simple: ˙ ∝ hyxi w In Hebbian learning, the weights are never decreased. In the neural substrate, this would require synapses to have boundless “conductance,” which is unrealistic. However, the same effect can be achieved if the weights do not increase without bound but simply maintain the same relative strengths. One way to effect this is via Oja’s rule, which is essentially just Hebbian learning augmented with a weight decay term that decreases weights in proportion to the square of the output produced: ˙ ∝ hyxi − hy2 iw w Figure 1(b) shows the weights learned when a single neuron was trained using Oja’s rule. If we work backwards and extract the objective function that this learning rule optimizes, we find that it is: E = h|x − yw|2 i Note that for the case of a single neuron, the weights vector points in the direction that accounts for most of the variance in the data. We also trained two neurons using Sanger’s rule, which is basically just a multi-unit version of Oja’s rule. In this case, the weights of neuron k experience Hebbian learning with subtractive weight decay on the residual signal, after neurons 1 to k − 1 have accounted for their portions: ∆wij = ηyi (xj −

i X

yk wkj )

k=1

Since the learning rule has a sequential component (the summation), it does not map well to a neural substrate, but it is a cool way of learning the principal components of your dataset. Figure 1(c) shows the weights learned when two neurons were trained using Sanger’s rule. Lastly, we trained a single neuron using Oja’s rule and two neurons using Sanger’s rule on the mixture-of-Gaussians data in D2 . Figure 2 shows that the weights vectors do not point in meaningful directions in this case! This points to the fact that doing PCA on data that does not follow a Gaussian distribution will produce principal components that have no meaningful interpretation. 3-1

Unconstrained Hebbian learning

Constrained Hebbian learning (via Oja’s rule)

Sanger’s rule with two neurons

2

2

2

1.5

1.5

1.5

1

1

1

0.5

0.5

0.5

0

0

0

−0.5

−0.5

−0.5

−1

−1

−1

−1.5

−1.5

−1.5

−2

−2 −2

−1

0

1

2

−2 −2

−1

0

(a)

1

2

−2

−1

0

(b)

1

2

(c)

Figure 1: (a) Unconstrained Hebbian learning. (b) Hebbian learning, constrained via Oja’s rule. (c) Two neurons trained using Sanger’s rule.

Constrained Hebbian learning (via Oja’s rule)

Sanger’s rule with two neurons

2

2

1

1

0

0

−1

−1

−2

−2

−3

−3 −2

−1

0

1

2

3

−2

(a)

−1

0

1

2

3

(b)

Figure 2: (a) Hebbian learning, constrained via Oja’s rule. (b) Two neurons trained using Sanger’s rule.

3-2

2

Eigenfaces

We experimented with learning ‘Eigenfaces’ using Sanger’s rule. Our face data contained the faces of 16 different people each in 3 poses. We preprocessed the data by subtracting the mean of the 48 images (the mean face). Figure 6(a) shows the mean face. Next, we used Sanger’s rule to learn the first two pricipal components – these weights are shown in figure 6(b). We then projected the face data onto these components and plotted the magnitudes of these projections using one color for each person (thus, there are three dots in each color). We expected the patterns of relative spatial positions for the three same-colored dots to be the same across people, and this seems to be mildly true. Figure 4 shows magnitudes of the 1st two principal components of the faces. The faces can be reconstructed from the magnitudes of activation of each principal component, and the mean face. For datasets with low intrinsic dimensionality, it should be possible to reconstruct every individual element using only a small number of principal components, and we found this to be true of the faces. Figure 5 shows that we are able to reconstruct a face (face 5 in this case) recognizably well from only 16 principal components. Considering that each face is a 3840 dimensional vector, this is quite amazing.

10

20

30

40

50

60 10

20

30

40

50

60

(a)

10

10

20

20

30

30

40

40

50

50

60

60 20

40

60

20

40

60

(b)

Figure 3: (a) The mean face, which was subtracted from the data in preprocessing. (b) Two neurons trained using Sanger’s rule to learn the first two ‘Eigenfaces’.

3-3

Figure 4: The faces, projected onto the first two principal components of the dataset. Each subject appears in three different poses; all three poses are plotted using the same color.

face 5 using 1 pcs

face 5 using 2 pcs

face 5 using 3 pcs

face 5 using 4 pcs

20

20

20

20

40

40

40

40

60

60

60

20

40

60

face 5 using 5 pcs

20

40

60

face 5 using 6 pcs

20

40

60

face 5 using 7 pcs

60

20

20

20

20

40

40

40

40

60

60

60

20

40

60

face 5 using 9 pcs

20

40

60

face 5 using 10 pcs

20

40

60

face 5 using 11 pcs

60

20

20

20

40

40

40

40

60

60

60

40

60

face 5 using 13 pcs

20

40

60

face 5 using 14 pcs

20

40

60

face 5 using 15 pcs

60

20

20

20

40

40

40

40

60

60

60

40

60

20

40

60

20

40

60

60

20

40

60

20

40

60

face 5 using 16 pcs

20

20

40

face 5 using 12 pcs

20

20

20

face 5 using 8 pcs

60

20

40

60

Figure 5: The 5th face, reconstructed using k principal components, for 1 ≤ k ≤ 16 3-4

3

Competitive learning

Winner take all learning (aka K-means) optimizes the following objective function: E(wi ) =

1 X µ (µ) Mi |x − wi |2 2 i,µ

where Miµ

( 1 if i = i∗ (µ); = 0 otherwise.

and i∗ (µ) is the winning unit for pattern µ. Our learning rule is thus: ∆wi = −η

X µ ∂E Mi (x(µ) − wi ) =η ∂wi µ

We trained a 4 unit winner-take-all network on the data in D2 , and experimented with using techniques for reducing the effect of ‘dead neurons,’ or neurons that never win. Specifically, we tried: • Leaky learning, in which the losing neurons have their weights adjusted in the same direction as the winner, albeit more slowly. • Adding noise to the weights of losing neurons. • Adding noise to the weights of both winning and losing neurons on a schedule that decreases the noise as learning proceeds. We also experimented with: • Batch mode learning, in which the weights update is accumulated for the entire batch, then normalized by the number of data points in the batch. • Stochastic gradient learning, in which the weights are updated each time a data point is considered. We were a bit frustrated because no combination of these techniques resulted in a system that, when initialized to random, normally distributed weights, would always learn to point one vector along each lobe of the data. Since there are exactly four lobe and exactly four neurons we believed this was a reasonable request.

4

Vector quantization of faces

We ran the same winner-take-all learning algorithm that we developed for the 2d mixture-ofGaussians data on the faces. Figure 7 shows the resulting weights, which are a bit disappointing. We used the eigenmovie.m function to look at the effect of turning up and down each individual component learned this way on top of the mean face, and compared these components to those learned by Sanger’s rule (principal components). In neither case do the components detectably correspond to anything “meaningful” (we were hoping for nose, mouth, fat face, skinny face, etc.).

3-5

Winner take all with leaking learning

2

1

0

−1

−2

−3 −2

−1

0

1

Figure 6: Winner take all 2d data.

3-6

2

3

20

20

20

20

40

40

40

40

60

20

40

60

60

20

40

60

60

20

40

60

60

20

20

20

20

40

40

40

40

60

20

40

60

60

20

40

60

60

20

40

60

60

20

20

20

20

40

40

40

40

60

20

40

60

60

20

40

60

60

20

40

60

60

20

20

20

20

40

40

40

40

60

20

40

60

60

20

40

60

60

20

40

60

60

20

40

60

20

40

60

20

40

60

20

40

60

Figure 7: Vector quantized faces. Note that some of the neurons have learned the same thing! This is a pathelogical solution that we were unable to purge from even the 2-dimensional mixture-ofGaussians data.

3-7

Appendix A: Matlab Code % h e b b .m − s c r i p t t o do h e b b i a n l e a r n i n g load data2d % data array X=D1 ; [N K]= s i z e (X ) ; % p l o t data plot (X( 1 , : ) ,X( 2 , : ) , ’ . ’ ) axis xy , axis image hold on % i n i t i a l i z e weights w=randn (N, 1 ) ; % plot weight vector h=plot ( [ 0 w ( 1 ) ] , [ 0 w ( 2 ) ] , ’ r ’ , ’ LineWidth ’ , 2 ) ; n u m t r i a l s =100; e t a =0.1/K; for t =1: n u m t r i a l s % compute neuron o u t p u t f o r a l l d a t a ( can be done as one l i n e ) Y=w’ ∗X; % compute dw : Hebbian l e a r n i n g r u l e dw=X∗Y ’ ; % u p d a t e w e i g h t v e c t o r by dw w=w + e t a ∗dw ; set ( h , ’ XData ’ , [ 0 w ( 1 ) ] , ’ YData ’ , [ 0 w ( 2 ) ] ) ; drawnow pause ( 0 . 2 5 ) end hold o f f t i t l e ( ’ U n c o n s t r a i n e d Hebbian l e a r n i n g ’ ) ;

3-8

% replot weight vector

% h e b b .m − s c r i p t t o do h e b b i a n l e a r n i n g % % you must f i r s t l o a d d a t a 2 d . mat % data array X=D1 ; [N K]= s i z e (X ) ; % p l o t data plot (X( 1 , : ) ,X( 2 , : ) , ’ . ’ ) axis xy , axis image hold on % i n i t i a l i z e weights w=randn (N, 1 ) ; % plot weight vector h=plot ( [ 0 w ( 1 ) ] , [ 0 w ( 2 ) ] , ’ r ’ , ’ LineWidth ’ , 2 ) ; n u m t r i a l s =100; e t a =0.1/K; for t =1: n u m t r i a l s % compute neuron o u t p u t f o r a l l d a t a ( can be done as one l i n e ) Y=w’ ∗X; % compute dw : Hebbian l e a r n i n g r u l e dw=X∗Y’−w∗sum(Y. ˆ 2 , 2 ) ; % u p d a t e w e i g h t v e c t o r by dw w=w + e t a ∗dw ; set ( h , ’ XData ’ , [ 0 w ( 1 ) ] , ’ YData ’ , [ 0 w ( 2 ) ] ) ;

% replot weight vector

drawnow pause ( 0 . 2 5 ) end hold o f f t i t l e ( ’ C o n s t r a i n e d Hebbian l e a r n i n g ( v i a Oja ’ ’ s r u l e ) ’ ) ;

3-9

% h e b b .m − s c r i p t t o do h e b b i a n l e a r n i n g % % sanger ’ s r u l e w i t h two neurons load data2d ; % data array X=D1 ; [N K]= s i z e (X ) ; % p l o t data plot (X( 1 , : ) ,X( 2 , : ) , ’ . ’ ) t i t l e ( ’ Sanger ’ ’ s r u l e with two n e u r o n s ’ ) ; axis xy , axis image hold on % i n i t i a l i z e weights w=randn (N, 2 ) ; % plot weight vector h0=plot ( [ 0 w ( 1 ) ] , [ 0 w ( 2 ) ] , ’ r ’ , ’ LineWidth ’ , 2 ) ; h1=plot ( [ 0 w ( 3 ) ] , [ 0 w ( 4 ) ] , ’ r ’ , ’ LineWidth ’ , 2 ) ; n u m t r i a l s =10000; e t a =0.1/K; for t =1: n u m t r i a l s % compute neuron o u t p u t f o r a l l d a t a ( can be done as one l i n e ) Y=w’ ∗X; % compute dw : Hebbian l e a r n i n g r u l e dw=zeros ( s i z e (w ) ) ; r=X; for i = 1 : 2 ; r=r −(Y( i , : ) ’ ∗ w ( : , i ) ’ ) ’ ; dw ( : , i )=Y( i , : ) ∗ r ’ ; end % u p d a t e w e i g h t v e c t o r by dw w=w + e t a ∗dw ; set ( h0 , ’ XData ’ , [ 0 w ( 1 ) ] , ’ YData ’ , [ 0 w ( 2 ) ] ) ; set ( h1 , ’ XData ’ , [ 0 w ( 3 ) ] , ’ YData ’ , [ 0 w ( 4 ) ] ) ; drawnow pause ( 0 . 2 5 ) 3-10

% replot weight vector % replot weight vector

end hold o f f

3-11

% h e b b .m − s c r i p t t o do h e b b i a n l e a r n i n g % % you must f i r s t l o a d d a t a 2 d . mat % data array X=D2 ; [N K]= s i z e (X ) ; % p l o t data plot (X( 1 , : ) ,X( 2 , : ) , ’ . ’ ) axis xy , axis image hold on % i n i t i a l i z e weights w=randn (N, 1 ) ; % plot weight vector h=plot ( [ 0 w ( 1 ) ] , [ 0 w ( 2 ) ] , ’ r ’ , ’ LineWidth ’ , 2 ) ; n u m t r i a l s =100; e t a =0.1/K; for t =1: n u m t r i a l s % compute neuron o u t p u t f o r a l l d a t a ( can be done as one l i n e ) Y=w’ ∗X; % compute dw : Hebbian l e a r n i n g r u l e dw=X∗Y’−w∗sum(Y. ˆ 2 , 2 ) ; % u p d a t e w e i g h t v e c t o r by dw w=w + e t a ∗dw ; set ( h , ’ XData ’ , [ 0 w ( 1 ) ] , ’ YData ’ , [ 0 w ( 2 ) ] ) ;

% replot weight vector

drawnow pause ( 0 . 2 5 ) end hold o f f t i t l e ( ’ C o n s t r a i n e d Hebbian l e a r n i n g ( v i a Oja ’ ’ s r u l e ) ’ ) ;

3-12

% h e b b .m − s c r i p t t o do h e b b i a n l e a r n i n g % % sanger ’ s r u l e w i t h two neurons load data2d ; % data array X=D2 ; [N K]= s i z e (X ) ; % p l o t data plot (X( 1 , : ) ,X( 2 , : ) , ’ . ’ ) t i t l e ( ’ Sanger ’ ’ s r u l e with two n e u r o n s ’ ) ; axis xy , axis image hold on % i n i t i a l i z e weights w=randn (N, 2 ) ; % plot weight vector h0=plot ( [ 0 w ( 1 ) ] , [ 0 w ( 2 ) ] , ’ r ’ , ’ LineWidth ’ , 2 ) ; h1=plot ( [ 0 w ( 3 ) ] , [ 0 w ( 4 ) ] , ’ r ’ , ’ LineWidth ’ , 2 ) ; n u m t r i a l s =100; e t a =0.1/K; for t =1: n u m t r i a l s % compute neuron o u t p u t f o r a l l d a t a ( can be done as one l i n e ) Y=w’ ∗X; % compute dw : Hebbian l e a r n i n g r u l e dw=zeros ( s i z e (w ) ) ; r=X; for i = 1 : 2 ; r=r −(Y( i , : ) ’ ∗ w ( : , i ) ’ ) ’ ; dw ( : , i )=Y( i , : ) ∗ r ’ ; end % u p d a t e w e i g h t v e c t o r by dw w=w + e t a ∗dw ; set ( h0 , ’ XData ’ , [ 0 w ( 1 ) ] , ’ YData ’ , [ 0 w ( 2 ) ] ) ; set ( h1 , ’ XData ’ , [ 0 w ( 3 ) ] , ’ YData ’ , [ 0 w ( 4 ) ] ) ; drawnow pause ( 0 . 2 5 ) 3-13

% replot weight vector % replot weight vector

end hold o f f

3-14

m e a n f a c e = mean( f a c e s 2 , 2 ) ; for i =1:48 f a c e s 2 c e n ( : , i ) = f a c e s 2 ( : , i ) − mean face ; end figure (1) colormap ( gray ) for i =1:48 imagesc ( reshape ( f a c e s 2 c e n ( : , i ) , 6 0 , 6 4 ) ) , axis image drawnow end imagesc ( reshape (mean( f a c e s 2 , 2 ) , 6 0 , 6 4 ) ) ;

3-15

% h e b b .m − s c r i p t t o do h e b b i a n l e a r n i n g % % you must f i r s t l o a d d a t a 2 d . mat % data array X=f a c e s 2 c e n ; [N K]= s i z e (X ) ; % 16 components L = 16; % p l o t data %p l o t (X( 1 , : ) ,X( 2 , : ) , ’ . ’ ) %a x i s xy , a x i s image %h o l d on % i n i t i a l i z e weights w=randn (N, L ) ; display every = 100; % plot weight vector %h0=p l o t ( [ 0 w ( 1 ) ] , [ 0 w ( 2 ) ] , ’ r ’ , ’ LineWidth ’ , 2 ) ; %h1=p l o t ( [ 0 w ( 3 ) ] , [ 0 w ( 4 ) ] , ’ r ’ , ’ LineWidth ’ , 2 ) ; n u m t r i a l s =10000; e t a =0.00000001/K; for t =1: n u m t r i a l s % compute neuron o u t p u t f o r a l l d a t a ( can be done as one l i n e ) Y=w’ ∗X; % compute dw : Hebbian l e a r n i n g r u l e dw=zeros ( s i z e (w ) ) ; r=X; for i =1:L ; r=r −(Y( i , : ) ’ ∗ w ( : , i ) ’ ) ’ ; dw ( : , i )=Y( i , : ) ∗ r ’ ; end % u p d a t e w e i g h t v e c t o r by dw w=w + e t a ∗dw ; i f (mod( t , d i s p l a y e v e r y ) == 0 ) , figure ( 1 ) ; for j =1:L 3-16

subplot ( sqrt (L ) , sqrt (L ) , j ) , imagesc ( reshape (w ( : , j ) , 6 0 , 6 4 ) ) , colormap ( gray ) , axis image ; end end drawnow f p r i n t f ( ’ l 2 o f w %.4 f , t r i a l %d\n ’ ,sum(w ( : ) . ˆ 2 ) , t ) ; %pause end %h o l d o f f

3-17

for i =1:K Z=w’ ∗X( : , i ) ; r = mean face ; figure ( 1 ) ; for j =1:L r = r + Z ( j ) ∗w ( : , j ) ; subplot ( sqrt (L ) , sqrt (L ) , j ) , imagesc ( reshape ( r , 6 0 , 6 4 ) ) , colormap ( gray ) , t i t l e ( s p r i n t f ( ’ f a c e %d u s i n g %d p c s ’ , i , j ) ) , axis image ; end pause end

3-18

% h e b b .m − s c r i p t t o do h e b b i a n l e a r n i n g clf ; % you must f i r s t l o a d d a t a 2 d . mat load data2d ; % data array X=D2 ; [N K]= s i z e (X ) ; % p l o t data ; plot (X( 1 , : ) ,X( 2 , : ) , ’ . ’ ) ; axis xy , axis image ; hold on ; %number o f u n i t s i n network P=4; % i n i t i a l i z e weights w=randn (N, P ) ; % plot weight vector h0=plot ( [ 0 w ( 1 ) ] , [ 0 w ( 2 ) ] h1=plot ( [ 0 w ( 3 ) ] , [ 0 w ( 4 ) ] h2=plot ( [ 0 w ( 5 ) ] , [ 0 w ( 6 ) ] h3=plot ( [ 0 w ( 7 ) ] , [ 0 w ( 8 ) ] hold on ;

, , , ,

’r ’r ’r ’r

’ ’ ’ ’

, , , ,

’ LineWidth ’ ’ LineWidth ’ ’ LineWidth ’ ’ LineWidth ’

,2); ,2); ,2); ,2);

n u m t r i a l s =100; eta win =0.1; e t a l o s =0.01; dw = zeros ( s i z e (w ) ) ; for t =1: n u m t r i a l s ; % compute neuron o u t p u t f o r a l l d a t a ( can be done as one l i n e ) Y=w’ ∗X; % compute dw : Hebbian l e a r n i n g r u l e for input = 1 : 1 0 0 0 ; [ w i n v a l , w i n p o s ] = max(Y( : , input ) ) ; dw = zeros ( s i z e (w ) ) ; for i = 1 : 4 ; i f ( i == w i n p o s ) , dw ( : , i )= dw ( : , i ) + e t a w i n ∗ (X( : , input)−w ( : , i ) ) ; else dw ( : , i )= dw ( : , i ) + e t a l o s ∗ (X( : , input)−w ( : , i ) ) ; 3-19

end end w = w + dw ; % n o r m a l i z e w e i g h t s t o have u n i t norm norm = sqrt (sum(w . ˆ 2 ) ) ; for i = 1 : 4 w ( : , i ) = w ( : , i ) / norm( i ) ; end end set ( h0 , ’ XData ’ set ( h1 , ’ XData ’ set ( h2 , ’ XData ’ set ( h3 , ’ XData ’ drawnow ;

,[0 ,[0 ,[0 ,[0

w( 1 ) ] w( 3 ) ] w( 5 ) ] w( 7 ) ]

, , , ,

’ YData ’ ’ YData ’ ’ YData ’ ’ YData ’

,[0 ,[0 ,[0 ,[0

w( 2 ) ] ) ; w( 4 ) ] ) ; w( 6 ) ] ) ; w( 8 ) ] ) ;

% % % %

end hold o f f ; t i t l e ( ’ Winner t a k e a l l with l e a k i n g l e a r n i n g ’ ) ;

3-20

replot replot replot replot

weight weight weight weight

vector vector vector vector

% h e b b .m − s c r i p t t o do h e b b i a n l e a r n i n g clf ; % you must f i r s t l o a d d a t a 2 d . mat load f a c e s 2 ; [N K]= s i z e ( f a c e s 2 ) ; m e a n f a c e = mean( f a c e s 2 , 2 ) ; for i =1:K f a c e s 2 c e n ( : , i ) = f a c e s 2 ( : , i ) − mean face ; end % data array X=f a c e s 2 c e n ; % p l o t data ; plot (X( 1 , : ) ,X( 2 , : ) , ’ . ’ ) ; axis xy , axis image ; hold on ; %number o f u n i t s i n network P=16; % i n i t i a l i z e weights %w=randn (N, P ) ; d i s p l a y e v e r y =1; n u m t r i a l s =10000; eta win =0.1; e t a l o s =0.01; dw = zeros ( s i z e (w ) ) ; for t =1: n u m t r i a l s ; % compute neuron o u t p u t f o r a l l d a t a ( can be done as one l i n e ) Y=w’ ∗X; for input =1:K [ w i n v a l , w i n p o s ] = max(Y( : , input ) ) ; dw = zeros ( s i z e (w ) ) ; for i =1:P ; i f ( i == w i n p o s ) , dw ( : , i )= dw ( : , i ) + e t a ∗ (X( : , input)−w ( : , i ) ) ; else dw ( : , i )= dw ( : , i ) + e t a 2 ∗ (X( : , input)−w ( : , i ) ) ; 3-21

end end w = w + dw ; end i f (mod( t , d i s p l a y e v e r y ) == 0 ) , figure ( 1 ) ; for j =1:P subplot ( sqrt (P) , sqrt (P) , j ) , imagesc ( reshape (w ( : , j ) , 6 0 , 6 4 ) ) , colormap ( gray ) , axis image o f f ; end end drawnow ; f p r i n t f ( ’ update %d\n ’ , t ) ; end hold o f f ;

3-22