Matrix Calculus - Notes on the Derivative of a Trace

Matrix Calculus - Notes on the Derivative of a Trace Johannes Traa This write-up elucidates the rules of matrix calculus for expressions involving th...
Author: Lora Thompson
197 downloads 2 Views 210KB Size
Matrix Calculus - Notes on the Derivative of a Trace Johannes Traa

This write-up elucidates the rules of matrix calculus for expressions involving the trace of a function of a matrix X: £ ¤ f = tr g ( X ) .

(1)

We would like to take the derivative of f with respect to X: ∂f =? . ∂X

(2)

One strategy is to write the trace expression as a scalar using index notation, take the derivative, and re-write in matrix form. An easier way is to reduce the problem to one or more smaller problems where the results for simpler derivatives can be applied. It’s bruteforce vs bottom-up.

M ATRIX -VALUED D ERIVATIVE The derivative of a scalar f with respect to a matrix X ∈ RM ×N can be written as:

1



∂f ∂X 11

    ∂f ∂f   ∂X 21 = ∂X   ...   

∂f ∂X M 1

∂f ∂X 12

···

∂f ∂X 1N

∂f ∂X 22

···

∂f ∂X 2N

.. .

..

.. .

∂f ∂X M 2

···

.

∂f ∂X M N

            

(3)

So, the result is the same size as X.

M ATRIX AND I NDEX N OTATION It is useful to be able to convert between matrix notation and index notation. For example, the product AB has elements: X (4) [AB]i k = A i j B j k , j

and the matrix product ABCT has elements: XX X X X £ ¤ £ ¤ Ai j B j k Cl k . ABCT i l = A i j BCT j l = A i j B j k C l k = j

j

j

k

(5)

k

F IRST-O RDER D ERIVATIVES E XAMPLE 1 Consider this example: f = t r [AXB] .

(6)

We can write this using index notation as: XX XX X XXX X A i j [XB] j i = Ai j X j k B ki = A i j X j k B ki . f = [AXB]i i = i

i

j

i

j

k

Taking the derivative with respect to X j k , we get: X ∂f = A i j B ki = [BA]k j . ∂X j k i

i

j

(7)

k

(8)

The result has to be the same size as X, so we know that the indices of the rows and columns must be j and k, respectively. This means we have to transpose the result above to write the derivative in matrix form as: ∂t r [AXB] = AT BT . ∂X

(9)

2

E XAMPLE 2 Similarly, we have: £ ¤ XXX f = t r AXT B = A i j X k j B ki , i

j

(10)

k

so that the derivative is: X ∂f = A i j B ki = [BA]k j , ∂X k j i

(11)

The X term appears in (10) with indices k j , so we need to write the derivative in matrix form such that k is the row index and j is the column index. Thus, we have: £ ¤ ∂ t r AXT B = BA . (12) ∂X

M ULTIPLE - ORDER Now consider a more complicated example: ¤ £ f = t r AXBXCT XXXXX = A i j X j k B kl X l m C i m . i

j

k

l

(13) (14)

m

The derivative has contributions from both appearances of X. TAKE 1 In index notation: XXX £ ¤ ∂f = A i j B kl X l m C i m = BXCT A k j , ∂X j k i l m XXX £ ¤ ∂f = A i j X j k B kl C i m = CT AXB ml . ∂X l m i j k Transposing appropriately and summing the terms together, we have: £ ¤ ∂ t r AXBXCT = AT CXT BT + BT XT AT C . ∂X

(15) (16)

(17)

TAKE 2 We can skip this tedious process by applying (9) for each appearance of X: £ ¤ £ ¤ ∂ t r AXBXCT ∂ t r [AXD] ∂ t r EXCT = + = AT DT + ET C . ∂X ∂X ∂X

(18)

3

where D = BXCT and E = AXB. So we just evaluate the matrix derivative for each appearance of X assuming that everything else is a constant (including other X’s). To see why this rule is useful, consider the following beast: £ ¤ f = t r AXXT BCXT XC .

(19)

We can immediately write down the derivative using (9) and (12): £ ¤ ¡ ¢T ¡ ¢ ¡ ¢ ¡ ¢T ∂ t r AXXT BCXT XC = (A)T XT BCXT XC + BCXT XC (AX) + (XC) AXXT BC + AXXT BCXT (C)T ∂X (20) = ACT XT XCT BT X + BCXT XCAX + XCAXXT BC + XCT BT XXT AT CT .

(21)

F ROBENIUS NORM The Frobenius norm shows up when we have an optimization problem involving a matrix factorization and we want to minimize a sum-of-squares error criterion: !2 Ã X XX £ ¤ (22) X i k − Wi j H j k = kX − WHk2F = t r (X − WH) (X − WH)T . f = i

k

j

We can work with the expression in index notation, but it’s easier to work directly with matrices and apply the results derived earlier. Suppose we want to find the derivative with respect to W. Expanding the matrix outer product, we have: £ ¤ £ ¤ £ ¤ £ ¤ f = t r XXT − t r XHT WT − t r WHXT + t r WHHT WT . Applying (9) and (12), we easily deduce that: £ ¤ ∂ t r (X − WH) (X − WH)T ∂W

= −2XHT + 2WHHT .

(23)

(24)

L OG Consider this trace expression: µ ¶ XX £ T ¤ XX f = t r V log (AXB) = Vi j log A i m X mn B n j . i

j

(25)

m n

Taking the derivative with respect to X mn , we get:   ¶ µ X X XX V V ∂f ij   = Ai m Bn j = Ai m Bn j . PP ∂ X mn A i m X mn B n j AXB i j i j i j

(26)

m n

Thus:

4

¤ £ ∂ t r VT log (AXB) ∂X

µ

T

=A

¶ V BT . AXB

(27)

Similarly: £ ¡ ¢¤ ∂ t r VT log AXT B

µ

V =B AXT B

¶T A.

(28)

Consider the tricky case of a d i ag (−) operator: £ ¤ XX f = t r A d i ag (x) B = Ai j x j B j i .

(29)

∂X

These bare a spooky resemblance to (9) and (12).

D IAG E XAMPLE 1

i

j

Taking the derivative, we have: X £¡ ¢ ¤ ∂f = A i j B j i = AT ¯ B 1 j . ∂x j i

(30)

So we can write: £ ¤ ∂ t r A d i ag (x) B ∂x

¡ ¢ = AT ¯ B 1 .

(31)

E XAMPLE 2 Consider the following take on the last example: £ ¤ XXX f = t r J A d i ag (x) B = Ai j x j B j k , i

j

(32)

k

where J is the matrix of ones. TAKE 1 Taking the derivative, we have: Ã !Ã ! XX X X £ ¤ ∂f = Ai j B j k = Ai j B j k = AT 1 ¯ B1 j . ∂x j i k i k

(33)

So we can write: £ ¤ ∂ t r J A d i ag (x) B ∂x

= AT 1 ¯ B1 .

(34)

5

TAKE 2 We could have derived this result from the previous example using the rotation property of the trace operator: £ ¤ £ ¤ £ ¤ £ ¤ f = t r J A d i ag (x) B = t r 11T A d i ag (x) B = t r 1T A d i ag (x) B1 = t r aT d i ag (x) b , (35) where we have defined a = AT 1 and b = B 1. Applying (31), we have: £ ¤ ∂ t r aT d i ag (x) b = (a ¯ b) 1 = AT 1 ¯ B 1 . ∂x

(36)

E XAMPLE 3 Consider a more complicated example: ¶ µ X £ T ¡ ¢¤ X X A i m xm B m j . f = t r V log A d i ag (x) B = Vi j log i

(37)

m

j

Taking the derivative with respect to x m , we have:   X X Vi j ∂f  Ai m Bm j P = A x B ∂x m i m m m j i j m " µ ¶ # X X V = Ai m Bm j A d i ag (x) B i j j i ·µ µ ¶ ¶ ¸ V T = A ¯B 1 . A d i ag (x) B m

(38)

(39) (40)

The final result is: £ ¡ ¢¤ ∂ t r VT log A d i ag (x) B ∂x

µ µ = AT

¶ ¶ V ¯B 1 . A d i ag (x) B

(41)

E XAMPLE 4 How about when we have a trace composed of a sum of expressions, each of which depends on what row of a matrix B is chosen: " # µ µ ¶ ¶ X T XXX X X ¡ ¢ f = tr V log A d i ag (Bk: X) C = Vi j log Ai m B kn X nm C m j . (42) k

k

i

j

m

n

Taking the derivative, we get:

6

 XXX ∂f  = P ∂X nm k i j

 Vi j

Ai m m µ XX

µ P



  A i m B kn C m j 

(43)

B kn X nm C m j n ¶ X V = B kn Ai mCm j i j A d i ag (Bk: X) C i j k ¶¸ ¶T · µµ X V T T A¯C , = B kn 1 A d i ag (Bk: X) C km k

(44) (45)

so we can write:  ∂ tr

· P k

¡ ¢ VT log A d i ag (Bk: X) C ∂X

¸

T

µ³

V A d i ag (B1: X)C

´T

T

¶

A¯C  1       .. = BT   . .    µ ¶ ³ ´T   1T A d i ag V(BK : X)C A ¯ CT

(46)

C ONCLUSION We’ve learned two things. First, if we don’t know how to find the derivative of an expression using matrix calculus directly, we can always fall back on index notation and convert back to matrices at the end. This reduces a potentially unintuitive matrix-valued problem into one involving scalars, which we are used to. Second, it’s less painful to massage an expression into a familiar form and apply previously-derived identities.

7

Suggest Documents