Matrix Calculus - Notes on the Derivative of a Trace

Matrix Calculus - Notes on the Derivative of a Trace Johannes Traa This write-up elucidates the rules of matrix calculus for expressions involving th...

Author: Lora Thompson

197 downloads 2 Views 210KB Size

Report

Download PDF

Recommend Documents

Calculus I Notes on Rates of Change and the Derivative

Lecture Notes on Calculus

Chapter 2 The Derivative Business Calculus 124

Matrix Approach to Boolean Calculus

Trace Analysis of Elements in a Palladium Matrix

3.1 Derivative of a Function Calculus. Example: Other notation used to denote the derivative (we will use most of these)

Calculus Students' Understanding of the Derivative in Relation to the Vertex of a Quadratic Function

Magnetic calculus and semiclassical trace formulas

Notes on Vector and Matrix Norms

On Zeroes of the Schwarzian Derivative

Matrix 6.10 Release Notes

Matrix Release Notes

Multivariable Calculus Lecture #12 Notes

ALGINATE BASED NANOPARTICLES AS A CARRIER MATRIX FOR THE DELIVERY OF CALIXARENE DERIVATIVE AS PHARMACEUTICAL COMPOUND

ARM ETM On-Chip Trace. Contents. Technical Notes V

Calculus I Notes on Basic Rules for Derivatives

Calculus Notes. October 12, I ve compiled a history of the Binomial Theorem largely based on the work of

On a Leibnitz-type fractional derivative 1

Experimental Lesson for an Onscreen Calculus Course: the Definition of the Derivative

Multicollinearity in Marketing Models: Notes on the Application of Ridge Trace Estimation in Structural Equation Modelling

Effect of Significant Digits on Derivative of a Function

Oracle Trace for OpenVMS Release Notes

On the Characteristic Polynomial of a Random Unitary Matrix

Introductory Calculus Notes. Ambar N. Sengupta

Matrix Calculus - Notes on the Derivative of a Trace Johannes Traa

This write-up elucidates the rules of matrix calculus for expressions involving the trace of a function of a matrix X: £ ¤ f = tr g ( X ) .

(1)

We would like to take the derivative of f with respect to X: ∂f =? . ∂X

(2)

One strategy is to write the trace expression as a scalar using index notation, take the derivative, and re-write in matrix form. An easier way is to reduce the problem to one or more smaller problems where the results for simpler derivatives can be applied. It’s bruteforce vs bottom-up.

M ATRIX -VALUED D ERIVATIVE The derivative of a scalar f with respect to a matrix X ∈ RM ×N can be written as:

1



∂f ∂X 11

    ∂f ∂f   ∂X 21 = ∂X   ...   

∂f ∂X M 1

∂f ∂X 12

···

∂f ∂X 1N

∂f ∂X 22

···

∂f ∂X 2N

.. .

..

.. .

∂f ∂X M 2

···

.

∂f ∂X M N

            

(3)

So, the result is the same size as X.

M ATRIX AND I NDEX N OTATION It is useful to be able to convert between matrix notation and index notation. For example, the product AB has elements: X (4) [AB]i k = A i j B j k , j

and the matrix product ABCT has elements: XX X X X £ ¤ £ ¤ Ai j B j k Cl k . ABCT i l = A i j BCT j l = A i j B j k C l k = j

j

j

k

(5)

k

F IRST-O RDER D ERIVATIVES E XAMPLE 1 Consider this example: f = t r [AXB] .

(6)

We can write this using index notation as: XX XX X XXX X A i j [XB] j i = Ai j X j k B ki = A i j X j k B ki . f = [AXB]i i = i

i

j

i

j

k

Taking the derivative with respect to X j k , we get: X ∂f = A i j B ki = [BA]k j . ∂X j k i

i

j

(7)

k

(8)

The result has to be the same size as X, so we know that the indices of the rows and columns must be j and k, respectively. This means we have to transpose the result above to write the derivative in matrix form as: ∂t r [AXB] = AT BT . ∂X

(9)

2

E XAMPLE 2 Similarly, we have: £ ¤ XXX f = t r AXT B = A i j X k j B ki , i

j

(10)

k

so that the derivative is: X ∂f = A i j B ki = [BA]k j , ∂X k j i

(11)

The X term appears in (10) with indices k j , so we need to write the derivative in matrix form such that k is the row index and j is the column index. Thus, we have: £ ¤ ∂ t r AXT B = BA . (12) ∂X

M ULTIPLE - ORDER Now consider a more complicated example: ¤ £ f = t r AXBXCT XXXXX = A i j X j k B kl X l m C i m . i

j

k

l

(13) (14)

m

The derivative has contributions from both appearances of X. TAKE 1 In index notation: XXX £ ¤ ∂f = A i j B kl X l m C i m = BXCT A k j , ∂X j k i l m XXX £ ¤ ∂f = A i j X j k B kl C i m = CT AXB ml . ∂X l m i j k Transposing appropriately and summing the terms together, we have: £ ¤ ∂ t r AXBXCT = AT CXT BT + BT XT AT C . ∂X

(15) (16)

(17)

TAKE 2 We can skip this tedious process by applying (9) for each appearance of X: £ ¤ £ ¤ ∂ t r AXBXCT ∂ t r [AXD] ∂ t r EXCT = + = AT DT + ET C . ∂X ∂X ∂X

(18)

3

where D = BXCT and E = AXB. So we just evaluate the matrix derivative for each appearance of X assuming that everything else is a constant (including other X’s). To see why this rule is useful, consider the following beast: £ ¤ f = t r AXXT BCXT XC .

(19)

We can immediately write down the derivative using (9) and (12): £ ¤ ¡ ¢T ¡ ¢ ¡ ¢ ¡ ¢T ∂ t r AXXT BCXT XC = (A)T XT BCXT XC + BCXT XC (AX) + (XC) AXXT BC + AXXT BCXT (C)T ∂X (20) = ACT XT XCT BT X + BCXT XCAX + XCAXXT BC + XCT BT XXT AT CT .

(21)

F ROBENIUS NORM The Frobenius norm shows up when we have an optimization problem involving a matrix factorization and we want to minimize a sum-of-squares error criterion: !2 Ã X XX £ ¤ (22) X i k − Wi j H j k = kX − WHk2F = t r (X − WH) (X − WH)T . f = i

k

j

We can work with the expression in index notation, but it’s easier to work directly with matrices and apply the results derived earlier. Suppose we want to find the derivative with respect to W. Expanding the matrix outer product, we have: £ ¤ £ ¤ £ ¤ £ ¤ f = t r XXT − t r XHT WT − t r WHXT + t r WHHT WT . Applying (9) and (12), we easily deduce that: £ ¤ ∂ t r (X − WH) (X − WH)T ∂W

= −2XHT + 2WHHT .

(23)

(24)

L OG Consider this trace expression: µ ¶ XX £ T ¤ XX f = t r V log (AXB) = Vi j log A i m X mn B n j . i

j

(25)

m n

Taking the derivative with respect to X mn , we get:   ¶ µ X X XX V V ∂f ij   = Ai m Bn j = Ai m Bn j . PP ∂ X mn A i m X mn B n j AXB i j i j i j

(26)

m n

Thus:

4

¤ £ ∂ t r VT log (AXB) ∂X

µ

T

=A

¶ V BT . AXB

(27)

Similarly: £ ¡ ¢¤ ∂ t r VT log AXT B

µ

V =B AXT B

¶T A.

(28)

Consider the tricky case of a d i ag (−) operator: £ ¤ XX f = t r A d i ag (x) B = Ai j x j B j i .

(29)

∂X

These bare a spooky resemblance to (9) and (12).

D IAG E XAMPLE 1

i

j

Taking the derivative, we have: X £¡ ¢ ¤ ∂f = A i j B j i = AT ¯ B 1 j . ∂x j i

(30)

So we can write: £ ¤ ∂ t r A d i ag (x) B ∂x

¡ ¢ = AT ¯ B 1 .

(31)

E XAMPLE 2 Consider the following take on the last example: £ ¤ XXX f = t r J A d i ag (x) B = Ai j x j B j k , i

j

(32)

k

where J is the matrix of ones. TAKE 1 Taking the derivative, we have: Ã !Ã ! XX X X £ ¤ ∂f = Ai j B j k = Ai j B j k = AT 1 ¯ B1 j . ∂x j i k i k

(33)

So we can write: £ ¤ ∂ t r J A d i ag (x) B ∂x

= AT 1 ¯ B1 .

(34)

5

TAKE 2 We could have derived this result from the previous example using the rotation property of the trace operator: £ ¤ £ ¤ £ ¤ £ ¤ f = t r J A d i ag (x) B = t r 11T A d i ag (x) B = t r 1T A d i ag (x) B1 = t r aT d i ag (x) b , (35) where we have defined a = AT 1 and b = B 1. Applying (31), we have: £ ¤ ∂ t r aT d i ag (x) b = (a ¯ b) 1 = AT 1 ¯ B 1 . ∂x

(36)

E XAMPLE 3 Consider a more complicated example: ¶ µ X £ T ¡ ¢¤ X X A i m xm B m j . f = t r V log A d i ag (x) B = Vi j log i

(37)

m

j

Taking the derivative with respect to x m , we have:   X X Vi j ∂f  Ai m Bm j P = A x B ∂x m i m m m j i j m " µ ¶ # X X V = Ai m Bm j A d i ag (x) B i j j i ·µ µ ¶ ¶ ¸ V T = A ¯B 1 . A d i ag (x) B m

(38)

(39) (40)

The final result is: £ ¡ ¢¤ ∂ t r VT log A d i ag (x) B ∂x

µ µ = AT

¶ ¶ V ¯B 1 . A d i ag (x) B

(41)

E XAMPLE 4 How about when we have a trace composed of a sum of expressions, each of which depends on what row of a matrix B is chosen: " # µ µ ¶ ¶ X T XXX X X ¡ ¢ f = tr V log A d i ag (Bk: X) C = Vi j log Ai m B kn X nm C m j . (42) k

k

i

j

m

n

Taking the derivative, we get:

6

 XXX ∂f  = P ∂X nm k i j

 Vi j

Ai m m µ XX

µ P

¶

  A i m B kn C m j 

(43)

B kn X nm C m j n ¶ X V = B kn Ai mCm j i j A d i ag (Bk: X) C i j k ¶¸ ¶T · µµ X V T T A¯C , = B kn 1 A d i ag (Bk: X) C km k

(44) (45)

so we can write:  ∂ tr

· P k

¡ ¢ VT log A d i ag (Bk: X) C ∂X

¸

T

µ³

V A d i ag (B1: X)C

´T

T

¶

A¯C  1       .. = BT   . .    µ ¶ ³ ´T   1T A d i ag V(BK : X)C A ¯ CT

(46)

C ONCLUSION We’ve learned two things. First, if we don’t know how to find the derivative of an expression using matrix calculus directly, we can always fall back on index notation and convert back to matrices at the end. This reduces a potentially unintuitive matrix-valued problem into one involving scalars, which we are used to. Second, it’s less painful to massage an expression into a familiar form and apply previously-derived identities.

7