Lecture 2. 3B1B Optimization Michaelmas 2016 A. Zisserman. Newton s method. Quasi-Newton methods. Least-Squares and Gauss-Newton methods

Lecture 2 3B1B Optimization Michaelmas 2016 • Newton’s method • Line search • Quasi-Newton methods • Least-Squares and Gauss-Newton methods • Downh...
Author: Stella Franklin
10 downloads 0 Views 6MB Size
Lecture 2 3B1B Optimization

Michaelmas 2016

• Newton’s method • Line search

• Quasi-Newton methods • Least-Squares and Gauss-Newton methods • Downhill simplex (amoeba) algorithm

A. Zisserman

Optimization for General Functions f (x, y) = exp(x)(4x2 + 2y 2 + 4xy + 2x + 1) 5

4

3

2

1

0

-1 -4

-3.5

-3

-2.5

-2

-1.5

-1

-0.5

0

0.5

1

Apply methods developed using quadratic Taylor series expansion

Rosenbrock’s function 2 2 2 f (x, y) = 100(y − x ) + (1 − x) Rosenbrock function 3 2.5 2 1.5 1 0.5 0 -0.5 -1 -2

-1

Minimum is at [1, 1]

0

1

2

Steepest descent • The 1D line minimization must be performed using one of the earlier methods (usually cubic polynomial interpolation) Steepest Descent

Steepest Descent

3 2.5 0.85

2 0.8

1.5 1

0.75

0.5 0

0.7

-0.5 0.65

-1 -2

-1

0

1

2

-0.95

-0.9

-0.85

-0.8

• The zig-zag behaviour is clear in the zoomed view (100 iterations) • The algorithm crawls along the valley

-0.75

Performance issues for optimization algorithms 1. Number of iterations required

2. Cost per iteration

3. Memory footprint

4. Region of convergence

Recall from lecture 1: Newton’s method in 1D Fit a quadratic approximation to f (x) using both gradient and curvature information at x. • Expand f (x) locally using a Taylor series δx2 00 0 f (x + δx) = f (x) + δxf (x) + f (x) + h.o.t 2

• Find the δx which minimizes this local quadratic approximation f 0(x + δx) = f 0(x) + δxf 00 (x) = 0 • and rearranging

• Update for x

f 0(x) δx = − 00 f (x) f 0(xn) xn+1 = xn − 00 f (xn)

Recall from lecture 1: Taylor expansion in 2D A function may be approximated locally by its Taylor series expansion about a point x0 f (x0 + δx) ≈ f (x0) +

Ã

∂f ∂f , ∂x ∂y



δx δy

!



∂ 2f 1 2 ⎢ + (δx, δy) ⎣ ∂∂x 2f 2 ∂x∂y

The expansion to second order is a quadratic function 1 f (x0 + δx) = a + g>δx + δx>H δx 2

⎤ Ã ! ∂ 2f δx ∂x∂y ⎥ ⎦ ∂ 2f δy ∂y 2

+ h.o.t

Newton’s method in ND Expand f (x) by its Taylor series about the point xn 1 > f (xn + δx) ≈ f (xn) + g> n δx + δx H n δx 2 where the gradient is the vector "

∂f ∂f gn = ∇f (xn) = ,..., ∂x1 ∂xN

#>

and the Hessian is the symmetric matrix ⎡

⎢ ⎢ H n = H (xn) = ⎢ ⎢ ⎣

∂ 2f ∂x2 1

..

... ...

∂ 2f ∂x1∂xN

⎤ ∂ 2f ∂x1∂xN ⎥ ⎥ ⎥ ⎥ 2 ⎦ ∂ f ∂x2 N

For a minimum we require that ∇f (x) = 0, and so ∇f (x) = gn + H nδx = 0

with solution δx = −H−1 n gn. This gives the iterative update

xn+1 = xn − H−1 n gn

Assume that H is positive definite (all eigenvalues greater than zero)

xn+1 = xn + δx = xn − H−1 n gn • If f (x) is quadratic, then the solution is found in one step. • The method has quadratic convergence (as in the 1D case). • The solution δx = −H−1 n gn is guaranteed to be a downhill direction (provided that H is positive definite) • For numerical reasons the inverse is not actually computed, instead δx is computed as the solution of H δx = −gn. • Rather than jump straight to xn − H−1 n gn , it is better to perform a line search which ensures global convergence

xn+1 = xn − αnH−1 n gn • If H = I then this reduces to steepest descent.

Newton’s method - example Newton method with line search

Newton method with line search

3

3

2.5

2.5

2

2

1.5

1.5

1

1

0.5

0.5

0

0

-0.5

-0.5

-1 -2

-1

0 gradient < 1e-3 after 15 iterations

1

2

-1 -2

-1.5

-1

-0.5

0

0.5

1

1.5

gradient < 1e-3 after 15 iterations

ellipses show successive quadratic approximations

• The algorithm converges in only 15 iterations compared to hundreds for steepest descent. • However, the method requires computing the Hessian matrix at each iteration – this is not always feasible

2

Quasi-Newton methods • If the problem size is large and the Hessian matrix is dense then it may be infeasible/inconvenient to compute it directly.

• Quasi-Newton methods avoid this problem by keeping a “rolling estimate” of H (x), updated at each iteration using new gradient information.

• Common schemes are due to Broyden, Fletcher, Goldfarb and Shanno (BFGS), and also Davidson, Fletcher and Powell (DFP).

e.g. in 1D

First derivatives f − f0 f − f−1 h h and f 0 (x0 − ) = 0 f 0(x0 + ) = 1 2 h 2 h second derivative

f(x)

f0−f−1 f1 −f0 − f − 2f0 + f−1 h = 1 f 00(x0) = h 2

h

h x-1

h

h x0

x+1

x

For H n+1 build an approximation from H n, gn, gn+1, xn, xn+1

Quasi-Newton: BFGS • Set H 0 = I. • Update according to

qnq> (H nsn) (H nsn)> n H n+1 = H n + > − qn sn s> n H n sn where

sn = xn+1 − xn qn = gn+1 − gn • The matrix itself is not stored, but rather represented compactly by a few stored vectors. • The estimate H n+1 is used to form a local quadratic approximation as before.

Example Rosenbrock function 3 2.5 2 1.5 1 0.5 0 -0.5 -1 -2

-1

0

1

2

• The method converges in 25 iterations, compared to 15 for the full-Newton method • In Matlab the optimization function ‘fminunc’ uses a BFGS quasi-Newton method for medium scale optimization problems

Matlab – fminunc >> f='100*(x(2)-x(1)^2)^2+(1-x(1))^2'; >> GRAD='[100*(4*x(1)^3-4*x(1)*x(2))+2*x(1)-2; 100*(2*x(2)-2*x(1)^2) ]'; Choose options for BFGS quasi-Newton >> OPTIONS=optimset('LargeScale','off', 'HessUpdate','bfgs' ); >> OPTIONS = optimset(OPTIONS,'gradobj','on'); Start point >> x = [-1.9; 2]; >> [x,fval] = fminunc({f,GRAD},x,OPTIONS); This produces x = 0.9998, 0.9996

fval = 3.4306e-008

Non-linear least squares • It is very common in applications for a cost function f (x) to be the sum of a large number of squared residuals

f (x) =

M X

ri2

i=1

• If each residual depends non-linearly on the parameters x then the minimization of f (x) is a non-linear least squares problem.

• Examples arise in non-linear regression (fitting) of data

Linear least squares reminder The goal is to fit a smooth curve to measured data points {si, ti} by minimizing the cost f (x) =

X

i=1

ri2 =

X

i=1

(y(si, x) − ti)2

t

target value s

For example, the regression function y(si, x) might be polynomial y(s, x) = x0 + x1s + x2s2 + . . . In this case the function is linear in the parameter x and there is a closed form solution. In general there will not be a closed form solution for non-linear functions y(s, x).

Non-linear least squares example: aligning a 3D model to an image

Cost function

Transformation parameters: • 3D rotation matrix R

• translation 3-vector T = (Tx, Ty , Tz )>

Image generation: • rotate and translate 3D model by R and T • project to generate image IˆR,T(x, y)

Non-linear least squares f (x ) =

M X

i=1

ri2 = krk2

The M × N Jacobian of the vector of residuals r is defined as ⎛

∂r1 ⎜ ∂x1 J(x) = ⎜ ⎜ .. ⎝ ∂r M ∂x1

... ...

⎞ ∂r1 ∂xN ⎟ ⎟ ⎟ ∂rM ⎠ ∂xN

assume M>N

J

Consider X ∂ X 2 ∂r ri = 2ri i ∂xk i ∂xk i

Hence ∇f (x) = 2J>r

=

J>

For the Hessian we require ∂ X ∂ri ∂2 X 2 ri = 2 ri ∂xl ∂xk i ∂xl i ∂xk

∂ 2ri = 2 +2 ri ∂xk ∂xl i ∂xk ∂xl i X ∂ri ∂ri

X

Hence 2 J> J

H ( x) =

+

2

M X

riR i

i=1

J>

J

Ri

• Note that the second-order term in the Hessian H (x) is multiplied by the residuals ri. • In most problems, the residuals will typically be small. • Also, at the minimum, the residuals will typically be distributed with mean = 0.

• For these reasons, the second-order term is often ignored, giving the Gauss-Newton approximation to the Hessian :

H (x) = 2J>J • Hence, explicit computation of the full Hessian can again be avoided.

Example – Gauss-Newton The minimization of the Rosenbrock function f (x, y) = 100(y − x2)2 + (1 − x)2 can be written as a least-squares problem with residual vector

r=



∂r1 ∂x J(x) = ⎝ ∂r 2 ∂x

"

10(y − x2) (1 − x)

#

⎞ Ã ! ∂r1 −20x 10 ∂y ⎠ = ∂r2 −1 0 ∂y

The true Hessian is ⎡

H(x) = ⎢ ⎣

∂ 2f ∂x2 ∂ 2f ∂x∂y

∂ 2f



∂x∂y ⎥ = ∂ 2f ⎦ ∂y 2

"

1200x2 − 400y + 2 −400x −400x

200

#

The Gauss-Newton approximation of the Hessian is 2J>J = 2

"

−20x −1 10 0

#"

−20x 10 −1 0

#

=

"

800x2 + 2 −400x −400x

200

#

Summary: Gauss-Newton optimization For a cost function f (x) that is a sum of squared residuals f (x) =

X

ri2

i=1

The Hessian can be approximated as

H (x) = 2J>J and the gradient is given by ∇f (x) = 2J>r So, the Newton update step

xn+1 = xn + δx = xn − H−1 n gn computed as H δx = −gn, becomes

J>J δx = −J>r These are called the normal equations.

>J xn+1 = xn − αnH−1 g with H ( x ) = 2 J n n n n n Gauss-Newton method with line search

Gauss-Newton method with line search

3

3

2.5

2.5

2

2

1.5

1.5

1

1

0.5

0.5

0

0

-0.5

-0.5

-1 -2

-1

0 gradient < 1e-3 after 14 iterations

1

2

-1 -2

-1.5

-1

-0.5

0

0.5

1

gradient < 1e-3 after 14 iterations

• minimization with the Gauss-Newton approximation with line search takes only 14 iterations

1.5

2

Comparison Gauss-Newton

Newton Newton method with line search

Gauss-Newton method with line search

3

3

2.5

2.5

2

2

1.5

1.5

1

1

0.5

0.5

0

0

-0.5

-0.5

-1 -2

-1

0

1

gradient < 1e-3 after 15 iterations

2

-1 -2

-1

0

1

2

gradient < 1e-3 after 14 iterations

• requires computing Hessian (i.e. n^2 second derivatives)

• approximates Hessian by Jacobian product

• exact solution if quadratic

• requires only n first derivatives

The downhill simplex (amoeba) algorithm

The downhill simplex (amoeba) algorithm • Due to Nelder and Mead (1965) • A direct method: only uses function evaluations (no derivatives) • a simplex is the polytope in N dimensions with N+1 vertices, e.g. • 2D: triangle • 3D: tetrahedron • basic idea: move by reflections, expansions or contractions

start

reflect

expand

contract

One iteration of the simplex algorithm • Reorder the points so that f (xn+1 ) > f (x2) > f (x1) (i.e. xn+1 is the worst point). • Generate a trial point xr by reflection

xr = ¯ x + α(¯ x − xn+1)

¡P ¢ where ¯ x= i xi /(N + 1) is the centroid and α > 0. Compute f (xr ), and there are then 3 possibilities: 1. f (x1) < f (xr ) < f (xn ) (i.e. xr is neither the new best or worst point), replace xn+1 by xr .

2. f (xr ) < f (x1 ) (i.e. xr is the new best point), then assume direction of reflection is good and generate a new point by expansion

xe = xr + β(xr − ¯ x)

where β > 0. If f (xe) < f (xr ) then replace xn+1 by xe, otherwise, the expansion has failed, replace xn+1 by xr . 3. f (xr ) > f (xn) then assume the polytope is too large and generate a new point by contraction

xc = ¯ x + γ(xn+1 − ¯ x) where γ (0 < γ < 1) is the contraction coefficient. If f (xc) < f (xn+1 ) then the contraction has succeeded and replace xn+1 by xc, otherwise contract again. Standard values are α = 1, β = 1, γ = 0.5.

Example

Path of best vertex Downhill Simplex

3 2.5 2 1.5 1 0.5 0 -0.5 -1 -2

-1

0

1

2

2.7 2.6 2.5 2.4 2.3 2.2 2.1 2 1.9 1.8 -2

-1.8

-1.6

-1.4

detail

-1.2

-1

Matlab fminsearch with 200 iterations

3

1

2

3

0

2

3

0 -1

0

1

2

3

-1

3

3

0

1

2

3

3

contractions

0 -1

0 -1 -1

XX

2

1

2

2

YY

3 2

1

1

YY

3 2 1

YY

0 -1 0

1 XX

3 1 -1 -1

0

XX

2

XX

0

YY

1

-1

0 -1 -1

XX

reflections

1

2

0

YY

3 2

-1

1

YY

3 2 -1

0

1

YY

3 2 1 -1

0

YY

Example 2: contraction about a minimum

-1

0

XX

1 XX

2

3

-1

0

1

2

3

XX

Summary • no derivatives required • deals well with noise in the cost function • is able to crawl out of some local minima (though, of course, can still get stuck)

Matlab – fminsearch Nelder-Mead simplex direct search >> banana = @(x)100*(x(2)-x(1)^2)^2+(1-x(1))^2; Pass the function handle to fminsearch: >> [x,fval] = fminsearch(banana,[-1.9, 2]) This produces x = 1.0000 1.0000 fval =4.0686e-010

Google to find out more on using this function

What is next? • Move from general and quadratic optimization problems to linear programming • Constrained optimzation