Lecture 2 3B1B Optimization
Michaelmas 2016
• Newton’s method • Line search
• Quasi-Newton methods • Least-Squares and Gauss-Newton methods • Downhill simplex (amoeba) algorithm
A. Zisserman
Optimization for General Functions f (x, y) = exp(x)(4x2 + 2y 2 + 4xy + 2x + 1) 5
4
3
2
1
0
-1 -4
-3.5
-3
-2.5
-2
-1.5
-1
-0.5
0
0.5
1
Apply methods developed using quadratic Taylor series expansion
Rosenbrock’s function 2 2 2 f (x, y) = 100(y − x ) + (1 − x) Rosenbrock function 3 2.5 2 1.5 1 0.5 0 -0.5 -1 -2
-1
Minimum is at [1, 1]
0
1
2
Steepest descent • The 1D line minimization must be performed using one of the earlier methods (usually cubic polynomial interpolation) Steepest Descent
Steepest Descent
3 2.5 0.85
2 0.8
1.5 1
0.75
0.5 0
0.7
-0.5 0.65
-1 -2
-1
0
1
2
-0.95
-0.9
-0.85
-0.8
• The zig-zag behaviour is clear in the zoomed view (100 iterations) • The algorithm crawls along the valley
-0.75
Performance issues for optimization algorithms 1. Number of iterations required
2. Cost per iteration
3. Memory footprint
4. Region of convergence
Recall from lecture 1: Newton’s method in 1D Fit a quadratic approximation to f (x) using both gradient and curvature information at x. • Expand f (x) locally using a Taylor series δx2 00 0 f (x + δx) = f (x) + δxf (x) + f (x) + h.o.t 2
• Find the δx which minimizes this local quadratic approximation f 0(x + δx) = f 0(x) + δxf 00 (x) = 0 • and rearranging
• Update for x
f 0(x) δx = − 00 f (x) f 0(xn) xn+1 = xn − 00 f (xn)
Recall from lecture 1: Taylor expansion in 2D A function may be approximated locally by its Taylor series expansion about a point x0 f (x0 + δx) ≈ f (x0) +
Ã
∂f ∂f , ∂x ∂y
!Ã
δx δy
!
⎡
∂ 2f 1 2 ⎢ + (δx, δy) ⎣ ∂∂x 2f 2 ∂x∂y
The expansion to second order is a quadratic function 1 f (x0 + δx) = a + g>δx + δx>H δx 2
⎤ Ã ! ∂ 2f δx ∂x∂y ⎥ ⎦ ∂ 2f δy ∂y 2
+ h.o.t
Newton’s method in ND Expand f (x) by its Taylor series about the point xn 1 > f (xn + δx) ≈ f (xn) + g> n δx + δx H n δx 2 where the gradient is the vector "
∂f ∂f gn = ∇f (xn) = ,..., ∂x1 ∂xN
#>
and the Hessian is the symmetric matrix ⎡
⎢ ⎢ H n = H (xn) = ⎢ ⎢ ⎣
∂ 2f ∂x2 1
..
... ...
∂ 2f ∂x1∂xN
⎤ ∂ 2f ∂x1∂xN ⎥ ⎥ ⎥ ⎥ 2 ⎦ ∂ f ∂x2 N
For a minimum we require that ∇f (x) = 0, and so ∇f (x) = gn + H nδx = 0
with solution δx = −H−1 n gn. This gives the iterative update
xn+1 = xn − H−1 n gn
Assume that H is positive definite (all eigenvalues greater than zero)
xn+1 = xn + δx = xn − H−1 n gn • If f (x) is quadratic, then the solution is found in one step. • The method has quadratic convergence (as in the 1D case). • The solution δx = −H−1 n gn is guaranteed to be a downhill direction (provided that H is positive definite) • For numerical reasons the inverse is not actually computed, instead δx is computed as the solution of H δx = −gn. • Rather than jump straight to xn − H−1 n gn , it is better to perform a line search which ensures global convergence
xn+1 = xn − αnH−1 n gn • If H = I then this reduces to steepest descent.
Newton’s method - example Newton method with line search
Newton method with line search
3
3
2.5
2.5
2
2
1.5
1.5
1
1
0.5
0.5
0
0
-0.5
-0.5
-1 -2
-1
0 gradient < 1e-3 after 15 iterations
1
2
-1 -2
-1.5
-1
-0.5
0
0.5
1
1.5
gradient < 1e-3 after 15 iterations
ellipses show successive quadratic approximations
• The algorithm converges in only 15 iterations compared to hundreds for steepest descent. • However, the method requires computing the Hessian matrix at each iteration – this is not always feasible
2
Quasi-Newton methods • If the problem size is large and the Hessian matrix is dense then it may be infeasible/inconvenient to compute it directly.
• Quasi-Newton methods avoid this problem by keeping a “rolling estimate” of H (x), updated at each iteration using new gradient information.
• Common schemes are due to Broyden, Fletcher, Goldfarb and Shanno (BFGS), and also Davidson, Fletcher and Powell (DFP).
e.g. in 1D
First derivatives f − f0 f − f−1 h h and f 0 (x0 − ) = 0 f 0(x0 + ) = 1 2 h 2 h second derivative
f(x)
f0−f−1 f1 −f0 − f − 2f0 + f−1 h = 1 f 00(x0) = h 2
h
h x-1
h
h x0
x+1
x
For H n+1 build an approximation from H n, gn, gn+1, xn, xn+1
Quasi-Newton: BFGS • Set H 0 = I. • Update according to
qnq> (H nsn) (H nsn)> n H n+1 = H n + > − qn sn s> n H n sn where
sn = xn+1 − xn qn = gn+1 − gn • The matrix itself is not stored, but rather represented compactly by a few stored vectors. • The estimate H n+1 is used to form a local quadratic approximation as before.
Example Rosenbrock function 3 2.5 2 1.5 1 0.5 0 -0.5 -1 -2
-1
0
1
2
• The method converges in 25 iterations, compared to 15 for the full-Newton method • In Matlab the optimization function ‘fminunc’ uses a BFGS quasi-Newton method for medium scale optimization problems
Matlab – fminunc >> f='100*(x(2)-x(1)^2)^2+(1-x(1))^2'; >> GRAD='[100*(4*x(1)^3-4*x(1)*x(2))+2*x(1)-2; 100*(2*x(2)-2*x(1)^2) ]'; Choose options for BFGS quasi-Newton >> OPTIONS=optimset('LargeScale','off', 'HessUpdate','bfgs' ); >> OPTIONS = optimset(OPTIONS,'gradobj','on'); Start point >> x = [-1.9; 2]; >> [x,fval] = fminunc({f,GRAD},x,OPTIONS); This produces x = 0.9998, 0.9996
fval = 3.4306e-008
Non-linear least squares • It is very common in applications for a cost function f (x) to be the sum of a large number of squared residuals
f (x) =
M X
ri2
i=1
• If each residual depends non-linearly on the parameters x then the minimization of f (x) is a non-linear least squares problem.
• Examples arise in non-linear regression (fitting) of data
Linear least squares reminder The goal is to fit a smooth curve to measured data points {si, ti} by minimizing the cost f (x) =
X
i=1
ri2 =
X
i=1
(y(si, x) − ti)2
t
target value s
For example, the regression function y(si, x) might be polynomial y(s, x) = x0 + x1s + x2s2 + . . . In this case the function is linear in the parameter x and there is a closed form solution. In general there will not be a closed form solution for non-linear functions y(s, x).
Non-linear least squares example: aligning a 3D model to an image
Cost function
Transformation parameters: • 3D rotation matrix R
• translation 3-vector T = (Tx, Ty , Tz )>
Image generation: • rotate and translate 3D model by R and T • project to generate image IˆR,T(x, y)
Non-linear least squares f (x ) =
M X
i=1
ri2 = krk2
The M × N Jacobian of the vector of residuals r is defined as ⎛
∂r1 ⎜ ∂x1 J(x) = ⎜ ⎜ .. ⎝ ∂r M ∂x1
... ...
⎞ ∂r1 ∂xN ⎟ ⎟ ⎟ ∂rM ⎠ ∂xN
assume M>N
J
Consider X ∂ X 2 ∂r ri = 2ri i ∂xk i ∂xk i
Hence ∇f (x) = 2J>r
=
J>
For the Hessian we require ∂ X ∂ri ∂2 X 2 ri = 2 ri ∂xl ∂xk i ∂xl i ∂xk
∂ 2ri = 2 +2 ri ∂xk ∂xl i ∂xk ∂xl i X ∂ri ∂ri
X
Hence 2 J> J
H ( x) =
+
2
M X
riR i
i=1
J>
J
Ri
• Note that the second-order term in the Hessian H (x) is multiplied by the residuals ri. • In most problems, the residuals will typically be small. • Also, at the minimum, the residuals will typically be distributed with mean = 0.
• For these reasons, the second-order term is often ignored, giving the Gauss-Newton approximation to the Hessian :
H (x) = 2J>J • Hence, explicit computation of the full Hessian can again be avoided.
Example – Gauss-Newton The minimization of the Rosenbrock function f (x, y) = 100(y − x2)2 + (1 − x)2 can be written as a least-squares problem with residual vector
r=
⎛
∂r1 ∂x J(x) = ⎝ ∂r 2 ∂x
"
10(y − x2) (1 − x)
#
⎞ Ã ! ∂r1 −20x 10 ∂y ⎠ = ∂r2 −1 0 ∂y
The true Hessian is ⎡
H(x) = ⎢ ⎣
∂ 2f ∂x2 ∂ 2f ∂x∂y
∂ 2f
⎤
∂x∂y ⎥ = ∂ 2f ⎦ ∂y 2
"
1200x2 − 400y + 2 −400x −400x
200
#
The Gauss-Newton approximation of the Hessian is 2J>J = 2
"
−20x −1 10 0
#"
−20x 10 −1 0
#
=
"
800x2 + 2 −400x −400x
200
#
Summary: Gauss-Newton optimization For a cost function f (x) that is a sum of squared residuals f (x) =
X
ri2
i=1
The Hessian can be approximated as
H (x) = 2J>J and the gradient is given by ∇f (x) = 2J>r So, the Newton update step
xn+1 = xn + δx = xn − H−1 n gn computed as H δx = −gn, becomes
J>J δx = −J>r These are called the normal equations.
>J xn+1 = xn − αnH−1 g with H ( x ) = 2 J n n n n n Gauss-Newton method with line search
Gauss-Newton method with line search
3
3
2.5
2.5
2
2
1.5
1.5
1
1
0.5
0.5
0
0
-0.5
-0.5
-1 -2
-1
0 gradient < 1e-3 after 14 iterations
1
2
-1 -2
-1.5
-1
-0.5
0
0.5
1
gradient < 1e-3 after 14 iterations
• minimization with the Gauss-Newton approximation with line search takes only 14 iterations
1.5
2
Comparison Gauss-Newton
Newton Newton method with line search
Gauss-Newton method with line search
3
3
2.5
2.5
2
2
1.5
1.5
1
1
0.5
0.5
0
0
-0.5
-0.5
-1 -2
-1
0
1
gradient < 1e-3 after 15 iterations
2
-1 -2
-1
0
1
2
gradient < 1e-3 after 14 iterations
• requires computing Hessian (i.e. n^2 second derivatives)
• approximates Hessian by Jacobian product
• exact solution if quadratic
• requires only n first derivatives
The downhill simplex (amoeba) algorithm
The downhill simplex (amoeba) algorithm • Due to Nelder and Mead (1965) • A direct method: only uses function evaluations (no derivatives) • a simplex is the polytope in N dimensions with N+1 vertices, e.g. • 2D: triangle • 3D: tetrahedron • basic idea: move by reflections, expansions or contractions
start
reflect
expand
contract
One iteration of the simplex algorithm • Reorder the points so that f (xn+1 ) > f (x2) > f (x1) (i.e. xn+1 is the worst point). • Generate a trial point xr by reflection
xr = ¯ x + α(¯ x − xn+1)
¡P ¢ where ¯ x= i xi /(N + 1) is the centroid and α > 0. Compute f (xr ), and there are then 3 possibilities: 1. f (x1) < f (xr ) < f (xn ) (i.e. xr is neither the new best or worst point), replace xn+1 by xr .
2. f (xr ) < f (x1 ) (i.e. xr is the new best point), then assume direction of reflection is good and generate a new point by expansion
xe = xr + β(xr − ¯ x)
where β > 0. If f (xe) < f (xr ) then replace xn+1 by xe, otherwise, the expansion has failed, replace xn+1 by xr . 3. f (xr ) > f (xn) then assume the polytope is too large and generate a new point by contraction
xc = ¯ x + γ(xn+1 − ¯ x) where γ (0 < γ < 1) is the contraction coefficient. If f (xc) < f (xn+1 ) then the contraction has succeeded and replace xn+1 by xc, otherwise contract again. Standard values are α = 1, β = 1, γ = 0.5.
Example
Path of best vertex Downhill Simplex
3 2.5 2 1.5 1 0.5 0 -0.5 -1 -2
-1
0
1
2
2.7 2.6 2.5 2.4 2.3 2.2 2.1 2 1.9 1.8 -2
-1.8
-1.6
-1.4
detail
-1.2
-1
Matlab fminsearch with 200 iterations
3
1
2
3
0
2
3
0 -1
0
1
2
3
-1
3
3
0
1
2
3
3
contractions
0 -1
0 -1 -1
XX
2
1
2
2
YY
3 2
1
1
YY
3 2 1
YY
0 -1 0
1 XX
3 1 -1 -1
0
XX
2
XX
0
YY
1
-1
0 -1 -1
XX
reflections
1
2
0
YY
3 2
-1
1
YY
3 2 -1
0
1
YY
3 2 1 -1
0
YY
Example 2: contraction about a minimum
-1
0
XX
1 XX
2
3
-1
0
1
2
3
XX
Summary • no derivatives required • deals well with noise in the cost function • is able to crawl out of some local minima (though, of course, can still get stuck)
Matlab – fminsearch Nelder-Mead simplex direct search >> banana = @(x)100*(x(2)-x(1)^2)^2+(1-x(1))^2; Pass the function handle to fminsearch: >> [x,fval] = fminsearch(banana,[-1.9, 2]) This produces x = 1.0000 1.0000 fval =4.0686e-010
Google to find out more on using this function
What is next? • Move from general and quadratic optimization problems to linear programming • Constrained optimzation