ELE604/ELE704 Optimization - Hacettepe University, Department of
Transkript
ELE604/ELE704 Optimization - Hacettepe University, Department of
Contents Unconstrained Optimization Unconstrained Minimization Descent Methods Motivation General Descent Method Line Search Exact Line Search Bisection Algorithm Backtracking Line Search ELE604/ELE704 Optimization Convergence Gradient Descent (GD) Method Gradient Descent Method Convergence Analysis Conv. of GD with Exact Line Search Conv. of GD with Backtracking Line Search Examples Steepest Descent (SD) Method Preliminary Denitions Steepest Descent Method Steepest Descent for dierent norms Unconstrained Optimization http://www.ee.hacettepe.edu.tr/∼usezen/ele604/ Euclidean Norm Quadratic Norm L1 -norm Choice of norm Dr. Umut Sezen & Dr. Cenk Toker Department of Electrical and Electronic Engineering Hacettepe University Convergence Analysis Examples Conjugate Gradient (GD) Method Introduction Conjugate Directions Descent Properties of the Conjugate Gradient Method The Conjugate Gradient Method Extension to Nonquadratic Problems Newton's Method (NA) The Newton Step Interpretation of the Newton Step The Newton Decrement Newton's Method Convergence Analysis Examples Approximation of the Hessian Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization Unconstrained Optimization 19-Dec-2014 1 / 120 Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization Unconstrained Minimization Unconstrained Optimization Unconstrained Minimization I I min f (x) = x∈RN min f (x) where f (x) : 2 / 120 19-Dec-2014 4 / 120 Unconstrained Minimization Example 1: Quadratic program The aim is RN 19-Dec-2014 1 T x Qx − bT x + c 2 where Q : RN ×N is symmetric, b ∈ RN and c ∈ R. → R is twice dierentiable. Necessary conditions: I The problem is solvable, i.e. nite optimal point x∗ exists. I The optimal value (nite) is given by ∇f (x∗ ) = Qx∗ − b = 0 (PSD) H(x∗ ) = Q 0 - Q ≺ 0 ⇒ f (x) has no local minimum. p∗ = inf f (x) = f (x∗ ) (> −∞) x - Q 0 ⇒ x∗ = Q−1 b is the unique global minimum. - Q 0 ⇒ either no solution or ∞ number of solutions. Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization Unconstrained Optimization 19-Dec-2014 3 / 120 Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization Unconstrained Minimization Unconstrained Optimization Unconstrained Minimization α > 0, β > 0 I α > 0, β = 0 150 Example 2: Consider 60 100 min x1 ,x2 ∈R f (x1 , x2 ) = 40 1 (αx21 + βx22 − x1 ) 2 50 20 0 Here, let us rst express the above equation in the quadratic program form with 0 −50 10 −20 10 5 α Q= −γ 1 b= 0 γ β 0 −5 −10 0 −5 x2 5 10 5 5 10 0 0 −5 −5 x1 −10 x2 α = 0, β > 0 −10 x1 α > 0, β < 0 60 60 40 where γ ∈ R, for simplicity we can take γ = 0. So, 40 20 20 - If α > 0 and β > 0 i.e. (Q 0) x∗ = ( α1 , 0) is the unique global minimum. - If α1 > 0 and β= 0 i.e. (Q 0). Innite number of solutions, ( α , y), y ∈ R . - If α = 0 and β > 0 i.e. (Q 0). No solution. - If α < 0 and β > 0 (or α > 0 and β < 0), (i.e. Q is indenite). No solution. Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization −10 19-Dec-2014 5 / 120 0 0 −20 −40 −20 10 5 10 5 0 −5 x2 −10 0 −5 10 −60 10 5 5 0 0 −10 x1 −5 −5 x2 Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization −10 −10 x1 19-Dec-2014 6 / 120 Unconstrained Optimization Unconstrained Minimization Descent Methods Motivation Motivation I Two possibilities: - {f (x) : x ∈ X} is unbounded below ⇒ no optimal solution. - {f (x) : x ∈ X} is bounded below ⇒ a global minimum exists, if kxk = 6 ∞. Then, unconstrained minimization methods - produce sequence of points x(k) ∈ dom f (x) for k = 0, 1, . . . with f (x(k) ) → p∗ - can be interpreted as iterative methods for solving the optimality condition ∇f (x∗ ) = 0 I If ∇f (x) 6= 0, there is an interval (0, δ) of stepsizes such that I If d makes an angle with ∇f (x) that is greater than 90◦ , i.e., f (x − α∇f (x)) < f (x) ∀α ∈ (0, δ) ∇T f (x) d < 0 ∃ an interval (0, δ) of stepsizes such that f (x + αd) < f (x) Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization Descent Methods I 19-Dec-2014 7 / 120 Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization Motivation Descent Methods I Proposition: For a descent method f (x(k+1) ) < f (x(k) ) 8 / 120 General Descent Method Given a starting point x(0) ∈ dom f (x) repeat 1. Determine a descent direction d(k) , 2. Line search: Choose a stepsize α(k) > 0, except x(k) = x∗ . I 19-Dec-2014 General Descent Method Denition: The descent direction d is selected such that ∇T f (x) d < 0 I ∀α ∈ (0, δ) 3. Update: x(k+1) = x(k) + α(k) d(k) , until stopping criterion is satised. Denition: Minimizing sequence is dened as x(k+1) = x(k) + α(k) d(k) where the scalar α ∈ (0, δ) is the stepsize (or step length) at iteration k, and d(k) ∈ RN is the step or search direction. (k) - How to nd optimum α ? Line Search Algorithm - How to nd optimum d ? Depends on the descent algorithm, e.g. d = −∇f (x(k) ). (k) (k) Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization Descent Methods I Example 3: Simplest method: Gradient Descent x(k+1) = x(k) − α(k) ∇f (x(k) ), 19-Dec-2014 k = 0, 1, . . . Note that, here the descent direction is d(k) = −∇f (x(k) ). 9 / 120 Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization General Descent Method Descent Methods 19-Dec-2014 10 / 120 Line Search Line Search I Example 4: Most sophisticated method: Newton's Method x(k+1) = x(k) − α(k) H−1 (x(k) )∇f (x(k) ), I k = 0, 1, . . . Suppose f (x) is a continuously dierentiable convex function and we want to nd α(k) = argmin f (x(k) + αd(k) ) Note that, here the descent direction is d(k) = −H−1 (x(k) )∇f (x(k) ). α for a given descent direction d . Now, let (k) h(α) = f (x(k) + αd(k) ) where h(α) : R → R is a "convex function" in the scalar variable α then the problem becomes α(k) = argmin h(α) α Then, as h(α) is convex, it has a minimum at h0 (α(k) ) = Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization 19-Dec-2014 11 / 120 ∂h(α(k) ) =0 ∂α Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization 19-Dec-2014 12 / 120 Descent Methods Line Search Descent Methods Line Search where h0 (α) is given by h0 (α) = ∂h(α) = ∇T f (x(k) + αd(k) ) d(k) ∂α (using chain rule) Therefore, since d is the descent direction, (i.e., ∇T f (x(k) ) d(k) < 0 ), we have h0 (0) < 0. Also, h0 (α) is monotone increasing function of α because h(α) is convex. Hence. search for h0 (α(k) ) = 0. Choice of stepsize: I Constant stepsize I Diminishing stepsize while satisfying I α(k) = c : constant α(k) → 0 ∞ P α (k) k=−∞ = ∞. Exact line search (analytic) α(k) = argmin f (x(k) + αd(k) ) α Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization Descent Methods 19-Dec-2014 13 / 120 Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization Line Search Descent Methods Exact line search: (for quadratic programs) I If f (x) is a quadratic function, then h(α) is also a quadratic function, i.e., α2 (k) T d H(x(k) )d(k) 2 - Since h0 (0) < 0, α̃ = - If h (α̃) = 0, α 0 Exact line search solution α0 which minimizes the quadratic equation above, i.e., ∂h(α0 ) = 0, is given by ∂α α0 = α(k) = − d Line Search (k) 0+α̂ 2 is the next test point = α̃ is found (very dicult to achieve) - If h0 (α̃) > 0 narrow down the search interval to (0, α̃) - If h0 (α̃) < 0 narrow down the search interval to (α̃, α̂) ∇T f (x(k) )d(k) (k) T 14 / 120 Bisection Algorithm: I Assume h(α) is convex, then h0 (α) is monotonically increasing function. Suppose that we know a value α̂ such that h0 (α̂) > 0. h(α) = f (x(k) − αd(k) ) = f (x(k) ) + α∇T f (x(k) )d(k) + 19-Dec-2014 H(x(k) )d(k) - If f (x) is a higher order function, then second order Taylor series approximation can be used for the exact line search algorithm (which also gives an approximate solution). Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization Descent Methods 19-Dec-2014 15 / 120 Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization Line Search Descent Methods 19-Dec-2014 16 / 120 Line Search Proposition: After every iteration, the current interval [α` , αu ] contains α∗ , h0 (α∗ ) = 0. Proposition: At the k-th iteration, the length of the current interval is L= Algorithm: 1. Set k = 0, α` = 0 and αu = α̂ 2. Set α̃ = α` +αu 2 Proposition: A value of α such that |α − α∗ | < ε can be found at most log2 and calculate h (α) 0 α̂ ε steps. 3. If h0 (α̃) > 0 ⇒ αu = α̃ and k = k + 1 4. If h0 (α̃) < 0 ⇒ α` = α̃ and k = k + 1 5. If h0 (α̃) = 0 ⇒ k 1 α̂ 2 I stop. How to nd α̂ such that h0 (α̂) > 0? 1. Make an initial guess of α̂ 2. If h0 (α̂) < 0 ⇒ α̂ = 2α̂, go to step 2 3. Stop. Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization 19-Dec-2014 17 / 120 Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization 19-Dec-2014 18 / 120 Descent Methods Line Search Descent Methods Line Search Backtracking line search I Stopping criterion for the Bisection Algorithm: h0 (α̃) → 0 as k → ∞, may not converge quickly. Some relevant stopping criteria: 1. Stop after k = K iterations (K : user dened) 2. Stop when |αu − α` | ≤ ε (ε : user dened) 3. Stop when h0 (α̃) ≤ ( : user dened) For small enough α: In general, 3rd criterion is the best. f (x0 + αd) ' f (x0 ) + α∇T f (x0 )d < f (x0 ) + γα∇T f (x0 )d where 0 < γ < 0.5 as ∇T f (x0 ) d < 0. Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization Descent Methods I 19-Dec-2014 19 / 120 Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization Line Search Descent Methods 19-Dec-2014 20 / 120 Line Search Algorithm: Backtracking line search Given a descent direction d for f (x) at x0 ∈ dom f (x) - The backtracking exit inequality α=1 f (x0 + αd) ≤ f (x0 ) + γα∇T f (x0 )d while f (x0 + αd) > f (x0 ) + γα∇T f (x0 )d holds for α ∈ [0, α0 ]. Then, the line search stops with a step length α i. α = 1 if α0 ≥ 1 α = βα end ii. α ∈ [βα0 , α0 ]. In other words, the step length obtained by backtracking line search satises where 0 < γ < 0.5 and 0 < β < 1 - At each iteration step size α is reduced by β (β ' 0.1 : coarse search, β ' 0.8 : ne search). α ≥ min {1, βα0 } . - γ can be interpreted as the fraction of the decrease in f (x) predicted by linear extrapolation (γ = 0.01 ↔ 0.3 (typical) meaning that we accept a decrease in f (x) between 1% and 30%). Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization Descent Methods I 19-Dec-2014 21 / 120 Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization Convergence Descent Methods 19-Dec-2014 22 / 120 Convergence Convergence Denition: Let k · k be a norm on RN . Let x(k) n n o∞ RN . Then, the sequence x(k) k=0 o∞ k=0 be a sequence of vectors in is said to converge to a limit x∗ if I Rate of Convergence o∞ Denition: Let k · k be a norm on RN . A sequence x(k) that converges to k=0 x∗ ∈ RN is said to converge at rate R ∈ R++ and with rate constant δ ∈ R++ if n ∀ε > 0, ∃Nε ∈ Z+ : (k ∈ Z+ and k ≥ Nε ) ⇒ (kx(k) − x∗ k < ε) n o∞ If the sequence x(k) converges to x∗ then we write lim k→∞ k=0 kx(k+1) − x∗ k =δ kx(k) − x∗ kR - If R = 1, 0 < δ < 1, then rate is linear lim x(k) = x∗ k→∞ - If 1 < R < 2, 0 < δ < ∞, then rate is called super-linear n o∞ and call x∗ as the limit of the sequence x(k) . k=0 - Nε may depend on ε - For a distance ε, after Nε iterations, all the subsequent iterations are within this distance ε to x∗ . - If R = 2, 0 < δ < ∞, then rate is called quadratic The rate of convergence R is sometimes called asymptotic convergence rate. It may not apply for the iterates, but applies asymptotically as k → ∞. This denition does not characterize how fast the convergence is (i.e. rate of convergence). Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization 19-Dec-2014 23 / 120 Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization 19-Dec-2014 24 / 120 Descent Methods Convergence Gradient Descent (GD) Method Gradient Descent Method Gradient Descent Method I First order Taylor series expansion at x0 gives us f (x0 + αd) ≈ f (x0 ) + α∇T f (x0 )d. This approximation is valid for αkdk → 0. Example 5: The sequence ak lim k→∞ Example 6: ∞ k=0 , 0 < a < 1 converges to 0. kak+1 − 0k =a kak − 0k1 I We want to choose d so that ∇T f (x0 )d is as small as (as negative as) possible for maximum descent. I If we normalize d, i.e. kdk = 1, then normalized direction d̃ d̃ = − ⇒ R = 1, δ = a makes the smallest inner product with ∇f (x0 ). n k o∞ The sequence a2 , 0 < a < 1 converges to 0. I Then, the unnormalized direction k=0 d = −∇f (x0 ) k+1 lim k→∞ ka2 − 0k =1 ka2k − 0k2 ⇒ is called the direction of gradient descent (GD) at the point of x0 . R = 2, δ = 1 I d Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization Gradient Descent (GD) Method 19-Dec-2014 25 / 120 is a direction as long as ∇f (x0 ) 6= 0. Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization Gradient Descent Method Gradient Descent (GD) Method I I ∇f (x0 ) k∇f (x0 )k Algorithm: Gradient Descent Algorithm 19-Dec-2014 26 / 120 19-Dec-2014 28 / 120 Convergence Analysis Convergence Analysis The Hessian matrix H(x) is bounded as Given a starting point x(0) ∈ dom f (x) 1. mI H(x), i.e., repeat 1. d(k) = −∇f (x(k) ) 2. Line search: Choose step size α(k) via a line search algorithm 3. Update: x(k+1) = x(k) + α(k) d(k) (H(x) − mI) 0 yT H(x)y ≥ mkyk2 , ∀y ∈ RN 2. H(x) M I, i.e., until stopping criterion is satised (M I − H(x)) 0 yT H(x)y ≤ M kyk2 , ∀y ∈ RN - A typical stopping criterion is k∇f (x)k < ε, ε → 0 (small) with ∀x ∈ dom f (x). Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization Gradient Descent (GD) Method 19-Dec-2014 27 / 120 Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization Convergence Analysis Gradient Descent (GD) Method I Convergence Analysis Lower Bound: mI H(x) For x, y ∈ dom f (x) f (y) = f (x) + ∇T f (x)(y − x) + Note that, condition number of a matrix is given by the ratio of the largest and the smallest eigenvalue, e.g., max λi = M κ(H(x)) = min λi m 1 (y − x)T H(z)(y − x) 2 for some z on the line segment [x, y] where H(z) mI. Thus, f (y) ≥ f (x) + ∇T f (x)(y − x) + m ky − xk2 2 - If m = 0, then the inequality characterizes convexity. If the condition number is close to one, the matrix is well-conditioned which means its inverse can be computed with good accuracy. If the condition number is large, then the matrix is said to be ill-conditioned. - If m > 0, then we have a better lower bound for f (y) Right-hand side is convex in y. Minimum is achieved at y0 = x − 1 ∇f (x) m Then, f (y) ≥ f (x) + ∇T f (x)(y0 − x) + ≥ f (x) − m ky0 − xk2 2 1 k∇f (x)k2 2m ∀y ∈ dom f . Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization 19-Dec-2014 29 / 120 Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization 19-Dec-2014 30 / 120 Gradient Descent (GD) Method Convergence Analysis Gradient Descent (GD) Method I When y = x∗ f (x∗ ) = p∗ ≥ f (x) − Upper Bound: H(x) M I For any x, y ∈ dom f (x), using similar derivations as the lower bound, we arrive at 1 k∇f (x)k2 2m f (y) ≤ f (x) + ∇T f (x)(y0 − x) + - A stopping criterion f (x) − p∗ ≤ Then for y = x∗ 1 k∇f (x)k2 2m Gradient Descent (GD) Method 19-Dec-2014 31 / 120 Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization Conv. of GD with Exact Line Search Gradient Descent (GD) Method For the exact line search, let us use second order approximation for f (x(k+1) ): 32 / 120 Conv. of GD with Exact Line Search f (x(k+1) ) ≤ f (x(k) ) − αk∇f (x(k) )k2 + f (x(k+1) ) = f (x(k) − α∇f (x(k) )) M α2 k∇f (x(k) )k2 2 Find α00 such that upper bound of f (x(k) − α∇f (x(k) )) is minimized over α. 2 α T (k) T (k) (k) (k) (k) (k) ∼ = f (x ) − α ∇ f (x )∇f (x ) + ∇ f (x ) H(x ) ∇f (x ) | {z } 2 | {z } Upper bound equation (i.e., right-hand side equation) is quadratic in α, hence minimized for M I k∇f (x(k) )k2 α00 = This criterion is quadratic in α. 1 M with the minimum value Normally, exact line search solution α0 which minimizes the quadratic equation above is given by f (x(k) ) − ∇T f (x(k) )∇f (x(k) ) ∇T f (x(k) )H(x(k) )∇f (x(k) ) 1 k∇f (x(k) )k2 2M Then, for α00 f (x(k+1) ) ≤ f (x(k) ) − Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization Gradient Descent (GD) Method 19-Dec-2014 - However, let us use the upper bound of the second order approximation for convergence analysis Convergence of GD using exact line search α0 = M ky0 − xk2 2 1 k∇f (x)k2 2M f (x∗ ) = p∗ ≤ f (x) − Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization I Convergence Analysis 19-Dec-2014 33 / 120 1 k∇f (x(k) )k2 2M Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization Conv. of GD with Exact Line Search Gradient Descent (GD) Method 19-Dec-2014 34 / 120 Conv. of GD with Exact Line Search Subtract p∗ for both sides f (x(k+1) ) − p∗ ≤ f (x(k) ) − p∗ − 1 k∇f (x(k) )k2 2M - Rate of convergence is unity (i.e., R = 1) We know that - Upper limit of rate constant is 1 − f (x(k) ) − p∗ ≤ 1 k∇f (x(k) )k2 2m ⇒ k∇f (x(k) )k2 ≥ 2m(f (x(k) ) − p∗ ) ≤ (1 − or (k+1) m )(f (x(k) ) − p∗ ) M ∗ Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization linear convergence Number of steps? Apply the above inequality recursively i.e., f (x(k+1) ) → p∗ as k → ∞, since 0 ≤ c < 1. Thus, convergence is guaranteed. - If m = M ⇒ c = 0, then convergence occurs in one iteration. m (f (x(k) ) − p∗ ) M f (x )−p m ≤ (1 − )=c≤1 M f (x(k) ) − p∗ ⇒ f (x(k+1) ) − p∗ ≤ ck (f (x(0) ) − p∗ ) Then substituting this result to the above inequality f (x(k+1) ) − p∗ ≤ (f (x(k) ) − p∗ ) − I m M ( - If m M ⇒ c → 1, the slow convergence. m ≤ 1) M 19-Dec-2014 35 / 120 Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization 19-Dec-2014 36 / 120 Gradient Descent (GD) Method Conv. of GD with Exact Line Search Gradient Descent (GD) Method (f (x(k+1) ) − p∗ ) ≤ ε is achieved after at most log [f (x(0) ) − p∗ ]/ε K= log (1/c) I Conv. of GD with Backtracking Line Search Convergence of GD using backtracking line search iterations - Numerator is small when initial point is close to x∗ (K gets smaller). - Numerator increases as accuracy increases (i.e., ε decreases) (K gets larger). m - Denominator decreases linearly with M (reciprocal of the condition number) m m m as c = (1 − M ), i.e., log(1/c) = − log(1 − M )' M (using log(x) = log(x0 ) + x10 (x − x0 ) − 2x12 (x − x0 )2 + · · · with x0 = 1). 0 - well-conditioned Hessian, smaller). - ill-conditioned Hessian, larger). m M m M → 0 ⇒ denominator is small (K gets Gradient Descent (GD) Method 19-Dec-2014 f (x(k) − α∇f (x(k) )) ≤ f (x(k) ) − γαk∇f (x(k) )k2 is satised when α ∈ [βα0 , α0 ] where α0 ≤ 1 M . Backtracking line search terminates either if α = 1 or α ≥ bound on the decrease 1. f (x(k+1) ) ≤ f (x(k) ) − γk∇f (x(k) )k2 if α = 1 → 1 ⇒ denominator is large (K gets Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization Backtracking exit condition 2. f (x(k+1) ) ≤ f (x(k) ) − 37 / 120 Conv. of GD with Backtracking Line Search βγ k∇f (x(k) )k2 M if α ≥ β M β M Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization Gradient Descent (GD) Method which gives a lower 19-Dec-2014 38 / 120 Conv. of GD with Backtracking Line Search If we put these inequalities (1 & 2) together βγ f (x(k+1) ) ≤ f (x(k) ) − min γ, k∇f (x(k) )k2 M - Rate of convergence is unity (i.e., R = 1) Similar to the analysis exact line search, subtract p∗ from both sides f (x(k+1) ) − p∗ ≤ f (x(k) ) − p∗ − γ min 1, β M f (x Finally ∗ )−p ≤ f (x(k+1) ) − p∗ ≤ ck f (x(0) ) − p∗ Thus, k → ∞ ⇒ ck → 0, so convergence is guaranteed. β 1 − 2mγ min 1, f (x(k) ) − p∗ M f (x(k+1) ) − p∗ ≤ f (x(k) ) − p∗ linear convergence - Rate is constant c < 1 k∇f (x(k) )k2 But, we know that k∇f (x(k) )k2 ≥ 2m(f (x(k) ) − p∗ ), then (k+1) ⇒ β 1 − 2mγ min 1, =c<1 M Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization 19-Dec-2014 39 / 120 Gradient Descent (GD) Method Examples Note: Examples 7, 8, 9 and 10 are taken from Convex Optimization (Boyd and Vandenberghe) (Ch. 9). Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization Gradient Descent (GD) Method 19-Dec-2014 40 / 120 19-Dec-2014 42 / 120 Examples Example 7: (quadratic problem in R2 ) Replace γ , α and t with σ, γ and α. Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization 19-Dec-2014 41 / 120 Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization Gradient Descent (GD) Method Examples Gradient Descent (GD) Method Examples Example 8: (nonquadratic problem in R2 ) Replace α and t with γ and α. Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization Gradient Descent (GD) Method 43 / 120 Examples Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization Gradient Descent (GD) Method 19-Dec-2014 Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization Gradient Descent (GD) Method 19-Dec-2014 45 / 120 Examples 44 / 120 19-Dec-2014 46 / 120 19-Dec-2014 48 / 120 Examples Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization Gradient Descent (GD) Method 19-Dec-2014 Examples Example 9: (problem in R100 ) Replace α and t with γ and α. Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization 19-Dec-2014 47 / 120 Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization Gradient Descent (GD) Method Examples Gradient Descent (GD) Method Examples Example 10: (Condition number) Replace γ , α and t with σ, γ and α. Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization Gradient Descent (GD) Method 19-Dec-2014 49 / 120 Examples Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization Gradient Descent (GD) Method 19-Dec-2014 50 / 120 Examples Observations: - The gradient descent algorithm is simple. - The gradient descent method often exhibits approximately linear convergence. - The choice of backtracking parameters γ and β has a noticeable but not dramatic eect on the convergence. Exact line search sometimes improves the convergence of the gradient method, but the eect is not large (and probably not worth the trouble of implementing the exact line search). - The convergence rate depends greatly on the condition number of the Hessian, or the sublevel sets. Convergence can be very slow, even for problems that are moderately well-conditioned (say, with condition number in the 100s). When the condition number is larger (say, 1000 or more) the gradient method is so slow that it is useless in practice. - The main advantage of the gradient method is its simplicity. Its main disadvantage is that its convergence rate depends so critically on the condition number of the Hessian or sublevel sets. Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization Steepest Descent (SD) Method I 19-Dec-2014 51 / 120 Preliminary Denitions Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization Steepest Descent (SD) Method Dual Norm: Let k · k denote any norm on RN , then the dual norm, denoted by k · k∗ , is the function from RN to R with values n o kxk∗ = max yT x : kyk ≤ 1 = sup yT x : kyk ≤ 1 Preliminary Denitions kxk2∗ = kxk2 The above denition also corresponds to a norm: it is convex, as it is the pointwise maximum of convex (in fact, linear) functions y → xT y; it is homogeneous of degree 1, that is, kαxk∗ = αkxk∗ for every x in RN and α ≥ 0. - The norm dual to the the L∞ -norm is the L1 -norm, or vice versa. By denition of the dual norm, - More generally, the dual of the Lp -norm is the Lq -norm kxk∞∗ = kxk1 and kxk1∗ = kxk∞ T kxkp∗ = kxkq x y ≤ kxk · kyk∗ This can be seen as a generalized version of the Cauchy-Schwartz inequality, which corresponds to the Euclidean norm. I 52 / 120 - The norm dual to the Euclidean norm is itself. This comes directly from the Cauchy-Schwartz inequality. y I 19-Dec-2014 p where q = . 1−p The dual to the dual norm above is the original norm. Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization 19-Dec-2014 53 / 120 Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization 19-Dec-2014 54 / 120 Steepest Descent (SD) Method Preliminary Denitions Steepest Descent (SD) Method Steepest Descent Method Steepest Descent Method I I Quadratic norm: A generalized quadratic norm of x is dened by f (x(k) + αd) ≈ f (x(k) ) + α∇T f (x(k) )d. 1/2 kxkP = xT Px = kP1/2 xk2 = kMxk2 This approximation is valid for αkdk2 → 0. where P = M M is an N × N symmetric positive denite (SPD) matrix. T I When P = I then, quadratic norm is equal to the Euclidean norm. I The dual of the quadratic norm is given by kxkP∗ = kxkQ = xT P−1 x The rst-order Taylor series approximation of f (x(k) + αd) around x(k) is I We want to choose d so that ∇T f (x0 )d is as small as (as negative as) possible for maximum descent. I First normalize d to obtain normalized steepest descent direction (nsd) dnsd n o dnsd = argmin ∇T f (x(k) ) d : kdk = 1 1/2 where k · k is any norm and k · k∗ is its dual norm on RN . Choice of norm is very important. where Q = P . −1 I It is also convenient to consider the unnormalized steepest descent direction (sd) dsd = k∇f (x)k∗ dnsd where k · k∗ is the dual norm of k · k. Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization Steepest Descent (SD) Method I 19-Dec-2014 55 / 120 Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization Steepest Descent Method Steepest Descent (SD) Method I Then, for the steepest descent step, we have 56 / 120 Steepest Descent for dierent norms Steepest Descent for dierent norms: - Euclidean norm: As k · k2∗ = k · k2 and having x0 = x(k) , the steepest descent direction is the negative gradient, i.e. ∇T f (x)dsd = k∇f (x)k∗ ∇T f (x)dnsd = −k∇f (x)k2∗ | {z } −k∇f (x)k∗ I 19-Dec-2014 dsd = −∇f (x0 ) Algorithm: Steepest Descent Algorithm Given a starting point x(0) ∈ dom f (x) repeat 1. Compute the steepest descent direction d(k) sd 2. Line search: Choose step size α(k) via a line search algorithm 3. Update: x(k+1) = x(k) + α(k) d(k) sd until stopping criterion is satised For Euclidean norm, steepest descent algorithm is the same as the gradient descent algorithm. Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization Steepest Descent (SD) Method 19-Dec-2014 57 / 120 Steepest Descent for dierent norms Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization Steepest Descent (SD) Method 19-Dec-2014 58 / 120 Steepest Descent for dierent norms Change of coordinates: Let y = P x, then kxkP = kyk2 . Using this change of coordinates, we can solve the original problem of minimizing f (x) by solving the equivalent problem of minimizing the function f˜(y) : RN → R, given by 1/2 - Quadratic norm: For a quadratic norm k · kP and having x0 = x(k) , the normalized descent direction is given by dnsd = −P−1 ∇f (x0 ) ∇f (x0 ) = −P−1 k∇f (x0 )kP∗ (∇T f (x0 )P−1 ∇f (x0 ))1/2 f˜(y) = f (P−1/2 y) = f (x) Apply the gradient descent method to f˜(y). The descent direction at y0 (x0 = P−1/2 y0 for the original problem) is dy = −∇f˜(y0 ) = −P−1/2 ∇f (P−1/2 y0 ) = −P−1/2 ∇f (x0 ) Then the descent direction for the original problem becomes dx = P−1/2 dy = −P−1 ∇f (x0 ) Thus, x∗ = P−1/2 y∗ . As k∇f (x)kP∗ = kP−1/2 ∇f (x)k2 , then The steepest descent method in the quadratic norm k · kP is equivalent to the gradient descent method applied to the problem after the coordinate transformation dsd = −P−1 ∇f (x0 ) y = P1/2 x Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization 19-Dec-2014 59 / 120 Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization 19-Dec-2014 60 / 120 Steepest Descent (SD) Method Steepest Descent for dierent norms Steepest Descent (SD) Method Steepest Descent for dierent norms - L1 -norm: For an L1 -norm k · k1 and having x0 = x(k) , the normalized descent direction is given by n o dnsd = − argmin ∇T f (x)d : kdk1 = 1 . Then, the unnormalized steepest descent direction is given by dsd = dnsd k∇f (x0 )k∞ = − ∂f (x0 ) ei ∂xi The steepest descent algorithm in the L1 -norm has a very natural interpretation: - At each iteration we select a component of ∇f (x0 ) with maximum absolute value, and then decrease or increase the corresponding component of x0 , according to the sign of (∇f (x0 ))i . Let i be any index for which k∇f (x0 )k∞ = max |(∇f (x0 ))i |. Then a normalized steepest descent direction dnsd for the L1 -norm is given by dnsd = − sign - The algorithm is sometimes called a coordinate-descent algorithm, since only one component of the variable x(k) is updated at each iteration. - This can greatly simplify, or even trivialize, the line search. ∂f (x0 ) ei ∂xi where ei is the i-th standard basis vector (i.e. the coordinate axis direction) with the steepest gradient. For example, in the gure above we have dnsd = e1 . Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization Steepest Descent (SD) Method 19-Dec-2014 61 / 120 Steepest Descent for dierent norms Steepest Descent (SD) Method Choice of norm: - Choice of norm can dramatically aect the convergence - Consider quadratic norm with respect to SPD matrix P. Performing the change of coordinates y = P1/2 x can change the condition number. - If an approximation of the Hessian at the optimal point, H(x∗ ), is known, then setting P ∼ = H(x∗ ) will yield P−1/2 H(x∗ )P1/2 ∼ =I resulting in a very low condition number. - If P is chosen correctly the ellipsoid ε = x : xT Px ≤ 1 approximates the cost surface at point x. - A correct P will greatly improve the convergence whereas the wrong choice of P will result in very poor convergence. Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization Steepest Descent (SD) Method 19-Dec-2014 62 / 120 Convergence Analysis Convergence Analysis - Condition number of the Hessian should be close to unity for fast convergence Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization 19-Dec-2014 63 / 120 Convergence Analysis - (Using backtracking line search) It can be shown that any norm can be bounded in terms of Euclidean norm with a constant η ∈ (0, 1] kxk∗ ≥ ηkxk2 - Assuming strongly convex f (x) and using H(x) ≺ M I M α2 kdsd k22 2 2 Mα ≤ f (x(k) ) + α∇T f (x(k) )dsd + kdsd k2∗ 2η 2 M ≤ f (x(k) ) − αk∇f (x(k) )k2∗ + α2 2 k∇f (x(k) )k2∗ 2η f (x(k) + αdsd ) ≤ f (x(k) ) + α∇T f (x(k) )dsd + Right hand side of the inequality is a quadratic function of α and has a minimum 2 at α = ηM . Then, f (x(k) + αdsd ) ≤ f (x(k) ) − γη 2 T η2 k∇f (x(k) )k2∗ ≤ f (x(k) ) + ∇ f (x(k) )dsd 2M M Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization Steepest Descent (SD) Method 19-Dec-2014 64 / 120 19-Dec-2014 66 / 120 Examples Example 11: A steepest descent example with L1 -norm. Since γ <n0.5 ando −k∇f (x)k2∗ = ∇T f (x)dsd , backtracking line search will return 2 α ≥ min 1, βη , then M βη 2 f (x(k) + αdsd ) ≤ f (x(k) ) − γ min 1, k∇f (x(k) )k2∗ M βη 2 ≤ f (x(k) ) − γη 2 min 1, k∇f (x(k) )k22 M Subtracting p∗ from both sides and using k∇f (x(k) )k2 ≥ 2m(f (x(k) ) − p∗ ), we have f (x(k+1) ) − p∗ βη 2 ≤ 1 − 2mγη 2 min 1, M f (x(k) ) − p∗ - Linear convergence =c<1 f (x(k) ) − p∗ ≤ ck f (x(0) ) − p∗ as k → ∞, ck → 0. So, convergence is guaranteed, Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization 19-Dec-2014 65 / 120 Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization Steepest Descent (SD) Method Examples Example 12: Consider the nonquadratic problem in and t with γ and α). Steepest Descent (SD) Method R2 Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization Steepest Descent (SD) Method Examples given in Example 8 (replace α 19-Dec-2014 67 / 120 Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization Examples Steepest Descent (SD) Method 19-Dec-2014 68 / 120 19-Dec-2014 70 / 120 Examples When P = I, i.e., gradient descent Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization Steepest Descent (SD) Method 19-Dec-2014 69 / 120 Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization Examples Conjugate Gradient (GD) Method Introduction Conjugate Gradient Method I Can overcome the slow convergence of Gradient Descent algorithm I Computational complexity is lower than Newton's Method. I Can be very eective in dealing with general objective functions. I We will rst investigate the quadratic problem 1 min xT Qx − bT x 2 where Q is SPD, and then extend the solution to the general case by approximation. Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization 19-Dec-2014 71 / 120 Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization 19-Dec-2014 72 / 120 Conjugate Gradient (GD) Method Conjugate Directions Conjugate Gradient (GD) Method Conjugate Directions Conjugate Directions I I Denition: Given a symmetric matrix Q, two vectors d1 and d2 are said to be Q-orthogonal or conjugate with respect to Q if Proposition: If Q is SPD and the set of non-zero vectors d0 , d1 , . . . , dk are Q-orthogonal, then these vectors are linearly indepedent. Proof: Assume linear dependency and suppose ∃αi , i = 0, 1, . . . , k : dT1 Qd2 = 0 α0 d0 + α1 d1 + · · · + αk dk = 0 - Although it is not required, we will assume that Q is SPD. Multiplying with - If Q = I, then the above denition becomes the denition of orthogonality. =0 dTi Qdj = 0, ∀i, j : i 6= j Conjugate Gradient (GD) Method I Quadratic Problem: But 19-Dec-2014 73 / 120 yields α0 dTi Qd0 +α1 dTi Qd1 + · · · + αi dTi Qdi + · · · + αk dTi Qdk = 0 | {z } | {z } | {z } | {z } - A nite set of non-zero vectors d0 , d1 , . . . , dk is said to be a Q-orthogonal set if Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization dTi Q dTi Qdi must be 0 =0 > 0 (Q: PD), then αi = 0. Repeat for all αi . Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization Conjugate Directions Conjugate Gradient (GD) Method 1 min xT Qx − bT x 2 =0 19-Dec-2014 Conjugate Directions - αi can be found from the known vector b and matrix Q once di are found. If Q is N × N PD matrix, then we have unique solution - The expansion of x∗ is a result of an iterative process of N steps where at the i-th step αi di is added. Qx∗ = b Let d0 , d1 , . . . , dN −1 be non-zero Q-orthogonal vectors corresponding to the N × N SPD matrix Q. They are linearly independent. Then the optimum solution is given by I −1 Conjugate Direction Theorem: Let {di }N i=0 be a set of non-zero Q-orthogonal n oN vectors. For any x(0) ∈ dom f (x), the sequence x(k) generated according to k=0 x∗ = α0 d0 + α1 d1 + · · · + αN −1 dN −1 x(k+1) = x(k) + α(k) dk , We can nd the value of the coecients αi by multiplying the above equation with dTi Q: with dTi Qx∗ = αi dTi Qdi αi = α(k) = − dTi b dTi Qdi . . . Qx∗ = b converges to the unique solution x∗ of Qx∗ = b after N steps, i.e. x(N ) = x∗ . dTi b di dTi Qdi i=0 Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization 19-Dec-2014 75 / 120 Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization Conjugate Directions Conjugate Gradient (GD) Method 19-Dec-2014 76 / 120 Conjugate Directions Descent Properties of the Conjugate Gradient Method I We dene B(k) which is spanned by {d0 , d1 , . . . , dk−1 } as the subspace of RN , i.e., Proof: Since dk are linearly independent, we can write x∗ − x(0) = α(0) d0 + α(1) d1 + · · · + α(N −1) dN −1 B(k) = span {d0 , d1 , . . . , dk−1 } ⊆ RN for some α(k) . we can nd α(k) by α(k) = dTk Q x∗ − x(0) We will show that at each step x(k) minimizes the objective over the k-dimensional linear variety x(0) + B(k) . (1) dTk Qdk I Now, the iterative steps from x(0) to x(k) x(k) − x(0) = α(0) d0 + α(1) d1 + · · · + α(k−1) dk−1 −1 Theorem: (Expanding Subspace Theorem) Let {di }N i=0 be non-zero, Q-orthogonal vectors in RN . For any x(0) ∈ RN , the sequence and due to Q-orthogonality x(k+1) = x(k) + α(k) dk x(k) − x(0) = 0 (2) dTk Q Using (1) and (2) we arrive at α(k) = dTk g(k) dTk Qdk g(k) = ∇f (x(k) ) = Qx(k) − b N −1 X Conjugate Gradient (GD) Method ... k ≥ 0 and g(k) is the gradient at x(k) Finally the optimum solution is given by, x∗ = 74 / 120 α(k) = − dTk g(k) dTk Qdk minimizes f (x) = 12 xT Qx − bT x on the line dTk Q x∗ − x(k) dTk Qdk = Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization dT g(k) − Tk dk Qdk x = x(k−1) − αdk−1 , and on x 19-Dec-2014 77 / 120 (0) +B (k) −∞ < α < ∞ . Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization 19-Dec-2014 78 / 120 Conjugate Gradient (GD) Method Conjugate Directions Conjugate Gradient (GD) Method Proof: Since x(k) ∈ x(0) + B(k) , i.e., B(k) contains the line x = x(k−1) − αdk−1 , it is enough to show that x(k) minimizes f (x) over x(0) + B(k) Conjugate Directions - Proof of g(k) ⊥ B(k) is by induction Let for k = 0, B(0) = {} (empty set), then g(k) ⊥ B(0) is true. Now assume that g(k) ⊥ B(k) , show that g(k+1) ⊥ B(k+1) From the denition of g(k) (g(k) = Qx(k) − b), it can be shown that g(k+1) = g(k) + αk Qdk Hence, by the denition of αk dTk g(k+1) = dTk g(k) + αk dTk Qdk = 0 Also, for i < k Since we assume that f (x) is strictly convex, the above condition holds when g(k) is orthogonal to B(k) , i.e., the gradient of f (x) at x(k) is orthogonal to B(k) . Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization Conjugate Gradient (GD) Method 19-Dec-2014 79 / 120 dTi g(k+1) = dT g(k) | i {z } + vanishes by induction αk dTi Qdk = 0 | {z } =0 Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization Conjugate Directions Conjugate Gradient (GD) Method 19-Dec-2014 80 / 120 The Conjugate Gradient Method The Conjugate Gradient Method In the conjugate direction method, select the successive direction vectors as a conjugate version of the successive gradients obtained as the method progresses - Corollary: The gradients g(k) , k = 0, 1, . . . , N satisfy dTi g(k) I Conjugate Gradient Algorithm: Start at any x(0) ∈ RN and dene d(0) = −g(0) = b − Qx(0) =0 for i < k. repeat Expanding subspace, at every iteration dk increases the dimensionality of B. Since x(k) minimizes f (x) over x(0) + B(k) , x(N ) is the overall minimum of f (x). (k) T g 1. α(k) = − dd(k) T Qd (k) (k) 2. x(k+1) = x(k) + α(k) d(k) 3. g(k+1) = Qx(k+1) − b 4. β (k) = g(k+1) Qd(k) T d(k) Qd(k) 5. d(k+1) = −g(k+1) + β (k) d(k) until k = N . Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization Conjugate Gradient (GD) Method 19-Dec-2014 81 / 120 Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization The Conjugate Gradient Method Conjugate Gradient (GD) Method 19-Dec-2014 82 / 120 The Conjugate Gradient Method Example 13: Consider the quadratic problem - Algorithm terminates in at most N steps with the exact solution (for the quadratic case) - Gradient is always linearly independent of all previous direction vectors, i.e., g(k) ⊥ B(k) , where B(k) = {d0 , d1 , . . . , dk−1 } where Q = 3 2 1 min xT Qx − bT x 2 2 2 and b = . 6 −8 Solution is given by - If solution is reached before N steps, the gradient is zero - Very simple formula, computational complexity is slightly higher than gradient descent algorithm - The process makes uniform progress toward the solution at every step. Important for the nonquadratic case. Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization 19-Dec-2014 83 / 120 Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization 19-Dec-2014 84 / 120 Conjugate Gradient (GD) Method The Conjugate Gradient Method Conjugate Gradient (GD) Method CG Summary I In theory (with exact arithmetic) converges to solution in N steps - The bad news: due to numerical round-o errors, can take more than N steps (or fail to converge) - The good news: with luck (i.e., good spectrum of Q), can get good approximate solution in N steps I Compared to direct (factor-solve) methods, CG is less reliable, data dependent; often requires good (problem-dependent) preconditioner I But, when it works, can solve extremely large systems Extension to Nonquadratic Problems Extension to Nonquadratic Problems I Idea is simple. We have two loops - Outer loop approximates the problem with a quadratic one - Inner loop runs conjugate gradient method (CGM) for the approximation i.e., for the neighbourhood of point x0 1 T T f (x) ∼ = f (x0 ) + ∇ f (x0 )(x − x0 ) + (x − x0 ) H(x0 )(x − x0 ) + residual | {z } | {z 2 } →0 quadratic function - Expanding 1 T 1 x H(x0 )x+ ∇T f (x0 ) − xT0 H(x0 ) x+f (x0 ) + xT0 H(x0 )x0 − ∇T f (x0 )x0 2 2 | {z } f (x) ∼ = independent of x, i.e., constant Thus, 1 min f (x) ≡ min xT H(x0 )x+ ∇T f (x0 ) − xT0 H(x0 ) x 2 1 ≡ min xT Qx − bT x 2 Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization Conjugate Gradient (GD) Method 19-Dec-2014 85 / 120 Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization Extension to Nonquadratic Problems Conjugate Gradient (GD) Method I I 19-Dec-2014 86 / 120 Extension to Nonquadratic Problems Nonquadratic Conjugate Gradient Algorithm: Starting at any x(0) ∈ RN , compute g(0) = ∇f (x(0) ) and set d(0) = −g(0) Here, repeat repeat (k) T (k) 1. α(k) = − d(k)dT H(xg(k) )d(k) Q = H(x0 ) bT = −∇T f (x0 ) + xT0 H(x0 ) The gradient g(k) is g (k) = Qx (k) 2. x(k+1) = x(k) + α(k) d(k) 3. g(k+1) = ∇f (x(k+1) ) −b = H(x0 )x0 + ∇f (x0 ) − H(x0 )x0 . . . x0 = x(k) 4. β (k) = = ∇f (x0 ) g(k+1) d(k) T T H(x(k) )d(k) H(x(k) )d(k) 5. d(k+1) = −g(k+1) + β (k) d(k) until k = N . new starting point is x(0) = x(N ) , g(0) = ∇f (x(N ) ) and d(0) = −g(0) . until stopping criterion is satised Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization Conjugate Gradient (GD) Method 19-Dec-2014 87 / 120 Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization Extension to Nonquadratic Problems Conjugate Gradient (GD) Method I - No line search is required. 88 / 120 Extension to Nonquadratic Problems Nonquadratic Conjugate Gradient Algorithm with Line-search: Starting at any x(0) ∈ RN , compute g(0) = ∇f (x(0) ) and set d(0) = −g(0) repeat repeat 1. Line search: α(k) = argmin f (x(k) + αd(k) ) α - H(x(k) ) must be evaluated at each point, can be impractical. - Algorithm may not be globally convergent. I 19-Dec-2014 Involvement of H(x(k) ) can be avoided by employing a line search algorithm for α(k) and slightly modifying β (k) 2. Update: x(k+1) = x(k) + α(k) d(k) 3. Gradient: g(k+1) = ∇f (x(k+1) ) 4. Use (k+1) T (k+1) Fletcher-Reeves method: β (k) = g g(k) T gg(k) , or T g(k+1) −g(k) ) g(k+1) Polak-Ribiere method: β (k) = ( T g(k) g(k) 5. d(k+1) = −g(k+1) + β (k) d(k) until k = N . new starting point is x(0) = x(N ) , g(0) = ∇f (x(N ) ) and d(0) = −g(0) . until stopping criterion is satised Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization 19-Dec-2014 89 / 120 Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization 19-Dec-2014 90 / 120 Conjugate Gradient (GD) Method Extension to Nonquadratic Problems Conjugate Gradient (GD) Method Extension to Nonquadratic Problems Convergence example of the nonlinear Conjugate Gradient Method: (a) A complicated function with many local minima and maxima. (b) Convergence path of Fletcher-Reeves CG. Unlike linear CG, convergence does not occur in two steps. (c) Cross-section of the surface corresponding to the rst line search. (d) Convergence path of Polak-Ribiere CG. Example 14: - Polak-Ribiere method can be superior to the Fletcher-Reeves method. - Global convergence of the line search methods is established by noting that a gradient descent step is taken every N steps and serves as a spacer step. Since the other steps do not increase the objective, and in fact hopefully they decrease it, global convergence is guaranteed. Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization Newton's Method (NA) 19-Dec-2014 91 / 120 Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization The Newton Step Newton's Method (NA) 19-Dec-2014 92 / 120 19-Dec-2014 94 / 120 The Newton Step The Newton Step I In Newton's Method, local quadratic approximations of f (x) are utilized. Starting with the second-order Taylor's approximation around x(k) , I Then x(k+1) = x(k) + ∆xnt 1 f (x(k+1) ) = f (x(k) ) + ∇f (x(k) )∆x + ∆xT H(x(k) )∆x + residual | {z 2 } fˆ(x(k+1) ) where ∆x = x(k+1) − x(k) , nd ∆x = ∆xnt such that fˆ(x(k+1) ) is minimized. I Quadratic approximation optimum step ∆xnt (by solving ∂ fˆ(x(k+1) ) ∂∆x = 0) ∆xnt = −H−1 (x(k) )∇f (x(k) ) is called the Newton step, which is a descent direction, i.e., ∇T f (x(k) )∆xnt = −∇T f (x(k) )H−1 (x(k) )∇f (x(k) ) < 0 Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization Newton's Method (NA) 19-Dec-2014 93 / 120 The Newton Step Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization Newton's Method (NA) Interpretation of the Newton Step The Newton Step 2. Steepest Descent Direction in Hessian Norm - The Newton step is the steepest descent direction at x(k) , i.e., 1. Minimizer of second-order approximation 1 2 kvkH(x(k) ) = vT H(x(k) )v - In the steepest descent method, the quadratic norm k · kP can signicantly increase speed of convergence, by decreasing the condition number. In the neighbourhood of x∗ , P = H(x∗ ) is a very good choice. - In Newton's method when x is near x∗ , we have H(x) ' H(x∗ ). As given on the previous slide ∆x minimizes fˆ(x(k+1) ), i.e., the quadratic approximation of f (x) in the neighbourhood of x(k) . - If f (x) is quadratic, then f (x(0) ) + ∆x is the exact minimizer of f (x) and algorithm terminates in a single step with the exact answer. - If f (x) is nearly quadratic, then x + ∆x is a very good estimate of the minimizer of f (x), x∗ . - For twice dierentiable f (x), quadratic approximation is very accurate in the neighbourhood of x∗ , i.e., when x is very close to x∗ , the point x + ∆x is a very good estimate of x∗ . Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization 19-Dec-2014 95 / 120 Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization 19-Dec-2014 96 / 120 Newton's Method (NA) The Newton Step Newton's Method (NA) 3. Solution of Linearized Optimality Condition The Newton Decrement I - First-order optimality condition The norm of the Newton step in the quadratic norm dened by H(x) is called the Newton decrement 1 2 λ(x) = k∆xnt kH(x) = ∆xTnt H(x)∆xnt ∇f (x∗ ) = 0 near x∗ (using rst order Taylor's approximation for ∇f (x + ∆x)) I It can be used as a stopping criterion since it is an estimate of f (x) − p∗ , i.e., ∇f (x + ∆x) ' ∇f (x) + H(x)∆x = 0 with the solution The Newton Decrement 1 f (x) − inf f (y) = f (x) − fˆ(x + ∆xnt ) = λ2 (x) y 2 where ∆xnt = −H−1 (x)∇f (x) 1 fˆ(x + ∆xnt ) = f (x) + ∇T f (x)∆xnt + ∆xTnt H(x)∆xnt 2 i.e., the second-order quadratic approximation of f (x) at x. Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization Newton's Method (NA) 19-Dec-2014 97 / 120 Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization The Newton Decrement Newton's Method (NA) 19-Dec-2014 98 / 120 Newton's Method Newton's Method Substitute fˆ(x + ∆xnt ) into f (x) − inf f (y) and let y ∆xnt = −H−1 (x)∇f (x) then I 1 T 1 ∇ f (x)H−1 (x)∇f (x) = λ2 (x) 2 2 λ2 (x) 2 repeat 1. Compute the Newton step and Newton decrement < , then algorithm can be terminated for some small . I So, if I With the substitution of ∆xnt = −H−1 (x)∇f (x), the Newton decrement can also be written as 1 λ(x(k) ) = ∇T f (x(k) )H−1 (x(k) )∇f (x(k) ) Given a starting point x(0) ∈ dom f (x) and some small tolerance > 0 ∆x(k) = −H−1 (x(k) )∇f (x(k) ) 1 2 λ(x(k) ) = ∇T f (x(k) )H−1 (x(k) )∇f (x(k) ) 2 2. Stopping criterion, quit if λ2 (x(k) )/2 ≤ . 3. Line search: Choose a stepsize α(k) > 0, e.g. by backtracking line search. 4. Update: x(k+1) = x(k) + α(k) ∆x(k) . Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization Newton's Method (NA) 19-Dec-2014 99 / 120 Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization Newton's Method Newton's Method (NA) I The stepsize α(k) (i.e., line search) is required for the non-quadratic initial parts of the algorithm. Otherwise, algorithm may not converge due to large higher-order residuals. I As x(k) gets closer to x∗ . f (x) can be better approximated by the second-order expansion. Hence, stepsize α(k) is no longer required. Line search algorithm will automatically set α(k) = 1. I If we start with α(k) = 1 and keep it the same, then the algorithm is called the pure Newton's method. I For an arbitrary f (x), there are two regions of convergence. - damped Newton phase, when x is far from x∗ - quadratically convergent phase, when x gets closer to x∗ I If we let H(x) = I, the algorithm reduces to gradient descent (GD) I 19-Dec-2014 100 / 120 Newton's Method If H(x) is not positive denite, Newton's method will not converge. So, use (aI + H(x))−1 instead of H−1 (x), also known as (a.k.a) Marquardt method. There always exists an a which will make the matrix (aI + H(x)) positive denite. a is a trade-o between GD and NA - a → ∞ ⇒ Gradient Descent (GD) - a → 0 ⇒ Newton's Method (NA) x(k+1) = x(k) − α(k) ∇f (x(k) ) Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization 19-Dec-2014 101 / 120 Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization 19-Dec-2014 102 / 120 Newton's Method (NA) I Newton's Method Newton's Method (NA) Newton step and decrement are independent of ane transformations (i.e., linear coordinate transformations), i.e., for non-singular T ∈ RN ×N x = Ty and Newton's Method - Similarly, the Newton decrement will be 1 2 λ(y) = ∇T f˜(y)H̃−1 (y)∇f˜(y) −1 12 = ∇T f (x)T TT H(x)T TT ∇f (x) f˜(y) = f (Ty) then ∇f˜(y) = TT ∇f (x) 1 2 = ∇T f (x)H(x)∇f (x) H̃(y) = TT H(x)T = λ(x) - So, the Newton step will be I ∆ynt = −H̃−1 (y)∇f˜(y) −1 TT ∇f (x) = − TT H(x)T Thus, Newton's Method is independent of ane transformations (i.e., linear coordinate transformations). = −T−1 H−1 (x)∇f (x) = T−1 ∆xnt i.e, x + ∆xnt = T (y + ∆ynt ), ∀x Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization Newton's Method (NA) 19-Dec-2014 103 / 120 Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization Convergence Analysis Newton's Method (NA) Convergence Analysis Convergence: There exist constants η ∈ 0, Read Boyd Sect. 9.5.3. I Assume a strongly convex f (x) with mI H(x) with constant m ∀x ∈ dom f (x) I I Newton's Method (NA) and σ > 0 such that f (x(0) )−p∗ σ iterations Quadratically Convergent Phase - All iterations use α = 1 (i.e., quadratic approximation suits very well.) - 105 / 120 k∇f (x)k2 < η - Newton's Method will perform well for small L. 19-Dec-2014 m2 L Damped Newton Phase - This phase ends after at most Thus, L measures how well f (x) can be approximated by a quadratic function. Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization Convergence Analysis k∇f (x)k2 ≥ η kH(x) − H(y)k2 ≤ L kx − yk2 If L is small f (x) is closer to a quadratic function. If L is large, f (x) is far from a quadratic function. If L = 0, then f (x) is quadratic. 104 / 120 - α < 1 gives better solutions, so most iterations will require line search, e.g. backtracking. - As k increases, function value decreases by at least σ, but not necessarily quadratic. and H(x) is Lipschitz continuous on dom f (x), i.e., for constant L > 0. This inequality imposes a bound on the third derivative if f (x). 19-Dec-2014 k∇f (x(k+1) )k L ≤ , i.e., quadratic convergence. 2m2 k∇f (x(k) )k2 Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization Convergence Analysis Newton's Method (NA) 19-Dec-2014 106 / 120 Convergence Analysis NA Summary - For small > 0, f (x) − p∗ < is achieved after at most log2 log2 0 . This is typically 5-6 iterations. iterations where 0 = - Number of iterations is bounded above by I Convergence of Newton's method is rapid in general, and quadratic near x∗ . Once the quadratic convergence phase is reached, at most six or so iterations are required to produce a solution of very high accuracy. I Newton's method is ane invariant. It is insensitive to the choice of coordinates, or the condition number of the sublevel sets of the objective. I Newton's method scales well with problem size. Ignoring the computation of the Hessian, its performance on problems in R10000 is similar to its performance on problems in R10 , with only a modest increase in the number of steps required. I The good performance of Newton's method is not dependent on the choice of algorithm parameters. In contrast, the choice of norm for steepest descent plays a critical role in its performance. 2m3 L2 f (x(0) ) − p∗ 0 + log2 log2 σ - σ and 0 are dependent on m, L and x(0) . Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization 19-Dec-2014 107 / 120 Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization 19-Dec-2014 108 / 120 Newton's Method (NA) Convergence Analysis Newton's Method (NA) I The main disadvantage of Newton's method is the cost of forming and storing the Hessian, and the cost of computing the Newton step, which requires solving a set of linear equations. I Other alternatives (called quasi-Newton methods) are also provided by a family of algorithms for unconstrained optimization. These methods require less computational eort to form the search direction, but they share some of the strong advantages of Newton methods, such as rapid convergence near x∗ . Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization Newton's Method (NA) 19-Dec-2014 109 / 120 Examples Examples Example 15: Consider the nonquadratic problem in R2 given in Example 8 and Example 12 (replace α and t with γ and α). Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization Newton's Method (NA) 19-Dec-2014 110 / 120 Examples Example 16: Consider the nonquadratic problem in R100 given in Example 9 (replace α and t with γ and α). Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization Newton's Method (NA) 19-Dec-2014 111 / 120 Examples Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization Newton's Method (NA) 19-Dec-2014 112 / 120 19-Dec-2014 114 / 120 Examples Example 17: (problem in R10000 ) Replace α and t with γ and α. Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization 19-Dec-2014 113 / 120 Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization Newton's Method (NA) Examples Newton's Method (NA) Approximation of the Hessian Approximation of the Hessian For relatively large scale problems, i.e. N is large, calculating the inverse of the Hessian at each iteration can be costly. So, we may use, some approximations of the Hessian S(x) = Ĥ−1 (x) → H−1 (x) x (k+1) = x(k) − α(k) S(x(k) )∇f (x(k) ) 1. Hybrid GD + NA We know that the rst phase the Newton's Algorithm (NA) is not very fast. Therefore, rst we can run run GD which has considerably low complexity and after satisfying some conditions, we can switch to the NA. Newton's Algorithm may not converge for highly non-quadratic functions unless x is close to x∗ . Hybrid method (given on the next slide) also guarantees global convergence. Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization Newton's Method (NA) I 19-Dec-2014 115 / 120 Approximation of the Hessian Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization Newton's Method (NA) Hybrid Algorithm - Start at x(0) ∈ dom f (x) repeat run GD (i.e., S(x(k) ) = I) until stopping criterion is satised - Start at the nal point of GD repeat run NA with exact H(x) (i.e., S(x(k) ) = H−1 (x(k) )) until stopping criterion is satised 19-Dec-2014 116 / 120 Approximation of the Hessian 2. The Chord Method If f (x) is close to a quadratic function, we may use S(x(k) ) = H−1 (x(0) ) throughout the iterations, i.e., ∆x(k) = −H−1 (x(0) )∇f (x(k) ) x(k+1) = x(k) + ∆x(k) This is also the same as the SD algorithm with P = H(x(0) ) and α(k) = 1. Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization Newton's Method (NA) 19-Dec-2014 117 / 120 Approximation of the Hessian Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization Newton's Method (NA) 19-Dec-2014 118 / 120 Approximation of the Hessian 4. Approximating Particular Terms 3. The Shamanski Method Updating the Hessian at every N iterations may give better performance, i.e., S(x(k) ) = H−1 (xb cN ) k N ∆x(k) = −H−1 (xb N cN )∇f (x(k) ) k x(k+1) = x(k) + ∆x(k) This is a trade-o between the Chord method (N ← ∞) and the full NA (N ← 1). Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization 19-Dec-2014 119 / 120 Inversion of sparse matrices can be easier, i.e., when many entries of H(x) are zero - If some entries of H(x) are small or below a small threshold, then set them to zero, obtaining Ĥ(x). Thus, Ĥ(x) becomes sparse. - In the extreme case. when the Hessian is strongly diagonal dominant, let the o-diagonal terms to be zero, obtaining Ĥ(x). Thus, Ĥ(x) becomes diagonal which is very easy to invert. There are also other advanced quasi-Newton (modied Newton) algorithms developed to approximate the inverse of the Hessian, e.g. Broyden and Davidon-Fletcher-Powell (DFP) methods. Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization 19-Dec-2014 120 / 120