ELE604/ELE704 Optimization - Hacettepe University, Department of

Transkript

Contents
Unconstrained Optimization
Unconstrained Minimization
Descent Methods
Motivation
General Descent Method
Line Search
Exact Line Search
Bisection Algorithm
Backtracking Line Search
ELE604/ELE704 Optimization
Convergence
Gradient Descent (GD) Method
Gradient Descent Method
Convergence Analysis
Conv. of GD with Exact Line Search
Conv. of GD with Backtracking Line Search
Examples
Steepest Descent (SD) Method
Preliminary Denitions
Steepest Descent Method
Steepest Descent for dierent norms
http://www.ee.hacettepe.edu.tr/∼usezen/ele604/
Euclidean Norm
Quadratic Norm
L1 -norm
Choice of norm
Dr. Umut Sezen & Dr. Cenk Toker
Department of Electrical and Electronic Engineering
Hacettepe University
Examples
Conjugate Gradient (GD) Method
Introduction
Conjugate Directions
Descent Properties of the Conjugate Gradient Method
The Conjugate Gradient Method
Extension to Nonquadratic Problems
Newton's Method (NA)
The Newton Step
Interpretation of the Newton Step
The Newton Decrement
Newton's Method
Examples
Approximation of the Hessian
Umut Sezen & Cenk Toker (Hacettepe University) ELE604 Optimization
19-Dec-2014
1 / 120
I
I
min f (x) =
x∈RN
min f (x)
where f (x) :
2 / 120
19-Dec-2014
4 / 120
Example 1: Quadratic program
The aim is
RN
19-Dec-2014
1 T
x Qx − bT x + c
2
where Q : RN ×N is symmetric, b ∈ RN and c ∈ R.
→ R is twice dierentiable.
Necessary conditions:
I
The problem is solvable, i.e. nite optimal point x∗ exists.
I
The optimal value (nite) is given by
∇f (x∗ ) = Qx∗ − b = 0
(PSD)
H(x∗ ) = Q 0
- Q ≺ 0 ⇒ f (x) has no local minimum.
p∗ = inf f (x) = f (x∗ ) (> −∞)
x
- Q 0 ⇒ x∗ = Q−1 b is the unique global minimum.
- Q 0 ⇒ either no solution or ∞ number of solutions.
19-Dec-2014
3 / 120
α > 0, β > 0
I
α > 0, β = 0
150
Example 2: Consider
60
100
min
x1 ,x2 ∈R
f (x1 , x2 ) =
40
1
(αx21 + βx22 − x1 )
2
50
20
0
Here, let us rst express the above equation in the quadratic program form with
0
−50
10
−20
10
5
α
Q=
−γ
1
b=
0
γ
β
0
−5
−10
0
−5
x2
5
10
5
5
10
0
0
−5
−5
x1
−10
x2
α = 0, β > 0
−10
x1
α > 0, β < 0
60
60
40
where γ ∈ R, for simplicity we can take γ = 0. So,
40
20
20
- If α > 0 and β > 0 i.e. (Q 0) x∗ = ( α1 , 0) is the unique global minimum.
- If
α1 > 0 and β= 0 i.e. (Q 0). Innite number of solutions,
( α , y), y ∈ R .
- If α = 0 and β > 0 i.e. (Q 0). No solution.
- If α < 0 and β > 0 (or α > 0 and β < 0), (i.e. Q is indenite). No solution.
−10
19-Dec-2014
5 / 120
0
0
−20
−40
−20
10
5
10
5
0
−5
x2
−10
0
−5
10
−60
10
5
5
0
0
−10
x1
−5
−5
x2
−10
−10
x1
19-Dec-2014
6 / 120
Descent Methods
Motivation
Motivation
I
Two possibilities:
- {f (x) : x ∈ X} is unbounded below ⇒ no optimal solution.
- {f (x) : x ∈ X} is bounded below ⇒ a global minimum exists, if kxk =
6 ∞.
Then, unconstrained minimization methods
- produce sequence of points x(k) ∈ dom f (x) for k = 0, 1, . . . with
f (x(k) ) → p∗
- can be interpreted as iterative methods for solving the optimality condition
∇f (x∗ ) = 0
I
If ∇f (x) 6= 0, there is an interval (0, δ) of stepsizes such that
I
If d makes an angle with ∇f (x) that is greater than 90◦ , i.e.,
f (x − α∇f (x)) < f (x)
∀α ∈ (0, δ)
∇T f (x) d < 0
∃ an interval (0, δ) of stepsizes such that
f (x + αd) < f (x)
Descent Methods
I
19-Dec-2014
7 / 120
Motivation
Descent Methods
I
Proposition: For a descent method
f (x(k+1) ) < f (x(k) )
8 / 120
Given a starting point x(0) ∈ dom f (x)
repeat
1. Determine a descent direction d(k) ,
2. Line search: Choose a stepsize α(k) > 0,
except x(k) = x∗ .
I
19-Dec-2014
Denition: The descent direction d is selected such that
∇T f (x) d < 0
I
∀α ∈ (0, δ)
3. Update: x(k+1) = x(k) + α(k) d(k) ,
until stopping criterion is satised.
Denition: Minimizing sequence is dened as
x(k+1) = x(k) + α(k) d(k)
where the scalar α ∈ (0, δ) is the stepsize (or step length) at iteration k, and
d(k) ∈ RN is the step or search direction.
(k)
- How to nd optimum α ?
Line Search Algorithm
- How to nd optimum d ?
Depends on the descent algorithm,
e.g. d = −∇f (x(k) ).
(k)
(k)
Descent Methods
I
Example 3: Simplest method: Gradient Descent
x(k+1) = x(k) − α(k) ∇f (x(k) ),
19-Dec-2014
k = 0, 1, . . .
Note that, here the descent direction is d(k) = −∇f (x(k) ).
9 / 120
Descent Methods
19-Dec-2014
10 / 120
Line Search
Line Search
I
Example 4: Most sophisticated method: Newton's Method
x(k+1) = x(k) − α(k) H−1 (x(k) )∇f (x(k) ),
I
k = 0, 1, . . .
Suppose f (x) is a continuously dierentiable convex function and we want to nd
α(k) = argmin f (x(k) + αd(k) )
Note that, here the descent direction is d(k) = −H−1 (x(k) )∇f (x(k) ).
α
for a given descent direction d . Now, let
(k)
h(α) = f (x(k) + αd(k) )
where h(α) : R → R is a "convex function" in the scalar variable α then the
problem becomes
α(k) = argmin h(α)
α
Then, as h(α) is convex, it has a minimum at
h0 (α(k) ) =
19-Dec-2014
11 / 120
∂h(α(k) )
=0
∂α
19-Dec-2014
12 / 120
Descent Methods
Line Search
Descent Methods
Line Search
where h0 (α) is given by
h0 (α) =
∂h(α)
= ∇T f (x(k) + αd(k) ) d(k)
∂α
(using chain rule)
Therefore, since d is the descent direction, (i.e., ∇T f (x(k) ) d(k) < 0 ), we have
h0 (0) < 0. Also, h0 (α) is monotone increasing function of α because h(α) is
convex. Hence. search for h0 (α(k) ) = 0.
Choice of stepsize:
I Constant stepsize
I
Diminishing stepsize
while satisfying
I
α(k) = c : constant
α(k) → 0
∞
P
α
(k)
k=−∞
= ∞.
Exact line search (analytic)
α(k) = argmin f (x(k) + αd(k) )
α
Descent Methods
19-Dec-2014
13 / 120
Line Search
Descent Methods
Exact line search: (for quadratic programs)
I If f (x) is a quadratic function, then h(α) is also a quadratic function, i.e.,
α2 (k) T
d
H(x(k) )d(k)
2
- Since h0 (0) < 0, α̃ =
- If h (α̃) = 0, α
0
Exact line search solution α0 which minimizes the quadratic equation above, i.e.,
∂h(α0 )
= 0, is given by
∂α
α0 = α(k) = −
d
Line Search
(k)
0+α̂
2
is the next test point
= α̃ is found (very dicult to achieve)
- If h0 (α̃) > 0 narrow down the search interval to (0, α̃)
- If h0 (α̃) < 0 narrow down the search interval to (α̃, α̂)
∇T f (x(k) )d(k)
(k) T
14 / 120
Bisection Algorithm:
I Assume h(α) is convex, then h0 (α) is monotonically increasing function. Suppose
that we know a value α̂ such that h0 (α̂) > 0.
h(α) = f (x(k) − αd(k) )
= f (x(k) ) + α∇T f (x(k) )d(k) +
19-Dec-2014
H(x(k) )d(k)
- If f (x) is a higher order function, then second order Taylor series approximation
can be used for the exact line search algorithm (which also gives an approximate
solution).
Descent Methods
19-Dec-2014
15 / 120
Line Search
Descent Methods
19-Dec-2014
16 / 120
Line Search
Proposition: After every iteration, the current interval [α` , αu ] contains α∗ ,
h0 (α∗ ) = 0.
Proposition: At the k-th iteration, the length of the current interval is
L=
Algorithm:
1. Set k = 0, α` = 0 and αu = α̂
2. Set α̃ =
α` +αu
2
Proposition: A value of α such that |α − α∗ | < ε can be found at most
log2
and calculate h (α)
0
α̂
ε
steps.
3. If h0 (α̃) > 0 ⇒ αu = α̃ and k = k + 1
4. If h0 (α̃) < 0 ⇒ α` = α̃ and k = k + 1
5. If h0 (α̃) = 0 ⇒
k
1
α̂
2
I
stop.
How to nd α̂ such that h0 (α̂) > 0?
1. Make an initial guess of α̂
2. If h0 (α̂) < 0 ⇒ α̂ = 2α̂, go to step 2
3. Stop.
19-Dec-2014
17 / 120
19-Dec-2014
18 / 120
Descent Methods
Line Search
Descent Methods
Line Search
Backtracking line search
I
Stopping criterion for the Bisection Algorithm: h0 (α̃) → 0 as k → ∞, may not
converge quickly.
Some relevant stopping criteria:
1. Stop after k = K iterations (K : user dened)
2. Stop when |αu − α` | ≤ ε (ε : user dened)
3. Stop when h0 (α̃) ≤ ( : user dened)
For small enough α:
In general, 3rd criterion is the best.
f (x0 + αd) ' f (x0 ) + α∇T f (x0 )d < f (x0 ) + γα∇T f (x0 )d
where 0 < γ < 0.5 as ∇T f (x0 ) d < 0.
Descent Methods
I
19-Dec-2014
19 / 120
Line Search
Descent Methods
19-Dec-2014
20 / 120
Line Search
Algorithm: Backtracking line search
Given a descent direction d for f (x) at x0 ∈ dom f (x)
- The backtracking exit inequality
α=1
f (x0 + αd) ≤ f (x0 ) + γα∇T f (x0 )d
while f (x0 + αd) > f (x0 ) + γα∇T f (x0 )d
holds for α ∈ [0, α0 ]. Then, the line search stops with a step length α
i. α = 1 if α0 ≥ 1
α = βα
end
ii. α ∈ [βα0 , α0 ].
In other words, the step length obtained by backtracking line search satises
where 0 < γ < 0.5 and 0 < β < 1
- At each iteration step size α is reduced by β (β ' 0.1 : coarse search, β ' 0.8 :
ne search).
α ≥ min {1, βα0 } .
- γ can be interpreted as the fraction of the decrease in f (x) predicted by linear
extrapolation (γ = 0.01 ↔ 0.3 (typical) meaning that we accept a decrease in
f (x) between 1% and 30%).
Descent Methods
I
19-Dec-2014
21 / 120
Convergence
Descent Methods
19-Dec-2014
22 / 120
Convergence
Convergence
Denition: Let k · k be a norm on RN . Let x(k)
n
n
o∞
RN . Then, the sequence x(k)
k=0
o∞
k=0
be a sequence of vectors in
is said to converge to a limit x∗ if
I
Rate of Convergence
o∞
Denition: Let k · k be a norm on RN . A sequence x(k)
that converges to
k=0
x∗ ∈ RN is said to converge at rate R ∈ R++ and with rate constant δ ∈ R++ if
n
∀ε > 0, ∃Nε ∈ Z+ : (k ∈ Z+ and k ≥ Nε ) ⇒ (kx(k) − x∗ k < ε)
n
o∞
If the sequence x(k)
converges to x∗ then we write
lim
k→∞
k=0
kx(k+1) − x∗ k
=δ
kx(k) − x∗ kR
- If R = 1, 0 < δ < 1, then rate is linear
lim x(k) = x∗
k→∞
- If 1 < R < 2, 0 < δ < ∞, then rate is called super-linear
n
o∞
and call x∗ as the limit of the sequence x(k)
.
k=0
- Nε may depend on ε
- For a distance ε, after Nε iterations, all the subsequent iterations are within
this distance ε to x∗ .
- If R = 2, 0 < δ < ∞, then rate is called quadratic
The rate of convergence R is sometimes called asymptotic convergence rate. It
may not apply for the iterates, but applies asymptotically as k → ∞.
This denition does not characterize how fast the convergence is (i.e. rate of
convergence).
19-Dec-2014
23 / 120
19-Dec-2014
24 / 120
Descent Methods
Convergence
I
First order Taylor series expansion at x0 gives us
f (x0 + αd) ≈ f (x0 ) + α∇T f (x0 )d.
This approximation is valid for αkdk → 0.
Example 5: The sequence ak
lim
k→∞
Example 6:
∞
k=0
, 0 < a < 1 converges to 0.
kak+1 − 0k
=a
kak − 0k1
I
We want to choose d so that ∇T f (x0 )d is as small as (as negative as) possible
for maximum descent.
I
If we normalize d, i.e. kdk = 1, then normalized direction d̃
d̃ = −
⇒
R = 1, δ = a
makes the smallest inner product with ∇f (x0 ).
n k o∞
The sequence a2
, 0 < a < 1 converges to 0.
I
Then, the unnormalized direction
k=0
d = −∇f (x0 )
k+1
lim
k→∞
ka2
− 0k
=1
ka2k − 0k2
⇒
is called the direction of gradient descent (GD) at the point of x0 .
R = 2, δ = 1
I d
19-Dec-2014
25 / 120
is a direction as long as ∇f (x0 ) 6= 0.
I
I
∇f (x0 )
k∇f (x0 )k
Algorithm: Gradient Descent Algorithm
19-Dec-2014
26 / 120
19-Dec-2014
28 / 120
The Hessian matrix H(x) is bounded as
1. mI H(x), i.e.,
repeat
1. d(k) = −∇f (x(k) )
2. Line search: Choose step size α(k) via a line search algorithm
3. Update: x(k+1) = x(k) + α(k) d(k)
(H(x) − mI) 0
yT H(x)y ≥ mkyk2 , ∀y ∈ RN
2. H(x) M I, i.e.,
until stopping criterion is satised
(M I − H(x)) 0
yT H(x)y ≤ M kyk2 , ∀y ∈ RN
- A typical stopping criterion is k∇f (x)k < ε, ε → 0 (small)
with ∀x ∈ dom f (x).
19-Dec-2014
27 / 120
I
Lower Bound: mI H(x)
For x, y ∈ dom f (x)
f (y) = f (x) + ∇T f (x)(y − x) +
Note that, condition number of a matrix is given by the ratio of the largest and
the smallest eigenvalue, e.g.,
max λi = M
κ(H(x)) = min λi m
1
(y − x)T H(z)(y − x)
2
for some z on the line segment [x, y] where H(z) mI. Thus,
f (y) ≥ f (x) + ∇T f (x)(y − x) +
m
ky − xk2
2
- If m = 0, then the inequality characterizes convexity.
If the condition number is close to one, the matrix is well-conditioned which means
its inverse can be computed with good accuracy. If the condition number is large,
then the matrix is said to be ill-conditioned.
- If m > 0, then we have a better lower bound for f (y)
Right-hand side is convex in y. Minimum is achieved at
y0 = x −
1
∇f (x)
m
Then,
f (y) ≥ f (x) + ∇T f (x)(y0 − x) +
≥ f (x) −
m
ky0 − xk2
2
1
k∇f (x)k2
2m
∀y ∈ dom f .
19-Dec-2014
29 / 120
19-Dec-2014
30 / 120
I
When y = x∗
f (x∗ ) = p∗ ≥ f (x) −
Upper Bound: H(x) M I
For any x, y ∈ dom f (x), using similar derivations as the lower bound, we arrive at
1
k∇f (x)k2
2m
f (y) ≤ f (x) + ∇T f (x)(y0 − x) +
- A stopping criterion
f (x) − p∗ ≤
Then for y = x∗
1
k∇f (x)k2
2m
19-Dec-2014
31 / 120
For the exact line search, let us use second order approximation for f (x(k+1) ):
32 / 120
f (x(k+1) ) ≤ f (x(k) ) − αk∇f (x(k) )k2 +
f (x(k+1) ) = f (x(k) − α∇f (x(k) ))
M α2
k∇f (x(k) )k2
2
Find α00 such that upper bound of f (x(k) − α∇f (x(k) )) is minimized over α.
2
α T
(k)
T
(k)
(k)
(k)
(k)
(k)
∼
= f (x ) − α ∇ f (x )∇f (x ) + ∇ f (x ) H(x ) ∇f (x )
|
{z
} 2
| {z }
Upper bound equation (i.e., right-hand side equation) is quadratic in α, hence
minimized for
M I
k∇f (x(k) )k2
α00 =
This criterion is quadratic in α.
1
M
with the minimum value
Normally, exact line search solution α0 which minimizes the quadratic equation
above is given by
f (x(k) ) −
∇T f (x(k) )∇f (x(k) )
∇T f (x(k) )H(x(k) )∇f (x(k) )
1
k∇f (x(k) )k2
2M
Then, for α00
f (x(k+1) ) ≤ f (x(k) ) −
19-Dec-2014
- However, let us use the upper bound of the second order approximation for
convergence analysis
Convergence of GD using exact line search
α0 =
M
ky0 − xk2
2
1
k∇f (x)k2
2M
f (x∗ ) = p∗ ≤ f (x) −
I
19-Dec-2014
33 / 120
1
k∇f (x(k) )k2
2M
19-Dec-2014
34 / 120
Subtract p∗ for both sides
f (x(k+1) ) − p∗ ≤ f (x(k) ) − p∗ −
1
k∇f (x(k) )k2
2M
- Rate of convergence is unity (i.e., R = 1)
We know that
- Upper limit of rate constant is 1 −
f (x(k) ) − p∗ ≤
1
k∇f (x(k) )k2
2m
⇒
k∇f (x(k) )k2 ≥ 2m(f (x(k) ) − p∗ )
≤ (1 −
or
(k+1)
m
)(f (x(k) ) − p∗ )
M
∗
linear convergence
Number of steps? Apply the above inequality recursively
i.e., f (x(k+1) ) → p∗ as k → ∞, since 0 ≤ c < 1. Thus, convergence is guaranteed.
- If m = M ⇒ c = 0, then convergence occurs in one iteration.
m
(f (x(k) ) − p∗ )
M
f (x
)−p
m
≤ (1 −
)=c≤1
M
f (x(k) ) − p∗
⇒
f (x(k+1) ) − p∗ ≤ ck (f (x(0) ) − p∗ )
Then substituting this result to the above inequality
f (x(k+1) ) − p∗ ≤ (f (x(k) ) − p∗ ) −
I
m
M
(
- If m M ⇒ c → 1, the slow convergence.
m
≤ 1)
M
19-Dec-2014
35 / 120
19-Dec-2014
36 / 120
(f (x(k+1) ) − p∗ ) ≤ ε is achieved after at most
log [f (x(0) ) − p∗ ]/ε
K=
log (1/c)
I
Convergence of GD using backtracking line search
iterations
- Numerator is small when initial point is close to x∗ (K gets smaller).
- Numerator increases as accuracy increases (i.e., ε decreases) (K gets larger).
m
- Denominator decreases linearly with M
(reciprocal of the condition number)
m
m
m
as c = (1 − M
), i.e., log(1/c) = − log(1 − M
)' M
(using
log(x) = log(x0 ) + x10 (x − x0 ) − 2x12 (x − x0 )2 + · · · with x0 = 1).
0
- well-conditioned Hessian,
smaller).
- ill-conditioned Hessian,
larger).
m
M
m
M
→ 0 ⇒ denominator is small (K gets
19-Dec-2014
f (x(k) − α∇f (x(k) )) ≤ f (x(k) ) − γαk∇f (x(k) )k2
is satised when α ∈ [βα0 , α0 ] where α0 ≤
1
M
.
Backtracking line search terminates either if α = 1 or α ≥
bound on the decrease
1. f (x(k+1) ) ≤ f (x(k) ) − γk∇f (x(k) )k2 if α = 1
→ 1 ⇒ denominator is large (K gets
Backtracking exit condition
2. f (x(k+1) ) ≤ f (x(k) ) −
37 / 120
βγ
k∇f (x(k) )k2
M
if α ≥
β
M
β
M
which gives a lower
19-Dec-2014
38 / 120
If we put these inequalities (1 & 2) together
βγ
f (x(k+1) ) ≤ f (x(k) ) − min γ,
k∇f (x(k) )k2
M
- Rate of convergence is unity (i.e., R = 1)
Similar to the analysis exact line search, subtract p∗ from both sides
f (x(k+1) ) − p∗ ≤ f (x(k) ) − p∗ − γ min 1,
β
M
f (x
Finally
∗
)−p ≤
f (x(k+1) ) − p∗ ≤ ck f (x(0) ) − p∗
Thus, k → ∞ ⇒ ck → 0, so convergence is guaranteed.
β
1 − 2mγ min 1,
f (x(k) ) − p∗
M
f (x(k+1) ) − p∗
≤
f (x(k) ) − p∗
linear convergence
- Rate is constant c < 1
k∇f (x(k) )k2
But, we know that k∇f (x(k) )k2 ≥ 2m(f (x(k) ) − p∗ ), then
(k+1)
⇒
β
1 − 2mγ min 1,
=c<1
M
19-Dec-2014
39 / 120
Gradient Descent (GD) Method Examples
Note: Examples 7, 8, 9 and 10 are taken from Convex Optimization (Boyd and Vandenberghe) (Ch. 9).
19-Dec-2014
40 / 120
19-Dec-2014
42 / 120
Examples
Example 7: (quadratic problem in R2 ) Replace γ , α and t with σ, γ and α.
19-Dec-2014
41 / 120
Examples
Examples
Example 8: (nonquadratic problem in R2 ) Replace α and t with γ and α.
43 / 120
Examples
19-Dec-2014
19-Dec-2014
45 / 120
Examples
44 / 120
19-Dec-2014
46 / 120
19-Dec-2014
48 / 120
Examples
19-Dec-2014
Examples
Example 9: (problem in R100 ) Replace α and t with γ and α.
19-Dec-2014
47 / 120
Examples
Examples
Example 10: (Condition number) Replace γ , α and t with σ, γ and α.
19-Dec-2014
49 / 120
Examples
19-Dec-2014
50 / 120
Examples
Observations:
- The gradient descent algorithm is simple.
- The gradient descent method often exhibits approximately linear convergence.
- The choice of backtracking parameters γ and β has a noticeable but not dramatic
eect on the convergence. Exact line search sometimes improves the convergence
of the gradient method, but the eect is not large (and probably not worth the
trouble of implementing the exact line search).
- The convergence rate depends greatly on the condition number of the Hessian, or
the sublevel sets. Convergence can be very slow, even for problems that are
moderately well-conditioned (say, with condition number in the 100s). When the
condition number is larger (say, 1000 or more) the gradient method is so slow that
it is useless in practice.
- The main advantage of the gradient method is its simplicity. Its main disadvantage
is that its convergence rate depends so critically on the condition number of the
Hessian or sublevel sets.
I
19-Dec-2014
51 / 120
Dual Norm: Let k · k denote any norm on RN , then the dual norm, denoted by
k · k∗ , is the function from RN to R with values
n
o
kxk∗ = max yT x : kyk ≤ 1 = sup yT x : kyk ≤ 1
kxk2∗ = kxk2
The above denition also corresponds to a norm: it is convex, as it is the pointwise
maximum of convex (in fact, linear) functions y → xT y; it is homogeneous of
degree 1, that is, kαxk∗ = αkxk∗ for every x in RN and α ≥ 0.
- The norm dual to the the L∞ -norm is the L1 -norm, or vice versa.
By denition of the dual norm,
- More generally, the dual of the Lp -norm is the Lq -norm
kxk∞∗ = kxk1 and kxk1∗ = kxk∞
T
kxkp∗ = kxkq
x y ≤ kxk · kyk∗
This can be seen as a generalized version of the Cauchy-Schwartz inequality, which
corresponds to the Euclidean norm.
I
52 / 120
- The norm dual to the Euclidean norm is itself. This comes directly from the
Cauchy-Schwartz inequality.
y
I
19-Dec-2014
p
where q =
.
1−p
The dual to the dual norm above is the original norm.
19-Dec-2014
53 / 120
19-Dec-2014
54 / 120
I
I
Quadratic norm: A generalized quadratic norm of x is dened by
f (x(k) + αd) ≈ f (x(k) ) + α∇T f (x(k) )d.
1/2
kxkP = xT Px
= kP1/2 xk2 = kMxk2
This approximation is valid for αkdk2 → 0.
where P = M M is an N × N symmetric positive denite (SPD) matrix.
T
I
When P = I then, quadratic norm is equal to the Euclidean norm.
I
The dual of the quadratic norm is given by
kxkP∗ = kxkQ = xT P−1 x
The rst-order Taylor series approximation of f (x(k) + αd) around x(k) is
I
We want to choose d so that ∇T f (x0 )d is as small as (as negative as) possible
for maximum descent.
I
First normalize d to obtain normalized steepest descent direction (nsd) dnsd
n
o
dnsd = argmin ∇T f (x(k) ) d : kdk = 1
1/2
where k · k is any norm and k · k∗ is its dual norm on RN . Choice of norm is very
important.
where Q = P .
−1
I
It is also convenient to consider the unnormalized steepest descent direction (sd)
dsd = k∇f (x)k∗ dnsd
where k · k∗ is the dual norm of k · k.
I
19-Dec-2014
55 / 120
I
Then, for the steepest descent step, we have
56 / 120
Steepest Descent for dierent norms:
- Euclidean norm: As k · k2∗ = k · k2 and having x0 = x(k) , the steepest descent
direction is the negative gradient, i.e.
∇T f (x)dsd = k∇f (x)k∗ ∇T f (x)dnsd = −k∇f (x)k2∗
|
{z
}
−k∇f (x)k∗
I
19-Dec-2014
dsd = −∇f (x0 )
Algorithm: Steepest Descent Algorithm
repeat
1. Compute the steepest descent direction d(k) sd
2. Line search: Choose step size α(k) via a line search algorithm
3. Update: x(k+1) = x(k) + α(k) d(k) sd
For Euclidean norm, steepest descent algorithm is the same as the gradient
descent algorithm.
19-Dec-2014
57 / 120
19-Dec-2014
58 / 120
Change of coordinates: Let y = P x, then kxkP = kyk2 . Using this change of
coordinates, we can solve the original problem of minimizing f (x) by solving the
equivalent problem of minimizing the function f˜(y) : RN → R, given by
1/2
- Quadratic norm: For a quadratic norm k · kP and having x0 = x(k) , the
normalized descent direction is given by
dnsd = −P−1
∇f (x0 )
∇f (x0 )
= −P−1
k∇f (x0 )kP∗
(∇T f (x0 )P−1 ∇f (x0 ))1/2
f˜(y) = f (P−1/2 y) = f (x)
Apply the gradient descent method to f˜(y). The descent direction at y0
(x0 = P−1/2 y0 for the original problem) is
dy = −∇f˜(y0 ) = −P−1/2 ∇f (P−1/2 y0 ) = −P−1/2 ∇f (x0 )
Then the descent direction for the original problem becomes
dx = P−1/2 dy = −P−1 ∇f (x0 )
Thus, x∗ = P−1/2 y∗ .
As k∇f (x)kP∗ = kP−1/2 ∇f (x)k2 , then
The steepest descent method in the quadratic norm k · kP is equivalent to the
gradient descent method applied to the problem after the coordinate
transformation
dsd = −P−1 ∇f (x0 )
y = P1/2 x
19-Dec-2014
59 / 120
19-Dec-2014
60 / 120
- L1 -norm: For an L1 -norm k · k1 and having x0 = x(k) , the normalized descent
direction is given by
n
o
dnsd = − argmin ∇T f (x)d : kdk1 = 1 .
Then, the unnormalized steepest descent direction is given by
dsd = dnsd k∇f (x0 )k∞ = −
∂f (x0 )
ei
∂xi
The steepest descent algorithm in the L1 -norm has a very natural interpretation:
- At each iteration we select a component of ∇f (x0 ) with maximum absolute
value, and then decrease or increase the corresponding component of x0 ,
according to the sign of (∇f (x0 ))i .
Let i be any index for which k∇f (x0 )k∞ = max |(∇f (x0 ))i |. Then a normalized
steepest descent direction dnsd for the L1 -norm is given by
dnsd = − sign
- The algorithm is sometimes called a coordinate-descent algorithm, since
only one component of the variable x(k) is updated at each iteration.
- This can greatly simplify, or even trivialize, the line search.
∂f (x0 )
ei
∂xi
where ei is the i-th standard basis vector (i.e. the coordinate axis direction) with
the steepest gradient. For example, in the gure above we have dnsd = e1 .
19-Dec-2014
61 / 120
Choice of norm:
- Choice of norm can dramatically aect the convergence
- Consider quadratic norm with respect to SPD matrix P. Performing the change of
coordinates
y = P1/2 x
can change the condition number.
- If an approximation of the Hessian at the optimal point, H(x∗ ), is known,
then setting P ∼
= H(x∗ ) will yield
P−1/2 H(x∗ )P1/2 ∼
=I
resulting in a very low condition number.
- If P is chosen correctly the ellipsoid ε = x : xT Px ≤ 1 approximates the
cost surface at point x.
- A correct P will greatly improve the convergence whereas the wrong choice
of P will result in very poor convergence.
19-Dec-2014
62 / 120
- Condition number of the Hessian should be close to unity for fast convergence
19-Dec-2014
63 / 120
- (Using backtracking line search) It can be shown that any norm can be bounded in
terms of Euclidean norm with a constant η ∈ (0, 1]
kxk∗ ≥ ηkxk2
- Assuming strongly convex f (x) and using H(x) ≺ M I
M α2
kdsd k22
2
2
Mα
≤ f (x(k) ) + α∇T f (x(k) )dsd +
kdsd k2∗
2η 2
M
≤ f (x(k) ) − αk∇f (x(k) )k2∗ + α2 2 k∇f (x(k) )k2∗
2η
f (x(k) + αdsd ) ≤ f (x(k) ) + α∇T f (x(k) )dsd +
Right hand side of the inequality is a quadratic function of α and has a minimum
2
at α = ηM . Then,
f (x(k) + αdsd ) ≤ f (x(k) ) −
γη 2 T
η2
k∇f (x(k) )k2∗ ≤ f (x(k) ) +
∇ f (x(k) )dsd
2M
M
19-Dec-2014
64 / 120
19-Dec-2014
66 / 120
Examples
Example 11: A steepest descent example with L1 -norm.
Since γ <n0.5 ando −k∇f (x)k2∗ = ∇T f (x)dsd , backtracking line search will return
2
α ≥ min 1, βη
, then
M
βη 2
f (x(k) + αdsd ) ≤ f (x(k) ) − γ min 1,
k∇f (x(k) )k2∗
M
βη 2
≤ f (x(k) ) − γη 2 min 1,
k∇f (x(k) )k22
M
Subtracting p∗ from both sides and using k∇f (x(k) )k2 ≥ 2m(f (x(k) ) − p∗ ), we
have
f (x(k+1) ) − p∗
βη 2
≤ 1 − 2mγη 2 min 1,
M
f (x(k) ) − p∗
- Linear convergence
=c<1
f (x(k) ) − p∗ ≤ ck f (x(0) ) − p∗
as k → ∞, ck → 0. So, convergence is guaranteed,
19-Dec-2014
65 / 120
Examples
Example 12: Consider the nonquadratic problem in
and t with γ and α).
R2
Examples
given in Example 8 (replace α
19-Dec-2014
67 / 120
Examples
19-Dec-2014
68 / 120
19-Dec-2014
70 / 120
Examples
When P = I, i.e., gradient descent
19-Dec-2014
69 / 120
Examples
Introduction
Conjugate Gradient Method
I
Can overcome the slow convergence of Gradient Descent algorithm
I
Computational complexity is lower than Newton's Method.
I
Can be very eective in dealing with general objective functions.
I
We will rst investigate the quadratic problem
1
min xT Qx − bT x
2
where Q is SPD, and then extend the solution to the general case by
approximation.
19-Dec-2014
71 / 120
19-Dec-2014
72 / 120
I
I
Denition: Given a symmetric matrix Q, two vectors d1 and d2 are said to be
Q-orthogonal or conjugate with respect to Q if
Proposition: If Q is SPD and the set of non-zero vectors d0 , d1 , . . . , dk are
Q-orthogonal, then these vectors are linearly indepedent.
Proof: Assume linear dependency and suppose ∃αi , i = 0, 1, . . . , k :
dT1 Qd2 = 0
α0 d0 + α1 d1 + · · · + αk dk = 0
- Although it is not required, we will assume that Q is SPD.
Multiplying with
- If Q = I, then the above denition becomes the denition of orthogonality.
=0
dTi Qdj = 0, ∀i, j : i 6= j
I
Quadratic Problem:
But
19-Dec-2014
73 / 120
yields
α0 dTi Qd0 +α1 dTi Qd1 + · · · + αi dTi Qdi + · · · + αk dTi Qdk = 0
| {z }
| {z }
| {z }
| {z }
- A nite set of non-zero vectors d0 , d1 , . . . , dk is said to be a Q-orthogonal set if
dTi Q
dTi Qdi
must be 0
=0
> 0 (Q: PD), then αi = 0. Repeat for all αi .
1
min xT Qx − bT x
2
=0
19-Dec-2014
- αi can be found from the known vector b and matrix Q once di are found.
If Q is N × N PD matrix, then we have unique solution
- The expansion of x∗ is a result of an iterative process of N steps where at the i-th
step αi di is added.
Qx∗ = b
Let d0 , d1 , . . . , dN −1 be non-zero Q-orthogonal vectors corresponding to the
N × N SPD matrix Q. They are linearly independent. Then the optimum solution
is given by
I
−1
Conjugate Direction Theorem: Let {di }N
i=0 be a set of non-zero Q-orthogonal
n
oN
vectors. For any x(0) ∈ dom f (x), the sequence x(k)
generated according to
k=0
x∗ = α0 d0 + α1 d1 + · · · + αN −1 dN −1
x(k+1) = x(k) + α(k) dk ,
We can nd the value of the coecients αi by multiplying the above equation
with dTi Q:
with
dTi Qx∗ = αi dTi Qdi
αi =
α(k) = −
dTi b
dTi Qdi
. . . Qx∗ = b
converges to the unique solution x∗ of Qx∗ = b after N steps, i.e. x(N ) = x∗ .
dTi b
di
dTi Qdi
i=0
19-Dec-2014
75 / 120
19-Dec-2014
76 / 120
Descent Properties of the Conjugate Gradient Method
I We dene B(k) which is spanned by {d0 , d1 , . . . , dk−1 } as the subspace of RN ,
i.e.,
Proof: Since dk are linearly independent, we can write
x∗ − x(0) = α(0) d0 + α(1) d1 + · · · + α(N −1) dN −1
B(k) = span {d0 , d1 , . . . , dk−1 } ⊆ RN
for some α(k) . we can nd α(k) by
α(k) =
dTk Q x∗ − x(0)
We will show that at each step x(k) minimizes the objective over the
k-dimensional linear variety x(0) + B(k) .
(1)
dTk Qdk
I
Now, the iterative steps from x(0) to x(k)
x(k) − x(0) = α(0) d0 + α(1) d1 + · · · + α(k−1) dk−1
−1
Theorem: (Expanding Subspace Theorem) Let {di }N
i=0 be non-zero,
Q-orthogonal vectors in RN .
For any x(0) ∈ RN , the sequence
and due to Q-orthogonality
x(k+1) = x(k) + α(k) dk
x(k) − x(0) = 0
(2)
dTk Q
Using (1) and (2) we arrive at
α(k) =
dTk g(k)
dTk Qdk
g(k) = ∇f (x(k) ) = Qx(k) − b
N
−1
X
... k ≥ 0
and g(k) is the gradient at x(k)
Finally the optimum solution is given by,
x∗ =
74 / 120
α(k) = −
dTk g(k)
dTk Qdk
minimizes f (x) = 12 xT Qx − bT x on the line
dTk Q x∗ − x(k)
dTk Qdk
=
dT g(k)
− Tk
dk Qdk
x = x(k−1) − αdk−1 ,
and on x
19-Dec-2014
77 / 120
(0)
+B
(k)
−∞ < α < ∞
.
19-Dec-2014
78 / 120
Proof: Since x(k) ∈ x(0) + B(k) , i.e., B(k) contains the line x = x(k−1) − αdk−1 ,
it is enough to show that x(k) minimizes f (x) over x(0) + B(k)
- Proof of g(k) ⊥ B(k) is by induction
Let for k = 0, B(0) = {} (empty set), then g(k) ⊥ B(0) is true.
Now assume that g(k) ⊥ B(k) , show that g(k+1) ⊥ B(k+1)
From the denition of g(k) (g(k) = Qx(k) − b), it can be shown that
g(k+1) = g(k) + αk Qdk
Hence, by the denition of αk
dTk g(k+1) = dTk g(k) + αk dTk Qdk = 0
Also, for i < k
Since we assume that f (x) is strictly convex, the above condition holds when g(k)
is orthogonal to B(k) , i.e., the gradient of f (x) at x(k) is orthogonal to B(k) .
19-Dec-2014
79 / 120
dTi g(k+1) =
dT g(k)
| i {z }
+
vanishes by induction
αk dTi Qdk = 0
| {z }
=0
19-Dec-2014
80 / 120
In the conjugate direction method, select the successive direction vectors as a
conjugate version of the successive gradients obtained as the method progresses
- Corollary: The gradients g(k) , k = 0, 1, . . . , N satisfy
dTi g(k)
I
Conjugate Gradient Algorithm:
Start at any x(0) ∈ RN and dene d(0) = −g(0) = b − Qx(0)
=0
for i < k.
repeat
Expanding subspace, at every iteration dk increases the dimensionality of B. Since
x(k) minimizes f (x) over x(0) + B(k) , x(N ) is the overall minimum of f (x).
(k) T
g
1. α(k) = − dd(k) T Qd
(k)
(k)
2. x(k+1) = x(k) + α(k) d(k)
3. g(k+1) = Qx(k+1) − b
4. β (k) =
g(k+1) Qd(k)
T
d(k) Qd(k)
5. d(k+1) = −g(k+1) + β (k) d(k)
until k = N .
19-Dec-2014
81 / 120
19-Dec-2014
82 / 120
Example 13: Consider the quadratic problem
- Algorithm terminates in at most N steps with the exact solution (for the quadratic
case)
- Gradient is always linearly independent of all previous direction vectors, i.e.,
g(k) ⊥ B(k) , where B(k) = {d0 , d1 , . . . , dk−1 }
where Q =
3
2
1
min xT Qx − bT x
2
2
2
and b =
.
6
−8
Solution is given by
- If solution is reached before N steps, the gradient is zero
- Very simple formula, computational complexity is slightly higher than gradient
descent algorithm
- The process makes uniform progress toward the solution at every step. Important
for the nonquadratic case.
19-Dec-2014
83 / 120
19-Dec-2014
84 / 120
CG Summary
I In theory (with exact arithmetic) converges to solution in N steps
- The bad news: due to numerical round-o errors, can take more than N
steps (or fail to converge)
- The good news: with luck (i.e., good spectrum of Q), can get good
approximate solution in N steps
I
Compared to direct (factor-solve) methods, CG is less reliable, data dependent;
often requires good (problem-dependent) preconditioner
I
But, when it works, can solve extremely large systems
I Idea is simple. We have two loops
- Outer loop approximates the problem with a quadratic one
- Inner loop runs conjugate gradient method (CGM) for the approximation
i.e., for the neighbourhood of point x0
1
T
T
f (x) ∼
= f (x0 ) + ∇ f (x0 )(x − x0 ) + (x − x0 ) H(x0 )(x − x0 ) + residual
| {z }
|
{z 2
}
→0
quadratic function
- Expanding
1 T
1
x H(x0 )x+ ∇T f (x0 ) − xT0 H(x0 ) x+f (x0 ) + xT0 H(x0 )x0 − ∇T f (x0 )x0
2
2
|
{z
}
f (x) ∼
=
independent of x, i.e., constant
Thus,
1
min f (x) ≡ min xT H(x0 )x+ ∇T f (x0 ) − xT0 H(x0 ) x
2
1
≡ min xT Qx − bT x
2
19-Dec-2014
85 / 120
I
I
19-Dec-2014
86 / 120
Nonquadratic Conjugate Gradient Algorithm:
Starting at any x(0) ∈ RN , compute g(0) = ∇f (x(0) ) and set d(0) = −g(0)
Here,
repeat
repeat
(k) T (k)
1. α(k) = − d(k)dT H(xg(k) )d(k)
Q = H(x0 )
bT = −∇T f (x0 ) + xT0 H(x0 )
The gradient g(k) is
g
(k)
= Qx
(k)
2. x(k+1) = x(k) + α(k) d(k)
3. g(k+1) = ∇f (x(k+1) )
−b
= H(x0 )x0 + ∇f (x0 ) − H(x0 )x0
. . . x0 = x(k)
4. β (k) =
= ∇f (x0 )
g(k+1)
d(k)
T
T
H(x(k) )d(k)
H(x(k) )d(k)
5. d(k+1) = −g(k+1) + β (k) d(k)
until k = N .
new starting point is x(0) = x(N ) , g(0) = ∇f (x(N ) ) and d(0) = −g(0) .
19-Dec-2014
87 / 120
I
- No line search is required.
88 / 120
Nonquadratic Conjugate Gradient Algorithm with Line-search:
Starting at any x(0) ∈ RN , compute g(0) = ∇f (x(0) ) and set d(0) = −g(0)
repeat
repeat
1. Line search: α(k) = argmin f (x(k) + αd(k) )
α
- H(x(k) ) must be evaluated at each point, can be impractical.
- Algorithm may not be globally convergent.
I
19-Dec-2014
Involvement of H(x(k) ) can be avoided by employing a line search algorithm for
α(k) and slightly modifying β (k)
2. Update: x(k+1) = x(k) + α(k) d(k)
3. Gradient: g(k+1) = ∇f (x(k+1) )
4. Use
(k+1) T (k+1)
Fletcher-Reeves method: β (k) = g g(k) T gg(k) , or
T
g(k+1) −g(k) ) g(k+1)
Polak-Ribiere method: β (k) = (
T
g(k) g(k)
5. d(k+1) = −g(k+1) + β (k) d(k)
until k = N .
new starting point is x(0) = x(N ) , g(0) = ∇f (x(N ) ) and d(0) = −g(0) .
19-Dec-2014
89 / 120
19-Dec-2014
90 / 120
Convergence example of the nonlinear Conjugate Gradient Method: (a) A complicated
function with many local minima and maxima. (b) Convergence path of Fletcher-Reeves CG. Unlike linear
CG, convergence does not occur in two steps. (c) Cross-section of the surface corresponding to the rst line
search. (d) Convergence path of Polak-Ribiere CG.
Example 14:
- Polak-Ribiere method can be superior to the Fletcher-Reeves method.
- Global convergence of the line search methods is established by noting that a
gradient descent step is taken every N steps and serves as a spacer step. Since the
other steps do not increase the objective, and in fact hopefully they decrease it,
global convergence is guaranteed.
19-Dec-2014
91 / 120
The Newton Step
19-Dec-2014
92 / 120
19-Dec-2014
94 / 120
The Newton Step
The Newton Step
I
In Newton's Method, local quadratic approximations of f (x) are utilized. Starting
with the second-order Taylor's approximation around x(k) ,
I
Then
x(k+1) = x(k) + ∆xnt
1
f (x(k+1) ) = f (x(k) ) + ∇f (x(k) )∆x + ∆xT H(x(k) )∆x + residual
|
{z 2
}
fˆ(x(k+1) )
where ∆x = x(k+1) − x(k) , nd ∆x = ∆xnt such that fˆ(x(k+1) ) is minimized.
I
Quadratic approximation optimum step ∆xnt (by solving
∂ fˆ(x(k+1) )
∂∆x
= 0)
∆xnt = −H−1 (x(k) )∇f (x(k) )
is called the Newton step, which is a descent direction, i.e.,
∇T f (x(k) )∆xnt = −∇T f (x(k) )H−1 (x(k) )∇f (x(k) ) < 0
19-Dec-2014
93 / 120
The Newton Step
Interpretation of the Newton Step
The Newton Step
2. Steepest Descent Direction in Hessian Norm
- The Newton step is the steepest descent direction at x(k) , i.e.,
1. Minimizer of second-order approximation
1
2
kvkH(x(k) ) = vT H(x(k) )v
- In the steepest descent method, the quadratic norm k · kP can signicantly
increase speed of convergence, by decreasing the condition number. In the
neighbourhood of x∗ , P = H(x∗ ) is a very good choice.
- In Newton's method when x is near x∗ , we have H(x) ' H(x∗ ).
As given on the previous slide ∆x minimizes fˆ(x(k+1) ), i.e., the quadratic
approximation of f (x) in the neighbourhood of x(k) .
- If f (x) is quadratic, then f (x(0) ) + ∆x is the exact minimizer of f (x) and
algorithm terminates in a single step with the exact answer.
- If f (x) is nearly quadratic, then x + ∆x is a very good estimate of the minimizer
of f (x), x∗ .
- For twice dierentiable f (x), quadratic approximation is very accurate in the
neighbourhood of x∗ , i.e., when x is very close to x∗ , the point x + ∆x is a very
good estimate of x∗ .
19-Dec-2014
95 / 120
19-Dec-2014
96 / 120
The Newton Step
3. Solution of Linearized Optimality Condition
I
- First-order optimality condition
The norm of the Newton step in the quadratic norm dened by H(x) is called the
Newton decrement
1
2
λ(x) = k∆xnt kH(x) = ∆xTnt H(x)∆xnt
∇f (x∗ ) = 0
near x∗ (using rst order Taylor's approximation for ∇f (x + ∆x))
I
It can be used as a stopping criterion since it is an estimate of f (x) − p∗ , i.e.,
∇f (x + ∆x) ' ∇f (x) + H(x)∆x = 0
with the solution
1
f (x) − inf f (y) = f (x) − fˆ(x + ∆xnt ) = λ2 (x)
y
2
where
∆xnt = −H−1 (x)∇f (x)
1
fˆ(x + ∆xnt ) = f (x) + ∇T f (x)∆xnt + ∆xTnt H(x)∆xnt
2
i.e., the second-order quadratic approximation of f (x) at x.
19-Dec-2014
97 / 120
19-Dec-2014
98 / 120
Newton's Method
Newton's Method
Substitute fˆ(x + ∆xnt ) into f (x) − inf f (y) and let
y
∆xnt = −H−1 (x)∇f (x)
then
I
1 T
1
∇ f (x)H−1 (x)∇f (x) = λ2 (x)
2
2
λ2 (x)
2
repeat
1. Compute the Newton step and Newton decrement
< , then algorithm can be terminated for some small .
I
So, if
I
With the substitution of ∆xnt = −H−1 (x)∇f (x), the Newton decrement can also
be written as
1
λ(x(k) ) = ∇T f (x(k) )H−1 (x(k) )∇f (x(k) )
Given a starting point x(0) ∈ dom f (x) and some small tolerance > 0
∆x(k) = −H−1 (x(k) )∇f (x(k) )
1
2
λ(x(k) ) = ∇T f (x(k) )H−1 (x(k) )∇f (x(k) )
2
2. Stopping criterion, quit if λ2 (x(k) )/2 ≤ .
3. Line search: Choose a stepsize α(k) > 0, e.g. by backtracking line search.
4. Update: x(k+1) = x(k) + α(k) ∆x(k) .
19-Dec-2014
99 / 120
Newton's Method
I
The stepsize α(k) (i.e., line search) is required for the non-quadratic initial parts of
the algorithm. Otherwise, algorithm may not converge due to large higher-order
residuals.
I
As x(k) gets closer to x∗ . f (x) can be better approximated by the second-order
expansion. Hence, stepsize α(k) is no longer required. Line search algorithm will
automatically set α(k) = 1.
I
If we start with α(k) = 1 and keep it the same, then the algorithm is called the
pure Newton's method.
I
For an arbitrary f (x), there are two regions of convergence.
- damped Newton phase, when x is far from x∗
- quadratically convergent phase, when x gets closer to x∗
I
If we let H(x) = I, the algorithm reduces to gradient descent (GD)
I
19-Dec-2014
100 / 120
Newton's Method
If H(x) is not positive denite, Newton's method will not converge.
So, use (aI + H(x))−1 instead of H−1 (x), also known as (a.k.a) Marquardt
method. There always exists an a which will make the matrix (aI + H(x)) positive
denite.
a is a trade-o between GD and NA
- a → ∞ ⇒ Gradient Descent (GD)
- a → 0 ⇒ Newton's Method (NA)
x(k+1) = x(k) − α(k) ∇f (x(k) )
19-Dec-2014
101 / 120
19-Dec-2014
102 / 120
I
Newton's Method
Newton step and decrement are independent of ane transformations (i.e., linear
coordinate transformations), i.e., for non-singular T ∈ RN ×N
x = Ty
and
Newton's Method
- Similarly, the Newton decrement will be
1
2
λ(y) = ∇T f˜(y)H̃−1 (y)∇f˜(y)
−1 12
=
∇T f (x)T TT H(x)T
TT ∇f (x)
f˜(y) = f (Ty)
then
∇f˜(y) = TT ∇f (x)
1
2
= ∇T f (x)H(x)∇f (x)
H̃(y) = TT H(x)T
= λ(x)
- So, the Newton step will be
I
∆ynt = −H̃−1 (y)∇f˜(y)
−1 TT ∇f (x)
= − TT H(x)T
Thus, Newton's Method is independent of ane transformations (i.e., linear
coordinate transformations).
= −T−1 H−1 (x)∇f (x)
= T−1 ∆xnt
i.e,
x + ∆xnt = T (y + ∆ynt ),
∀x
19-Dec-2014
103 / 120
Convergence: There exist constants η ∈ 0,
Read Boyd Sect. 9.5.3.
I Assume a strongly convex f (x) with mI H(x) with constant m ∀x ∈ dom f (x)
I
I
and σ > 0 such that
f (x(0) )−p∗
σ
iterations
Quadratically Convergent Phase
- All iterations use α = 1 (i.e., quadratic approximation suits very well.)
-
105 / 120
k∇f (x)k2 < η
- Newton's Method will perform well for small L.
19-Dec-2014
m2
L
Damped Newton Phase
- This phase ends after at most
Thus, L measures how well f (x) can be approximated by a quadratic function.
k∇f (x)k2 ≥ η
kH(x) − H(y)k2 ≤ L kx − yk2
If L is small f (x) is closer to a quadratic function. If L is large, f (x) is far from a
quadratic function. If L = 0, then f (x) is quadratic.
104 / 120
- α < 1 gives better solutions, so most iterations will require line search, e.g.
backtracking.
- As k increases, function value decreases by at least σ, but not necessarily
quadratic.
and H(x) is Lipschitz continuous on dom f (x), i.e.,
for constant L > 0. This inequality imposes a bound on the third derivative if
f (x).
19-Dec-2014
k∇f (x(k+1) )k
L
≤
, i.e., quadratic convergence.
2m2
k∇f (x(k) )k2
19-Dec-2014
106 / 120
NA Summary
- For small > 0, f (x) − p∗ < is achieved after at most
log2 log2
0
. This is typically 5-6 iterations.
iterations where 0 =
- Number of iterations is bounded above by
I
Convergence of Newton's method is rapid in general, and quadratic near x∗ . Once
the quadratic convergence phase is reached, at most six or so iterations are
required to produce a solution of very high accuracy.
I
Newton's method is ane invariant. It is insensitive to the choice of coordinates,
or the condition number of the sublevel sets of the objective.
I
Newton's method scales well with problem size. Ignoring the computation of the
Hessian, its performance on problems in R10000 is similar to its performance on
problems in R10 , with only a modest increase in the number of steps required.
I
The good performance of Newton's method is not dependent on the choice of
algorithm parameters. In contrast, the choice of norm for steepest descent plays a
critical role in its performance.
2m3
L2
f (x(0) ) − p∗
0
+ log2 log2
σ
- σ and 0 are dependent on m, L and x(0) .
19-Dec-2014
107 / 120
19-Dec-2014
108 / 120
I
The main disadvantage of Newton's method is the cost of forming and storing the
Hessian, and the cost of computing the Newton step, which requires solving a set
of linear equations.
I
Other alternatives (called quasi-Newton methods) are also provided by a family of
algorithms for unconstrained optimization. These methods require less
computational eort to form the search direction, but they share some of the
strong advantages of Newton methods, such as rapid convergence near x∗ .
19-Dec-2014
109 / 120
Examples
Examples
Example 15: Consider the nonquadratic problem in R2 given in Example 8 and Example
12 (replace α and t with γ and α).
19-Dec-2014
110 / 120
Examples
Example 16: Consider the nonquadratic problem in R100 given in Example 9 (replace α
and t with γ and α).
19-Dec-2014
111 / 120
Examples
19-Dec-2014
112 / 120
19-Dec-2014
114 / 120
Examples
Example 17: (problem in R10000 ) Replace α and t with γ and α.
19-Dec-2014
113 / 120
Examples
For relatively large scale problems, i.e. N is large, calculating the inverse of the Hessian
at each iteration can be costly. So, we may use, some approximations of the Hessian
S(x) = Ĥ−1 (x) → H−1 (x)
x
(k+1)
= x(k) − α(k) S(x(k) )∇f (x(k) )
1. Hybrid GD + NA
We know that the rst phase the Newton's Algorithm (NA) is not very fast.
Therefore, rst we can run run GD which has considerably low complexity and
after satisfying some conditions, we can switch to the NA.
Newton's Algorithm may not converge for highly non-quadratic functions unless x
is close to x∗ .
Hybrid method (given on the next slide) also guarantees global convergence.
I
19-Dec-2014
115 / 120
Hybrid Algorithm
- Start at x(0) ∈ dom f (x)
repeat
run GD (i.e., S(x(k) ) = I)
- Start at the nal point of GD
repeat
run NA with exact H(x) (i.e., S(x(k) ) = H−1 (x(k) ))
19-Dec-2014
116 / 120
2. The Chord Method
If f (x) is close to a quadratic function, we may use S(x(k) ) = H−1 (x(0) )
throughout the iterations, i.e.,
∆x(k) = −H−1 (x(0) )∇f (x(k) )
x(k+1) = x(k) + ∆x(k)
This is also the same as the SD algorithm with P = H(x(0) ) and α(k) = 1.
19-Dec-2014
117 / 120
19-Dec-2014
118 / 120
4. Approximating Particular Terms
3. The Shamanski Method
Updating the Hessian at every N iterations may give better performance, i.e.,
S(x(k) ) = H−1 (xb cN )
k
N
∆x(k) = −H−1 (xb N cN )∇f (x(k) )
k
x(k+1) = x(k) + ∆x(k)
This is a trade-o between the Chord method (N ← ∞) and the full NA (N ← 1).
19-Dec-2014
119 / 120
Inversion of sparse matrices can be easier, i.e., when many entries of H(x) are zero
- If some entries of H(x) are small or below a small threshold, then set them
to zero, obtaining Ĥ(x). Thus, Ĥ(x) becomes sparse.
- In the extreme case. when the Hessian is strongly diagonal dominant, let the
o-diagonal terms to be zero, obtaining Ĥ(x). Thus, Ĥ(x) becomes
diagonal which is very easy to invert.
There are also other advanced quasi-Newton (modied Newton) algorithms
developed to approximate the inverse of the Hessian, e.g. Broyden and
Davidon-Fletcher-Powell (DFP) methods.
19-Dec-2014
120 / 120

ELE604/ELE704 Optimization - Hacettepe University, Department of

Transkript

Benzer belgeler

Yazgünoğlu 1 Res. Assist. Kerim Can YAZGÜNOĞLU E

Res. Assist. Pelin KUT BELENLİ - Hacettepe Üniversitesi İngiliz Dili

Yazgünoğlu 1 Res. Assist. Kerim Can YAZGÜNOĞLU E

Ar. Gör. Elif AKAR - Hacettepe Üniversitesi Ankara Devlet

Assoc.Prof. Dr. Didem DAYANGAÇ

Arş. Gör. Kerim Can YAZGÜNOĞLU

Uluslararası Uzm. Maia MARİ CHİKHRADZE

Ayrıntılı özgeçmiş için tıklayınız... - Hacettepe Üniversitesi Biyolojik

Belma Koçer Gümüşel - Kozmetiklerde Güvenlilik Çalışma Grubu