Least-squares-based three-term conjugate gradient methods

In this paper, we first propose a new three-term conjugate gradient (CG) method, which is based on the least-squares technique, to determine the CG parameter, named LSTT. And then, we present two improved variants of the LSTT CG method, aiming to obtain the global convergence property for general nonlinear functions. The least-squares technique used here well combines the advantages of two existing efficient CG methods. The search directions produced by the proposed three methods are sufficient descent directions independent of any line search procedure. Moreover, with the Wolfe–Powell line search, LSTT is proved to be globally convergent for uniformly convex functions, and the two improved variants are globally convergent for general nonlinear functions. Preliminary numerical results are reported to illustrate that our methods are efficient and have advantages over two famous three-term CG methods.


Introduction
Consider the following unconstrained optimization problem: where f : R n → R is a continuously differentiable function whose gradient function is denoted by g(x).
Conjugate gradient (CG) methods are known to be among the most efficient methods for unconstrained optimization due to their advantages of simple structure, low storage, and nice numerical behavior. CG methods have been widely used to solve practical problems, especially large-scale problems such as image recovery [1], condensed matter physics [2], environmental science [3], and unit commitment problems [4][5][6].
For the current iteration point x k , the CG methods yield the new iterate x k+1 by the formula where α k is the stepsize determined by a certain line search and d k is the so-called search direction in the form of d k = -g k , k = 0, -g k + β k d k-1 , k ≥ 1, in which β k is a parameter. Different choices of β k correspond to different CG methods. Some classical and famous formulas of the CG methods parameter β k are: , Hestenes and Stiefel (HS) [7]; β FR k = g k 2 g k-1 2 , Fletcher and Reeves (FR) [8]; β PRP k = g T k y k-1 g k- 1 2 ,Polak,Ribiére,and Polyak (PRP) [9,10]; , Dai and Yuan (DY) [11], where g k = g(x k ), y k-1 = g kg k-1 , and · denotes the Euclidean norm.
Here are two commonly used line searches for choosing the stepsize α k .
-The Wolfe-Powell line search: the stepsize α k satisfies the following two relations: f (x k + α k d k )f (x k ) ≤ δα k g T k d k (1) and where 0 < δ < σ < 1. -The strong Wolfe-Powell line search: the stepsize α k satisfies both (1) and the following relation: In recent years, based on the above classical formulas and line searches, many variations of CG methods have been proposed, including spectral CG methods [12,13], hybrid CG methods [14,15], and three-term CG methods [16,17]. Among them, the three-term CG methods seem to attract more attention, and a great deal of efforts has been devoted to developing this kind of methods, see, e.g., [18][19][20][21][22][23]. In particular, by combining the PRP method [9,10] with the BFGS quasi-Newton method [24], Zhang et al. [22] presented a three-term PRP CG method (TTPRP). Their motivation is that the PRP method has good numerical performance but is generally not a descent method when the Armijo-type line search is executed. The direction of TTPRP is given by which is always a descent direction (independent of line searches) for the objective function.
In the same way, Zhang et al. [25] presented a three-term FR CG method (TTFR) whose direction is in the form of where θ (1) k is given by (3). Later, Zhang et al. [23] proposed a three-term HS CG method (TTHS) whose direction is defined by where The above approaches [22,23,25] have a common advantage that the relation d T k g k = g k 2 holds. This means that they always generate descent directions without the help of line searches. Moreover, they can all achieve global convergence under suitable line searches. Before putting forward the idea of our new three-term CG methods, we first briefly review a hybrid CG method (HCG) proposed by Babaie-Kafaki and Ghanbari [26], in which the search direction is in the form of where the parameter is given by a convex combination of FR and PRP formulas It is obvious that the choice of θ k is very critical for the practical performance of the HCG method. By taking into account that the TTHS method has good theoretical property and numerical performance, Babaie-Kafaki and Ghanbari [26] proposed a way to select θ k such that the direction d HCG k is as close as possible to d TTHS k in the sense that their distance is minimized, i.e., the optimal choice θ * k is obtained by solving the least-squares problem Similarly, Babaie-Kafaki and Ghanbari [27] proposed another hybrid CG method by combining HS with DY, in which the combination coefficient is also determined by the leastsquares technique (5). The numerical results in [26,27] show that this least-squares-based approach is very efficient. Summarizing the above discussions, we have the following two observations: (1) the three-term CG methods perform well both theoretically and numerically; (2) the leastsquares technique can greatly improve the efficiency of CG methods. Putting these together, the main goal of this paper is to develop new three-term CG methods that are based on the least-squares technique. More precisely, we first propose a basic three-term CG method, namely LSTT, in which the least-squares technique well combines the advantages of two existing efficient CG methods. With the Wolfe-Powell line search, LSTT is proved to be globally convergent for uniformly convex functions. In order to obtain the global convergence property for general nonlinear functions, we further present two improved variants of the LSTT CG method. All the three methods generate sufficient descent directions independent of any line search procedure. Global convergence is also analyzed for the proposed methods. Finally, some preliminary numerical results are reported to illustrate that our methods are efficient and have advantages over two famous three-term CG methods.
The paper is organized as follows. In Sect. 2, we present the basic LSTT CG method. Global convergence of LSTT is proved in Sect. 3. Two improved variants of LSTT and their convergence analysis are given in Sect. 4. Numerical results are reported in Sect. 5. Some concluding remarks are made in Sect. 6.

Least-squares-based three-term (LSTT) CG method
In this section, we first derive a new three-term CG formula, and then present the corresponding CG algorithm. Our formula is based on the following modified HS (MHS) formula proposed by Hager and Zhang [28,29]: where τ k (≥ 0) is a parameter. The corresponding direction is then given by Different choices of τ k will lead to different types of CG formulas. In particular, β MHS k (0) = β HS k , and β MHS k (2) is just the formula proposed in [28]. In this paper, we present a more sophisticated choice of τ k by making use of the leastsquares technique. More precisely, the optimal choice τ * k is determined such that the direction d MHS k is as close as possible to d TTHS k , i.e., it is generated by solving the least-squares problem Substituting (4) and (7) in (8), we have Thus, from (6), we obtain So far, it seems that the two-term direction d MHS k (τ * k ) obtained from (9) and (10) is a "good enough" direction; however, it may not always be a descent direction of the objective function. In order to overcome this difficulty, we propose a least-squares-based three-term (LSTT) direction by augmenting a term to d MHS k (τ * k ) as follows: where The following lemma shows that the direction d LSTT k (11) is a sufficient descent direction, which is independent of the line search used.

Lemma 1
Let the search direction d k := d LSTT k be generated by (11). Then it satisfies the following sufficient descent condition: Proof For k = 0, we have d 0 = -g 0 , so it follows that g T For k ≥ 1, we have which along with (10) and (12) shows that So the proof is completed.
Step 1: If g k ≤ , then stop.
Step 3: Find α k by some line search.
Step 5: Let k := k + 1 and go to Step 1. Now, we formally present the least-squares-based three-term CG algorithm (Algorithm 1) that uses d LSTT k (11) as the search direction. Note that it reduces to the classical HS method if an exact line search is executed in Step 3.

Convergence analysis for uniformly convex functions
In this section, we establish the global convergence of Algorithm 1 for uniformly convex functions. The stepsize α k at Step 3 is generated by the Wolfe-Powell line search (1) and (2). For this purpose, we first make two standard assumptions on the objective function, which are assumed to be hold throughout the rest of the paper.

Assumption 2
There is an open set O containing Ω, in which f (x) is continuous differentiable and its gradient function g(x) is Lipschitz continuous, i.e., there exists a constant L > 0 such that From Assumptions 1 and 2, it is not difficult to verify that there is a constant γ > 0 such that The following lemma is commonly used in proving the convergence of CG methods, which is called the Zoutendijk condition [30].

Lemma 2
Suppose that the sequence {x k } of iterates is generated by Algorithm 1. If the search direction d k satisfies g T k d k < 0 and the stepsize α k is calculated by the Wolfe-Powell line search (1) and (2), then we have From Lemma 1, we know that if Algorithm 1 does not stop, then Thus, under Assumptions 1 and 2, relation (16) holds immediately for Algorithm 1. Now, we present the global convergence of Algorithm 1 (with = 0) for uniformly convex functions.
Theorem 1 Suppose that the sequence {x k } of iterates is generated by Algorithm 1, and that the stepsize α k is calculated by the Wolfe-Powell line search (1) and (2). If f is uniformly convex on the level set Ω, i.e., there exists a constant μ > 0 such that then either g k = 0 for some k, or Proof If g k = 0 for some k, then the algorithm stops. So, in what follows, we assume that an infinite sequence {x k } is generated. According to Lipschitz condition (14), the following relation holds: where s k-1 := x kx k-1 . In addition, from (17) it follows that By combining the definition of d k (cf. (10), (11), and (12)) with relations (18) and (19), we have This together with Lemma 1 and (16) shows that with ω = 2 + 2L/μ, which implies that lim k→∞ g k = 0.

Two improved variants of the LSTT CG method
Note that the global convergence of Algorithm 1 is established only for uniformly convex functions. In this section, we present two improved variants of Algorithm 1, which both have global convergence property for general nonlinear functions.
Step 1: If g k ≤ , then stop.
Step 3: Find α k by some line search.

An improved version of LSTT (LSTT+)
In fact, the main difficulty impairing convergence for general functions is that β MHS k (τ * k ) (cf. (10)) may be negative. So, similar to the strategy used in [31], we present the first modification of direction d LSTT k (11) as follows: where β MHS k (τ * k ) and θ k are given by (10) and (12), respectively. The corresponding algorithm is given in Algorithm 2.
Obviously, the search direction d k generated by Algorithm 2 satisfies the sufficient descent condition (13). Therefore, if the stepsize α k is calculated by the Wolfe-Powell line search (1) and (2), then the Zoutendijk condition (16) also holds for Algorithm 2.
The following lemma shows some other important properties about the search direction d k .

Lemma 3
Suppose that the sequence {d k } of directions is generated by Algorithm 2, and that the stepsize α k is calculated by the Wolfe-Powell line search (1) and (2). If there is a constant c > 0 such that g k ≥ c for any k, then d k = 0 for each k, and Proof Firstly, from Lemma 1 and the fact that g k ≥ c, we have which implies that d k = 0 for each k. Secondly, from (16) and (21), we have Now we rewrite the direction d k in (20) as where According to (23) and (24), it follows that From the fact that u k = 1, we obtain On the other hand, from the Wolfe-Powell line search condition (2) and (21), we have Since g T k-1 d k-1 < 0, we have This together with (26) shows that Again from (2), it follows that By combining (27) and (28), we have In addition, the following relation comes directly from (15) Finally, from (15), (29), and (30), we give a bound on the numerator of a k : where M = γ + 2γ max{ σ 1-σ , 1}. This together with (25) shows that Summing the above relation over k and using (22), the proof is completed.
We are now ready to prove the global convergence of Algorithm 2. (1) and (2). Then either g k = 0 for some k or lim inf k→∞ g k = 0.

Theorem 2 Suppose that the sequence {x k } of iterates is generated by Algorithm 2, and that the stepsize α k is calculated by the Wolfe-Powell line search
Proof Suppose by contradiction that there is a constant c > 0 such that g k ≥ c for any k. So the conditions of Lemma 3 hold. We first show that there is a bound on the steps s k , whose proof is a modified version of [28,Thm. 3.2]. From Assumption 1, there is a constant B > 0 such that For any l ≥ k, it is clear that This together with the triangle inequality and (31) shows that Denote where σ , L, and γ are given in (2), (14), and (15), respectively. Let be a positive integer, chosen large enough that Moreover, from Lemma 3, we can choose an index k 0 large enough that Thus, if j > k ≥ k 0 and jk ≤ , we can derive the following relations by (34) and the Cauchy-Schwarz inequality: Combining (32) and (35), we have where l > k ≥ k 0 and lk ≤ . Next, we prove that there is a bound on the directions d k .
If d k = -g k in (20), then from (15) we have In what follows, we consider the case where Thus, from (15), (18), and (26), we have Then, by defining S j = 2ξ 2 s j 2 , for l > k 0 , we have From (36), following the corresponding lines in [28,Thm. 3.2], we can conclude that the right-hand side of (38) is bounded, and the bound is independent of l. This together with (37) contradicts (22). Therefore, lim inf k→∞ g k = 0.

A modified version of LSTT+ (MLSTT+)
In order to further improve the efficiency of Algorithm 2, we propose a modified version of d LSTT+ k (20) as follows: where θ k is given by (12) and The difference between (20) and (39) is that y k-1 is replaced by z k-1 . This idea, which aims to improve the famous PRP method, originated from [32]. Such a substitution seems useful here in that it could increase the possibility of the CG parameter being positive, and as a result, the three-term direction is used more often. In fact, as iterations go along, g k approaches zero asymptotically, and therefore the fact that g k / g k-1 < 1 may frequently happen. If in addition g T k g k-1 > 0, then we have g T k z k-1 = g k -g k g k-1 g T k g k-1 > g kg T k g k-1 = g T k y k-1 .
The following lemma shows that the search direction (39) also has sufficient descent property.

Algorithm 3: A modified version of LSTT+ algorithm (MLSTT+)
Step 0: Choose an initial point x 0 ∈ R n and a tolerance > 0. Let k := 0.
Step 1: If g k ≤ , then stop.
Step 5: Let k := k + 1 and go to Step 1. (39). Then it satisfies the following sufficient descent condition (independent of line search):

Lemma 4 Let the search direction d k be generated by
Proof The proof is similar to that of Lemma 1.
From Lemma 4, we know that the Zoutendijk condition (16) also holds for Algorithm 3. In what follows, we show that Algorithm 3 is globally convergent for general functions. The following lemma illustrates that the direction d k generated by Algorithm 3 inherits some useful properties of d LSTT+ k (20), whose proof is a modification of Lemma 3.

Lemma 5 Suppose that the sequence {d k } of directions is generated by Algorithm 3.
If there is a constant c > 0 such that g k ≥ c for any k, then d k = 0 for each k, and ∞ k=0 u ku k-1 2 < +∞, Proof From the related analysis in Lemma 3, we have Now we redisplay the direction d k in (39) as wherê Definê According to (44) and (46), it follows that Thus, following the lines in the proof of Lemma 3, we get Moreover, we also have The following relations hold by the definition of z k-1 (41): By combining (15), (48), and (49), we put a bound on the numerator of â k : . This together with (47) shows that Summing the above inequalities over k and utilizing (43), we complete the proof.
We finally present the global convergence of Algorithm 3.

Theorem 3
Suppose that the sequence {x k } of iterates is generated by Algorithm 3. Then either g k = 0 for some k or lim inf k→∞ g k = 0.
Proof Given that there is a constant c > 0 such that g k ≥ c for any k, then the conclusions of Lemma 5 hold.

Numerical results
In this section, we aim to test the practical effectiveness of Algorithm 2 (LSTT+) and Algorithm 3 (MLSTT+) which are both convergent for general functions under the Wolfe-Powell line search. The numerical results are compared with the TTPRP [22] method and the TTHS [23] method by solving 104 test problems from the CUTE library [33][34][35], whose dimensions range from 2 to 5,000,000.
All codes were written in Matlab R2014a and run on a PC with 4 GB RAM memory and Windows 7 operating system. The stepsizes α k are generated by the Wolfe-Powell line search with σ = 0.1 and δ = 0.01. In Tables 1, 2, 3, "Name" and "n" mean the abbreviation of the test problem and its dimension. "Itr/NF/NG" stand for the number of iterations, function evaluations, and gradient evaluations, respectively. "Tcpu" and " g * " denote the computing time of CPU and the final norm of the gradient value, respectively. The stopping criterion is g k ≤ 10 -6 or Itr > 2000.
To clearly show the difference in numerical effects between the above mentioned four CG methods, we present the performance profiles introduced by Dolan and Morè [36] in Figs. 1,2,3,4 (with respect to Itr,NF,NG,and Tcpu,respectively), which is based on the following.
Denote the whole set of n p test problems by P, and the set of solvers by S. Let t p,s be the Tcpu (the Itr or others) required to solve problem p ∈ P by solver s ∈ S, and define the performance ratio as For t p,s of the "NaN" in Tables 1, 2, 3, we let r p,s = 2 max{r p,s : s ∈ S}, then the performance profile for each solver can be defined by ρ s (τ ) = 1 n p size {p ∈ P : log 2 r p,s ≤ τ } ,    Figure 1 Performance profiles on Itr of four CG methods where size(A) stands for the number of elements in the set A. Hence ρ s (τ ) is the probability for solver s ∈ S that the performance ratio r p,s is within a factor τ ∈ R. The function ρ s is the (cumulative) distribution function for the performance ratio. Apparently the solver whose curved shape is on the top will win over the rest of the solvers. Refer to [36] for more details. For each method, the performance profile plots the fraction ρ s (τ ) of the problems for which the method is within a factor τ of the best time. The left side of the figure represents the percentage of the test problems for which a method is the fastest. The right side represents the percentage of the test problems that are successfully solved by each of the methods. The top curve is the method that solved the most problems in a time that was within a factor τ of the best time.
In Figs. 1, 2, 3, 4, we compare the performance of the LSTT+ method and the MLSTT+ method with the TTPRP method and the TTHS method. We observe from Fig. 1 that MLSTT+ is the fastest for about 51% of the test problems with the smallest number of iterations, and it ultimately solves about 98% of the test problems. LSTT+ has the second best performance which can solve 88% of the test problems successfully, while TTPRP and TTHS solve about 80% and 78% of the test problems successfully, respectively. Figure 2 shows that MLSTT+ exhibits the best performance for the number of function evaluations since it can solve about 49% of the test problems with the smallest number of function evaluations; LSTT+ has the second best performance as it solves about 40% in the same situation. From Fig. 3, it is not difficult to see that MLSTT+ and LSTT+ perform better than the other two methods about the number of gradient evaluations. Moreover, MLSTT+ is the fastest for the number of gradient evaluations since it solves about 56% of the test problems with the smallest number of gradient evaluations, while LSTT+ solves about 41% of the test problems with the smallest number of gradient evaluations. In Fig. 4, MLSTT+ displays the best performance for CPU time since it solves about 53% of the test problems with the least CPU time and the data for LSTT+ is 42% in the same case, which is second. Since all methods were implemented with the same line search, we can conclude that the LSTT+ method and the MLSTT+ method seem more efficient.
Combining Tables 1,2,3 and Figs. 1,2,3,4, we are led to the conclusion that LSTT+ and MLSTT+ perform better than TTPRP and TTHS, in which MLSTT+ is the best one. This shows that the proposed methods of this paper possess good numerical performance.

Conclusion
In this paper, we have presented three new three-term CG methods that are based on the least-squares technique to determine the CG parameters. All can generate sufficient descent directions without the help of a line search procedure. The basic one is globally convergent for uniformly convex functions, while the other two improved variants possess global convergence for general nonlinear functions. Preliminary numerical results show that our methods are very promising.