 Research
 Open Access
 Published:
An efficient modification of the HestenesStiefel nonlinear conjugate gradient method with restart property
Journal of Inequalities and Applications volume 2016, Article number: 110 (2016)
Abstract
The conjugate gradient (CG) method is one of the most popular methods to solve nonlinear unconstrained optimization problems. The HestenesStiefel (HS) CG formula is considered one of the most efficient methods developed in this century. In addition, the HS coefficient is related to the conjugacy condition regardless of the line search method used. However, the HS parameter may not satisfy the global convergence properties of the CG method with the WolfePowell line search if the descent condition is not satisfied. In this paper, we use the original HS CG formula with a mild condition to construct a CG method with restart using the negative gradient. The convergence and descent properties with the strong WolfePowell (SWP) and weak WolfePowell (WWP) line searches are established. Using this condition, we guarantee that the HS formula is nonnegative, its value is restricted, and the number of restarts is not too high. Numerical computations with the SWP line search and some standard optimization problems demonstrate the robustness and efficiency of the new version of the CG parameter in comparison with the latest and classical CG formulas. An example is used to describe the benefit of using different initial points to obtain different solutions for multimodal optimization functions.
Introduction
Consider the following form for the unconstrained optimization problem:
where \(f \mathbb{:R}^{n} \rightarrow \mathbb{R}\) is a smooth nonlinear function. To solve (1) using the conjugate gradient (CG) method, we normally use the following iterative method:
where the starting point \(x_{1} \in \mathbb{R}^{n}\) is arbitrary and \(\alpha_{k} > 0\) is the step length, which is computed via a line search. The search direction \(d_{k}\) is defined by
where \(g_{k} = \nabla f(x_{k})\), \(\beta_{k}\) is a scalar, and the steepest descent method is used as an initial search direction, i.e.,
The most wellknown CG formulas are the following:
HestenesStiefel (HS) [1],
FletcherReeves (FR) [2],
and PolakRibierePolyak (PRP) [3],
Theoretically, equations (5), (6), and (7) are similar if we use an exact line search, i.e.,
and quadratic functions, i.e.,
where Q is a positive definite matrix and b is a vector. However, in numerical computations and convergence analyses, the three main CG formulas are different if we use nonquadratic functions.
To obtain the step length, we have two types of line searches: exact line search as given by (8), which is an expensive line search in terms of calculating the function and gradient evolutions, and inexact line search, which approximates the step length by reducing the function value and direction derivative. The inexact line search is inexpensive and inherits identical advantages as the exact line search. The most popular inexact line search is the WolfePowell line search [4, 5], which is designed to approximate the suitable step length using the following equations:
and
where \(0 < \delta < \sigma < 1\).
The strong version of the weak WolfePowell (WWP) line search is the strong WolfePowell (SWP) line search, which is given by (9) and
The difference between the WWP and SWP line searches is that the former no longer searches for the step length when the current iteration in (2) is far from the stationary point.
An important rule in the CG method is the descent property, which is given by
If the CG formula inherits (12), then the iterative formula in (2) absolutely reduces the function value in every iteration. It is clear from (9) that \(f(x_{k} + \alpha_{k}d_{k})  f(x_{k}) \le \delta \alpha_{k}g_{k}^{T}d_{k}\). If the direction derivative (i.e., \(g_{k}^{T}d_{k}\)) is negative, we obtain
Thus, (12) must be satisfied before using the WolfePowell line search. If we extend (12) to the form
then (13) is called the sufficient descent condition.
The HS CG formula is related to the conjugacy condition regardless of the objective function and line search, i.e.,
If the CG formula inherits (14), the efficiency will be better than other CG parameters that do not inherit this property. Dai and Liao [6] proposed the following novel conjugacy condition for an inexact line search:
Using the criteria of an exact line search, \(g_{k}^{T}d_{k  1} = 0\), (15) reduces to the original conjugacy condition (14).
Because the PRP and HS formulas cannot satisfy the descent property when the SWP or WWP line searches are used, Gilbert and Nocedal [7] use Powell’s [8] suggestion to solve the convergence problem of the PRP method as follows:
However, \(\beta_{k}^{\mathrm{PRP}+}\) cannot satisfy the descent property with the SWP line search. Thus, [7] uses the Moré and Thuente algorithm [9] if the descent property is not satisfied with the SWP line search.
Furthermore, [7] makes the following suggestion:
TouatiAhmed and Storey [10] suggest the following hybrid method:
Several CG parameters that pertain to the PRP and HS formulas have been presented [11–13]. Here, we denote the following CG formulas using the WYL family:
The CG formulas in the WYL family are clearly positive and satisfy the global convergence with descent properties. However, this family does not inherit the restart property. Thus, the convergence rate is linear. For more about the convergence rate, we refer the reader to [14].
Recently, Alhawarat et al. [15] constructed the following CG formula with the restart property as follows:
In addition, to learn about many versions of CG parameters related to classical CG methods and their convergence properties, we refer the reader to [16, 17].
Motivation and the new modification
The HestenesStiefel (HS) CG formula is considered one of the most efficient methods developed in this century. In addition, the HS coefficient is related to the conjugacy condition regardless of whether the line search used. However, the HS parameter may not satisfy the global convergence properties of the CG method with the WolfePowell line search if the descent condition is not satisfied. Thus, in this section, we present the following CG parameter:
The nonzero term of (17) can clearly be written as follows:
and
If \(\beta_{k}^{\mathrm{ZA}} = 0\), the search direction becomes the steepestdescent method. In addition, if \(g_{k} = 0\), the stationary point is found. Thus, in the following analysis, we always suppose that \(\beta_{k}^{\mathrm{ZA}} > 0\) and \(g_{k} \ne 0\) for all \(k \ge 1\).
To prove that the number of restarts in (17) is not too many by using different nonstandard initial points, we compare \(\beta_{k}^{\mathrm{ZA}}\) with the following modified PRP CG parameter:
In the numerical results section, (20) restarted too many times by using nonstandard different initial points. In other words, the CG parameter in (20) uses the steepestdescent method several times to reach the optimum solution. However, by using the standard initial points, PRP^{∗} becomes more efficient. Thus, in terms of the efficiency, (20) is not as efficient as (17) because the latter does not require as many times to restart.
The following algorithm describes the steps of using the CG method with (17) and SWP line search to obtain the solution for the optimization functions.
Algorithm 1
 Step 0.:

Initialization. Given \(x_{1}\), set \(k = 1\).
 Step 1.:

If \(\Vert g_{k} \Vert \le \varepsilon\), then stop, where \(0 < \varepsilon \ll 1\).
 Step 2.:

Compute \(\beta_{k}\) based on (17).
 Step 3.:
 Step 4.:
 Step 5.:

Update a new point based on (2).
 Step 6.:

Convergent test and stopping criteria: if \(\Vert g_{k} \Vert \le \varepsilon\), then stop; otherwise, go to Step 2 with \(k = k + 1\).
Global convergence properties for the \(\beta_{k}^{\mathrm{ZA}}\) method
Because we are interested in determining the stationary point for the nonlinear optimization functions that are bounded below and whose gradient is Lipschitz continuous, the following standard assumption is necessary.
Assumption 1

I.
The level set \(\Psi = \{ x f(x) \le f(x_{1})\}\) is bounded, i.e., there is a positive constant M such that
$$\Vert x \Vert \le M, \quad \forall x \in \Psi. $$ 
II.
In some neighborhood N of Ψ, f is continuously differentiable, and its gradient is Lipschitz continuous, i.e., for all \(x,y \in N\), there is a constant \(L > 0\) such that
$$\bigl\Vert g(x)  g(y) \bigr\Vert \le L\Vert x  y \Vert . $$
This assumption implies that there is a positive constant B such that
The following lemma is known as the Zoutendijk condition [8], which is normally used to prove the convergence properties of CG formulas with the standard CG method. The global convergence indicates that a stationary point is obtained.
Lemma 3.1
Suppose that Assumption 1 holds. Consider the CG methods of forms (2) and (3), where the search direction satisfies the sufficient descent condition and the step length, which is computed using the standard WWP line search. Then
The following two theorems demonstrate that (17) satisfies the descent condition with SWP and WWP line searches.
Theorem 3.1
Let the sequences \(\{ g_{k} \}\) and \(\{ d_{k} \}\) be generated by methods (2), (3) and (17) with step length \(\alpha_{k}\), which is computed using the SWP line searches (9) and (11) with \(\sigma < \frac{1}{3}\); then the sufficient descent condition (13) holds for some \(c \in (0, 1)\).
Proof
Multiplying (3) by \(g_{k}^{T}\), we have
Then we have the following two cases:
Case 1. If \(g_{k}^{T}d_{k  1} \le 0\), then using (17), we obtain
Case 2. If \(g_{k}^{T}d_{k  1} > 0\), then divide both sides of (21) by \(\Vert g_{k} \Vert ^{2}\) and using (17), we obtain
Using (11) with \(\sigma < 1/3\), we obtain
Let \(c = 1  \frac{2\sigma}{(1  \sigma )}\), we obtain
The proof is complete. □
Theorem 3.2
Assume the sequences \(\{ g_{k} \}\) and \(\{ d_{k} \}\) are generated using the methods (2), (3) and (17) with step length \(\alpha_{k}\), which is computed via the WWP line search given by (9) and (10); then the descent condition (13) holds.
Proof
If \(g_{k}^{T}d_{k  1} \le 0\), then the proof is similar to Case 1 in Theorem 3.1. If \(g_{k}^{T}d_{k  1} > 0\), from (3), we have
Dividing both sides by \(\Vert g_{k} \Vert ^{2}\), we obtain
Let \(c =  1 + \frac{2}{(1  \sigma )}\); then we obtain
The proof is complete. □
Gilbert and Nocedal [7] presented a useful property to prove the global convergence properties for the methods that pertain to the PRP (HS) formula. The property is as follows.
Property ∗
Consider a method of the form given by (2) and (3) and suppose that
We say that the method has Property ∗ if there are constants \(b > 1\) and \(\lambda > 0\) such that for all \(k \ge 1\), we obtain \(\vert \beta_{k} \vert \le b\); if \(\Vert x_{k}  x_{k  1} \Vert \le \lambda\), then
The following lemma is similar to that presented in [7].
Lemma 3.2
Consider the CG method as defined in (2), (3), and (17) and the step length computed using the WWP line search. If the equation in (22) and Assumption 1 hold, then \(\beta_{k}^{\mathrm{ZA}}\) satisfies Property ∗.
Proof
Assume that \(b = \frac{2\bar{\gamma}^{2}}{(1  \sigma )c\gamma^{2}}\) and \(\lambda = \frac{(1  \sigma )c\gamma^{2}}{2L\bar{\gamma} b}\). Then \(b > 1\) and \(\lambda > 0\). Using (10) and Theorem 3.2, we obtain
Using (17) and Assumption 1, we obtain
If \(\Vert x_{k}  x_{k  1} \Vert \le \lambda\), then
The proof is complete. □
The proof of the forthcoming lemmas and Theorem 3.3 originally can be found in [7]. However, we present it here for readability. The following lemma shows that if the CG formula satisfies Property ∗, then the fraction of steps cannot be too small.
Lemma 3.3
Assume that Assumption 1 holds. Assume that the sequences \(\{ g_{k} \}\) and \(\{ d_{k} \}\) are generated by Algorithm 1, where \(\alpha_{k}\) is computed using the WWP line search, in which the sufficient descent condition (13) holds, and assume that the method has Property ∗. Suppose also \(\Vert g_{k} \Vert \ge \gamma\) for some \(\lambda > 0\). Then there exists \(\lambda > 0\) such that for any \(\Delta \in \mathbb{N}\) and any index \(k_{0}\), there is an index \(k > k_{0}\) that satisfies
where \(\kappa_{k,\Delta}^{\lambda} = \{ i \in \mathbb{N}:k \le i \le k + \Delta  1,\Vert s_{i} \Vert > \lambda\}\), \(\mathbb{N}\) denotes the set of positive integers, and \(\vert \kappa_{k,\Delta}^{\lambda} \vert \) denotes the number of elements in \(\kappa_{k,\Delta}^{\lambda}\).
Lemma 3.4
Suppose that Assumption 1 holds. Assume that the sequences \(\{ g_{k} \}\) and \(\{ d_{k} \}\) are generated by Algorithm 1, where \(\alpha_{k}\) is computed using the WWP line search, and that the sufficient descent condition (13) holds. If \(\beta_{k} \ge 0\) and (22) holds, then \(d_{k} \ne 0\) and
Theorem 3.3
Suppose that Assumption 1 holds. Assume that the sequences \(\{ g_{k} \}\) and \(\{ d_{k} \}\) are generated by Algorithm 1, where \(\alpha_{k}\) is computed using the WWP line search, and that the sufficient descent condition (13) holds. In addition, suppose that Property ∗ holds. Then we have \(\lim_{k \to \infty} \inf \Vert g_{k} \Vert = 0\).
Proof
Based on Lemma 3.4, we prove the theorem by contradiction. Define \(u_{i}: = \frac{d_{i}}{\Vert d_{i} \Vert }\). For any two indices l, k with \(l \ge k\), we have
where \(s_{i  1} = x_{i}  x_{i  1}\).
Taking the norms,
Using Assumption 1, we know that sequence \(\{ x_{k} \}\) is bounded, and there is a positive constant η such that \(\Vert x_{k} \Vert \le \eta\) for all \(k \ge 1\). Thus,
which implies that
Assume that \(\lambda > 0\) is given by Lemma 3.3. Following the notation of this lemma, we define
From Lemma 3.3, we can find an index \(k_{0}\) such that
With this Δ and \(k_{0}\), Lemma 3.4 gives an index \(k \ge k_{0}\) such that
Next, according to the CauchySchwarz inequality and (24), we see that for any index \(i \in [k, k + \Delta  1]\),
By this relation, (23) and (25), with \(l = k + \Delta  1\), we have
Thus, \(\Delta < 8\eta /\lambda\), which contradicts the definition of Δ. The proof is complete. □
Using Lemmas 3.2, 3.3, and 3.4 and Theorem 3.3, the global convergence of Algorithm 1 with the WolfePowell line search is similarly established to that in Theorem 4.3 in [7]. Therefore, the proof of the following theorem is omitted, and we present the following theorems without proof.
Theorem 3.4
Suppose that Assumption 1 holds. Consider the CG method of forms (2), (3) and (11), where \(\alpha_{k}\) is computed using the WWP line search; then \(\lim_{k \to \infty} \inf \Vert g_{k} \Vert = 0\).
Theorem 3.5
Suppose that Assumption 1 holds. Consider the CG method of forms (2), (3) and (11), where \(\alpha_{k}\) is computed using the SWP line search with \(0 < \sigma < 1/3\); then \(\lim_{k \to \infty} \inf \Vert g_{k} \Vert = 0\).
Numerical results and discussion
To test the efficiency and robustness of the new method (17), some standard test functions were selected from CUTE [18] and Andrei [19], as summarized in Table 1. We performed a comparison with other CG methods, which included the WYL family, PRP^{∗}, HPRP, PRP, HS+, FR, and ZA parameters. The stopping criterion was \(\Vert g_{k} \Vert \le 10^{  6}\) for all algorithms.
The initial point \(x_{1} \in \mathbb{R}^{n}\) is arbitrary. As shown in Table 1, we used different initial points based on the original standard points. We notice from the numerical results that different initial points almost had different stationary points for the multimodal functions. In addition, the efficiency of the algorithm depended on the initial points for every function. For example, the efficiency of the FR algorithm with the extended Rosenbrock function and the initial point \((1.2, 1, 1.2, 1, \ldots, 1.2, 1)\) is different from that with \((5, 5, \ldots, 5)\) or \((10, 10, \ldots, 10)\) as the initial point. Moreover, the initial point determines the value of the CG formula based on Powell [20]; for example, the PRP or HS parameter fails to obtain the solution if its value is negative. In contrast, if we use another initial point, the value of PRP is nonnegative and satisfies the descent property. This result motivated us to further study the initial points. Moreover, different dimensions were used for every function, and the dimension range was \([2, 10\text{,}000]\).
We present Himmelblau’s function (Figure 1), which is a multimodal function to test the efficiency of the optimization algorithms. The function is defined as follows:
In Table 2, we used different initial points with Algorithm 1 and Himmelblau’s function. Every initial point gave a different solution point, as indicated in Table 2. We used a MATLAB 7.9 subroutine with an Intel (R) Core (TM) i5 CPU, 4 GB DDR2 RAM and the SWP line search under cubic interpolation. We used the Sigma Plot 10 program to graph the data based on multiple horizontal steps, and the graphs are shown in Figures 2 and 3. The selected values of δ and σ are 0.01 and 0.1, respectively.
The performance results are shown in Figures 2 and 3 with a performance profile introduced by Dolan and Moré [21].
This performance measure was introduced to compare a set of solvers S on a set of problems F. Assuming that there are \(n_{s}\) solvers and \(n_{f}\) problems in S and F, respectively. Then the measure \(t_{f,s}\) is defined as the required number of iterations or CPU time to solve problem f using solver s. To create a baseline for comparison, the performance of solver s on problem f is scaled by the best performance of any solver in S on the problem using the ratio
Suppose that a parameter \(r_{M} \ge r_{f,s}\) for all f, s is selected. \(r_{f,s} = r_{M}\) if and only if solver s does not solve problem f.
Because we would like to obtain an overall assessment of the performance of a solver, we defined the measure
Thus, \(P_{s}(t)\) is the probability for solver \(s \in S\) that the performance ratio \(r_{f,s}\) is within a factor \(t \in \mathbb{R}\) of the best possible ratio. If we define function \(p_{s}\) as the cumulative distribution function for the performance ratio, then the performance measure \(f_{s}:\mathbb{R} \to [0,1]\) for a solver is nondecreasing and piecewise continuous from the right. The value of \(f_{s}(1)\) is the probability that the solver has the best performance of all solvers. In general, a solver with high values of \(f(t)\), which appears in the upper right corner of the figure, is preferable.
Based on the left side of Figures 2 and 3, ZA is clearly above the other curves. As we previously mentioned, PRP^{∗} seems to be better than HPRP because the latter restarted too many times by using the negative gradient. Furthermore, WYL is better than NPRP, DPRP. Although the PRP and HS methods are efficient, both of them have theoretical problems; thus, the number of solved function using the PRP formula does not exceed 75%. The HS+ formula also has theoretical problem when the direction derivative is positive; hence, it may not satisfy the descent property with the SWP line search. Thus, the percentage value of solved functions using the HS+ formula is approximately 90%. The FR formula satisfies the descent property and convergence property, but we terminated the program several times because it is cyclic without reaching the solution. For all algorithms, the time limit to obtain the solution was 500 seconds.
Conclusion
In this paper, we used the HS CG formula with the restart. The global convergence and descent properties were established with WWP and SWP line searches. The numerical results demonstrate that the new modification is better than other CG parameters.
References
 1.
Hestenes, MR, Stiefel, E: Methods of conjugate gradients for solving linear systems. J. Res. Natl. Bur. Stand. 49(6), 409436 (1952)
 2.
Fletcher, R, Reeves, CM: Function minimization by conjugate gradients. Comput. J. 7(2), 149154 (1964)
 3.
Polak, E, Ribiere, G: Note sur la convergence de méthodes de directions conjuguées. ESAIM: Math. Model. Numer. Anal. 3(R1), 3543 (1969)
 4.
Wolfe, P: Convergence conditions for ascent methods. SIAM Rev. 11, 226235 (1968)
 5.
Wolfe, P: Convergence conditions for ascent methods. II: some corrections. SIAM Rev. 13, 185188 (1971)
 6.
Dai, YH, Liao, LZ: New conjugacy conditions and related nonlinear conjugate gradient methods. Appl. Math. Optim. 43(1), 87101 (2001)
 7.
Gilbert, JC, Nocedal, J: Global convergence properties of conjugate gradient methods for optimization. SIAM J. Optim. 2(1), 2142 (1992)
 8.
Zoutendijk, G: Nonlinear programming, computational methods. Integer Nonlinear Program. 143(1), 3786 (1970)
 9.
Moré, JJ, Thuente, DJ: On line search algorithms with guaranteed sufficient decrease. Mathematics and Computer Science Division Preprint MCSP1530590, Argonne National Laboratory, Argonne, IL (1990)
 10.
TouatiAhmed, D, Storey, C: Efficient hybrid conjugate gradient techniques. J. Optim. Theory Appl. 64, 379397 (1990)
 11.
Wei, Z, Yao, S, Liu, L: The convergence properties of some new conjugate gradient methods. Appl. Math. Comput. 183(2), 13411350 (2006)
 12.
Zhang, L: An improved WeiYaoLiu nonlinear conjugate gradient method for optimization computation. Appl. Math. Comput. 215(6), 22692274 (2009)
 13.
Dai, Z, Wen, F: Another improved WeiYaoLiu nonlinear conjugate gradient method with sufficient descent property. Appl. Math. Comput. 218(14), 74217430 (2012)
 14.
Sun, W, Yuan, YX: Optimization Theory and Methods: Nonlinear Programming, vol. 1. Springer, Berlin (2006)
 15.
Alhawarat, A, Mamat, M, Rivaie, M, Salleh, Z: An efficient hybrid conjugate gradient method with the strong WolfePowell line search. Math. Probl. Eng. 2015, Article ID 103517 (2015)
 16.
Yuan, G, Meng, Z, Li, Y: A modified Hestenes and Stiefel conjugate gradient algorithm for largescale nonsmooth minimizations and nonlinear equations. J. Optim. Theory Appl. 168(1), 129152 (2016)
 17.
Alhawarat, A, Mamat, M, Rivaie, M, Mohd, I: A new modification of nonlinear conjugate gradient coefficients with global convergence properties. Int. J. Math. Comput. Stat. Nat. Phys. Eng. 8(1), 5460 (2014)
 18.
Bongartz, I, Conn, AR, Gould, N, Toint, PL: CUTE: constrained and unconstrained testing environment. ACM Trans. Math. Softw. 21(1), 123160 (1995)
 19.
Andrei, N: An unconstrained optimization test functions collection. Adv. Model. Optim. 10(1), 147161 (2008)
 20.
Powell, MJD: Nonconvex Minimization Calculations and the Conjugate Gradient Method. Springer, Berlin (1984)
 21.
Dolan, ED, Moré, JJ: Benchmarking optimization software with performance profiles. Math. Program. 91(2), 201213 (2002)
Acknowledgements
The authors are grateful to the editor and the anonymous reviewers for their valuable comments and suggestions, which have substantially improved this paper. In addition, we acknowledge the Ministry of Higher Education Malaysia and Universiti Malaysia Terengganu; this study was partially supported under the Fundamental Research Grant Scheme (FRGS) Vote no. 59347.
Author information
Affiliations
Corresponding author
Additional information
Competing interests
The authors declare that they have no competing interests.
Authors’ contributions
All authors contributed equally to the writing of this paper. All authors read and approved the final version of this paper.
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.
About this article
Cite this article
Salleh, Z., Alhawarat, A. An efficient modification of the HestenesStiefel nonlinear conjugate gradient method with restart property. J Inequal Appl 2016, 110 (2016). https://doi.org/10.1186/s1366001610495
Received:
Accepted:
Published:
Keywords
 conjugate gradient method
 WolfePowell line search
 HestenesStiefel formula
 restart condition
 performance profile