Two efficient modifications of AZPRP conjugate gradient method with sufficient descent property

The conjugate gradient method can be applied in many fields, such as neural networks, image restoration, machine learning, deep learning, and many others. Polak–Ribiere–Polyak and Hestenses–Stiefel conjugate gradient methods are considered as the most efficient methods to solve nonlinear optimization problems. However, both methods cannot satisfy the descent property or global convergence property for general nonlinear functions. In this paper, we present two new modifications of the PRP method with restart conditions. The proposed conjugate gradient methods satisfy the global convergence property and descent property for general nonlinear functions. The numerical results show that the new modifications are more efficient than recent CG methods in terms of number of iterations, number of function evaluations, number of gradient evaluations, and CPU time.


Introduction
We consider the following form for the unconstrained optimization problem: where f : R n → R is a continuously differentiable function and its gradient is denoted by g(x) = ∇f (x). To solve (1.1) using the CG method, we use the following iterative method starting from the initial point x 0 ∈ R n . Then x k+1 = x k + α k d k , k = 0, 1, 2, . . . , (1.2) where α k > 0 is the step size obtained by some line search. The search direction d k is defined by where g k = g(x k ) and β k is known as the conjugate gradient method. To obtain the steplength α k , we have the following two line searches: However, (1.4) is computationally expensive if the function has many local minima.

Inexact line search
To overcome the cost of using exact line search and obtain steps that are neither too long nor too short, we usually use inexact line search, in particular weak Wolfe-Powell (WWP) line search [1,2] given as follows: Another, strong, version of Wolfe-Powell (SWP) line search is given by (1.5) and where 0 < δ < σ < 1. The descent condition (downhill condition) plays an important role in the CG method, where the equation of the descent condition is given as follows: Albaali [3] extended (1.8) to the following form: k ≥ 0 and c > 0, (1.9) called the sufficient descent condition. The steepest descent method is the simplest of the gradient methods for optimization functions in n variables. From a current trial point x 1 , for a function f (x), one expects to find a vector close to a minimum by moving away from x 1 along the direction which causes f (x) to decrease rapidly, i.e., f (x 1 ) > f (x 2 ) > f (x 3 ) > · · · . This direction of steepest descent is given by the negative gradient, -g k . Using contour lines, the minimum point of a function is obtained with two variables. For example, Fig. 1 shows contour lines for Booth function in two dimensions.
As we see in Fig. 2, the gradient f (x) is orthogonal with the contour lines, and for every x, the gradient point in the direction of the steepest increases f (x). In Fig. 2, the gradient,  The graph of Booth function with contour lines and its gradients contours, and Booth function are plotted, which clearly portrays the function's minimum using the function or contour line graph. Despite the steepest descent method robustness, it is not efficient due to CPU time for large-dimensional functions. Thus, using the CG method will avoid the orthogonality between the ∇f and the search direction. Figure 3 shows the angle between the ∇f and d k using the CG method. The most famous classical formulas of CG methods are Hestenses-Stiefel (HS) [3], Polak-Ribiere-Polyak (PRP) [4], Liu and Storey (LS) [5], Fletcher-Reeves (FR) [6], Fletcher (CD) [7], Dai and Yuan (DY) [8], given as follows: These methods are similar if we use exact line search and a function satisfying quadratic line search condition since g T k d k-1 = 0, which implies g T k d k =g k 2 using (1.3). In addition, if the function is quadratic, then g T k g k-1 = 0. The global convergence properties were studied by Zoutendijk [9] and Al-Baali [10]. The global convergence of the PRP method for a convex objective function under exact line search was proved by Polak and Ribere in [4]. Later, Powell [11] gave a counterexample showing a nonconvex function, in which PRP and HS can cycle infinitely without getting a solution. Powell emphasized the importance to achieve the global convergence of PRP and HS method, which should not be negative. Moreover, Gilbert and Nocedal [12] proved that nonnegative PRP, i.e., with β k = max{β PRP k , 0}, is globally convergent under complicated line searches.
Since the function is quadratic, i.e., the step size is obtained by exact line search (1.4), the CG method satisfies the conjugacy condition, i.e., d T i Hd T j = 0, ∀i = j. Using the mean value theorem and exact line search with equation (1.3), we can obtain β HS k . From the quasi-Newton method, BFGS method, the limited memory (LBFGS) method, and equation (1.3), Dai and Liao [13] proposed the following conjugacy condition: where s k-1 = x kx k-1 , and t ≥ 0. In the case of t = 0, equation (1.10) becomes the classical conjugacy condition. By using (1.3) and (1.10), [13] proposed the following CG formula: (1.11) However, β DL k faces the same problem as β PRP k and β HS k , i.e., β DL k is not nonnegative in general. Thus, [13] replaced equation (1.11) by (1.12) Moreover, Hager and Zhang [14,15] presented a modified CG parameter that satisfies the descent property for any inexact line search with g T k d k ≤ -(7/8) g k 2 . This new version of the CG method is globally convergent whenever the line search satisfies the (WP) line search requirement. This formula is given as follows: In 2006, Wei et al. [16] gave a new positive CG method, which is quite similar to the original PRP method, which has global convergence under exact and inexact line search, that is, where y k-1 = g kg k-1 . From the WYL method, many modifications appeared, such as the following [17]: [11] and Alhawarat et al. [18] constructed the following CG method with a new restart criterion as follows: where μ k = s k y k , s k = x kx k-1 , y k = g kg k-1 , and · denotes the Euclidean norm. Besides, Kaelo et al. [19] proposed the following CG formula:

Motivation and the new restarted formula
To improve the efficiency of β AZPRP k in terms of function evaluation, gradient evaluation, number of iterations, and CPU time, we construct two new CG methods based on β AZPRP k , β DPRP k , and β DHS k as follows: The second modification is given as follows: II. In some neighborhood N of , f is continuous and differentiable, and its gradient is Lipchitz continuous. That is, for any x, y ∈ N , there exists a constant L > 0 such that The following is considered one of the most important lemmas used to prove the global convergence properties. For more details, the reader can refer to [9].
where (3.1) is known as the Zoutendijk condition. Inequality (3.1) also holds for the exact line search, the Armijo-Goldstein line search, and the SWP line search.
Gilbert and Nocedal [11] presented an important theorem to find the global convergence of nonnegative PRP and nonnegative methods summarized by Theorem 3.3. Furthermore, they presented a nice property, called Property*, as follows: Property* Consider a method of the form (1.1) and (1.2), and suppose 0 < γ ≤ g k ≤γ . We say that the method possesses Property* if there exist constant b > 1 and λ > 0 such that for all k ≥ 1, we get |β k | ≤ b, and if x kx k-1 ≤ λ, then The following theorem plays a crucial role in the CG method given in [11].
To show that β A1 k ≤ 1 2b , we have the following two cases: Case 2: μ k < 1 To satisfy Property* for β A1 k with μ k < 1, we need the following inequality: Using (3.3), we obtain Thus, in all cases The proof is completed. Proof We will apply Theorem 3.1. Note that the following properties hold for β A1 k : k satisfies Property* using Theorem 3.3. iii. β A1 k satisfies the descent property using Theorem 3.2. iv. Assumption 1 holds. Thus, all properties in Theorem 3.1 are satisfied, which leads to lim k→∞ g k = 0.

The global convergence properties of β
To show that β A2 k ≤ 1 2b , we have the following two cases: Thus, in all cases  5) and (1.6), then lim k→∞ g k = 0.
Proof We will apply Theorem 3.1. Note that the following properties hold for β A2 k : k satisfies Property* by using Theorem 3.6. iii. β A2 k satisfies the descent property by using Theorem 3.5. iv. Assumption 1 holds. Thus all properties in Theorem 3.1 are satisfied, which leads to lim k→∞ g k = 0. If the condition g k 2 > μ k |g T k g k-1 | does not hold for β A1 k and β A2 k , then the CG method will be restarted using The following two theorems show that the CG method with β D-H k has the descent and convergence properties.
Letting c = 1, we then obtain which completes the proof. Proof We will prove this theorem by contradiction. Suppose Theorem 3.4 is not true. Then, a constant ε > 0 exists such that By squaring both sides of (1.2), we obtain Let g k q = min g k 2 , g k 3 , g k 4 , q ∈ N, Also, let Therefore,

Numerical results and discussions
To analyze the efficiency of the new CG method, several test functions are selected from CUTE [20], as shown in the Appendix. These functions can be obtained from the following website: http://ccpforge.cse.rl.ac.uk/gf/project/cutest/wiki/ In the Appendix, the following notations are defined as follows:   It is clear that based on the left-hand side of Figs. 4, 5, 6, and 7, the CG method A1 is above the other curves. Therefore, it is the most efficient method among related AZPRP methods. However, CG method A2 is not as efficient as A1. Still, it is more efficient than AZPRP with respect to CPU time, the number of function evaluations, gradient evaluations, and the number of iterations. In addition, as an application of the CG method in image restoration, the reader can refer to the following references [22][23][24].

Conclusion
In this paper, we proposed two efficient conjugate gradient methods related to the AZPRP method. The two methods satisfied global convergence properties and the descent property when SWP line searches were employed. Furthermore, our numerical results showed that the new methods are more efficient than the AZPRP method with respect to the number of iterations, gradient evaluations, function evaluations, and CPU time.