# General inertial proximal stochastic variance reduction gradient for nonconvex nonsmooth optimization

## Abstract

In this paper, motivated by the competitive performance of the proximal stochastic variance reduction gradient (Prox-SVRG) method, a novel general inertial Prox-SVRG (GIProx-SVRG) algorithm is proposed for solving a class of nonconvex finite sum problems. More precisely, Nesterov’s momentum trick-based extrapolation accelerated step is incorporated into the framework of Prox-SVRG method. The GIProx-SVRG algorithm possesses more general accelerated expression and thus can potentially achieve accelerated convergence speed. Moreover, based on the supermartingale convergence theory and the error bound condition, we establish a linear convergence rate for the iterate sequence generated by the GIProx-SVRG algorithm. We observe that there is no theory in which the general extrapolation technique is incorporated into the Prox-SVRG method, whereas we establish such a theory in this paper. Experimental results demonstrate the superiority of our method over state-of-the-art methods.

## 1 Introduction

In the field of machine learning, one can often encounter a class of finite sum optimization problems that has the following general form:

$$\min_{x\in \mathbb{R}^{d}} F(x)\triangleq \bigl\{ f(x)+R(x) \bigr\} ,$$
(1)

where f is a loss function that consists of the average of a large number of smooth (not necessarily convex) component functions, i.e., $$f(x)=\frac{1}{n}\sum^{n}_{i=1}f_{i}(x)$$, where each function $$f_{i}:\mathbb{R}^{d}\rightarrow \mathbb{R}$$ and $$R:\mathbb{R}^{d}\rightarrow \mathbb{R}\cup +\infty$$ is a convex (possibly nondifferentiable) regularizer used to prevent over fitting or to induce sparsity (e.g., $$l_{1}$$ norm). We assume that the optimal value of problem (1) is finite and attainable. This type of problem covers many applications in machine learning, including but not limited to matrix completion , image processing , and neural network .

One key challenge for problem (1) is that traditional deterministic first-order methods would be prohibitive to solve it when the number of component functions is extremely large . In this case, a popular alternative method in machine learning is stochastic gradient descent (SGD), which dates back to the pioneering work proposed by Robbins and Monro . The SGD method enjoys a faster iteration than that of the full gradient method since the gradient is calculated at only one sample or mini-batch samples. However, SGD requires a decreasing step size to ensure its convergence due to the variance induced by random sampling, which results in a slower sublinear convergence rate (cf. [19, Chap. 2.1]).

Acceleration techniques also play a certain role in promoting both theoretical analysis and practical applications in finite sum optimizations. The representative methods include Katyusha momentum , ASVRG , SpiderBoost with momentum , and other accelerated methods (see, e.g., [6, 34]). Although the specific accelerated operations are different from these methods, they show a common conclusion from theory: SVRG coupled with an acceleration trick could obtain the best known oracle complexities for convex objectives or the near-optimal oracle complexity for nonconvex problems. Thanks to the research on acceleration techniques in unconstrained optimization algorithms, acceleration techniques are introduced into constrained composite optimization algorithms (see, e.g., [4, 33]). Recently, the general inertial acceleration technique comes into being in deterministic optimization methods and shows empirically that it is better than the usual acceleration methods (e.g., [30, 31]). In contrast to the usual acceleration step, the inertial acceleration step enjoys more general expression and thus has potentially better performance in practice.

From the above-mentioned facts, we can know that acceleration techniques, especially the general inertial acceleration technique, have a potential ability to boost the convergence in both convex and nonconvex problems. Consequently, in this paper, we incorporate the inertial acceleration technique into the framework of Prox-SVRG and propose a general inertial proximal stochastic variance reduction gradient (GIProx-SVRG) for a class of nonsmooth nonconvex finite sum optimization. Specifically speaking, we incorporate double Nesterov’s acceleration step that involves the latest two iterate points as a general inertial step to adjust the current iterate. The GIProx-SVRG algorithm has more general accelerate expression than the Prox-SVRG method and thus can potentially achieve an accelerated convergence rate.

The following are summarized contributions of this paper:

1. (i)

A novel stochastic first-order method, namely GIProx-SVRG, is proposed by combining the general inertial acceleration technique and Prox-SVRG. GIProx-SVRG has a general accelerate step, which results in a potential accelerated performance as opposed to Prox-SVRG.

2. (ii)

Based on the supermartingale convergence theory and the well-known error bound condition, we establish a linear convergence property for the iteration sequences generated by the GIProx-SVRG algorithm.

3. (iii)

The extensive experimental results on standard datasets demonstrate that our proposed algorithm is superior to the considered approaches in the numerical results.

The rest of this paper is organized as follows. Some related notations and theories are reviewed in Sect. 2. In Sect. 3, we introduce and describe the GIProx-SVRG algorithm in detail. In Sect. 4, we focus on the main convergence property of our proposed algorithm. Section 5 mainly conducts numerical simulations. Section 6 makes the conclusion of this paper.

## 2 Notations and preliminaries

Throughout this paper, $$\mathbb{R}^{d}$$ is the d-dimensional Euclidean space, and its standard inner product is denoted by $$\langle \cdot ,\cdot \rangle$$. We use $$\|\cdot \|$$ and $$\|\cdot \|_{1}$$ to denote the $$l_{2}$$ and $$l_{1}$$ norms, respectively. For a closed convex set $$X\subseteq \mathbb{R}^{d}$$, we let $$\operatorname{dist}(x,X)=\inf_{y\in X}\|x-y\|$$ be the distance of the point $$x\in \mathbb{R}^{d}$$ to X. The domain of an extended real-valued function $$g:\mathbb{R}^{d}\rightarrow [-\infty ,\infty ]$$ is defined as $$\operatorname{dom} g=\{x\in \mathbb{R}^{d}\mid g(x)<+\infty \}$$. The function g is proper if and only if $$\operatorname{dom} g\neq \emptyset$$ and $$g(x)>-\infty$$ for any $$x\in \operatorname{dom} g$$. We say the function g is closed if it is lower semicontinuous. A proper closed function g is said to be level bounded if the lower sets of g are bounded, i.e., $$\{x\in \mathbb{R}^{d}\mid g(x)\leq c, \forall c\in R\}$$ is bounded. For a stochastic algorithm, we use $$\mathbb{E}[\cdot ]$$ to denote the total expectation in terms of its whole iteration process. For a certain random variable i, we denote its expectation by $$\mathbb{E}_{i}[\cdot ]$$.

The proximal operator of a proper closed convex function g at the point $$y\in \mathbb{R}^{d}$$ with parameter $$\alpha >0$$ is defined as

$$\operatorname{prox}_{\alpha g}(y):=\operatorname*{arg\,min}_{x\in \mathbb{R}^{d}} \biggl\{ g(x)+ \frac{1}{2\alpha} \Vert x-y \Vert ^{2}\biggr\} .$$

We refer the readers to  for more detailed properties of the proximal operator.

### Definition 1

For a continuously differentiable function $$f:\mathbb{R}^{d} \rightarrow \mathbb{R}$$, if

$$\bigl\Vert \nabla f(x)-\nabla f(y) \bigr\Vert \leq L \Vert x-y \Vert ,\quad \forall x,y\in \mathbb{R}^{d},$$

then we say f has a Lipschitz continuous gradient with modulus $$L>0$$.

### Lemma 1

([20, Lemma 1.2.3])

Let $$f:\mathbb{R}^{d}\rightarrow \mathbb{R}$$ be a continuously differentiable function. And f has a Lipschitz continuous gradient with modulus L. Then, for all $$x,y\in \mathbb{R}^{d}$$,

$$\bigl\vert f(x)-f(y)-\bigl\langle \nabla f(y),x-y\bigr\rangle \bigr\vert \leq \frac{L}{2} \Vert x-y \Vert ^{2}.$$

In the following, we give the definition of the subdifferential.

### Definition 2

(Subdifferential [31, Definition 1])

Let $$f:\mathbb{R}^{d}\rightarrow (-\infty ,+\infty ]$$ be a proper and lower semicontinuous function.

1. (i)

For given $$x\in \operatorname{dom} f$$, the Frechet subdifferential of f at x, written by $$\hat{\partial} f(x)$$, is the set of all vectors $$u\in \mathbb{R}^{d}$$ satisfying

$$\lim_{y\neq x}\inf_{y\rightarrow x} \frac{f(y)-f(x)-\langle u,y-x\rangle}{ \Vert y-x \Vert }\geq 0,$$

and we set $$\hat{\partial} f(x)=\emptyset$$ when $$x\notin \operatorname{dom} f$$.

2. (ii)

The limiting-subdifferential (or simply the subdifferential) of the function f at $$x\in \operatorname{dom} f$$ is defined as

$$\partial f(x)\triangleq \bigl\{ \xi \in \mathbb{R}^{d}\mid \exists x_{k} \rightarrow x, \text{s.t. }f(x_{k}) \rightarrow f(x) \text{ and } \hat{\partial} f(x) \ni \xi _{k} \rightarrow \xi \bigr\} .$$
3. (iii)

A point $$x^{*}$$ is called (limiting-) stationary point of f if it satisfies $$0\in \partial f(x^{*})$$.

In this paper, we denote the set of stationary points of problem (1) by $$\mathcal{X}$$.

In the following, we give the property of strong convexity.

### Definition 3

A function $$f:\mathbb{R}^{d}\rightarrow \mathbb{R}$$ is μ-strongly convex if

$$f(x)\geq f(y)+\langle \xi ,x-y\rangle +\frac{\mu}{2} \Vert x-y \Vert ^{2},\quad \forall x,y\in \mathbb{R}^{d},$$

where ξ is a subgradient in $$\partial f(y)$$.

Finally, we introduce two important classes of linear convergence rate.

### Definition 4

For a given sequence $$\{x_{k}\}$$, suppose that $$x^{\ast}$$ is the approximation optimal solution of problem (1) if there exists a constant $$0<\theta <1$$ such that the following inequality holds:

$$\bigl\Vert x_{k+1}-x^{*} \bigr\Vert \leq \theta \bigl\Vert x_{k}-x^{*} \bigr\Vert .$$

Then we say that the sequence $$\{x_{k}\}$$ is Q-linearly convergent. If the following holds

$$\limsup_{k\rightarrow +\infty} \bigl\Vert x_{k}-x^{*} \bigr\Vert ^{ \frac{1}{k}}< 1,$$

then we say that the sequence $$\{x_{k}\}$$ is R-linearly convergent.

In what follows, we provide a close relationship between Q-linear convergence and R-linear convergence.

### Lemma 2

Given two sequences $$\{x_{k}\}$$ and $$\{y_{k}\}$$, if the two sequences satisfy the following two states:

1. (1)

$$0\leq y_{k}\leq x_{k}$$;

2. (2)

the sequence $$\{x_{k}\}$$ is Q-linearly convergent,

then the sequence $$\{y_{k}\}$$ is R-linearly convergent.

In the following, a property of the convergence of supermartingale is stated.

### Lemma 3

(Supermartingale convergence [7, Lemma 2.7])

Let $$\{x_{k}\}^{\infty}_{k=0}$$ and $$\{Y_{k}\}^{\infty}_{k=0}$$ be sequences of bounded nonnegative random variables. If for all k,

$$Y_{k}\leq \mathbb{E}_{k}[x_{k}-x_{k+1}],$$

then $$\sum^{\infty} _{k=0}Y_{k}<\infty$$ a.s. and $$\{x_{k}\}$$ is convergent a.s.

## 3 General inertial proximal SVRG for nonconvex nonsmooth optimization

In this section, we introduce our proposed accelerated stochastic-type algorithm. In , Wu et al. developed a general inertial proximal gradient algorithm (GIPGM) for nonsmooth nonconvex problem and achieved a good performance compared with some other accelerated proximal gradient methods. This aroused our interest in whether the general inertial technique, by linking it to the Prox-SVRG algorithm, performs so well in solving the nonsmooth nonconvex finite sum problem, which is still far from being solved. This motivated us to incorporate this skill into Prox-SVRG to further improve the convergence performance of the stochastic-type algorithm.

Combining GIPGM and Prox-SVRG, we develop a new accelerated stochastic first-order method, namely, general inertial proximal stochastic variance reduction gradient algorithm (GIProx-SVRG) for a class of nonsmooth nonconvex finite sum problems. The detailed pseudocode of GIProx-SVRG is given in Algorithm 1.

It is shown in Algorithm 1 that there exist two layer loops in GIProx-SVRG, where the outer loop is denoted by $$s=1,\ldots ,S$$ and the inner loop is denoted by $$k=0,\ldots ,m-1$$. Similar to Prox-SVRG, the heart operation in the outer loop, i.e., Step 3, is calculation of a full gradient that involved all component functions of problem (1). The main structure of the inner loop is from Step 5 to Step 9, which is a multi-stochastic gradient step. In the inner loop, two Nesterov’s acceleration steps as a general inertial step are performed, which is the main difference from AProx-SVRG and Prox-SVRG. Then, an unbiased gradient estimate, i.e., $$\mathbb{E}_{i^{s}_{k}}[v^{s}_{k}]=\nabla f(z^{s}_{k})$$, is constructed at the inertial accelerated point $$z^{s}_{k}$$ and with respect to the sample $$i^{s}_{k}$$ selected randomly from $$\{1,\ldots,n\}$$. At last, a proximal operator is utilized to calculate the proximal point $$x^{s}_{k+1}$$ with the inertial points $$y^{s}_{k}$$ and $$z^{s}_{k}$$. With the iteration going on, the variance of the stochastic gradient tends gradually to zero, and thus a larger constant step size can be used in GIProx-SVRG. Note that, by simple operation, Step 9 in Algorithm 1 is equal to the following:

$$x^{s}_{k+1}=\operatorname*{arg\,min}_{x\in \mathbb{R}^{d}} \biggl\{ R(x)+ \bigl\langle v^{s}_{k},x\bigr\rangle + \frac{1}{2\alpha} \bigl\Vert x-y^{s}_{k} \bigr\Vert ^{2}\biggr\} .$$
(2)

### Remark

In view of the structure in Algorithm 1, GIProx-SVRG possesses more general accelerated expression than AProx-SVGR and Prox-SVGR. Especially, GIProx-SVRG is equal to AProx-SVGR and Prox-SVRG as $$\beta =0$$ and $$\beta =\lambda =0$$, respectively. Note that a broad range of parameters given in Algorithm 1 served as our convergence guarantee. Better choices can be made according to the specific structure of problems, and this is beyond the scope of this paper.

In the following, we provide an upper bound for stochastic gradient variance.

### Lemma 4

Let $$\{x^{s}_{k}\}$$ be the sequence generated by Algorithm 1. Then, for stochastic gradient $$v^{s}_{k}=\nabla f_{i^{s}_{k}}(z^{s}_{k})-\nabla f_{i^{s}_{k}}( \tilde{x}^{s-1})+\nabla f(\tilde{x}^{s-1})$$ in Algorithm 1, we have for all k in a fixed epoch s that

$$\mathbb{E}_{k} \bigl\Vert v^{s}_{k}- \nabla f\bigl(z^{s}_{k}\bigr) \bigr\Vert ^{2}\leq 2L^{2} \bigl\Vert x^{s}_{k}- \tilde{x}^{s-1} \bigr\Vert ^{2}+2\lambda ^{2}L^{2} \bigl\Vert x^{s}_{k}-x^{s}_{k-1} \bigr\Vert ^{2},$$

where $$L=\max_{i}L_{i}$$ is the Lipschitz constant of f and $$L_{i}$$ is the Lipschitz constant of $$\nabla f_{i}$$. λ is the extrapolation parameter in Algorithm 1.

See Appendix A.1 for the proof. Lemma 4 shows that the variance of stochastic gradient is bounded by $$O(\|x^{s}_{k}-\tilde{x}^{s-1}\|^{2}+\|x^{s}_{k}-x^{s}_{k-1}\|^{2})$$. And one can see from this bound that the variance of stochastic gradient gradually tends to zero as the iterations tend to a local stationary point.

In the following, we derive an inequality associated with a general proximal stochastic gradient iteration.

### Lemma 5

Let f be a Lipschitz continuous gradient function with modulus L and R be a proper lower semicontinuous convex function such that $$F = f + R$$. Let v be a stochastic gradient associated with a random index i such that $$\mathbb{E}_{i}[v]=\nabla f(z)$$ and point $$y\in \mathbb{R}^{d}$$ is independent of i. Suppose that there exists the following proximal iteration:

$$x^{+}=\operatorname{prox}_{\alpha R}(y-\alpha v),$$

where $$\alpha >0$$. Then, for any point $$x\in \mathbb{R}^{d}$$ independent of i, we have the following inequality:

\begin{aligned} \mathbb{E}_{i}\bigl[F \bigl(x^{+}\bigr)\bigr]& \leq F(x)+\alpha \mathbb{E}_{i} \bigl\Vert v- \nabla h(z) \bigr\Vert ^{2}+\frac{L}{2} \mathbb{E}_{i} \bigl\Vert x^{+}-z \bigr\Vert ^{2}- \frac{1}{2\alpha}\mathbb{E}_{i} \bigl\Vert x^{+}-y \bigr\Vert ^{2} \\ &\quad{} - \frac{1}{2\alpha}\mathbb{E}_{i} \bigl\Vert x^{+}-x \bigr\Vert ^{2}+\frac{L}{2} \Vert x-z \Vert ^{2}+ \frac{1}{2\alpha} \Vert x-y \Vert ^{2}. \end{aligned}

### Proof

Since f is Lipschitz continuous, by Lemma 1, we have that for any $$x\in \mathbb{R}^{d}$$,

$$\textstyle\begin{cases} f(x^{+})\leq f(z)+\langle \nabla f(z),x^{+}-z\rangle +\frac{L}{2} \Vert x^{+}-z \Vert ^{2}, \\ f(z)\leq f(x)+\langle \nabla f(z),z-x\rangle +\frac{L}{2} \Vert x-z \Vert ^{2}. \end{cases}$$

On the other hand, from the strong convexity of the proximal iteration $$x^{+}=\operatorname{prox}_{\alpha R}(y-\alpha v)$$ it follows that

$$R\bigl(x^{+}\bigr)\leq R(x)+\bigl\langle v,x-x^{+} \bigr\rangle +\frac{1}{2\alpha} \Vert x-y \Vert ^{2}- \frac{1}{2\alpha} \bigl\Vert x^{+}-y \bigr\Vert ^{2}-\frac{1}{2\alpha} \bigl\Vert x^{+}-x \bigr\Vert ^{2}.$$

Summing these three inequalities together, we see further that

\begin{aligned} F\bigl(x^{+}\bigr)& \leq F(x)+\bigl\langle v-\nabla f(z),x-x^{+} \bigr\rangle +\frac{L}{2} \bigl\Vert x^{+}-z \bigr\Vert ^{2}- \frac{1}{2\alpha} \bigl\Vert x^{+}-y \bigr\Vert ^{2} \\ &\quad{} - \frac{1}{2\alpha} \bigl\Vert x^{+}-x \bigr\Vert ^{2}+\frac{L}{2} \Vert x-z \Vert ^{2}+ \frac{1}{2\alpha} \Vert x-y \Vert ^{2} \\ & = F(x)+\bigl\langle v-\nabla f(z),\bar{x}^{+}-x^{+} \bigr\rangle +\bigl\langle v- \nabla f(z),x-\bar{x}^{+}\bigr\rangle + \frac{L}{2} \bigl\Vert x^{+}-z \bigr\Vert ^{2} \\ &\quad{} - \frac{1}{2\alpha} \bigl\Vert x^{+}-y \bigr\Vert ^{2}-\frac{1}{2\alpha} \bigl\Vert x^{+}-x \bigr\Vert ^{2}+ \frac{L}{2} \Vert x-z \Vert ^{2}+\frac{1}{2\alpha} \Vert x-y \Vert ^{2} \\ & \leq F(x)+\alpha \bigl\Vert v-\nabla f(z) \bigr\Vert ^{2}+ \bigl\langle v-\nabla f(z),x- \bar{x}^{+}\bigr\rangle + \frac{L}{2} \bigl\Vert x^{+}-z \bigr\Vert ^{2} \\ &\quad{} - \frac{1}{2\alpha} \bigl\Vert x^{+}-y \bigr\Vert ^{2}-\frac{1}{2\alpha} \bigl\Vert x^{+}-x \bigr\Vert ^{2}+ \frac{L}{2} \Vert x-z \Vert ^{2}+\frac{1}{2\alpha} \Vert x-y \Vert ^{2}, \end{aligned}

where $$\bar{x}^{+}=\operatorname{prox}_{\alpha \sigma}(y-\alpha \nabla f(z))$$. The second inequality comes from $$\langle x,y\rangle \leq \|x\|\|y\|$$ and the the nonexpansive property of the proximal operator (see [32, Lemma 3.6]). The conclusion can be obtained immediately from taking the conditional expectation operator $$\mathbb{E}_{i}$$ on both sides of the above inequality. □

Followed by Lemmas 4 and 5, we now derive the following lemma.

### Lemma 6

Let $$\{x^{s}_{k}\}$$ be the sequence generated by Algorithm 1. Assume that the objective function F in problem (1) satisfies conditions in Lemma 5. Then we have

\begin{aligned} \mathbb{E}_{k}\bigl[F \bigl(x^{s}_{k+1}\bigr)\bigr]& \leq F\bigl(x^{s}_{k} \bigr)+ \frac{L}{2}\mathbb{E}_{k} \bigl\Vert x^{s}_{k+1}-z^{s}_{k} \bigr\Vert ^{2}- \frac{1}{2\alpha}\mathbb{E}_{k} \bigl\Vert x^{s}_{k+1}-y^{s}_{k} \bigr\Vert ^{2}- \frac{1}{2\alpha}\mathbb{E}_{k} \bigl\Vert x^{s}_{k+1}-x^{s}_{k} \bigr\Vert ^{2} \\ &\quad{} + \biggl(2\alpha L^{2}\lambda ^{2}+ \frac{L\lambda ^{2}}{2}+ \frac{\beta ^{2}}{2\alpha}\biggr) \bigl\Vert x^{s}_{k}-x^{s}_{k-1} \bigr\Vert ^{2}+2\alpha L^{2} \bigl\Vert x^{s}_{k}- \tilde{x}^{s-1} \bigr\Vert ^{2}. \end{aligned}

See Appendix A.2 for the proof. In the next section, we present a detailed convergence analysis of the sequence $$\{\tilde{x}^{s}\}$$ and $$\{F(\tilde{x}^{s})\}$$ generated by Algorithm 1 and the main results of our paper.

## 4 Convergence analysis

In this section, we provide a detailed theoretical analysis for our proposed algorithm. We show the proposed algorithm with a local linear convergence to a stationary point of problem (1) under some mild assumptions. To this end, we first list some necessary assumptions.

### Assumption 1

For the objective function $$F=f+R$$ in problem (1),

1. (i)

F is level bounded and bounded from below;

2. (ii)

F is coercive, i.e., $$F(x)\rightarrow \infty$$ whenever $$\|x\|\rightarrow \infty$$;

3. (iii)

Each component function $$f_{i}:\mathbb{R}^{d}\rightarrow \mathbb{R}$$ in $$f=\frac{1}{n}\sum^{n}_{i=1}f_{i}$$ for all $$i\in \{1,2,\ldots,n\}$$ is continuously differentiable, nonconvex and has Lipschitz continuous gradient with modulus $$L_{i}>0$$;

4. (iv)

$$R:\mathbb{R}^{d}\rightarrow (-\infty ,\infty ]$$ is a proper lower semicontinuous convex function, and its associated proximal operator is easy to compute.

### Assumption 2

In our Algorithm 1, for any fixed epoch $$s\in \{1,2,\ldots,S\}$$, the sample sequences $$\{i^{s}_{k}\}=(i^{s}_{1},i^{s}_{2},\ldots,i^{s}_{k})$$ in the inner loop are independent and identically distributed (i.i.d.).

Assumption 1 is very common in the analysis of convergence behavior of stochastic first-order algorithms in nonconvex composite optimization. (i) and (ii) guarantee the boundedness of the sequence $$\{\tilde{x}^{s}\}$$ generated by Algorithm 1. The third one implies that f is also Lipschitz continuous gradient with modulus $$L=\max_{i}L_{i}$$. The last one makes sure that R can be applied to a proximal operation.

In the following, we construct a random auxiliary function

$$H^{s}_{k}=F\bigl(x^{s}_{k} \bigr)+\Theta _{k}\bigl( \bigl\Vert x^{s}_{k}-x^{s}_{k-1} \bigr\Vert ^{2}+ \bigl\Vert x^{s}_{k}- \tilde{x}^{s-1} \bigr\Vert ^{2}\bigr),$$

where the sequence $$\{\tilde{x}^{s}\}$$ is generated by Algorithm 1. $$\Theta _{k}=\Theta _{k+1}(1+a)+2\alpha ^{2}L^{2}(1+\lambda ^{2})+L \lambda (\lambda -\frac{1}{2})+\frac{\beta}{2\alpha}$$ is a nonincreasing sequence. We first show an expected decreasing property for $$\{H^{s}_{k}\}$$.

### Lemma 7

Let $$\{x^{s}_{k}\}$$ be the sequence generated by Algorithm 1. Suppose that $$\Theta _{k}$$ in $$H^{s}_{k}$$ satisfies $$\Theta _{k+1}\leq [\frac{1}{\alpha}(1-\frac{\beta}{2})+\frac{L}{2}( \lambda -1)]\frac{a}{2a+1}$$, $$a>0$$, and $$\Theta _{m}=0$$. Let $$\beta , \lambda \in [0,1)$$ such that $$\alpha \in (0,\frac{\beta}{L\lambda}]$$. Then the following statements are satisfied:

1. (i)

For all k in a fixed epoch s,

\begin{aligned} \mathbb{E}_{k}\bigl[H^{s}_{k+1}-H^{s}_{k} \bigr]& \leq - \bigl[\Theta _{k+1}(1+a)+2 \alpha L^{2} \bigr] \bigl\Vert x^{s}_{k}-x^{s}_{k-1} \bigr\Vert ^{2} \\ &\quad{} - \biggl[2\alpha L^{2}\lambda ^{2}+L\lambda ^{2}+\frac{\beta}{2\alpha}- \frac{L\lambda}{2}\biggr] \bigl\Vert x^{s}_{k}-\tilde{x}^{s-1} \bigr\Vert ^{2}. \end{aligned}
2. (ii)

The following summable with respect to the iteration sequences holds almost surely, i.e.,

$$\sum^{\infty}_{s=1}\sum ^{m-1}_{k=0}\rho _{1}\bigl[ \bigl\Vert x^{s}_{k}-x^{s}_{k-1} \bigr\Vert ^{2}+ \bigl\Vert x^{s}_{k}- \tilde{x}^{s-1} \bigr\Vert ^{2}\bigr]< \infty ,\quad \textit{a.s.}$$

for some constant $$\rho _{1}>0$$, and it follows that

$$\bigl\Vert x^{s}_{k}-x^{s}_{k-1} \bigr\Vert \xrightarrow{\textit{a.s.}}0,\quad \textit{and} \quad \bigl\Vert x^{s}_{k}- \tilde{x}^{s-1} \bigr\Vert \xrightarrow{\textit{a.s.}}0.$$
3. (iii)

$$\{H^{s}_{k}\}$$ is almost surely convergent to a finite, positive random variable.

See Appendix A.3 for the proof. Lemma 7 shows the reducing property of $$\{H^{s}_{k}\}$$ in expectation and the summable with respect to the iteration sequences. Let Ω be the set of accumulation points of the sequence $$\{\tilde{x}^{s}\}$$ generated by Algorithm 1. Then, from Lemma 7(ii), we have $$\emptyset \neq \Omega \subseteq \mathcal{X}$$ if F is level bounded. In the next theorem, we will show that any accumulation point of the sequence $$\{\tilde{x}^{s}\}$$ generated by Algorithm 1, if it exists, is a stationary point of the objective function F.

### Theorem 1

Let $$\{x^{s}_{k}\}$$ be the sequence generated by Algorithm 1. Suppose that Assumption 1, the condition on step size, and inertial coefficients in Lemma 7are satisfied. Then we can obtain the following two conclusions:

1. (i)

Any accumulation point of the sequence $$\{\tilde{x}^{s}\}$$ almost surely is a stationary point of problem (1).

2. (ii)

$$\lim_{s\rightarrow \infty}F(\tilde{x}^{s}):=\xi$$ exists almost surely and $$F\equiv \xi$$ over Ω almost surely.

See Appendix A.4 for the proof. In what follows, we will show another main result of this paper under the local error bound condition (EB), which is a key tool in the convergence analysis of nonconvex optimization ([29, Sect. 3.2]). Here, we provide the expected form of EB.

### Definition 5

(Error bound condition)

Suppose that $$x\in \mathbb{R}^{d}$$ is generated by a certain stochastic algorithm and v is a stochastic gradient associated with a random index i. Then:

1. (1)

For any $$\xi \geq \inf_{x\in \mathbb{R}^{d}}F(x)$$, there exist $$\varepsilon >0$$ and $$\tau >0$$ such that

$$\mathbb{E}\bigl[\operatorname{dist}(x,\mathcal{X})\bigr]\leq \tau \mathbb{E} \bigl\Vert x-\operatorname{prox}_{R}(x- \alpha v) \bigr\Vert ,$$

whenever $$\mathbb{E}\|x-\operatorname{prox}_{R}(x-\alpha v)\|\leq \varepsilon$$ and $$F(x)\leq \xi$$.

2. (2)

There exists $$\delta >0$$, s.t., $$\|x-y\|\geq \delta$$ whenever $$x,y\in \mathcal{X}$$, $$F(x)\neq F(y)$$.

In what follows, we show a linear convergence in expectation for $$\{H^{s}_{k}\}$$ with error bound condition.

### Theorem 2

Let $$\{x^{s}_{k}\}$$ be the sequence generated by Algorithm 1. Suppose that Assumptions 1, 2and the condition on step size as well as inertial coefficients in Lemma 7are satisfied. Then we have:

1. (i)

$$\lim_{s\rightarrow \infty}\mathbb{E}[\operatorname{dist}(x^{s}_{k}, \mathcal{X})]=0$$, $$\forall k\in \{0,1,2,\ldots,m\}$$;

2. (ii)

$$\{H^{s}_{k}\}$$ is Q-linearly convergent in expectation for a fixed epoch s.

### Proof

We first prove (i). According to the x-update in Algorithm 1, we have

\begin{aligned} &\mathbb{E} \bigl\Vert x^{s}_{k}- \operatorname{prox}_{\alpha R}\bigl(x^{s}_{k}- \alpha \nabla f\bigl(x^{s}_{k}\bigr)\bigr) \bigr\Vert ^{2} \\ &\quad = \mathbb{E} \bigl\Vert x^{s}_{k}-y^{s}_{k}+y^{s}_{k}- \operatorname{prox}_{\alpha R}\bigl(y^{s}_{k}- \alpha v^{s}_{k}\bigr)+\operatorname{prox}_{\alpha R} \bigl(y^{s}_{k}-\alpha v^{s}_{k} \bigr) \\ &\quad \quad{} - \operatorname{prox}_{\alpha R}\bigl(y^{s}_{k}- \alpha \nabla f\bigl(z^{s}_{k}\bigr)\bigr)+ \operatorname{prox}_{\alpha R}\bigl(y^{s}_{k}- \alpha \nabla f\bigl(z^{s}_{k}\bigr)\bigr)- \operatorname{prox}_{\alpha R}\bigl(x^{s}_{k}- \alpha \nabla f\bigl(x^{s}_{k}\bigr)\bigr) \bigr\Vert ^{2} \\ &\quad \leq 4\mathbb{E} \bigl\Vert x^{s}_{k}-y^{s}_{k} \bigr\Vert ^{2}+4\mathbb{E} \bigl\Vert y^{s}_{k}- \operatorname{prox}_{ \alpha R}\bigl(y^{s}_{k}- \alpha v^{s}_{k}\bigr) \bigr\Vert ^{2} \\ &\quad \quad{} + 4\mathbb{E} \bigl\Vert \operatorname{prox}_{\alpha R} \bigl(y^{s}_{k}-\alpha v^{s}_{k} \bigr)-\operatorname{prox}_{ \alpha R}\bigl(y^{s}_{k}- \alpha \nabla f\bigl(z^{s}_{k}\bigr)\bigr) \bigr\Vert ^{2} \\ &\quad \quad{} + 4\mathbb{E} \bigl\Vert \operatorname{prox}_{\alpha R} \bigl(y^{s}_{k}-\alpha \nabla f\bigl(z^{s}_{k} \bigr)\bigr)-\operatorname{prox}_{ \alpha R}\bigl(x^{s}_{k}- \alpha \nabla f\bigl(x^{s}_{k}\bigr)\bigr) \bigr\Vert ^{2} \\ &\quad \leq 4\beta ^{2}\mathbb{E} \bigl\Vert x^{s}_{k}-x^{s}_{k-1} \bigr\Vert ^{2}+4\mathbb{E} \bigl\Vert x^{s}_{k+1}-y^{s}_{k} \bigr\Vert ^{2}+4 \alpha ^{2}\mathbb{E} \bigl\Vert v^{s}_{k}-\nabla f\bigl(z^{s}_{k} \bigr) \bigr\Vert ^{2} \\ &\quad \quad{} + 4\mathbb{E} \bigl\Vert \bigl(y^{s}_{k}-x^{s}_{k} \bigr)-\alpha \bigl(\nabla f\bigl(z^{s}_{k}\bigr)- \nabla f\bigl(x^{s}_{k}\bigr)\bigr) \bigr\Vert ^{2} \\ &\quad \leq 8\mathbb{E} \bigl\Vert x^{s}_{k+1}-x^{s}_{k} \bigr\Vert ^{2}+(20\beta ^{2}+8L^{2} \alpha ^{2}\bigl(\beta ^{2}+\lambda ^{2}\bigr) \mathbb{E} \bigl\Vert x^{s}_{k}-x^{s}_{k-1} \bigr\Vert ^{2}+8L^{2} \alpha ^{2} \mathbb{E} \bigl\Vert x^{s}_{k}-\tilde{x}^{s-1} \bigr\Vert ^{2} \\ &\quad \leq \bigl(20\beta ^{2}+8L^{2}\alpha ^{2} \bigl(1+\beta ^{2}+\lambda ^{2}\bigr)\bigr) \mathbb{E} \bigl[ \bigl\Vert x^{s}_{k+1}-x^{s}_{k} \bigr\Vert ^{2}+ \bigl\Vert x^{s}_{k}-x^{s}_{k-1} \bigr\Vert ^{2}+ \bigl\Vert x^{s}_{k}- \tilde{x}^{s-1} \bigr\Vert ^{2}\bigr], \end{aligned}

where the first inequality holds by applying $$\|\sum^{m}_{i=1}x_{i}\|^{2}\leq m\sum^{m}_{i=1}\|x_{i}\|^{2}$$. The second inequality comes from nonexpansion of the proximal operator. The third inequality holds by using $$\|x+y\|^{2}\leq 2\|x\|^{2}+2\|y\|^{2}$$. Combining Lemma 7 and the above inequality, we have for all $$k\in \{0,1,\ldots,m\}$$

$$\lim_{s\rightarrow \infty}\mathbb{E} \bigl\Vert x^{s}_{k}- \operatorname{prox}_{ \alpha R}\bigl(x^{s}_{k}- \alpha \nabla f\bigl(x^{s}_{k}\bigr)\bigr) \bigr\Vert ^{2}=0,$$

which directly leads to $$\lim_{s\rightarrow \infty}\mathbb{E}[\operatorname{dist}(x^{s}_{k}, \mathcal{X})]=0$$ for all $$k\in \{0,1,2,\ldots,m\}$$.

We now prove (ii). Let $$\bar{x}^{s}\in \mathcal{X}$$ such that $$\operatorname{dist}(x^{s}_{k},\mathcal{X})=\|\bar{x}^{s}-x^{s}_{k}\|$$ and $$F(\bar{x}^{s})=\xi$$ a.s. It is immediate from (i) and $$\|x^{s}_{k}-\tilde{x}^{s-1}\|\xrightarrow{\text{a.s.}}0$$ by Lemma 7 that $$\|\bar{x}^{s+1}-\bar{x}^{s}\|\xrightarrow{\text{a.s.}}0$$. Consequently, from this together with (2) in Definition 5, we have $$F(\bar{x}^{s})\equiv \xi$$ for all sufficiently large s. On the one hand, for sufficiently large s,

\begin{aligned} &\xi -\mathbb{E}\bigl[H^{s}_{k} \bigr] \\ &\quad = \mathbb{E}\bigl[F\bigl(\bar{x}^{s}\bigr)-F\bigl(x^{s}_{k} \bigr)-\Theta _{k}\bigl( \bigl\Vert x^{s}_{k}-x^{s}_{k-1} \bigr\Vert ^{2}+ \bigl\Vert x^{s}_{k}- \tilde{x}^{s-1} \bigr\Vert ^{2}\bigr)\bigr] \\ &\quad = \mathbb{E}\bigl[f\bigl(\bar{x}^{s}\bigr)-f\bigl(x^{s}_{k} \bigr)+R\bigl(\bar{x}^{s}\bigr)-R\bigl(x^{s}_{k} \bigr)- \Theta _{k}\bigl( \bigl\Vert x^{s}_{k}-x^{s}_{k-1} \bigr\Vert ^{2}+ \bigl\Vert x^{s}_{k}- \tilde{x}^{s-1} \bigr\Vert ^{2}\bigr)\bigr] \\ &\quad \leq \mathbb{E}\bigl[f\bigl(\bar{x}^{s}\bigr)-f \bigl(x^{s}_{k}\bigr)-\bigl\langle \xi ^{s},x^{s}_{k}- \bar{x}^{s} \bigr\rangle -\Theta _{k}\bigl( \bigl\Vert x^{s}_{k}-x^{s}_{k-1} \bigr\Vert ^{2}+ \bigl\Vert x^{s}_{k}- \tilde{x}^{s-1} \bigr\Vert ^{2}\bigr)\bigr] \\ &\quad \leq \frac{L}{2}\mathbb{E} \bigl\Vert x^{s}_{k}- \bar{x}^{s} \bigr\Vert ^{2}-\Theta _{k} \mathbb{E}\bigl[ \bigl\Vert x^{s}_{k}-x^{s}_{k-1} \bigr\Vert ^{2}+ \bigl\Vert x^{s}_{k}- \tilde{x}^{s-1} \bigr\Vert ^{2}\bigr], \end{aligned}

where $$\xi ^{s}\in \partial R(\bar{x}^{s})$$. The first inequality comes from the convexity of R, the second inequality by using the first-order optimality condition of minimizing (2), and the smoothness of f. By the fact that $$\lim_{s\rightarrow \infty}\mathbb{E}\|x^{s}_{k}-\bar{x}^{s}\|= \lim_{s\rightarrow \infty}\mathbb{E}[\operatorname{dist}(x^{s}_{k}, \mathcal{X})]=0$$ together with nonincreasing $$\{H^{s}_{k}\}$$ a.s., we have

$$\lim_{s\rightarrow \infty}\mathbb{E}\bigl[H^{s}_{k} \bigr]-\xi \geq 0.$$

On the other hand, we have by Lemma 5 that

\begin{aligned} &\mathbb{E}\bigl[F\bigl(x^{s}_{k+1} \bigr)-\xi \bigr] \\ &\quad \leq \alpha \mathbb{E} \bigl\Vert v^{s}_{k}-\nabla f\bigl(z^{s}_{k}\bigr) \bigr\Vert ^{2}+ \frac{L}{2}\mathbb{E} \bigl\Vert x^{s}_{k+1}-z^{s}_{k} \bigr\Vert ^{2}-\frac{1}{2\alpha} \mathbb{E} \bigl\Vert x^{s}_{k+1}-y^{s}_{k} \bigr\Vert ^{2}-\frac{1}{2\alpha}\mathbb{E} \bigl\Vert x^{s}_{k+1}- \bar{x}^{s}_{k} \bigr\Vert ^{2} \\ &\quad \quad{} + \frac{L}{2}\mathbb{E} \bigl\Vert \bar{x}^{s}_{k}-z^{s}_{k} \bigr\Vert ^{2}+ \frac{1}{2\alpha}\mathbb{E} \bigl\Vert \bar{x}^{s}_{k}-y^{s}_{k} \bigr\Vert ^{2} \\ &\quad = \alpha \mathbb{E} \bigl\Vert v^{s}_{k}-\nabla f \bigl(z^{s}_{k}\bigr) \bigr\Vert ^{2}+ \biggl( \frac{L}{2}-\frac{1}{2\alpha}\biggr)\mathbb{E} \bigl\Vert x^{s}_{k+1}-x^{s}_{k} \bigr\Vert ^{2}+\biggl( \frac{\beta}{\alpha}-L\lambda \biggr)\mathbb{E} \bigl\langle x^{s}_{k+1}-x^{s}_{k},x^{s}_{k}-x^{s}_{k-1} \bigr\rangle \\ &\quad \quad{} + \biggl(\frac{L\lambda ^{2}}{2}-\frac{\beta ^{2}}{2\alpha}\biggr)\mathbb{E} \bigl\Vert x^{s}_{k}-x^{s}_{k-1} \bigr\Vert ^{2}+ \frac{L}{2}\mathbb{E} \bigl\Vert \bar{x}^{s}_{k}-z^{s}_{k} \bigr\Vert ^{2}+ \frac{1}{2\alpha}\mathbb{E} \bigl\Vert \bar{x}^{s}_{k}-y^{s}_{k} \bigr\Vert ^{2} \\ &\quad \leq \biggl(\frac{L}{2}-\frac{1}{2\alpha}+\frac{\beta}{2\alpha}- \frac{L\lambda}{2}\biggr)\mathbb{E} \bigl\Vert x^{s}_{k+1}-x^{s}_{k} \bigr\Vert ^{2}+ \frac{L}{2}\mathbb{E} \bigl\Vert \bar{x}^{s}_{k}-z^{s}_{k} \bigr\Vert ^{2}+ \frac{1}{2\alpha}\mathbb{E} \bigl\Vert \bar{x}^{s}_{k}-y^{s}_{k} \bigr\Vert ^{2} \\ &\quad \quad{} + \biggl(\frac{L\lambda ^{2}}{2}-\frac{\beta ^{2}}{2\alpha}+ \frac{\beta}{2\alpha}-\frac{L\lambda}{2}+2\alpha L^{2}\lambda ^{2}\biggr) \mathbb{E} \bigl\Vert x^{s}_{k}-x^{s}_{k-1} \bigr\Vert ^{2}+2L^{2}\alpha \mathbb{E} \bigl\Vert x^{s}_{k}- \tilde{x}^{s-1} \bigr\Vert ^{2}. \end{aligned}

In addition, for the two terms $$\frac{L}{2}\|\bar{x}^{s}_{k}-z^{s}_{k}\|^{2}$$ and $$\frac{1}{2\alpha}\|\bar{x}^{s}_{k}-y^{s}_{k}\|^{2}$$ on the left-hand side of the above inequality,

\begin{aligned}& \frac{L}{2} \bigl\Vert \bar{x}^{s}_{k}-z^{s}_{k} \bigr\Vert ^{2}\leq L \bigl\Vert \bar{x}^{s}_{k}-x^{s}_{k} \bigr\Vert ^{2}+L \bigl\Vert x^{s}_{k}-z^{s}_{k} \bigr\Vert ^{2}=L \bigl\Vert \bar{x}^{s}_{k}-x^{s}_{k} \bigr\Vert ^{2}+L\lambda ^{2} \bigl\Vert x^{s}_{k}-x^{s}_{k-1} \bigr\Vert ^{2},\\& \frac{1}{2\alpha} \bigl\Vert \bar{x}^{s}_{k}-y^{s}_{k} \bigr\Vert ^{2}\leq \frac{1}{\alpha} \bigl\Vert \bar{x}^{s}_{k}-x^{s}_{k} \bigr\Vert ^{2}+\frac{1}{\alpha} \bigl\Vert x^{s}_{k}-y^{s}_{k} \bigr\Vert ^{2}= \frac{1}{\alpha} \bigl\Vert \bar{x}^{s}_{k}-x^{s}_{k} \bigr\Vert ^{2}+ \frac{\beta ^{2}}{\alpha} \bigl\Vert x^{s}_{k}-x^{s}_{k-1} \bigr\Vert ^{2}. \end{aligned}

Combining the above three inequalities, we have

\begin{aligned} &\mathbb{E}\bigl[F\bigl(x^{s}_{k+1} \bigr)-\xi \bigr] \\ &\quad \leq \biggl[\frac{L}{2}-\frac{1}{2\alpha}+\frac{\beta}{2\alpha}- \frac{L\lambda}{2}\biggr]\mathbb{E} \bigl\Vert x^{s}_{k+1}-x^{s}_{k} \bigr\Vert ^{2}+2L^{2} \alpha \mathbb{E} \bigl\Vert x^{s}_{k}-\tilde{x}^{s-1} \bigr\Vert ^{2} \\ &\quad \quad{} + \biggl[\frac{L\lambda ^{2}}{2}+\frac{\beta ^{2}}{2\alpha}+ \frac{\beta}{2\alpha}-\frac{L\lambda}{2}+2\alpha L^{2}\lambda ^{2}+L \lambda ^{2}\biggr]\mathbb{E} \bigl\Vert x^{s}_{k}-x^{s}_{k-1} \bigr\Vert ^{2} \\ &\quad \quad{} + \biggl[L+\frac{1}{\alpha}\biggr]\mathbb{E} \bigl\Vert \bar{x}^{s}_{k}-x^{s}_{k} \bigr\Vert ^{2} \\ &\quad = \biggl[\frac{L}{2}-\frac{1}{2\alpha}+\frac{\beta}{2\alpha}- \frac{L\lambda}{2}\biggr]\mathbb{E} \bigl\Vert x^{s}_{k+1}-x^{s}_{k} \bigr\Vert ^{2}+2L^{2} \alpha \mathbb{E} \bigl\Vert x^{s}_{k}-\tilde{x}^{s-1} \bigr\Vert ^{2} \\ &\quad \quad{} + \biggl[\frac{L\lambda ^{2}}{2}+\frac{\beta ^{2}}{2\alpha}+ \frac{\beta}{2\alpha}-\frac{L\lambda}{2}+2\alpha L^{2}\lambda ^{2}+L \lambda ^{2}\biggr]\mathbb{E} \bigl\Vert x^{s}_{k}-x^{s}_{k-1} \bigr\Vert ^{2} \\ &\quad \quad{} + \biggl[L+\frac{1}{\alpha}\biggr]\mathbb{E}\bigl[ \operatorname{dist}^{2}\bigl(x^{s}_{k}, \mathcal{X}\bigr)\bigr] \\ &\quad \leq \biggl[\frac{L}{2}-\frac{1}{2\alpha}+\frac{\beta}{2\alpha}- \frac{L\lambda}{2}\biggr]\mathbb{E} \bigl\Vert x^{s}_{k+1}-x^{s}_{k} \bigr\Vert ^{2}+2L^{2} \alpha \mathbb{E} \bigl\Vert x^{s}_{k}-\tilde{x}^{s-1} \bigr\Vert ^{2} \\ &\quad \quad{} + \biggl[\frac{L\lambda ^{2}}{2}+\frac{\beta ^{2}}{2\alpha}+ \frac{\beta}{2\alpha}-\frac{L\lambda}{2}+2\alpha L^{2}\lambda ^{2}+L \lambda ^{2}\biggr]\mathbb{E} \bigl\Vert x^{s}_{k}-x^{s}_{k-1} \bigr\Vert ^{2} \\ &\quad \quad{} + \biggl[L+\frac{1}{\alpha}\biggr]\tau ^{2}Q^{2} \mathbb{E} \bigl\Vert x^{s}_{k+1}-x^{s}_{k} \bigr\Vert ^{2}+ \bigl\Vert x^{s}_{k}-x^{s}_{k-1} \bigr\Vert ^{2}] \\ &\quad \leq \rho _{2}\mathbb{E}\bigl[ \bigl\Vert x^{s}_{k+1}-x^{s}_{k} \bigr\Vert ^{2}+ \bigl\Vert x^{s}_{k}-x^{s}_{k-1} \bigr\Vert ^{2}+ \bigl\Vert x^{s}_{k}- \tilde{x}^{s-1} \bigr\Vert ^{2}\bigr] \end{aligned}

for some constant $$\rho _{2}>0$$. Consequently, we have

\begin{aligned} 0& \leq \mathbb{E}\bigl[H^{s}_{k+1}- \xi \bigr] \\ & = \mathbb{E}\bigl[F\bigl(x^{s}_{k+1}\bigr)-F( \bar{x}_{k})+\Theta _{k}\bigl( \bigl\Vert x^{s}_{k+1}-x^{s}_{k} \bigr\Vert ^{2}+ \bigl\Vert x^{s}_{k+1}- \tilde{x}^{s-1} \bigr\Vert ^{2}\bigr)\bigr] \\ & \leq \rho _{2}\mathbb{E}\bigl[ \bigl\Vert x^{s}_{k+1}-x^{s}_{k} \bigr\Vert ^{2}+ \bigl\Vert x^{s}_{k}-x^{s}_{k-1} \bigr\Vert ^{2}+ \bigl\Vert x^{s}_{k}- \tilde{x}^{s-1} \bigr\Vert ^{2} \\ &\quad{} + \Theta _{k}\bigl( \bigl\Vert x^{s}_{k+1}-x^{s}_{k} \bigr\Vert ^{2}+ \bigl\Vert x^{s}_{k+1}- \tilde{x}^{s-1} \bigr\Vert ^{2}\bigr)\bigr] \\ & \leq \rho _{3}\mathbb{E}\bigl[ \bigl\Vert x^{s}_{k+1}-x^{s}_{k} \bigr\Vert ^{2}+ \bigl\Vert x^{s}_{k}-x^{s}_{k-1} \bigr\Vert ^{2}+ \bigl\Vert x^{s}_{k}- \tilde{x}^{s-1} \bigr\Vert ^{2}\bigr] \end{aligned}

for some constant $$\rho _{3}>0$$. Moreover, by Lemma 7 we have

\begin{aligned} \mathbb{E}\bigl[H^{s}_{k+1}-H^{s}_{k} \bigr]&=\mathbb{E}\bigl[\bigl(H^{s}_{k+1}- \xi \bigr)- \bigl(H^{s}_{k}-\xi \bigr)\bigr] \\ &\leq -\rho \mathbb{E}\bigl[ \bigl\Vert x^{s}_{k+1}-x^{s}_{k} \bigr\Vert ^{2}+ \bigl\Vert x^{s}_{k}-x^{s}_{k-1} \bigr\Vert ^{2}+ \bigl\Vert x^{s}_{k}- \tilde{x}^{s-1} \bigr\Vert ^{2}\bigr], \end{aligned}
(3)

where $$\rho =\min \{\Theta _{k+1}(1+a)+2\alpha L^{2},2\alpha L^{2}\lambda ^{2}+L \lambda ^{2}+\frac{\beta}{2\alpha}-\frac{L\lambda}{2}\}$$. Therefore, we have

$$\mathbb{E}\bigl[H^{s}_{k+1}-\xi \bigr]-\mathbb{E} \bigl[H^{s}_{k}-\xi \bigr]\leq - \frac{\rho}{\rho _{3}} \mathbb{E}\bigl[H^{s}_{k+1}-\xi \bigr].$$

This completes the proof of Q-linear convergence of $$\{H^{s}_{k}\}$$ in expectation. □

Now, we are ready to present the last key result, i.e., the linear convergence of $$\{\tilde{x}^{s}\}$$ and $$\{F(\tilde{x}^{s})\}$$ generated by Algorithm 1.

### Theorem 3

Let $$\{x^{s}_{k}\}$$ be the sequence generated by Algorithm 1. Suppose that Assumptions 1, 2and the condition on step size as well as inertial coefficients in Lemma 7are satisfied. Then we can obtain the following two conclusions:

1. (i)

the sequence $$\{x^{s}_{k}\}$$ is Q-linearly convergent in expectation;

2. (ii)

the sequence $$\{F(x^{s}_{k})\}$$ is R-linearly convergent in expectation.

### Proof

(i) It follows from Theorem 2 that $$\{H^{s}_{k}\}$$ is Q-linearly convergent in expectation. Then, without loss of generality, let $$\lim_{s\rightarrow \infty}\mathbb{E}[H^{s}_{k}]=\xi$$. On the other hand, from (3), for all $$k\in \{0,1,\ldots,m-1\}$$ in epoch s,

\begin{aligned} \mathbb{E}\bigl[H^{s}_{k+1} \bigr]&\leq \mathbb{E}\bigl[H^{s}_{k}\bigr]-\rho \mathbb{E}\bigl[ \bigl\Vert x^{s}_{k+1}-x^{s}_{k} \bigr\Vert ^{2}+ \bigl\Vert x^{s}_{k}-x^{s}_{k-1} \bigr\Vert ^{2}+ \bigl\Vert x^{s}_{k}- \tilde{x}^{s-1} \bigr\Vert ^{2}\bigr] \\ &\leq \mathbb{E}\bigl[H^{s}_{k}\bigr]-\rho \mathbb{E} \bigl\Vert x^{s}_{k}-\tilde{x}^{s-1} \bigr\Vert ^{2}. \end{aligned}

Then we have

\begin{aligned} \mathbb{E} \bigl\Vert x^{s}_{k}-\tilde{x}^{s-1} \bigr\Vert ^{2}\leq \frac{1}{\rho}\mathbb{E}\bigl[H^{s}_{k}- \xi \bigr]-\frac{1}{\rho}\mathbb{E}\bigl[H^{s}_{k+1}- \xi \bigr]\leq \frac{1}{\rho}\mathbb{E}\bigl[H^{s}_{k}- \xi \bigr]. \end{aligned}
(4)

According to (4) and the Q-linear convergence, there exist $$M>0$$ and $$0< c<1$$ for all s and for all $$k\in \{0,1,\ldots,m\}$$ such that

$$\mathbb{E} \bigl\Vert x^{s}_{k}- \tilde{x}^{s-1} \bigr\Vert \leq Mc^{s}.$$

Thus, for any $$t_{2}>t_{1}\geq 1$$, we have

$$\mathbb{E} \bigl\Vert x^{t_{2}}_{k}- \tilde{x}^{t_{1}} \bigr\Vert \leq \sum^{t_{2}-1}_{s=t_{1}} \mathbb{E} \bigl\Vert \tilde{x}^{s-1}-x^{s}_{k} \bigr\Vert \leq \frac{Mc^{t_{1}}}{1-c}.$$

The above inequality indicates that the sequence $$\{x^{s}_{k}\}$$ is a Cauchy sequence in expectation. Therefore, $$\{x^{s}_{k}\}$$ is convergent in expectation. Without loss of generality, let $$\lim_{s\rightarrow \infty}x^{s}_{k}=\hat{x}$$ a.s., then for all $$k\in \{0,1,\ldots,m\}$$, we have

$$\lim_{t_{2}\rightarrow \infty}\mathbb{E} \bigl\Vert x^{t_{2}}_{k}- \tilde{x}^{t_{1}} \bigr\Vert =\mathbb{E} \bigl\Vert \hat{x}- \tilde{x}^{t_{1}}_{k} \bigr\Vert \leq \frac{Mc^{t_{1}}}{1-c},$$

which means that $$\{x^{s}_{k}\}$$ is Q-linearly convergent in expectation.

(ii) In view of the definition of $$\{H^{s}_{k}\}$$, for all $$k\in \{0,1,\ldots,m\}$$ in epoch s, we have

\begin{aligned} \mathbb{E} \bigl\vert F\bigl(x^{s}_{k} \bigr)-\xi \bigr\vert &=\mathbb{E} \bigl\vert H^{s}_{k}- \xi - \Theta _{k}\bigl( \bigl\Vert x^{s}_{k}-x^{s}_{k-1} \bigr\Vert ^{2}+ \bigl\Vert x^{s}_{k}- \tilde{z}^{s-1} \bigr\Vert ^{2}\bigr) \bigr\vert \\ &\leq \mathbb{E}\bigl[H^{s}_{k}-\xi +\Theta _{k}\bigl( \bigl\Vert x^{s}_{k}-x^{s}_{k-1} \bigr\Vert ^{2}+ \bigl\Vert x^{s}_{k}- \tilde{z}^{s-1} \bigr\Vert ^{2}\bigr)\bigr] \\ &\leq \mathbb{E}\biggl[H^{s}_{k}-\xi + \frac{\Theta _{k}}{\rho}\bigl( \bigl\Vert x^{s}_{k}-x^{s}_{k-1} \bigr\Vert ^{2}+ \bigl\Vert x^{s}_{k}- \tilde{z}^{s-1} \bigr\Vert ^{2}\bigr)\biggr]. \end{aligned}

The first inequality holds by the triangle inequality and the second one is obtained from (5). This together with the expected Q-linear convergence of the sequence $$\{H^{s}_{k}\}$$ and Lemma 2 indicates that $$\{F(x^{s}_{k})\}$$ is R-linearly convergent in expectation. This completes the proof. □

## 5 Numerical experiments

We have made a rigorous theoretical analysis for our proposed algorithm. Next, we will construct some experiments to demonstrate the performance of the proposed algorithm. The two optimization problems used frequently in stochastic optimization will be utilized in our experiments. The first one is a convex optimization problem and the second one is nonconvex. Four different sizes of problems are designed randomly for each regression optimization. We compare the proposed algorithm with Prox-SVRG  and mini-batch proximal stochastic gradient (MBProx-SGD) . All experiments are performed in MATLAB 2017b on a 64-bit PC with an Intel(R) Core(TM) i7-6700HQ CPU (2.60 GHz) and 16 GB of RAM.

### 5.1 Optimization problem description

We utilize two different optimization problems used in stochastic optimization to demonstrate the performance of our proposed method.

$$l_{1}$$-regularized logistic regression is a classical convex optimization used in stochastic optimization. It is defined as follows:

$$\min_{x\in \mathbb{R}^{d}}F(x)\triangleq \Biggl\{ \frac{1}{n}\sum _{i=1}^{n}\log\bigl(1+e^{-b_{i}a_{i}^{T}x} \bigr)+ \mu \Vert x \Vert _{1}\Biggr\} ,$$

where $$a_{i}\in \mathbb{R}^{d}$$ represents a d-dimensional sample, $$b_{i}\in \{-1,+1\}$$, $$i=1,2,\ldots,n$$, is the corresponding label, and μ is the regularized parameter.

For nonconvex optimization, we choose $$l_{1}$$-regularized sigmod regression to test the superiority of our method. The formulation of sigmod regression is described as follows:

$$\min_{x\in \mathbb{R}^{d}}F(x)\triangleq \Biggl\{ \frac{1}{n}\sum ^{n}_{i=1} \frac{1}{1+e^{b_{i}a_{i}^{T}x}}+\mu \Vert x \Vert _{1}\Biggr\} ,$$

where $$a_{i}$$ and $$b_{i}$$ are the same as in logistic regression.

Both logistic regression and sigmod regression are in the form of empirical risk minimization, and it is a challenge to the determined gradient-type method when the dimension d or the number n is very large. As a result, these issues are satisfied with the stochastic gradient method.

### 5.2 Parameter setting and experimental results

From the formulation of $$l_{1}$$-regularized logistic regression, we can give an equivalent form as follows:

$$\min_{x\in \mathbb{R}^{d}}F(x)\triangleq \Biggl\{ \frac{1}{n}\sum _{i=1}^{n}\log\bigl(1+e^{-b_{i}A_{i}^{T}x} \bigr)+ \mu \Vert x \Vert _{1}\Biggr\} ,$$

where $$A\in \mathbb{R}^{d\times n}$$ with each column is a d-dimensional sample, and $$A_{i}$$ is denoted by the ith column of matrix A. From Definition 1, we can know that $$f(x)=\frac{1}{n}\sum_{i=1}^{n}\log(1+e^{-b_{i}A_{i}^{T}x})$$ is gradient continuously with modulus $$L=0.25\lambda _{\max}(A^{T}A)$$. As a result, we take $$L=0.25\lambda _{\max}(A^{T}A)$$ and $$l=0$$ because of its convexity. Similarity, we set $$L=0.25\lambda _{\max}(A^{T}A)$$ and $$l=|\lambda _{\min}(A^{T}A)|$$ for sigmod regression.

Now, we conduct the experiments to study the performance of the proposed algorithm under convex and nonconvex situations. The specific parameter settings are as follows:

Prox-SVRG algorithm: The nonaccelerated proximal SVRG algorithm for problem (1). There are two parameters that are needed to turn in the Prox-SVRG algorithm, i.e., the step size α and the number of inner iteration m. As in , these two parameters are set to be $$\alpha =\frac{1}{6L}$$ and $$m=\sqrt{b}$$, where b is the minibatch.

Prox-SGD algorithm: There is only one parameter in this algorithm, i.e., the step size. As the theoretical analysis in [8, Theorem 2] shows, the author let step size $$\alpha _{k}\leq c/L$$, where $$c>0$$ is a constant. Thus, we also take this setting for these two parameters in our experiments.

GIProx-SVRG algorithm: There are four parameters in our algorithm, i.e., the step size α, the number of inner iteration m, and the extrapolation parameters β, λ. For m, we let $$m=n$$. For the step size α and two extrapolation parameters β, λ, we set the value as described in Algorithm 1, i.e., $$\alpha \in (0,\frac{\beta}{L\lambda}]$$ and $$\beta \in [\max\{\frac{4}{1-2b},\lambda \},\frac{b(1+\lambda )}{2b+2}]$$, $$\lambda \leq \frac{L}{5L+l}$$, and $$b\in (0,\frac{1}{2})$$. Thus, we put this setting in our experiments.

We designed randomly four groups with different sizes of problems for each regression optimization, which is described in Table 1. For regularized parameter μ, we have different choice for different group, which is described in Table 1. In Table 1, we use $$\mu _{1}$$ and $$\mu _{2}$$ to present the regularized parameter in logistic regression and sigmod regression, respectively. As a proximal gradient method, we set

$$\frac{ \Vert x_{k+1}-x_{k} \Vert }{\max\{ \Vert x_{k} \Vert ,1\}}< 10^{-5},$$

as the termination condition.

Our experimental results are presented in Table 2, Table 3, Fig. 1, and Fig. 2. In Tables 2 and 3, we report the performance (i.e., CPU-time, iteration, and the optimal value of objective function) of three different algorithms. As one can see from Tables 2 and 3, our algorithm finds a local optimum in the lowest number of iterations and lower time consumption, which is indicated by bold results. In Fig. 1, we report and use $$\|F(x_{k})-F(x^{*})\|$$ to represent the change curve of the value of the objective function. $$F(x^{*})$$ represents the approximate local optimal solution obtained from different algorithms. From Figs. 1 and 2, we can obtain that our proposed method owns the fastest convergence rate and the R-linear convergence property of our method. In summary, we can know that the performance of our proposed method is superior to the other two methods.

## 6 Conclusion

In this paper, a class of nonconvex nonsmooth finite sum optimization problems is considered. Based on the general inertial proximal gradient algorithm and stochastic variance reduction gradient, a novel stochastic gradient algorithm, named general inertial proximal stochastic variance reduction gradient, is proposed. The proposed algorithm further improves the performance of convergence of the proximal stochastic gradient method with careful selection of two involved extrapolation parameters. The experiments on two represented different optimization problems often arising in machine learning demonstrate the superiority of the proposed method against other related stochastic gradient algorithms.

## References

1. Allen-Zhu, Z.: Katyusha: the first direct acceleration of stochastic gradient methods. In: Proceedings of the 49th Annual ACM SIGACT Symposium on the Theory of Computing, Montreal, Canada, pp. 1200–1205 (2017)

2. Blatt, D., Hero, A.O., Gauchman, H.: A convergent incremental gradient method with a constant step size. SIAM J. Optim. 18(1), 29–51 (2007)

3. Bottou, L., Curtis, F.E., Nocedal, J.: Optimization methods for large-scale machine learning. SIAM Rev. 60(2), 223–311 (2018)

4. Chen, C., Chen, Y., Ouyang, Y., Pasiliao, E.: Stochastic accelerated alternating direction method of multipliers with importance sampling. J. Optim. Theory Appl. 179(2), 676–695 (2018)

5. Defazio, A., Bach, F., Lacoste-Julien, S.: SAGA: a fast incremental gradient method with support for non-strongly convex composite objectives. In: Advances in Neural Information Processing Systems, pp. 1646–1654 (2014)

6. Driggs, D., Ehrhardt, M.J., Schonlieb, C.B.: Accelerating variance-reduced stochastic gradient methods. Math. Program. (2020). https://doi.org/10.1007/s10107-020-01566-2

7. Driggs, D., Tang, J., Liang, J., Davies, M., Schonlieb, C.B.: A stochastic proximal alternating minimization for nonsmooth and nonconvex optimization. SIAM J. Imaging Sci. 14, 1932–1970 (2021)

8. Ghadimi, S., Lan, G., Zhang, H.: Mini-batch stochastic approximation methods for nonconvex stochastic composite optimization. Math. Program. 155, 267–305 (2016)

9. Goodfellow, I., Bengio, Y., Courville, A.: Deep Learning. MIT Press, Cambridge (2016)

10. Gower, R.M., Schmidt, M., Bach, F., Richtarik, P.: Variance-reduced methods for machine learning. Proc. IEEE 108(11), 1968–1983 (2020)

11. Heidelberger, P.: Variance reduction techniques for the simulation of Markov process. Acta Inform. 13(1), 21–37 (1980)

12. James, B.A.P.: Variance reduction techniques. J. Oper. Res. Soc. 39(6), 525–530 (1985)

13. Jiang, K., Sun, D., Toh, K.C.: Katyusha: a partial proximal point algorithm for nuclear norm regularized matrix least squares problems. Math. Program. Comput. 6(3), 281–325 (2014)

14. Johnson, R., Zhang, T.: Accelerating stochastic gradient descent using predictive variance reduction. In: Advances in Neural Information Processing Systems, pp. 1113–1121 (2013)

15. Koivu, M.: Variance reduction in sample approximations of stochastic programs. Math. Program. 103(3), 463–485 (2005)

16. Le Roux, N., Schmidt, M., Bach, F.: A stochastic gradient method with an exponential convergence rate for finite training sets. In: Advances in Neural Information Processing Systems (NIPS) (2013)

17. Li, Z., Li, J.: A simple proximal stochastic gradient method for nonsmooth nonconvex optimization. In: Advances in Neural Information Processing Systems, pp. 1–10 (2018)

18. Mouatasim, A.: Control proximal gradient algorithm for image $$\ell_{1}$$ regularization. Signal Image Video Process. 13(6), 1113–1121 (2019)

19. Nemirovski, A., Juditsky, A., Lan, G., Shapiro, A.: Robust stochastic approximation approach to stochastic programming. SIAM J. Optim. 19(4), 1574–1609 (2009)

20. Nesterov, Y.: Introductory Lectures on Convex Optimization: A Basic Course. Kluwer Academic, Boston (2004)

21. Nguyen, L.M., Liu, J., Scheinberg, K., Takc, M.: SARAH: a novel method for machine learning problems using stochastic recursive gradient. In: Precup, D., Teh, Y.W. (eds.) Proceedings of the 34th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 70, pp. 2613–2621. International Convention Centre, PMLR, Sydney (2017)

22. Nguyen, L.M., Scheinberg, K., Takac, M.: Inexact SARAH algorithm for stochastic optimization. Optim. Methods Softw. 36(1), 237–258 (2021)

23. Parikh, N., Boyd, S.: Proximal algorithms. Found. Trends Optim. 1(3), 127–239 (2014)

24. Pham, N.H., Nguyen, L.M., Phan, D.T., Tran-Dinh, Q.: ProxSARAH: an efficient algorithmic framework for stochastic composite nonconvex optimization. J. Mach. Learn. Res. 21, 1–48 (2020)

25. Reddi, S.J., Sra, S., Poczos, B., Smola, A.J.: Proximal stochastic methods for nonsmooth nonconvex finite-sum optimization. In: 30th Conference on Neural Information Processing Systems (2016)

26. Robbins, H., Monro, S.: A stochastic approximation method. Ann. Math. Stat. 22(3), 400–407 (1951)

27. Shang, F., Jiao, L., Zhou, K., Cheng, J., Jin, Y.: ASVRG: accelerated proximal SVRG. In: Proceedings Asian Conference on Machine Learning, vol. 95, pp. 815–830 (2018)

28. Wang, Z., Ji, K., Zhou, Y., Liang, Y., Tarokh, V.: Spiderboost and momentum: faster variance reduction algorithms. In: Advances in Neural Information Processing Systems, pp. 2403–2413 (2019)

29. Wen, B., Chen, X., Pong, T.: Linear convergence of proximal gradient algorithm with extrapolation for a class of nonconvex nonsmooth minimization problems. SIAM J. Optim. 27(1), 124–145 (2017)

30. Wu, Z., Li, C., Li, M., Lim, A.: Inertial proximal gradient methods with Bregman regularization for a class of nonconvex optimization problems. J. Glob. Optim. 79(3), 617–644 (2021)

31. Wu, Z., Li, M.: General inertial proximal gradient method for a class of nonconvex nonsmooth optimization problems. Comput. Optim. Appl. 73(1), 129–158 (2019)

32. Xiao, L., Zhang, T.: A proximal stochastic gradient method with progressive variance reduction. SIAM J. Optim. 24(4), 2057–2075 (2014)

33. Xiao, X.: A unified convergence analysis of stochastic Bregman proximal gradient and extragradient methods. J. Optim. Theory Appl. 188(3), 605–627 (2021)

34. Yang, Z., Chen, Z., Wang, C.: An accelerated stochastic variance-reduced method for machine learning problems. Knowl.-Based Syst. 198, 105941 (2020)

35. Yang, Z., Wang, C., Zhang, Z., Li, J.: Mini-batch algorithms with online step size. Knowl.-Based Syst. 165, 228–240 (2019)

## Acknowledgements

This work is supported in part by the National Nature Science Foundation of China under Grant 61573014 and in part by the Fundamental Research Funds for the Central Universities under Grant JB210717 and in part by the Key Natural Science Projects of Chuzhou Polytechnic under Grant ZKZ-2022-07.

## Author information

Authors

### Contributions

(1) Shuya Sun wrote the article. (2) Shuya Sun and Lulu He completed all experiments. (3) Lulu He prepared the theorems,equations and proof of properties. All authors reviewed the manuscript.

### Corresponding author

Correspondence to Lulu He.

## Ethics declarations

### Competing interests

The authors declare no competing interests.

## Appendix

### Proof

By applying conditional expectation operator $$\mathbb{E}_{k}$$, we have

\begin{aligned} \mathbb{E}_{k} \bigl\Vert v^{s}_{k}-\nabla f\bigl(z^{s}_{k} \bigr) \bigr\Vert ^{2}& = \mathbb{E}_{k} \bigl\Vert \nabla f_{i^{s}_{k}}\bigl(z^{s}_{k}\bigr)-\nabla f_{i^{s}_{k}}\bigl( \tilde{z}^{s-1}\bigr)+\nabla f\bigl( \tilde{x}^{s-1}\bigr)-\nabla f\bigl(z^{s}_{k} \bigr) \bigr\Vert ^{2} \\ & \leq \mathbb{E}_{k} \bigl\Vert \nabla f_{i^{s}_{k}} \bigl(z^{s}_{k}\bigr)-\nabla f_{i^{s}_{k}}\bigl( \widetilde{x}^{s-1}\bigr) \bigr\Vert ^{2} \\ & \leq L^{2} \bigl\Vert z^{s}_{k}- \tilde{x}^{s-1} \bigr\Vert ^{2} \\ & \leq 2L^{2} \bigl\Vert x^{s}_{k}- \tilde{x}^{s-1} \bigr\Vert ^{2}+2L^{2}\lambda ^{2} \bigl\Vert x^{s}_{k}-x^{s}_{k-1} \bigr\Vert ^{2}, \end{aligned}

where the first inequality holds by the fact that $$\mathbb{E}\|X-\mathbb{E}[X]\|^{2}\leq \mathbb{E}\|X\|^{2}$$, where X is a random variable. The second inequality follows from $$\nabla f_{i}$$ is Lipschitz continuous. The last inequality holds by using the fact that $$\|x+y\|^{2}\leq 2\|x\|^{2}+2\|y\|^{2}$$ and the definition of $$z^{s}_{k}$$ in Algorithm 1. □

### Proof

Setting $$x^{+}=x^{s}_{k+1}$$, $$x=x^{s}_{k}$$, $$y=y^{s}_{k}$$, $$z=z^{s}_{k}$$, and $$v=v^{s}_{k}$$ in Lemma 5 and taking the conditional expectation operator $$\mathbb{E}_{k}$$, we have

\begin{aligned} \mathbb{E}_{k}\bigl[F \bigl(x^{s}_{k+1}\bigr)\bigr]& \leq F\bigl(x^{s}_{k} \bigr)+\alpha \mathbb{E}_{k} \bigl\Vert v^{s}_{k}- \nabla f\bigl(z^{s}_{k}\bigr) \bigr\Vert ^{2}+\frac{L}{2} \mathbb{E}_{k} \bigl\Vert x^{s}_{k+1}-z^{s}_{k} \bigr\Vert ^{2}-\frac{1}{2\alpha} \mathbb{E}_{k} \bigl\Vert x^{s}_{k+1}-y^{s}_{k} \bigr\Vert ^{2} \\ &\quad{} - \frac{1}{2\alpha}\mathbb{E}_{k} \bigl\Vert x^{s}_{k+1}-x^{s}_{k} \bigr\Vert ^{2}+ \frac{L}{2} \bigl\Vert x^{s}_{k}-z^{s}_{k} \bigr\Vert ^{2}+\frac{1}{2\alpha} \bigl\Vert x^{s}_{k}-y^{s}_{k} \bigr\Vert ^{2} \\ & = F\bigl(x^{s}_{k}\bigr)+\alpha \mathbb{E}_{k} \bigl\Vert v^{s}_{k}- \nabla f\bigl(z^{s}_{k}\bigr) \bigr\Vert ^{2}+ \frac{L}{2}\mathbb{E}_{k} \bigl\Vert x^{s}_{k+1}-z^{s}_{k} \bigr\Vert ^{2}- \frac{1}{2\alpha}\mathbb{E}_{k} \bigl\Vert x^{s}_{k+1}-y^{s}_{k} \bigr\Vert ^{2} \\ &\quad{} - \frac{1}{2\alpha}\mathbb{E}_{k} \bigl\Vert x^{s}_{k+1}-x^{s}_{k} \bigr\Vert ^{2}+\biggl( \frac{L\lambda ^{2}}{2}+\frac{\beta ^{2}}{2\alpha} \biggr) \bigl\Vert x^{s}_{k}-x^{s}_{k-1} \bigr\Vert ^{2} \\ & \leq F\bigl(x^{s}_{k}\bigr)+\frac{L}{2} \mathbb{E}_{k} \bigl\Vert x^{s}_{k+1}-z^{s}_{k} \bigr\Vert ^{2}- \frac{1}{2\alpha}\mathbb{E}_{k} \bigl\Vert x^{s}_{k+1}-y^{s}_{k} \bigr\Vert ^{2}- \frac{1}{2\alpha}\mathbb{E}_{k} \bigl\Vert x^{s}_{k+1}-x^{s}_{k} \bigr\Vert ^{2} \\ &\quad{} + \biggl(2\alpha L^{2}\lambda ^{2}+ \frac{L\lambda ^{2}}{2}+ \frac{\beta ^{2}}{2\alpha}\biggr) \bigl\Vert x^{s}_{k}-x^{s}_{k-1} \bigr\Vert ^{2}+2\alpha L^{2} \bigl\Vert x^{s}_{k}- \tilde{x}^{s-1} \bigr\Vert ^{2}, \end{aligned}

where the first equality follows from the definition of $$z^{s}_{k}$$ and $$y^{s}_{k}$$ in Algorithm 1, and the second inequality follows by using Lemma 4. □

### Proof

We first prove (i). It follows from the definition of $$H^{s}_{k}$$ that

\begin{aligned} \mathbb{E}_{k} \bigl[H^{s}_{k+1}\bigr]& = \mathbb{E}_{k} \bigl[F\bigl(x^{s}_{k+1}\bigr)+ \Theta _{k+1} \bigl( \bigl\Vert x^{s}_{k+1}-x^{s}_{k} \bigr\Vert ^{2}+ \bigl\Vert x^{s}_{k+1}- \tilde{x}^{s-1} \bigr\Vert ^{2}\bigr)\bigr] \\ & \leq \mathbb{E}_{k}\biggl[F\bigl(x^{s}_{k+1} \bigr)+\Theta _{k+1}\biggl(2+\frac{1}{a}\biggr) \bigl\Vert x^{s}_{k+1}-x^{s}_{k} \bigr\Vert ^{2}\biggr]+ \Theta _{k+1}(1+a) \bigl\Vert x^{s}_{k}-\tilde{x}^{s-1} \bigr\Vert ^{2} \\ & \leq F\bigl(x^{s}_{k}\bigr)+\bigl[\Theta _{k+1}(1+a)+2\alpha L^{2}\bigr] \bigl\Vert x^{s}_{k}- \tilde{x}^{s-1} \bigr\Vert ^{2} \\ &\quad{} + \biggl[\Theta _{k+1}\biggl(2+\frac{1}{a}\biggr)- \frac{1}{2\alpha}\biggr]\mathbb{E}_{k} \bigl\Vert x^{s}_{k+1}-x^{s}_{k} \bigr\Vert ^{2}+ \frac{L}{2}\mathbb{E}_{k} \bigl\Vert x^{s}_{k+1}-z^{s}_{k} \bigr\Vert ^{2} \\ &\quad{} + \biggl(2\alpha L^{2}\lambda ^{2}+ \frac{L\lambda ^{2}}{2}+ \frac{\beta ^{2}}{2\alpha}\biggr) \bigl\Vert x^{s}_{k}-x^{s}_{k-1} \bigr\Vert ^{2}- \frac{1}{2\alpha}\mathbb{E}_{k} \bigl\Vert x^{s}_{k+1}-y^{s}_{k} \bigr\Vert ^{2} \\ & = F\bigl(x^{s}_{k}\bigr)+\biggl[\Theta _{k+1}\biggl(2+\frac{1}{a}\biggr)+\frac{L}{2}- \frac{1}{\alpha}\biggr]\mathbb{E}_{k} \bigl\Vert x^{s}_{k+1}-x^{s}_{k} \bigr\Vert ^{2} \\ &\quad{} + \bigl[2\alpha L^{2}\lambda ^{2}+L\lambda ^{2}\bigr] \bigl\Vert x^{s}_{k}-x^{s}_{k-1} \bigr\Vert ^{2}+\bigl[ \Theta _{k+1}(1+a)+2\alpha L^{2}\bigr] \bigl\Vert x^{s}_{k}- \tilde{x}^{s-1} \bigr\Vert ^{2} \\ &\quad{} + \biggl(\frac{\beta}{\alpha}-L\lambda \biggr)\mathbb{E}_{k} \bigl\langle x^{s}_{k+1}-x^{s}_{k},x^{s}_{k}-x^{s}_{k-1} \bigr\rangle . \end{aligned}
(5)

The first inequality is obtained by using $$\langle x,y\rangle \leq \frac{a}{2}\|x\|^{2}+\frac{1}{2a}\|y\|^{2}$$ for any $$a>0$$, the second inequality follows from Lemma 6, and the last equality is obtained by using the definition of $$z^{s}_{k}$$ and $$y^{s}_{k}$$ in Algorithm 1.

For the last term $$\langle x^{s}_{k+1}-x^{s}_{k},x^{s}_{k}-x^{s}_{k-1}\rangle$$ on the right-hand side of (5), by using the Cauchy–Schwarz inequality and the assumption that $$\alpha \leq \frac{\beta}{L\lambda}$$, we have

$$\biggl(\frac{\beta}{\alpha}-L\lambda \biggr)\bigl\langle x^{s}_{k+1}-x^{s}_{k},x^{s}_{k}-x^{s}_{k-1} \bigr\rangle \leq \biggl(\frac{\beta}{2\alpha}-\frac{L\lambda}{2}\biggr)\bigl[ \bigl\Vert x^{s}_{k+1}-x^{s}_{k} \bigr\Vert ^{2}+ \bigl\Vert x^{s}_{k}-x^{s}_{k-1} \bigr\Vert ^{2}\bigr].$$

Substituting this inequality into (5), we see further that

\begin{aligned} \mathbb{E}_{k}\bigl[H^{s}_{k+1} \bigr]& \leq F\bigl(x^{s}_{k}\bigr)+\biggl[\Theta _{k+1}\biggl(2+ \frac{1}{a}\biggr)+\frac{L}{2}- \frac{1}{\alpha}+\frac{\beta}{2\alpha}- \frac{L\lambda}{2}\biggr] \mathbb{E}_{k} \bigl\Vert x^{s}_{k+1}-x^{s}_{k} \bigr\Vert ^{2} \\ &\quad{} + \biggl[2\alpha L^{2}\lambda ^{2}+L\lambda ^{2}+\frac{\beta}{2\alpha}- \frac{L\lambda}{2}\biggr] \bigl\Vert x^{s}_{k}-x^{s}_{k-1} \bigr\Vert ^{2} \\ &\quad{} + \bigl[\Theta _{k+1}(1+a)+2\alpha L^{2}\bigr] \bigl\Vert x^{s}_{k}-\tilde{x}^{s-1} \bigr\Vert ^{2} \\ & = H^{s}_{k}-\biggl[\frac{1}{\alpha}+ \frac{L\lambda}{2}-\Theta _{k+1}\biggl(2+ \frac{1}{a} \biggr)-\frac{L}{2}-\frac{\beta}{2\alpha}\biggr]\mathbb{E}_{k} \bigl\Vert x^{s}_{k+1}-x^{s}_{k} \bigr\Vert ^{2} \\ &\quad{} - \bigl[\Theta _{k+1}(1+a)+2\alpha L^{2}\bigr] \bigl\Vert x^{s}_{k}-x^{s}_{k-1} \bigr\Vert ^{2} \\ &\quad{} - \biggl[2\alpha L^{2}\lambda ^{2}+L\lambda ^{2}+\frac{\beta}{2\alpha}- \frac{L\lambda}{2}\biggr] \bigl\Vert x^{s}_{k}-\tilde{x}^{s-1} \bigr\Vert ^{2}, \end{aligned}

where the last equality follows from the definition of $$H^{s}_{k}$$ and $$\Theta _{k}$$. Rearranging the term, we obtain

\begin{aligned} \mathbb{E}_{k} \bigl[H^{s}_{k+1}-H^{s}_{k} \bigr]& \leq - \biggl[ \frac{1}{\alpha}+\frac{L\lambda}{2}-\Theta _{k+1}\biggl(2+\frac{1}{a}\biggr)- \frac{L}{2}- \frac{\beta}{2\alpha}\biggr]\mathbb{E}_{k} \bigl\Vert x^{s}_{k+1}-x^{s}_{k} \bigr\Vert ^{2} \\ &\quad{} - \bigl[\Theta _{k+1}(1+a)+2\alpha L^{2}\bigr] \bigl\Vert x^{s}_{k}-x^{s}_{k-1} \bigr\Vert ^{2} \\ &\quad{} - \biggl[2\alpha L^{2}\lambda ^{2}+L\lambda ^{2}+\frac{\beta}{2\alpha}- \frac{L\lambda}{2}\biggr] \bigl\Vert x^{s}_{k}-\tilde{x}^{s-1} \bigr\Vert ^{2}. \end{aligned}

This proves claim (i).

Next we prove (ii). From the solution in (i) and the scheme of Algorithm 1 as well as the definition of k, we have by the supermartingale convergence theorem, i.e., Lemma 3, that

$$\sum^{\infty} _{s=1}\sum ^{m-1} _{k=0}\bigl[ \bigl\Vert x^{s}_{k}-x^{s}_{k-1} \bigr\Vert ^{2}+ \bigl\Vert x^{s}_{k}- \tilde{x}^{s-1} \bigr\Vert ^{2}\bigr]< \infty ,\quad \text{a.s.},$$

which results in $$\|x^{s}_{k}-x^{s}_{k-1}\|\xrightarrow{\text{a.s.}}0$$ and $$\|x^{s}_{k}-\tilde{x}^{s-1}\|\xrightarrow{\text{a.s.}}0$$.

Finally, we show (iii). This statement is immediately obtained from (i) and Lemma 3. □

### Proof

We first prove (i). Let be an accumulation point of $$\{\tilde{x}^{s}\}$$. Then it follows from the boundedness of $$\{\tilde{x}^{s}\}$$ by Assumption 1 that there exists a convergent subsequence $$\{\tilde{x}^{s_{i}}\}$$ such that $$\lim_{i\rightarrow \infty}\tilde{x}^{s_{i}}=\bar{x}$$. Since $$x^{s_{i}}_{m+1}$$ is a local minimizer of (2) and the first-order optimality condition, we have

$$-\frac{1}{\alpha}\bigl(x^{s_{i}}_{m+1}-y^{s_{i}}_{m} \bigr)\in v^{s_{i}}_{k}+ \partial R\bigl(x^{s_{i}}_{m+1} \bigr).$$

Substituting the definition of $$y^{s_{i}}_{m}$$ and $$v^{s_{i}}_{m}$$ into this equality, we see further that

$$-\frac{1}{\alpha}\bigl[x^{s_{i}}_{m+1}-x^{s_{i}}_{m}- \beta \bigl(x^{s_{i}}_{m}-x^{s_{i}}_{m-1} \bigr)\bigr] \in \partial R\bigl(x^{s_{i}}_{m+1}\bigr)+\nabla f_{i_{m}}\bigl(z^{s_{i}}_{m}\bigr)- \nabla f_{i_{m}}\bigl(\tilde{x}^{s_{i}-1}\bigr)+\nabla f\bigl( \tilde{x}^{s_{i}-1}\bigr).$$

Passing to the limit the above equality and recalling that $$\|x^{s}_{k}-x^{s}_{k-1}\|\xrightarrow{\text{a.s.}}0$$ for all k, $$z^{s_{i}}_{m}=x^{s_{i}}_{m}+\lambda (x^{s_{i}}_{m}-x^{s_{i}}_{m-1})$$ and $$\tilde{x}^{s_{i}-1}=x^{s_{i}-1}_{m}$$, together with the continuity of $$\nabla f_{i}$$ and the closedness of ∂R, we have

$$0\in \nabla f(\bar{x})+\partial R(\bar{x})=\partial F(\bar{x}),\quad \text{a.s.},$$

which means that is a stationary point of F almost surely.

We now prove (ii). Since $$\{H^{s}_{k}\}$$ is convergent a.s., $$\|x^{s}_{k}-x^{s}_{k-1}\|^{2}\xrightarrow{\text{a.s.}}0$$ and $$\|x^{s}_{k}-\tilde{x}^{s-1}\|^{2}\xrightarrow{\text{a.s.}}0$$ by Lemma 7, $$\lim_{s\rightarrow \infty}F(\tilde{x}^{s})$$ exists almost surely. Without loss of generality, let $$\xi =\lim_{s\rightarrow \infty}F(\tilde{x}^{s})$$ almost surely. Next, with the boundedness of $$\{\tilde{x}^{s}\}$$, $$\forall \bar{x}\in \Omega$$, there exists a convergent subsequence $$\{\tilde{x}^{s_{i}}\}$$ such that $$\lim_{i\rightarrow \infty}\tilde{x}^{s_{i}}=\bar{x}$$. From the assumption that F is lower semicontinuous, we have

$$F(\bar{x})\leq \liminf_{i\rightarrow \infty} F\bigl( \tilde{x}^{s_{i}}\bigr)=\xi .$$
(6)

Next, we show that $$F(\bar{x})\geq \lim_{i\rightarrow \infty}\sup F(\tilde{x}^{s_{i}})= \xi$$ holds almost surely. In view of

$$x^{s_{i}}_{m}=\operatorname*{arg\,min}_{x\in \mathbb{R}^{d}} \biggl\{ R(x)+ \frac{1}{2\alpha} \bigl\Vert x-y^{s_{i}}_{m-1} \bigr\Vert ^{2}+\bigl\langle v^{s_{i}}_{m-1},x \bigr\rangle \biggr\} ,$$

we see that

$$R\bigl(x^{s_{i}}_{m}\bigr)+\frac{1}{2\alpha} \bigl\Vert x^{s_{i}}_{m}-y^{s_{i}}_{m-1} \bigr\Vert ^{2}+ \bigl\langle v^{s_{i}}_{m-1},x^{s_{i}}_{m} \bigr\rangle \leq R(\bar{x})+ \frac{1}{2\alpha} \bigl\Vert \bar{x}-y^{s_{i}}_{m-1} \bigr\Vert ^{2}+\bigl\langle v^{s_{i}}_{m-1}, \bar{x}\bigr\rangle .$$

Rearranging the above inequality, we obtain

\begin{aligned} R(\bar{x})& \geq R \bigl(x^{s_{i}}_{m}\bigr)+\frac{1}{2\alpha} \bigl\Vert x^{s_{i}}_{m}-y^{s_{i}}_{m-1} \bigr\Vert ^{2}+ \bigl\langle v^{s_{i}}_{m-1},x^{s_{i}}_{m}- \bar{x}\bigr\rangle - \frac{1}{2\alpha} \bigl\Vert \bar{x}-y^{s_{i}}_{m-1} \bigr\Vert ^{2}. \end{aligned}
(7)

Observing that

\begin{aligned} \bigl\Vert x^{s_{i}}_{m}-y^{s_{i}}_{m-1} \bigr\Vert ^{2}&= \bigl\Vert x^{s_{i}}_{m}-x^{s_{i}}_{m-1}- \beta \bigl(x^{s_{i}}_{m-1}-x^{s_{i}}_{m-2} \bigr) \bigr\Vert ^{2} \\ &\leq 2 \bigl\Vert x^{s_{i}}_{m}-x^{s_{i}}_{m-1} \bigr\Vert ^{2}+2\beta ^{2} \bigl\Vert x^{s_{i}}_{m-1}-x^{s_{i}}_{m-2} \bigr\Vert ^{2} \end{aligned}

as well as

\begin{aligned} \bigl\Vert \bar{x}-y^{s_{i}}_{m-1} \bigr\Vert ^{2}&= \bigl\Vert \bar{x}-x^{s_{i}}_{m-1}- \beta \bigl(x^{s_{i}}_{m-1}-x^{s_{i}}_{m-2} \bigr) \bigr\Vert ^{2} \\ &\leq 2 \bigl\Vert x^{s_{i}}_{m-1}-\bar{x} \bigr\Vert ^{2}+2\beta ^{2} \bigl\Vert x^{s_{i}}_{m-1}-x^{s_{i}}_{m-2} \bigr\Vert ^{2}, \end{aligned}

it immediately follows from $$\|x^{s_{i}}_{k}-x^{s_{i}}_{k-1}\|^{2}\xrightarrow{\text{a.s.}}0$$ for all k and $$\lim_{i\rightarrow \infty}\tilde{x}^{s_{i}}=\bar{x}$$ that

$$\textstyle\begin{cases} \Vert x^{s_{i}}_{m}-y^{s_{i}}_{m-1} \Vert ^{2}\xrightarrow{\text{a.s.}}0, \\ \Vert \bar{x}-y^{s_{i}}_{m-1} \Vert ^{2}\xrightarrow{\text{a.s.}}0. \end{cases}$$
(8)

By adding $$f(\tilde{x}^{s_{i}})$$ and taking the limit $$i\rightarrow \infty$$ to (7), together with the continuity of f and (8), we have

$$F(\bar{x})\geq \limsup_{i\rightarrow \infty}\bigl\{ f\bigl( \tilde{x}^{s_{i}}\bigr)+R\bigl(\tilde{x}^{s_{i}}\bigr)\bigr\} = \lim_{i \rightarrow \infty}F\bigl(\tilde{x}^{s_{i}}\bigr)=\xi ,$$

which implies that

\begin{aligned} F(\bar{x})\geq \lim _{i\rightarrow \infty} F\bigl( \tilde{x}^{s_{i}}\bigr)=\xi , \quad \text{a.s.} \end{aligned}
(9)

Combining (6) and (9), we have $$F\equiv \xi$$ on Ω almost surely. □

## Rights and permissions 