- Research
- Open access
- Published:
Convergence rate for the moving least-squares learning with dependent sampling
Journal of Inequalities and Applications volume 2018, Article number: 200 (2018)
Abstract
We consider the moving least-squares (MLS) method by the regression learning framework under the assumption that the sampling process satisfies the α-mixing condition. We conduct the rigorous error analysis by using the probability inequalities for the dependent samples in the error estimates. When the dependent samples satisfy an exponential α-mixing, we derive the satisfactory learning rate and error bound of the algorithm.
1 Introduction
The least-squares (LS) method is an important global approximate method based on the regular or concentrated data sample points. However, there are still some irregular or scattered samples which are obtained in many practical applications such as engineering and machine learning [1–4]. They also need to be analyzed to achieve their special usefulness. The moving least-squares (MLS) method was introduced by McLain in [4] to draw a set of contours based on a cluster of scattered data sample points. It turns out that the MLS method is a useful local approximation tool in various fields of mathematics such as approximation theory, data smoothing [5], statistics [6], and numerical analysis [7]. Recently, research effort has been made to study the regression learning algorithm by the MLS method, see [8–12]. The main advantage of the MLS regression learning algorithm is that we can learn the regression function in the simple function space, usually generated by polynomials.
We recall the regression learning problem by the MLS method briefly. Functions for learning are defined on a compact metric space X (input space) and take values in \(Y=\mathbb{R}\) (output space). The sampling process is controlled by an unknown Borel probability measure ρ on \(Z= X\times Y\). We define the regression function as follows:
where \(\rho(\cdot|x)\) is the conditional probability measure induced by ρ on Y given \(x\in X\). The goal of regression learning is to find a good approximation of the regression function \(f_{\rho}\) based on a set of random samples \(\mathbf{z}=\{z_{i}\}_{i=1}^{m}=\{(x_{i}, y_{i})\}_{i=1}^{m} \in Z^{m}\) drawn according to the measure ρ.
We define the approximation \(f_{\mathbf{z}}\) of \(f_{\rho}\) pointwisely:
the local moving empirical error is defined by
where the hypothesis space \(\mathcal{H}\subseteq C(X)\) is a d̃-dimensional Lipschitz function space, \(\sigma=\sigma(m)>0\) is a window width, and \(\Phi:\mathbb{R}^{n}\times\mathbb{R}^{n}\to \mathbb{R}^{+}\) is called an MLS weight function which satisfies the conditions as follows, see [9, 10]:
where the constants \(q>n+1\), \(c_{q}, \tilde{c_{q}}>0\).
The task of the paper is to derive the error bound of \(\|f_{\mathbf {z}}-f_{\rho}\|_{\rho_{X}}\) with the norm \(\|f(\cdot)\|_{\rho_{X}}:=(\int_{X}|f(\cdot)|^{2}\, d{\rho _{X}})^{\frac{1}{2}}\) to evaluate the approximation ability of \(f_{\mathbf{z}}\), see [13–22]. The error analysis of algorithm (1.1) for the independent and identical (i.i.d.) samples has been carried out in [8–10]. However, the samples are not independent but are not far from being independent in some real data analysis such as market prediction, system diagnosis, and speech recognition. The mixing conditions can quantify how close to independence a sequence of random samples is. In [14, 16, 23–25], the authors carried out the regression estimation of the least squares algorithm with the α-mixing samples. Up to now there has been no result of algorithm (1.1) obtained in the case of dependent samples. Hence we extend the analysis of algorithm (1.1) to the α-mixing sampling setting which is quite easy to establish, see [26].
Definition 1.1
Let \(\mathcal {M}_{a}^{b} \) denote the σ-algebras of events generated by the random samples \(\{z_{i}=(x_{i}, y_{i})\}_{i=a}^{b}\). \(\{z_{i}\}_{i\geq1}\) is said to satisfy a strongly mixing condition (or α-mixing condition) if
Specifically, if there exist some positive constants \(\overline{\alpha }>0\), \(\beta>0\), and \(c>0\) such that
then it is said to satisfy an exponential strongly mixing condition.
Our goal is to obtain the convergence rate as \(m \to\infty\) of algorithm (1.1) under hypothesis (1.7). The rest of the paper is organized as follows. In Sect. 2, we review some concepts and state our main results and the error decomposition. In Sect. 3, we present the estimate of the sample error. In Sect. 4, we provide the proofs of the main results.
2 Main results and error decomposition
Before giving the main results, we firstly need to provide some concepts that will be referred to throughout this paper, see [8–10].
Definition 2.1
The probability measure \(\rho_{X} \) on X is said to satisfy the condition \(L_{\tau}\) with exponent \(\tau>0\) if
where the constants \(r_{0}>0\), \(c_{\tau}>0\), and \(B(x,r)=\{u\in X: |u-x|\leq r, \mbox{for }r>0\}\).
Definition 2.2
We say that the hypothesis space \(\mathcal{H}\) satisfies the norming condition with exponent \(\zeta>0\) and \(d\in\mathbb{N}\) if we can find points \(\{u_{i}\}_{i=1}^{d}\subset B(x,\sigma)\) for every \(x\in X\) and \(0<\sigma\leq\sigma_{0}\) satisfying \(|u_{i}-u_{j}|\geq 2c_{\mathcal{H}}\sigma\) for \(i\neq j\) and
where the constants \(\sigma_{0}>0\), \(c_{\mathcal{H}}>0\) and d is chosen as at least the dimension d̃ of \(\mathcal{H}\).
Here we assume \(|y|\leq M\) almost surely, and all the constants such as C̃, \(C_{\mathcal{H},\zeta}\), \(A_{\tau,\zeta}\), \(C_{\mathcal{H},\rho _{X}}\), \(C'_{\mathcal{H},\rho_{X}}\), and so on are independent of the key parameters δ, m, or σ in this paper. Now we give our main results of algorithm (1.1).
Theorem 2.1
Assume that (1.7), (2.1), and (2.2) hold. Suppose \(0< p<2\), \(\sigma= (m^{(\alpha)} )^{-\gamma}\) with \(m^{(\alpha)}= \lfloor m \lceil \{\frac{8m}{c} \} ^{1/(1+\beta)} \rceil^{-1} \rfloor\), \(\gamma>0\), and \(0<\sigma\leq\min \{\sigma_{0},1,(r_{0}/C_{\mathcal{H},\zeta })^{1/\max\{\zeta,1\}} \}\). If m satisfies
then for any \(0<\delta<1\), with confidence \(1-\delta\), we have
Then we can obtain the explicit learning rate of algorithm (1.1) with selecting the suitable parameter \(\sigma=\sigma(m)\).
Theorem 2.2
Under the assumptions of Theorem 2.1, if we choose \(\sigma= (m^{(\alpha)} )^{\frac{\varepsilon}{-(4\varsigma+2\max\{\tau,\tau\varsigma\} )}}\), \(0<\varepsilon<1/4\), and
then with confidence \(1-\delta\), we have
where
Remark 2.1
The result of the above theorem shows that the learning rate tends to \(m^{-\frac{1}{2}}\) when \(\sigma\rightarrow1\). For the i.i.d. case, the same rate has been obtained in [9, 10].
To estimate the quantity of the total error \(\|f_{\mathbf{z}}-f_{\rho }\|_{\rho_{X}}\), we use the proposition from [8] below.
Proposition 2.1
Assume (2.1) and (2.2) hold. Then we have
where
is called the local moving expected risk and
is called the target function.
Remark 2.2
Here we assume \(f_{\rho}\in\mathcal{H}\). It follows from
that \(f_{\mathcal{H}}=f_{\rho}\). Thus \(\|f_{\mathbf{z}}-f_{\rho}\| _{\rho_{X}}=\|f_{\mathbf{z}}-f_{\mathcal{H}}\|_{\rho_{X}}\).
Next we only need to provide the upper bound of the integral in (2.8). So to do this, we give its decomposition as follows:
What is left is to estimate the sample error \(\mathcal{S}(\mathbf {z},\sigma)\).
3 Estimates for the sample error
In order to obtain the probability estimate of \(\mathcal{S}(\mathbf {z},\sigma)\), we shall use the upper bound for \(f_{\mathbf{z},\sigma ,x}\) and \(f_{\mathcal{H},\sigma,x}\). We firstly derive the confidence-based estimate of \(f_{\mathbf {z},\sigma,x}\) as follows.
Proposition 3.1
Under the assumptions of Theorem 2.1, if
then with confidence at least \(1-\delta\), we have
where
The proof is analogous to that of Theorem 3 in [8] except that we need to use the following Lemma 3.1 for the dependent sampling setting to replace Lemma 2 in [8].
Lemma 3.1
Let \(0< r\leq r_{0}\) and \(0<\delta<1\). If (1.7) and (2.1) hold, then with confidence \(1-\delta\), we have
Specifically, if
then with confidence at least \(1-\delta\), we have
where \(\frac{\sharp({\mathbf{x}}\cap B(x,r))}{m}\) is the proportion of those sampling points lying in \(B(x,r)\).
Proof
It is shown in Theorem 5.3 of [27] that one can find \(\{v_{j}\}_{j=1}^{\mathcal{N}}\subseteq X\) satisfying \(X\subseteq B_{R}(\mathbb{R}^{n})\subseteq\bigcup_{j=1}^{\mathcal {N}}B(v_{j},\frac{r}{2})\) and \(\mathcal{N}\leq(\frac{4R}{r}+1)^{n}\). Let \(\xi^{(j)}: X\rightarrow\mathbb{R}\) be the characteristic function of the set \(B(v_{j},\frac{r}{2})\). Its mean \(\mu^{(j)}=\int _{X} \xi^{(j)}(x)\,d\rho_{X}=\rho_{X}(B(v_{j},\frac{r}{2}))\) satisfies \(|\xi^{(j)}-\mu^{(j)}|\leq1\) and \(\sigma^{2}(\xi ^{(j)})\leq1\). Now we use the Bernstein inequality for the dependent samples in [28].
Proposition 3.2
Suppose that (1.7) holds. Let the random variable \(m^{(\alpha)}\) be the effective number of observations and \(\xi _{i}=\xi(z_{i})\) be a real-valued function on the probability space Z with mean \(\mu=\int_{Z} \xi(z)\,d\rho\) and variance \(\sigma ^{2}\). Assume that \(|\xi_{i}-\mu|\leq D\) almost surely. Then, for every \(\varepsilon>0\),
Then it follows from the above proposition that
hence,
For \(0<\delta<1\), let
Then we get
It follows that, with confidence at least \(1-\delta\),
Hence, we have
Condition (2.1) yields \(\mu^{(j)}\geq c_{\tau} (\frac {r}{2} )^{\tau}\). Also \(\xi^{(j)}(x_{i})=1\) if \(x_{i}\in B(v_{j},\frac{r}{2})\) and 0 otherwise. So that \(\frac{1}{m}\sum_{i=1}^{m}\xi^{(j)}(x_{i})=\sharp ({\mathbf{x}}\cap B(v_{j},\frac{r}{2}))/m\). Hence,
Observe from \(X\subseteq\bigcup_{j=1}^{\mathcal{N}}B(v_{j},\frac {r}{2})\) that for each \(x\in X\), there exists some \(j\in {1,\ldots,\mathcal{N}}\) such that \(x\in B(v_{j},\frac{r}{2})\), i.e., \(|v_{j}-x|\leq\frac{r}{2}\). Since \(x_{i}\in B(v_{j},\frac{r}{2})\) implies \(|x_{i}-x|\leq|x_{i}-v_{j}|+|v_{j}-x|\leq r\), we see that
This proves Lemma 3.1. □
Now we are in a position to prove Proposition 3.1.
Proof of Proposition 3.1
By (3.1) and setting \(r=C_{\mathcal{H},\zeta}\sigma^{\max\{ \zeta,1\}}\leq r_{0}\), it is easy to see that (3.4) holds. Then (3.5) is valid.
It follows from (3.5) and Definition 2.2 with σ replaced by \(\frac{\sigma}{2}\) that
and
where \(\{x_{i,l}\}_{l=1}^{m_{i}}\) are the points of the set \(\sharp ({\mathbf{x}}\cap B(u_{i},r))\), which implies
where \(x\in X\), \(l=1,\ldots,\tilde{m}\), and \(\tilde {m}=\min_{1\leq i\leq d}\{m_{i}\}\).
Then, by (1.4), we have
Hence
The last inequality has been proved in Theorem 3 in [8].
Finally, combining (3.19) with the following inequality
we derive the desired result. □
We also need to invoke Lemma 4 in [8] which provides the result about the upper bound of \(f_{\mathcal{H},\sigma,x}\).
Proposition 3.3
Assume that (2.1) and (2.2) hold. Then, for some constant \(C'_{\mathcal{H},\rho _{X}}\) independent of σ, we have
Next we will bound the sample error. The estimation for \(\mathcal {S}(\mathbf{z},\sigma)\) relies on the ratio probability inequality below that can be found in [27].
Proposition 3.4
Suppose that (1.7) holds. Let \(\mathcal{G}\) be a set of functions on Z and \(c>0\) such that, for each \(g\in\mathcal {G}\), \(\mu(g)=\int_{Z}g(z)\,d\rho\geq0\), \(\mu(g^{2})\leq c\mu(g)\), and \(|g(z)-\mu(g)|\leq D\) almost surely. Then, for every \(\varepsilon >0\) and \(0<\alpha\leq1\), we have
We obtain the upper bound estimate for \(\mathcal{S}(\mathbf{z},\sigma )\) by using Proposition 3.4.
Proposition 3.5
If the assumptions of Proposition 3.1 hold,
and
then with confidence \(1-\delta\), there holds
Proof
Let the function \(g(u,y)\) be defined on the function set
With condition (1.5) and the bound \(c_{\rho}\) of the density function of \(\rho_{X}\), we have
which implies
Hence \(|g(u,y)-\mu(g)|\leq2c_{R}\).
It follows from the Schwarz inequality that
By (3.27),
It has been proved in [9] that
Substituting (3.31) into (3.30),
Using Proposition 3.4 with \(\alpha=\frac{1}{4}\) and \(\mathcal{G}=\mathcal{G}_{R}\), we know that
Since for any \(g_{1}, g_{1}\in\mathcal{G}_{R}\),
then we have
It follows from (3.33) that
We set the term \((1+4e^{-2}\overline{\alpha} )\mathcal{N} (B_{1},\frac{\varepsilon}{16R^{2}D\sigma^{n}} )\exp \{-\frac{3m^{(\alpha)}\varepsilon}{2048R^{2}D\sigma ^{n}} \}\) of the above inequality to \(\delta/2\). We need to invoke the lemma proved by the same method of Proposition 4.3 in [21].
Lemma 3.2
Let \(\eta^{\ast}(m^{(\alpha)},\delta)\) be the smallest positive solution of the following inequality in η:
If \(\log\mathcal{N}(B_{1},\eta)\leq c_{p}(\eta)^{-p}\), for some \(p\in(0,2)\), \(c_{p}>0\) and all \(\eta>0\), then with confidence at least \(1-\delta\), we have
Then we return to the proof of Proposition 3.5.
It follows from Theorem 5.3 in [27] that
When \(0<\varepsilon<16 R^{2}D\sigma^{n}\),
When \(\varepsilon\geq16 R^{2}D\sigma^{n}\), we have
Hence we conclude that
This, together with (3.37), implies that, for
we obtain
Combining (3.43) with (3.36), with confidence \(1-\delta /2\), we have
Finally, setting \(f=f_{\mathbf{z},\sigma,x}\) in the above inequality, we derive the desired result. □
4 Proofs of the main results
In this subsection, we provide the proofs of Theorem 2.1 and Theorem 2.2. We firstly prove Theorem 2.1.
Proof
If we take \(\sigma= (m^{(\alpha)} )^{-\gamma}\), \(\gamma>0\), then we have
It is readily seen that (2.3) implies (3.1). Then Proposition 3.5 holds true. We thus obtain, with confidence \(1-\delta\),
Therefore from (2.8) we obtain
This proves Theorem 2.1. □
Next, we prove Theorem 2.2.
Proof
Let \(\gamma=\varepsilon/[4\varsigma+2\max\{\tau,\tau\varsigma\} ]>0\) and \(p=\frac{\varepsilon}{1-\varepsilon}\). Therefore we have
and
It follows from (2.5) that
which implies that condition (2.3) of Theorem 2.1 holds true, we thus obtain, with confidence \(1-\delta\),
This proves Theorem 2.2. □
References
Cerny, M., Antoch, J., Hladik, M.: On the possibilistic approach to linear regression models involving uncertain, indeterminate or interval data. Inf. Sci. 244(7), 26–47 (2013)
Fasshauer, G.E.: Toward approximate moving least squares approximation with irregularly spaced centers. Comput. Methods Appl. Mech. Eng. 193(12–14), 1231–1243 (2004)
Komargodski, Z., Levin, D.: Hermite type moving-least-squares approximations. Comput. Math. Appl. 51(8), 1223–1232 (2006)
Mclain, D.H.: Drawing contours from arbitrary data points. Comput. J. 17(17), 318–324 (1974)
Savitzky, A., Golay, M.J.E.: Smoothing and differentiation of data by simplified least squares procedures. Anal. Chem. 36(8), 1627–1639 (1964)
Wand, M.P., Jones, M.C.: Kernel Smoothing. Chapman & Hall, London (1995)
Shepard, D.: A two-dimensional interpolation function for irregularly-spaced data. In: ACM National Conference, pp. 517–524 (1968)
Wang, H.Y., Xiang, D.H., Zhou, D.X.: Moving least-square method in learning theory. J. Approx. Theory 162(3), 599–614 (2010)
Wang, H.Y.: Concentration estimates for the moving least-square method in learning theory. J. Approx. Theory 163(9), 1125–1133 (2011)
He, F.C., Chen, H., Li, L.Q.: Statistical analysis of the moving least-squares method with unbounded sampling. Inf. Sci. 268(1), 370–380 (2014)
Tong, H.Z., Wu, Q.: Learning performance of regularized moving least square regression. J. Comput. Appl. Math. 325, 42–55 (2017)
Guo, Q., Ye, P.X.: Error analysis of the moving least-squares method with non-identical sampling. Int. J. Comput. Math., 1–15 (2018). https://doi.org/10.1080/00207160.2018.1469748
Smale, S., Zhou, D.X.: Online learning with Markov sampling. Anal. Appl. 7(1), 87–113 (2009)
Pan, Z.W., Xiao, Q.W.: Least-square regularized regression with non-iid sampling. J. Stat. Plan. Inference 139(10), 3579–3587 (2009)
Guo, Z.C., Shi, L.: Classification with non-i.i.d. sampling. Math. Comput. Model. 54(5), 1347–1364 (2011)
Guo, Q., Ye, P.X.: Coefficient-based regularized regression with dependent and unbounded sampling. Int. J. Wavelets Multiresolut. Inf. Process. 14(5), 1–14 (2016)
Sun, H.W., Guo, Q.: Coefficient regularized regression with non-iid sampling. Int. J. Comput. Math. 88(15), 3113–3124 (2011)
Chu, X.R., Sun, H.W.: Half supervised coefficient regularization for regression learning with unbounded sampling. Int. J. Comput. Math. 90(7), 1321–1333 (2013)
Steinwart, I., Hush, D., Scovel, C.: Learning from dependent observations. J. Multivar. Anal. 100(1), 175–194 (2009)
Billingsley, P.: Convergence of probability measures. Appl. Stat. 159(1–2), 1–59 (1968)
Wu, Q., Ying, Y.M., Zhou, D.X.: Learning rates of least-square regularized regression. Found. Comput. Math. 6(2), 171–192 (2006)
Lv, S.G., Feng, Y.L.: Semi-supervised learning with the help of Parzen windows. J. Math. Anal. Appl. 386(1), 205–212 (2012)
Sun, H.W., Wu, Q.: Regularized least square regression with dependent samples. Adv. Comput. Math. 32(2), 175–189 (2010)
Chu, X.R., Sun, H.W.: Regularized least square regression with unbounded and dependent sampling. Abstr. Appl. Anal. 2013, Article ID 139318 (2013)
Guo, Q., Ye, P.X., Cai, B.L.: Convergence rate for \(l^{q}\) -coefficient regularized regression with non-i.i.d. sampling. IEEE Access 6, 18804–18813 (2018)
Parthasarathy, K.R.: Convergence of probability measures. Technometrics 12(1), 171–172 (1968)
Cucker, F., Zhou, D.X.: Learning Theory: An Approximation Theory Viewpoint. Cambridge University Press, Cambridge (2007)
Modha, D.S., Masry, E.: Minimum complexity regression estimation with weakly dependent observations. IEEE Trans. Inf. Theory 42(6), 2133–2145 (1996)
Funding
This work was supported by the National Natural Science Foundation of China (grant no. 11671213) and the Fundamental Research Funds for the Central Universities.
Author information
Authors and Affiliations
Contributions
Both authors contributed equally to the writing of this paper. Both authors read and approved the final manuscript.
Corresponding author
Ethics declarations
Competing interests
The authors declare that they have no competing interests.
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.
About this article
Cite this article
Guo, Q., Ye, P. Convergence rate for the moving least-squares learning with dependent sampling. J Inequal Appl 2018, 200 (2018). https://doi.org/10.1186/s13660-018-1794-8
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s13660-018-1794-8