Skip to content

Advertisement

Open Access

Adaptive bridge estimation for high-dimensional regression models

Journal of Inequalities and Applications20162016:258

https://doi.org/10.1186/s13660-016-1205-y

Received: 16 August 2016

Accepted: 12 October 2016

Published: 20 October 2016

Abstract

In high-dimensional models, the penalized method becomes an effective measure to select variables. We propose an adaptive bridge method and show its oracle property. The effectiveness of the proposed method is demonstrated by numerical results.

Keywords

adaptive bridgehigh-dimensionalityvariable selectionoracle propertypenalized methodtuning parameter

MSC

62F1262E1562J05

1 Introduction

For the classical linear regression model \(Y=X\beta+\varepsilon\), we are interested in the problem of variable selection and estimation, where \(Y=(y_{1},y_{2},\ldots,y_{n})^{T}\) is the response vector, \(X=(X_{1},X_{2},\ldots,X_{p})=(x_{1},x_{2},\ldots,x_{n})^{T}=(x_{ij})_{n\times p}\) is an \(n\times p\) design matrix, and \(\varepsilon=(\varepsilon_{1},\varepsilon_{2},\ldots,\varepsilon_{n})^{T}\) is a random vector. The main topic is how to estimate the coefficients vector \(\beta\in\mathrm{R}^{p}\) when p increases with sample size n and many elements of β equal zero. We can transfer this problem into a minimization of a penalized least squares objective function
$$\hat{\beta}=\arg\min_{\beta}Q(\beta), \quad Q(\beta)=\|Y-X\beta \| ^{2}+\lambda\sum_{j=1}^{p}| \beta_{j}|^{\zeta}, $$
where \(\|\cdot\|\) is the \(l_{2}\) norm of the vector, λ is a tuning parameter. For \(\zeta>0\), β̂ is called the bridge estimator proposed by Frank and Friedman [1]. There are two well-known special cases of the bridge estimator. If \(\zeta=2\), it is the ridge estimator in Hoerl and Kennard [2]; if \(\zeta=1\), it is the Lasso estimator by Tibshirani [3], which does not possess the oracle property in Fan and Li [4]. For \(0<\zeta\leq 1\), Knight and Fu [5] studied the asymptotic distributions of bridge estimators when the number of covariates is fixed and provided a theoretical justification for the use of bridge estimators to select variables. The bridge estimators can distinguish between the covariates whose coefficients are exactly zero and the covariates whose coefficients are nonzero. There is much statistical literature about penalization-based methods. Some examples include the SCAD by Fan and Li [4], the elastic net by Zou and Hastie [6], the adaptive lasso by Zou [7], the Dantzig selector by Candes and Tao [8] and the non-concave MCP penalty by Zhang [9]. For bridge estimation, Huang et al. [10] extended the results of Knight and Fu [5] to infinite dimensional parameters and showed that for \(0<\zeta<1\) the bridge estimator can correctly select covariates with nonzero coefficients and under appropriate conditions the bridge estimator enjoys the oracle property. Subsequently, Wang et al. [11] studied the consistency of the bridge estimator for a generalized linear model.
In this paper, we consider the following penalized model:
$$ \hat{\beta}=\arg\min_{\beta}Q(\beta), \quad Q(\beta)=\|Y-X\beta \| ^{2}+\lambda\sum_{j=1}^{p} \tilde{\omega}_{j}|\beta_{j}|^{\zeta}, $$
(1.1)
where \(\tilde{\omega}=(\tilde{\omega}_{1},\tilde{\omega}_{2},\ldots,\tilde {\omega}_{p})^{T}\) is a given vector of weights. Usually, if we let the initial estimator \(\tilde{\beta}=(\tilde{\beta}_{1},\tilde{\beta }_{2},\ldots,\tilde{\beta}_{p})^{T}\) be the non-penalized MLE, then \(\tilde {\omega}_{j}=|\tilde{\beta}_{j}|^{-1}\), \(j=1,2,\ldots,p\). β̂ is called the adaptive bridge estimator. We propose and study the adaptive bridge estimator method (abridge for short). We derive some theoretical properties of the adaptive bridge estimator for the case when p can increase to infinity with n. Under some conditions, with the choice of the tuning parameter, we show that the adaptive bridge estimator enjoys the oracle property; that is, the adaptive bridge estimator can correctly select covariates with nonzero coefficients with probability converging to one and that the estimator of nonzero coefficients has the same asymptotic distribution that it would have if the zero coefficients were known in advance.

As far as we know, there is no literature to discuss the properties of an adaptive bridge, so our results make up for this. Compared with the results in Huang et al. [10] and Wang et al. [11], the condition (A2) (see Section 2) imposed on the true coefficients is much weaker. Moreover, in Wang et al. [11] one needs the true coefficients to meet the additional condition called covering number. Besides, Huang et al. [10] and Wang et al. [11] both use the LQA algorithm to obtain the estimator. The shortcoming of the LQA algorithm is that if we delete one variable in some step of the iteration, this variable will have no chance to appear in the final model. In order to improve this algorithm, we employ the MM algorithm to improve the stability.

The rest of the paper is organized as follows. In Section 2, we introduce notations and assumptions which will be needed in the our results and present the main results. Section 3 presents some simulation results. The conclusion and the proofs of the main results are arranged in Sections 4 and 5.

2 Main results

For convenience of the statement, we first give some notations. Let \(\beta_{0}=(\beta_{01},\beta_{02},\ldots,\beta_{0p})^{T}\) be the true parameter, \(J_{1}=\{j:\beta_{0j}\neq0,j=1,2,\ldots,p\}\), \(J_{2}=\{j:\beta _{0j}=0,j=1,2,\ldots,p\}\), the cardinality of the set \(J_{1}\) is denoted by q and \(h_{1}=\min\{|\beta_{0j}|:j\in J_{1}\}\). Without loss of generality, we assume that the first q coefficients of covariates (denoted by \(X_{(1)}\)) are nonzero, \(X_{(2)}\) be covariates with zero coefficients, \(\beta_{0}=(\beta_{0(1)}^{T},\beta_{0(2)}^{T})^{T}\), \(\hat{\beta }=(\hat{\beta}_{(1)}^{T},\hat{\beta}_{(2)}^{T})^{T}\) correspondingly. Actually, p, q, X, Y, β, and λ are related to the sample size n, we omit n for convenience. In this paper, we only consider the statistical properties of the adaptive bridge for the case of \(p< n\); consequently we put \(p=O(n^{c_{2}})\), \(q=O(n^{c_{1}})\), \(\lambda =O(n^{-\delta})\), where \(0\leq c_{1}< c_{2}<1\), \(\delta>0\). Here we use the terminology in Zhao and Yu [12] , and we define \(\hat{\beta}=_{s}\beta _{0}\) if and only if \(\operatorname{sgn}(\hat{\beta})=\operatorname{sgn}(\beta_{0})\), where we denote the sign of a \(p\times1\) vector β as \(\operatorname{sgn}(\beta )=(\operatorname{sgn}(\beta_{1}), \operatorname{sgn}(\beta_{2}),\ldots, \operatorname{sgn}(\beta _{p}))^{T}\). For any symmetric matrix Z, denote by \(\lambda_{\mathrm{min}}(Z)\) and \(\lambda_{\mathrm{max}}(Z)\) the minimum and maximum eigenvalue of Z, respectively. Denote \(\frac{X^{T}X}{n}:=D\) and \(D=\bigl ( {\scriptsize\begin{matrix}{} D_{11} &D_{12} \cr D_{21}& D_{22}\end{matrix}} \bigr )\), where \(D_{11}=\frac{1}{n}X_{(1)}^{T}X_{(1)}\).

Next, we state some assumptions which will be needed in the following results.
(A1): 

The error term ε is i.i.d. with \(E(\varepsilon)=0\) and \(E(\varepsilon^{2k})<+\infty\), where \(k>0\). For the special case we denote \(E(\varepsilon^{2})=\sigma^{2}\).

(A2): 

There exists a positive constant M such that \(h_{1}\geq Mn^{\alpha}\), where \(\max\{-\frac{1}{2},\frac{c_{2}-1}{2},\frac{-1}{2-\zeta }\}<\alpha<\min\{c_{2}-\delta,\frac{c_{2}-\delta-\zeta}{1+\zeta}\}\) and \(\delta+\alpha+\frac{1}{2}\zeta< c_{2}\).

(A3): 

Suppose \(\tau_{1}\) and \(\tau_{2}\) are the minimum and maximum eigenvalues of the matrix \(D_{11}\). There exist constants \(\tau_{10}\) and \(\tau_{20}\) such that \(0<\tau_{10}\leq\tau_{1}\leq\tau_{2}\leq\tau _{20}\), and the eigenvalues of \(\frac{1}{n}X^{T}\operatorname{var}(Y)X\) are bounded.

(A4): 

Let \(g_{i}\) be the transpose of the ith row vector of \(X_{(1)}\), such that \(\lim_{n\rightarrow\infty}n^{-\frac {1}{2}} \max_{1\leq i\leq n}g_{i}^{T} g_{i}=0\).

It is worth mentioning that condition (A1) is much weaker than those in the literature where it is commonly assumed that the error term has Gaussian tail probability distribution. In this paper we allow ε to have a heavy tail. The regularity condition (A2) is a common assumption for the nonzero coefficients, which can ensure that all important covariates could be included in the finally selected model. Condition (A3) means that the matrix \(\frac {1}{n}X_{(1)}^{T}X_{(1)}\) is strictly positive definite. For condition (A4), we will use it to prove the asymptotic normality of the estimators of the nonzero coefficients. In fact, if the nonzero coefficients have an upper bound, then we can easily verify condition (A4).

2.1 Consistency of the estimation

Theorem 2.1

Consistency of the estimation

If \(0<\zeta<2\), and conditions (A1)-(A3) hold, then there exists a local minimizer β̂ of \(Q(\beta)\), such that \(\|\hat{\beta }-\beta_{0}\|=O_{p}(n^{\frac{\delta+\alpha-c_{2}}{\zeta}})\).

Remark 2.1

By condition (A2), we know that \(c_{2}-\delta-\alpha>0\) and the estimator consistency refers to the order of sample size and tuning parameter. Theorem 2.1 extends the previous results.

2.2 Oracle property of the estimation

Theorem 2.2

Oracle property

If \(0<\zeta<1\), and conditions (A1)-(A4) hold, then the adaptive bridge estimator satisfies the following properties.
  1. (1)

    (Selection consistency) \(\lim_{n \rightarrow\infty }P\{\hat{\beta}=_{s}\beta_{0}\}=1\);

     
  2. (2)

    (Asymptotic normality) \(\sqrt{n}s^{-1}u^{T}(\hat{\beta }_{(1)}-\beta_{0(1)})\stackrel{\mathrm{d}}{\longrightarrow} N(0,1)\), where \(s^{2}=\sigma^{2}u^{T}D_{11}^{-1}u\) for any \(q\times1\) vector u and \(\|u\|\leq1\).

     

Remark 2.2

By Theorems 2.1 and 2.2, we can easily see that the adaptive bridge is able to consistently identify the true model.

3 Simulation results

In this section we evaluate the performance of the adaptive bridge estimator proposed in (1.1) by simulation studies. Set \(\zeta=1/2\) and simulate the data by the model \(Y=X\beta+\varepsilon\), \(\varepsilon\sim N(0,\sigma^{2})\), where \(\sigma=1\), \(\beta _{0(1)}=(-2.5,-2.5,-2.5,3,3,3,-1,-1,-1)^{T}\). The design matrix X is generated by a p-dimensional multivariate normal distribution with mean zero and a covariance matrix whose \((i,j)\)th component is \(\rho ^{|i-j|}\), where we let \(\rho=0.5\mbox{ and }0.9\), respectively. The following examples are considered.

Example 3.1

The sample size \(n=200\) and the covariates number \(p=50\).

Example 3.2

The sample size \(n=500\) and the covariates number \(p=80\).

Example 3.3

The sample size \(n=800\) and the covariates number \(p=100\).

We connect the minorization-maximization (MM) algorithm by Hunter and Li [13] and the Newton-Raphson method to estimate the adaptive abridge (abridge), where the tuning parameter is selected by 5-fold cross-validation. Meanwhile, we compare our results with that from lasso [14], adaptive lasso (alasso) and bridge methods. In order to evaluate the performance of the estimators, we select four measures called \(L_{2}\)-loss, PE, C, and IC. \(L_{2}\)-loss is median of \(\|\hat{\beta }-\beta_{0}\|_{2}\) to evaluate the estimation accuracy, and PE is the prediction error defined by median of \(n^{-1}\|Y-X\hat{\beta}\|^{2}\). The other two measures are to qualify the performance of model consistency, where C and IC refer to the average number of correctly selected zero covariates and the average number of incorrectly selected zero covariates. The numerical results are listed in Table 1 and Table 2, where υ equals the number of zero coefficients in the true model and the numbers in parentheses are the corresponding standard deviations which are obtained by 500 replicates.
Table 1

Simulation results for \(\pmb{\rho=0.5}\)

Setting

Method

\(\boldsymbol{L_{2}}\) -loss

PE

C

IC

n = 200

p = 50

υ = 41

Lasso

0.5459 (0.1160)

0.8830 (0.1126)

28.4540 (5.3337)

0 (0)

Alasso

0.5442 (0.1149)

0.8790 (0.1099)

28.5300 (5.2421)

0 (0)

Bridge

0.4733 (0.1120)

0.8755 (0.1068)

28.2700 (5.0834)

0 (0)

Abridge

0.4617 (0.1155)

0.9005 (0.1038)

38.8380 (2.9903)

0 (0)

n = 500

p = 80

υ = 71

Lasso

0.3459 (0.0829)

0.9469 (0.0673)

56.1300 (6.3329)

0 (0)

Alasso

0.3476 (0.0830)

0.9465 (0.0686)

56.0360 (6.5269)

0 (0)

Bridge

0.2950 (0.0732)

0.9394 (0.0654)

52.7140 (6.7155)

0 (0)

Abridge

0.2745 (0.0728)

0.9661 (0.0630)

69.7160 (2.5994)

0 (0)

n = 800

p = 100

υ = 91

Lasso

0.2814 (0.0624)

0.9664 (0.0548)

74.4820 (7.1596)

0 (0)

Alasso

0.2817 (0.0620)

0.9687 (0.0552)

74.7180 (7.0409)

0 (0)

Bridge

0.2327 (0.0576)

0.9570 (0.0534)

69.4380 (8.8332)

0 (0)

Abridge

0.2160 (0.0569)

0.9839 (0.0514)

89.7680 (3.0359)

0 (0)

Table 2

Simulation results for \(\pmb{\rho=0.9}\)

Setting

Method

\(\boldsymbol{L_{2}}\) -loss

PE

C

IC

n = 200

p = 50

υ = 41

Lasso

1.0102 (0.2452)

0.8725 (0.1049)

27.2580 (4.6677)

0.0040 (0.0632)

Alasso

1.0123 (0.2475)

0.8800 (0.1024)

27.2460 (4.5851)

0.0040 (0.0632)

Bridge

0.8961 (0.2624)

0.8656 (0.1059)

25.0600 (5.3104)

0 (0)

Abridge

0.8468 (0.2843)

0.8965 (0.1092)

37.7800 (4.2298)

0.0260 (0.1593)

n = 500

p = 80

υ = 71

Lasso

0.6649 (0.1630)

0.9435 (0.0664)

52.9840 (5.7272)

0 (0)

Alasso

0.6671 (0.1620)

0.9388 (0.0667)

52.3420 (6.1499)

0 (0)

Bridge

0.5251 (0.1442)

0.9368 (0.0649)

50.0820 (7.7934)

0 (0)

Abridge

0.4837 (0.1377)

0.9646 (0.0651)

68.2320 (4.7732)

0 (0)

n = 800

p = 100

υ = 91

Lasso

0.5382 (0.1242)

0.9623 (0.0545)

70.4900 (6.5000)

0 (0)

Alasso

0.5371 (0.1259)

0.9614 (0.0544)

69.9680 (6.9009)

0 (0)

Bridge

0.4126 (0.1183)

0.9572 (0.0541)

66.6060 (9.6239)

0 (0)

Abridge

0.3580 (0.1087)

0.9818 (0.0520)

89.2840 (4.3161)

0 (0)

Note that in every case the adaptive bridge outperforms the other methods in sparsity, which can select the smaller model. For the adaptive bridge the prediction error is a little higher than the other methods, but when consider the estimation accuracy, the adaptive bridge is still the winner, followed by bridge. We also find the interesting fact that with the sample size n larger, the performance of correctly selecting the zero covariates for the adaptive bridge is better whenever \(\rho=0.5\mbox{ or }0.9\). Meanwhile with n increasing, the estimation accuracy performs better, but the prediction error is worse. Additionally, when ρ increases, the prediction error increases, but the estimation accuracy decreases.

4 Conclusion

In this paper we have proposed the adaptive bridge estimator and presented some theoretical properties of the adaptive bridge estimator. Under some conditions, with the choice of the tuning parameter, we have showed that the adaptive bridge estimator enjoys the oracle property. The effectiveness of the proposed method is demonstrated by numerical results.

5 Proofs

Proof of Theorem 2.1

In view of the idea in Fan and Li [4], we only need to prove that, for any \(\epsilon>0\), there exists a large constant C such that
$$ \liminf_{n\rightarrow\infty}P\Bigl\{ \inf_{\|u\|=C}Q( \beta_{0}+\theta u)>Q(\beta _{0})\Bigr\} \geq1-\epsilon, $$
(5.1)
which means that with a probability of at least \(1-\epsilon\) there exists a local minimizer β̂ in the ball \(\{\beta_{0}+\theta u:\|u\|\leq C\}\).
First, let \(\theta=n^{\frac{\delta+\alpha-c_{2}}{\zeta}}\), then
$$\begin{aligned}& Q(\beta_{0}+\theta u)-Q(\beta_{0}) \\& \quad =\theta^{2}nu^{T}\biggl(\frac{X^{T}X}{n}\biggr)u- \theta u^{T}X^{T}(Y-X\beta_{0})+\lambda\sum _{j=1}^{p}\tilde{\omega}_{j} \bigl(|\beta_{0j}+\theta u|^{\zeta}-|\beta _{0j}|^{\zeta}\bigr) \\& \quad \geq\lambda_{\mathrm{min}}\biggl(\frac{X^{T}X}{n}\biggr) \theta^{2}n\|u\|^{2}-n\theta u^{T}\frac {X^{T}(Y-X\beta_{0})}{n}- \lambda\sum_{j=1}^{p}\tilde{ \omega}_{j}|\theta|^{\zeta}\|u\|^{\zeta} \\& \quad :=T_{1}+T_{2}+T_{3}, \end{aligned}$$
(5.2)
where \(T_{1}=\lambda_{\mathrm{min}}(\frac{X^{T}X}{n})\theta^{2}n\|u\|^{2}\), \(T_{2}=-n\theta u^{T}\frac{X^{T}(Y-X\beta_{0})}{n}\), and \(T_{3}=-\lambda\sum_{j=1}^{p}\tilde {\omega}_{j}|\theta|^{\zeta}\|u\|^{\zeta}\).
For \(T_{2}\), set \(v=O_{P}(n^{\alpha})\) and by assumptions (A2) and (A3) we have
$$\begin{aligned} P\biggl\{ \biggl\Vert \frac{X^{T}(Y-X\beta_{0})}{n}\biggr\Vert \geq Mv\biggr\} &\leq \frac{1}{M^{2}v^{2}}E\Biggl[\sum_{j=1}^{p} \biggl(\frac{1}{n}X_{j}^{T}(Y-X\beta_{0}) \biggr)^{2}\Biggr] \\ &=\frac{1}{nM^{2}v^{2}}\operatorname{tr}\biggl(\frac{1}{n}X^{T} \operatorname{var}(Y)X\biggr)\rightarrow0\quad (n\rightarrow \infty). \end{aligned}$$
Hence
$$\begin{aligned} |T_{2}| =&\biggl\Vert n\theta u^{T}\frac{X^{T}(Y-X\beta_{0})}{n}\biggr\Vert \leq n|\theta|\biggl\Vert \frac {X^{T}(Y-X\beta_{0})}{n}\biggr\Vert \Vert u \Vert \\ =&n|\theta|O_{P}(v)\Vert u\Vert =o_{P}(1)\Vert u\Vert . \end{aligned}$$
(5.3)
As for \(|T_{3}|\), observe that \(\|\tilde{\beta}-\beta_{0}\|=O_{P}((\frac {p}{n})^{1/2})\), \(\min|\beta_{j}|\leq\max|\tilde{\beta_{j}}-\beta _{0j}|+\min|\tilde{\beta_{j}}|\), and assumption (A2), we can obtain
$$\begin{aligned} M&\leq n^{-\alpha}\min|\beta_{j}| \leq n^{-\alpha}\max| \tilde{\beta_{j}}-\beta_{0j}|+n^{-\alpha}\min|\tilde { \beta_{j}}| \\ &=n^{-\alpha}O_{P}\biggl(\biggl(\frac{p}{n} \biggr)^{1/2}\biggr)+n^{-\alpha}\min|\tilde{\beta_{j}}|. \end{aligned}$$
This together with assumption (A2) yields \(P\{\min|\tilde{\beta _{j}}|\geq\frac{1}{2}Mn^{\alpha}\}\rightarrow1\) (\(n\rightarrow\infty\)).
For \(v_{1}=\frac{2\lambda p}{Mn^{\alpha}}\), \(P\{\lambda\sum_{j=1}^{p}\tilde{\omega}_{j}|\leq v_{1}\}\geq P\{\frac {\lambda p}{\min|\tilde{\beta_{j}}|}\leq v_{1}\}=P\{\min|\tilde{\beta _{j}}|\geq\frac{\lambda p}{v_{1}}\}\rightarrow1\) (\(n\rightarrow\infty\)), i.e., \(|\lambda\sum_{j=1}^{p}\tilde{\omega}_{j}|=O_{P}(\frac {\lambda p}{Mn^{\alpha}})\). Now with assumption (A2) we conclude that
$$ |T_{3}|=O_{P}\biggl(\frac{\lambda p}{Mn^{\alpha}}\biggr)| \theta|^{\zeta}\|u\|^{\zeta}=O_{P}(1)\| u \|^{\zeta}. $$
(5.4)
When \(0<\zeta<2\) and C is large enough, by (5.3) and (5.4) we see that (5.2) is determined by \(T_{1}\), so (5.1) holds. □

Proof of Theorem 2.2

(1) First of all, by the K-K-T condition we know that β̂ is the defined adaptive bridge estimator, if the following holds:
$$ \left \{\textstyle\begin{array}{l} \frac{\partial\|Y-X\beta\|^{2}}{\partial\beta_{j}}|_{\beta_{j}=\hat{\beta}_{j}} =\lambda\zeta\tilde{\omega}_{j}|\hat{\beta}_{j}|^{\zeta-1}\operatorname{sgn}(\hat {\beta}_{j}), \quad \hat{\beta}_{j}\neq0, \\ \frac{\partial\|Y-X\beta\|^{2}}{\partial\beta_{j}}|_{\beta_{j}=\hat{\beta}_{j}} \leq\lambda\zeta\tilde{\omega}_{j}|\hat{\beta}_{j}|^{\zeta-1},\quad \hat{\beta }_{j}=0. \end{array}\displaystyle \right . $$
(5.5)
Let \(\hat{u}=\hat{\beta}-\beta_{0}\) and define \(V(u)=\sum_{j=1}^{n}(\varepsilon_{i}-X_{i}^{T}u)^{2}+\lambda\sum_{j=1}^{p}\tilde{\omega }_{j}|u_{j}+\beta_{0j}|^{\zeta}\), then we obtain \(\hat{u}=\arg\min_{u}V(u)\). Notice that \(\sum_{j=1}^{n}(\varepsilon_{i}-X_{i}^{T}u)^{2}=-2\varepsilon ^{T}Xu+nu^{T}Du+\varepsilon^{T}\varepsilon\), which yields \(\frac{d[\sum_{j=1}^{n}(\varepsilon_{i}-X_{i}^{T}u)^{2}]}{du}|_{u=\hat {u}}=-2X^{T}\varepsilon+2nD\hat{u} :=2\sqrt{n}[D(\sqrt{n}\hat{u})-E]\), where \(E=\frac{X^{T}\varepsilon}{\sqrt{n}}\). Together with (5.5) and the fact \(\{|\hat{u}_{(1)}|<|\beta_{0(1)}|\} \subset\{\operatorname{sgn}(\hat{\beta}_{(1)})=\operatorname{sgn}(\beta_{0(1)})\}\), if û satisfies
$$D_{11}\sqrt{n}\hat{u}_{(1)}-E_{(1)}= \frac{-\lambda}{2\sqrt{n}}\zeta\bar {W}_{(1)} \quad \text{and}\quad | \hat{u}_{(1)}|< |\beta_{0(1)}|, $$
where \(\bar{W}=(\tilde{\omega}_{1}|\hat{u}_{(1)}+\beta_{01}|^{\zeta -1}\operatorname{sgn}(\beta_{01}),\tilde{\omega}_{2}|\hat{u}_{(1)}+\beta _{02}|^{\zeta-1}\operatorname{sgn}(\beta_{02}), \ldots,\tilde{\omega}_{p}|\hat{u}_{(1)}+\beta_{0p}|^{\zeta-1} \operatorname{sgn}(\beta _{0p}))^{T}\), then we have \(\operatorname{sgn}(\hat{\beta}_{(1)})=\operatorname{sgn}(\beta_{0(1)})\) and \(\hat{\beta}_{(2)}=0\). Let
$$\tilde{W}=\bigl(2\tilde{\omega}_{1}|\beta_{01}|^{\zeta-1} \operatorname{sgn}(\beta _{01}),2\tilde{\omega}_{2}| \beta_{02}|^{\zeta-1}\operatorname{sgn}(\beta_{02}), \ldots,2\tilde{\omega}_{p}|\beta_{0p}|^{\zeta-1} \operatorname{sgn}(\beta_{0p})\bigr)^{T}, $$
it follows that \(|D_{11}^{-1}E_{(1)}|+\frac{\lambda\zeta}{2\sqrt{n}}|D_{11}^{-1}\tilde {W}_{(1)}|<\sqrt{n}|\beta_{0(1)}|\). Denote \(A=\{ |D_{11}^{-1}|E_{(1)}+\frac{\lambda\zeta}{2\sqrt{n}}|D_{11}^{-1}\tilde {W_{(1)}}|<\sqrt{n}|\beta_{0(1)}|\}\), we conclude that \(P\{\operatorname{sgn}(\hat{\beta})=\operatorname{sgn}(\beta_{0})\}\geq P\{A\}\), from which it follows that
$$\begin{aligned} P\bigl\{ \operatorname{sgn}(\hat{\beta})\neq\operatorname{sgn}( \beta_{0})\bigr\} \leq& P\bigl\{ A^{c}\bigr\} \\ \leq& P\biggl\{ |\xi_{i}|\geq\frac{1}{2}\sqrt{n}| \beta_{0i}|, \exists i\in J_{1}\biggr\} \\ &{}+P\biggl\{ \frac{\lambda\zeta}{n}|Z_{i}|>|\beta_{0i}|, \exists i\in J_{1} \biggr\} :=I_{1}+I_{2}, \end{aligned}$$
(5.6)
where \(\xi=(\xi_{1},\xi_{2},\ldots,\xi_{q})^{T}=D_{11}^{-1}E_{(1)}\), \(Z=(Z_{1},Z_{2},\ldots,Z_{q})^{T}=D_{11}^{-1}W_{(1)}\). For \(I_{1}=P\{|\xi_{i}|\geq\frac{1}{2}\sqrt{n}|\beta_{0i}|, \exists i\in J_{1}\}\), then \(E[(\xi_{i})^{2k}]<\infty\), \(\forall i\in J_{1}\). So its tail probability satisfies \(P\{|\xi_{i}|>t\}=O(t^{-2k})\), \(\forall t>0\), which yields
$$ I_{1}\leq P\biggl\{ |\xi_{i}|\geq\frac{1}{2} \sqrt{n}h_{1}, \exists i\in J_{1}\biggr\} =q O\biggl(\biggl( \frac{1}{2}\sqrt{n}h_{1}\biggr)^{-2k}\biggr)\rightarrow0 \quad (n\rightarrow\infty). $$
(5.7)
For \(I_{2}\), notice that \(1-I_{2}=P\{\frac{\lambda\zeta}{n}|Z_{i}|\leq|\beta _{0i}|, \exists i\in J_{1} \}\) and \(|Z_{i}|\leq\| D_{11}^{-1}\tilde{W}_{(1)}\|\leq\frac{1}{\tau_{1}}\|\tilde{W}_{(1)}\|\leq \frac{2\sqrt{q}h_{1}^{\zeta-1}}{\tau_{1}\min_{j\in J_{1}}|\tilde{\beta}_{j}|}\), then we can get
$$1-I_{2}\geq P\biggl\{ \frac{2\sqrt{q}\lambda\zeta h_{1}^{\zeta-1}}{\tau_{1}\min_{j\in J_{1}}|\tilde{\beta}_{j}|}\leq nh_{1}\biggr\} =P\biggl\{ \lambda\zeta\leq\frac{n\tau_{1}h_{1}^{2-\zeta}}{2\sqrt{q}}\min_{j\in J_{1}}|\tilde{ \beta}_{j}|\biggr\} =1+O_{p}(1)\quad (n\rightarrow\infty). $$
This follows that \(I_{2}\rightarrow0\) (\(n\rightarrow\infty\)). Together with (5.6) and (5.7), \(\lim_{n \rightarrow\infty}P\{\hat {\beta}=_{s}\beta_{0}\}=1\) holds. This completes the proof of the first part of Theorem 2.2.
(2) Let \(W=(\tilde{\omega}_{1}|\hat{\beta_{1}}|^{\zeta-1}\operatorname{sgn}(\hat{\beta}_{1}),\tilde{\omega}_{2}|\hat{\beta}_{2}|^{\zeta-1}\operatorname{sgn}(\hat{\beta}_{2}), \ldots,\tilde{\omega}_{p}|\hat{\beta}_{p}|^{\zeta-1}\operatorname{sgn}(\hat{\beta }_{p}))^{T}\), we can easily get \(\frac{\partial\|Y-X\beta\|^{2}}{\partial \beta_{j}}|_{\beta=\hat{\beta}}=0\), \(j\in J_{1}\), i.e., \(X_{(1)}^{T}(Y-X_{(1)}\hat{\beta}_{(1)})=X^{T}_{(1)}(X_{(1)}\beta _{0(1)}-X_{(1)}\hat{\beta}_{(1)}+\varepsilon)=\frac{\lambda\zeta}{2}W_{(1)}\), which yields \(D_{11}(\hat{\beta}_{(1)}-\beta_{0(1)})=\frac{X_{(1)}^{T}\varepsilon }{n}-\frac{\lambda\zeta}{2n}W_{(1)}\). It follows from the first part of Theorem 2.2 that \(\lim_{n \rightarrow \infty}P\{D_{11}(\hat{\beta}_{(1)}-\beta_{0(1)})=\frac {X_{(1)}^{T}\varepsilon}{n}-\frac{\lambda\zeta}{2n}W_{(1)}\}=1\), then we can see that, for any \(q\times1\) vector u and \(\|u\|\leq1\),
$$ \sqrt{n}u^{T}(\hat{\beta}_{(1)}-\beta_{0(1)}) =n^{-1/2}u^{T}D_{11}^{-1}X_{(1)}^{T} \varepsilon-\frac{\lambda\zeta}{2\sqrt {n}}u^{T}D_{11}^{-1}W_{(1)}+O_{P}(1). $$
(5.8)
Notice that
$$\begin{aligned} \biggl\vert \frac{\lambda\zeta}{2\sqrt{n}}u^{T}D_{11}^{-1}W_{(1)} \biggr\vert \leq&\frac{\lambda \zeta}{2\sqrt{n}\tau_{1}}\frac{\|\hat{\beta}_{(1)}\|^{\zeta-1}}{\min_{j\in J_{1}}|\tilde{\beta}_{j}|}\leq\frac{\lambda\zeta M_{1}^{\zeta -1}q^{\frac{\zeta-1}{2}}n^{\alpha(\zeta-2)-\frac{1}{2}}}{2^{\zeta -1}M\tau_{1}} \\ =&O \bigl(n^{\frac{1}{2}c_{1}(\zeta-1)+\alpha(\zeta-2)-\frac{1}{2}}\bigr), \end{aligned}$$
where the second inequality holds because \(P\{\min_{j\in J_{1}}|\hat{\beta }_{j}|\geq\frac{1}{2}M_{1}n^{\alpha}\}\rightarrow1\) (\(n\rightarrow\infty\)), for \(M_{1}>0\). By \(\frac{1}{2}c_{1}(\zeta-1)+\alpha(\zeta-2)-\frac{1}{2}<0\), we obtain \(|\frac{\lambda\zeta}{2\sqrt{n}}u^{T}D_{11}^{-1}W_{(1)}|=o_{P}(1)\), which together with (5.8) yields
$$ \sqrt{n}u^{T}(\hat{\beta}_{(1)}-\beta_{0(1)}) =n^{-1/2}u^{T}D_{11}^{-1}X_{(1)}^{T} \varepsilon+o_{P}(1). $$
(5.9)
Denote \(s^{2}=\sigma^{2}u^{T}D_{11}^{-1}u\) and \(F_{i}=n^{-\frac {1}{2}}s^{-1}u^{T}D_{11}^{-1}g_{i}^{T}\), by assumption (A4) and (5.9) we have \(\sqrt{n}s^{-1}u^{T}(\hat{\beta}_{(1)}-\beta_{0(1)})=\sum_{i=1}^{n}F_{i}\varepsilon_{i}+o_{P}(1)\stackrel{\mathrm{d}}{\longrightarrow} N(0,1)\). This completes the proof of the second part of Theorem 2.2. □

Declarations

Acknowledgements

The research was supported by the NSF of Anhui Province (No. 1508085QA13) and the China Scholarship Council.

Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.

Authors’ Affiliations

(1)
School of International Trade and Economics, University of International Business and Economics, Beijing, P.R. China
(2)
Department of Mathematical Science, University of Wisconsin at Milwaukee, Milwaukee, USA

References

  1. Frank, IE, Friedman, JH: A statistical view of some chemometrics regression tools. Technometrics 35, 109-148 (1993) (with discussion) View ArticleMATHGoogle Scholar
  2. Hoerl, AE, Kennard, RW: Ridge regression: biased estimation for nonorthogonal problems. Technometrics 12, 55-67 (1970) View ArticleMATHGoogle Scholar
  3. Tibshirani, R: The lasso method for variable selection in the Cox model. Stat. Med. 16, 385-395 (1997) View ArticleGoogle Scholar
  4. Fan, J, Li, R: Variable selection via nonconcave penalized likelihood and its oracle properties. J. Am. Stat. Assoc. 96, 1348-1360 (2001) MathSciNetView ArticleMATHGoogle Scholar
  5. Knight, K, Fu, WJ: Asymptotics for lasso-type estimators. Ann. Stat. 28, 1356-1378 (2000) MathSciNetView ArticleMATHGoogle Scholar
  6. Zou, H, Hastie, T: Regularization and variable selection via the elastic net. J. R. Stat. Soc., Ser. B 67, 301-320 (2005) MathSciNetView ArticleMATHGoogle Scholar
  7. Zou, H: The adaptive lasso and its oracle properties. J. Am. Stat. Assoc. 101, 1418-1429 (2006) MathSciNetView ArticleMATHGoogle Scholar
  8. Candes, E, Tao, T: The Dantzig selector: statistical estimation when p is much larger than n. Ann. Stat. 35, 2313-2351 (2007) MathSciNetView ArticleMATHGoogle Scholar
  9. Zhang, C: Nearly unbiased variable selection under minimax concave penalty. Ann. Stat. 38, 894-942 (2010) MathSciNetView ArticleMATHGoogle Scholar
  10. Huang, J, Ma, S, Zhang, CH: Adaptive lasso for sparse high-dimensional regression models. Stat. Sin. 18, 1603-1618 (2008) MathSciNetMATHGoogle Scholar
  11. Wang, M, Song, L, Wang, X: Bridge estimation for generalized linear models with a diverging number of parameters. Stat. Probab. Lett. 80, 1584-1596 (2010) MathSciNetView ArticleMATHGoogle Scholar
  12. Zhao, P, Yu, B: On model selection consistency of lasso. J. Mach. Learn. Res. 7, 2541-2563 (2006) MathSciNetMATHGoogle Scholar
  13. Hunter, DR, Li, R: Variable selection using MM algorithms. Ann. Stat. 33, 1617-1642 (2005) MathSciNetView ArticleMATHGoogle Scholar
  14. Efron, B, Hastie, T, Johnstone, I, Tibshirani, R: Least angle regression. Ann. Stat. 32, 407-499 (2004) (with discussion) MathSciNetView ArticleMATHGoogle Scholar

Copyright

© Chen et al. 2016

Advertisement