Adaptive bridge estimation for highdimensional regression models
 Zhihong Chen^{1},
 Yanling Zhu^{1}Email authorView ORCID ID profile and
 Chao Zhu^{2}
https://doi.org/10.1186/s136600161205y
© Chen et al. 2016
Received: 16 August 2016
Accepted: 12 October 2016
Published: 20 October 2016
Abstract
In highdimensional models, the penalized method becomes an effective measure to select variables. We propose an adaptive bridge method and show its oracle property. The effectiveness of the proposed method is demonstrated by numerical results.
Keywords
MSC
1 Introduction
As far as we know, there is no literature to discuss the properties of an adaptive bridge, so our results make up for this. Compared with the results in Huang et al. [10] and Wang et al. [11], the condition (A_{2}) (see Section 2) imposed on the true coefficients is much weaker. Moreover, in Wang et al. [11] one needs the true coefficients to meet the additional condition called covering number. Besides, Huang et al. [10] and Wang et al. [11] both use the LQA algorithm to obtain the estimator. The shortcoming of the LQA algorithm is that if we delete one variable in some step of the iteration, this variable will have no chance to appear in the final model. In order to improve this algorithm, we employ the MM algorithm to improve the stability.
The rest of the paper is organized as follows. In Section 2, we introduce notations and assumptions which will be needed in the our results and present the main results. Section 3 presents some simulation results. The conclusion and the proofs of the main results are arranged in Sections 4 and 5.
2 Main results
For convenience of the statement, we first give some notations. Let \(\beta_{0}=(\beta_{01},\beta_{02},\ldots,\beta_{0p})^{T}\) be the true parameter, \(J_{1}=\{j:\beta_{0j}\neq0,j=1,2,\ldots,p\}\), \(J_{2}=\{j:\beta _{0j}=0,j=1,2,\ldots,p\}\), the cardinality of the set \(J_{1}\) is denoted by q and \(h_{1}=\min\{\beta_{0j}:j\in J_{1}\}\). Without loss of generality, we assume that the first q coefficients of covariates (denoted by \(X_{(1)}\)) are nonzero, \(X_{(2)}\) be covariates with zero coefficients, \(\beta_{0}=(\beta_{0(1)}^{T},\beta_{0(2)}^{T})^{T}\), \(\hat{\beta }=(\hat{\beta}_{(1)}^{T},\hat{\beta}_{(2)}^{T})^{T}\) correspondingly. Actually, p, q, X, Y, β, and λ are related to the sample size n, we omit n for convenience. In this paper, we only consider the statistical properties of the adaptive bridge for the case of \(p< n\); consequently we put \(p=O(n^{c_{2}})\), \(q=O(n^{c_{1}})\), \(\lambda =O(n^{\delta})\), where \(0\leq c_{1}< c_{2}<1\), \(\delta>0\). Here we use the terminology in Zhao and Yu [12] , and we define \(\hat{\beta}=_{s}\beta _{0}\) if and only if \(\operatorname{sgn}(\hat{\beta})=\operatorname{sgn}(\beta_{0})\), where we denote the sign of a \(p\times1\) vector β as \(\operatorname{sgn}(\beta )=(\operatorname{sgn}(\beta_{1}), \operatorname{sgn}(\beta_{2}),\ldots, \operatorname{sgn}(\beta _{p}))^{T}\). For any symmetric matrix Z, denote by \(\lambda_{\mathrm{min}}(Z)\) and \(\lambda_{\mathrm{max}}(Z)\) the minimum and maximum eigenvalue of Z, respectively. Denote \(\frac{X^{T}X}{n}:=D\) and \(D=\bigl ( {\scriptsize\begin{matrix}{} D_{11} &D_{12} \cr D_{21}& D_{22}\end{matrix}} \bigr )\), where \(D_{11}=\frac{1}{n}X_{(1)}^{T}X_{(1)}\).
 (A_{1}):

The error term ε is i.i.d. with \(E(\varepsilon)=0\) and \(E(\varepsilon^{2k})<+\infty\), where \(k>0\). For the special case we denote \(E(\varepsilon^{2})=\sigma^{2}\).
 (A_{2}):

There exists a positive constant M such that \(h_{1}\geq Mn^{\alpha}\), where \(\max\{\frac{1}{2},\frac{c_{2}1}{2},\frac{1}{2\zeta }\}<\alpha<\min\{c_{2}\delta,\frac{c_{2}\delta\zeta}{1+\zeta}\}\) and \(\delta+\alpha+\frac{1}{2}\zeta< c_{2}\).
 (A_{3}):

Suppose \(\tau_{1}\) and \(\tau_{2}\) are the minimum and maximum eigenvalues of the matrix \(D_{11}\). There exist constants \(\tau_{10}\) and \(\tau_{20}\) such that \(0<\tau_{10}\leq\tau_{1}\leq\tau_{2}\leq\tau _{20}\), and the eigenvalues of \(\frac{1}{n}X^{T}\operatorname{var}(Y)X\) are bounded.
 (A_{4}):

Let \(g_{i}\) be the transpose of the ith row vector of \(X_{(1)}\), such that \(\lim_{n\rightarrow\infty}n^{\frac {1}{2}} \max_{1\leq i\leq n}g_{i}^{T} g_{i}=0\).
It is worth mentioning that condition (A_{1}) is much weaker than those in the literature where it is commonly assumed that the error term has Gaussian tail probability distribution. In this paper we allow ε to have a heavy tail. The regularity condition (A_{2}) is a common assumption for the nonzero coefficients, which can ensure that all important covariates could be included in the finally selected model. Condition (A_{3}) means that the matrix \(\frac {1}{n}X_{(1)}^{T}X_{(1)}\) is strictly positive definite. For condition (A_{4}), we will use it to prove the asymptotic normality of the estimators of the nonzero coefficients. In fact, if the nonzero coefficients have an upper bound, then we can easily verify condition (A_{4}).
2.1 Consistency of the estimation
Theorem 2.1
Consistency of the estimation
If \(0<\zeta<2\), and conditions (A_{1})(A_{3}) hold, then there exists a local minimizer β̂ of \(Q(\beta)\), such that \(\\hat{\beta }\beta_{0}\=O_{p}(n^{\frac{\delta+\alphac_{2}}{\zeta}})\).
Remark 2.1
By condition (A_{2}), we know that \(c_{2}\delta\alpha>0\) and the estimator consistency refers to the order of sample size and tuning parameter. Theorem 2.1 extends the previous results.
2.2 Oracle property of the estimation
Theorem 2.2
Oracle property
 (1)
(Selection consistency) \(\lim_{n \rightarrow\infty }P\{\hat{\beta}=_{s}\beta_{0}\}=1\);
 (2)
(Asymptotic normality) \(\sqrt{n}s^{1}u^{T}(\hat{\beta }_{(1)}\beta_{0(1)})\stackrel{\mathrm{d}}{\longrightarrow} N(0,1)\), where \(s^{2}=\sigma^{2}u^{T}D_{11}^{1}u\) for any \(q\times1\) vector u and \(\u\\leq1\).
3 Simulation results
In this section we evaluate the performance of the adaptive bridge estimator proposed in (1.1) by simulation studies. Set \(\zeta=1/2\) and simulate the data by the model \(Y=X\beta+\varepsilon\), \(\varepsilon\sim N(0,\sigma^{2})\), where \(\sigma=1\), \(\beta _{0(1)}=(2.5,2.5,2.5,3,3,3,1,1,1)^{T}\). The design matrix X is generated by a pdimensional multivariate normal distribution with mean zero and a covariance matrix whose \((i,j)\)th component is \(\rho ^{ij}\), where we let \(\rho=0.5\mbox{ and }0.9\), respectively. The following examples are considered.
Example 3.1
The sample size \(n=200\) and the covariates number \(p=50\).
Example 3.2
The sample size \(n=500\) and the covariates number \(p=80\).
Example 3.3
The sample size \(n=800\) and the covariates number \(p=100\).
Simulation results for \(\pmb{\rho=0.5}\)
Setting  Method  \(\boldsymbol{L_{2}}\) loss  PE  C  IC 

n = 200 p = 50 υ = 41  Lasso  0.5459 (0.1160)  0.8830 (0.1126)  28.4540 (5.3337)  0 (0) 
Alasso  0.5442 (0.1149)  0.8790 (0.1099)  28.5300 (5.2421)  0 (0)  
Bridge  0.4733 (0.1120)  0.8755 (0.1068)  28.2700 (5.0834)  0 (0)  
Abridge  0.4617 (0.1155)  0.9005 (0.1038)  38.8380 (2.9903)  0 (0)  
n = 500 p = 80 υ = 71  Lasso  0.3459 (0.0829)  0.9469 (0.0673)  56.1300 (6.3329)  0 (0) 
Alasso  0.3476 (0.0830)  0.9465 (0.0686)  56.0360 (6.5269)  0 (0)  
Bridge  0.2950 (0.0732)  0.9394 (0.0654)  52.7140 (6.7155)  0 (0)  
Abridge  0.2745 (0.0728)  0.9661 (0.0630)  69.7160 (2.5994)  0 (0)  
n = 800 p = 100 υ = 91  Lasso  0.2814 (0.0624)  0.9664 (0.0548)  74.4820 (7.1596)  0 (0) 
Alasso  0.2817 (0.0620)  0.9687 (0.0552)  74.7180 (7.0409)  0 (0)  
Bridge  0.2327 (0.0576)  0.9570 (0.0534)  69.4380 (8.8332)  0 (0)  
Abridge  0.2160 (0.0569)  0.9839 (0.0514)  89.7680 (3.0359)  0 (0) 
Simulation results for \(\pmb{\rho=0.9}\)
Setting  Method  \(\boldsymbol{L_{2}}\) loss  PE  C  IC 

n = 200 p = 50 υ = 41  Lasso  1.0102 (0.2452)  0.8725 (0.1049)  27.2580 (4.6677)  0.0040 (0.0632) 
Alasso  1.0123 (0.2475)  0.8800 (0.1024)  27.2460 (4.5851)  0.0040 (0.0632)  
Bridge  0.8961 (0.2624)  0.8656 (0.1059)  25.0600 (5.3104)  0 (0)  
Abridge  0.8468 (0.2843)  0.8965 (0.1092)  37.7800 (4.2298)  0.0260 (0.1593)  
n = 500 p = 80 υ = 71  Lasso  0.6649 (0.1630)  0.9435 (0.0664)  52.9840 (5.7272)  0 (0) 
Alasso  0.6671 (0.1620)  0.9388 (0.0667)  52.3420 (6.1499)  0 (0)  
Bridge  0.5251 (0.1442)  0.9368 (0.0649)  50.0820 (7.7934)  0 (0)  
Abridge  0.4837 (0.1377)  0.9646 (0.0651)  68.2320 (4.7732)  0 (0)  
n = 800 p = 100 υ = 91  Lasso  0.5382 (0.1242)  0.9623 (0.0545)  70.4900 (6.5000)  0 (0) 
Alasso  0.5371 (0.1259)  0.9614 (0.0544)  69.9680 (6.9009)  0 (0)  
Bridge  0.4126 (0.1183)  0.9572 (0.0541)  66.6060 (9.6239)  0 (0)  
Abridge  0.3580 (0.1087)  0.9818 (0.0520)  89.2840 (4.3161)  0 (0) 
Note that in every case the adaptive bridge outperforms the other methods in sparsity, which can select the smaller model. For the adaptive bridge the prediction error is a little higher than the other methods, but when consider the estimation accuracy, the adaptive bridge is still the winner, followed by bridge. We also find the interesting fact that with the sample size n larger, the performance of correctly selecting the zero covariates for the adaptive bridge is better whenever \(\rho=0.5\mbox{ or }0.9\). Meanwhile with n increasing, the estimation accuracy performs better, but the prediction error is worse. Additionally, when ρ increases, the prediction error increases, but the estimation accuracy decreases.
4 Conclusion
In this paper we have proposed the adaptive bridge estimator and presented some theoretical properties of the adaptive bridge estimator. Under some conditions, with the choice of the tuning parameter, we have showed that the adaptive bridge estimator enjoys the oracle property. The effectiveness of the proposed method is demonstrated by numerical results.
5 Proofs
Proof of Theorem 2.1
Proof of Theorem 2.2
Declarations
Acknowledgements
The research was supported by the NSF of Anhui Province (No. 1508085QA13) and the China Scholarship Council.
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.
Authors’ Affiliations
References
 Frank, IE, Friedman, JH: A statistical view of some chemometrics regression tools. Technometrics 35, 109148 (1993) (with discussion) View ArticleMATHGoogle Scholar
 Hoerl, AE, Kennard, RW: Ridge regression: biased estimation for nonorthogonal problems. Technometrics 12, 5567 (1970) View ArticleMATHGoogle Scholar
 Tibshirani, R: The lasso method for variable selection in the Cox model. Stat. Med. 16, 385395 (1997) View ArticleGoogle Scholar
 Fan, J, Li, R: Variable selection via nonconcave penalized likelihood and its oracle properties. J. Am. Stat. Assoc. 96, 13481360 (2001) MathSciNetView ArticleMATHGoogle Scholar
 Knight, K, Fu, WJ: Asymptotics for lassotype estimators. Ann. Stat. 28, 13561378 (2000) MathSciNetView ArticleMATHGoogle Scholar
 Zou, H, Hastie, T: Regularization and variable selection via the elastic net. J. R. Stat. Soc., Ser. B 67, 301320 (2005) MathSciNetView ArticleMATHGoogle Scholar
 Zou, H: The adaptive lasso and its oracle properties. J. Am. Stat. Assoc. 101, 14181429 (2006) MathSciNetView ArticleMATHGoogle Scholar
 Candes, E, Tao, T: The Dantzig selector: statistical estimation when p is much larger than n. Ann. Stat. 35, 23132351 (2007) MathSciNetView ArticleMATHGoogle Scholar
 Zhang, C: Nearly unbiased variable selection under minimax concave penalty. Ann. Stat. 38, 894942 (2010) MathSciNetView ArticleMATHGoogle Scholar
 Huang, J, Ma, S, Zhang, CH: Adaptive lasso for sparse highdimensional regression models. Stat. Sin. 18, 16031618 (2008) MathSciNetMATHGoogle Scholar
 Wang, M, Song, L, Wang, X: Bridge estimation for generalized linear models with a diverging number of parameters. Stat. Probab. Lett. 80, 15841596 (2010) MathSciNetView ArticleMATHGoogle Scholar
 Zhao, P, Yu, B: On model selection consistency of lasso. J. Mach. Learn. Res. 7, 25412563 (2006) MathSciNetMATHGoogle Scholar
 Hunter, DR, Li, R: Variable selection using MM algorithms. Ann. Stat. 33, 16171642 (2005) MathSciNetView ArticleMATHGoogle Scholar
 Efron, B, Hastie, T, Johnstone, I, Tibshirani, R: Least angle regression. Ann. Stat. 32, 407499 (2004) (with discussion) MathSciNetView ArticleMATHGoogle Scholar