Oracle inequalities for weighted group lasso in high-dimensional misspecified Cox models

We study the nonasymptotic properties of a general norm penalized estimator, which include Lasso, weighted Lasso, and group Lasso as special cases, for sparse high-dimensional misspecified Cox models with time-dependent covariates. Under suitable conditions on the true regression coefficients and random covariates, we provide oracle inequalities for prediction and estimation error based on the group sparsity of the true coefficient vector. The nonasymptotic oracle inequalities show that the penalized estimator has good sparse approximation of the true model and enables to select a few meaningful structure variables among the set of features.


Introduction
In recent years, high-throughput and nonparametric complex data have been frequently collected in gene-biology, signal processing, neuroscience, and other scientific fields. With massive data in regression problem, we encounter the situation that both the number of covariates p and the sample size n are increasing, and p is a function of n, i.e., p =: p(n). The curse of dimensionality with computational complexity forces us to make the variable selection since the true regression coefficients β * often are sparse with few nonzero components. Thus only a subset of the variable is preferable as important feature. The sparse set of nonzero coordinates in β * also aims to choose the best model. A popular approach is to penalize the log-likelihood by adding a penalty function, which will intuitively lead to choosing a sparse model. One popular proposed method is Lasso (least absolute shrinkage and selection operator), which was introduced in Tibshirani [23] as a modification of the least square method in linear models. With the development of data science, high-dimensional statistics, including various regularization methods (such as group Lasso, weighted Lasso) have been sprung up by statisticians' efforts for over two decades.
Ever since the methodology of Lasso linear models, studying various penalty functions (from data independent to data-driven penalty) and loss functions (from smooth to nonsmooth, from Lipschitz to non-Lipschitz) remains hot in high-dimensional statistics, even though Lasso regularization has been thoroughly analyzed. However, arising in much practical application, predictors may have group structures. Yuan and Lin [29] study the problem of selecting grouped variables for accurate prediction in linear regressions, and their proposed group Lasso is an extension of Lasso for the purpose of the accuracy of estimation. When considering the variable selection in Cox models, massive data sets bring researchers unprecedented computational challenges, see Tibshirani [24]. Fan and Li [9] study the SCAD penalized partial likelihood approach for the Cox models, and the proposed estimator enjoys the oracle property if a proper regularization parameter is chosen. Zhang and Lu [34] consider different penalties for different coefficients (the adaptive Lasso), and their idea is that "unimportant variables receive larger penalties than important ones so that important variables tend to be retained in the selection process, whereas unimportant variables are more likely to be dropped". Theoretical properties, including consistency and rate of convergence of this estimator called adaptive Lasso, are also shown by Zhang and Lu [34] when the number of covariates is fixed.
A potential characterization, which appeared in large-scale gene data associated with survival time, is that we only have a few (maybe several) significant predictors among p (maybe thousands) covariates and p n apparently. For example, the survival of patients with diffuse large-B-cell lymphoma(DLBCL) after chemotherapy is affected by molecular features of the tumors, which is measured by high-dimensional microarray gene expression. Rosenwald et al. [20] adopt Cox models to identify individual genes whose expression correlated with the outcome, and the data contain n = 240 patients and p = 7399 gene expression levels associated with a good or an adverse outcome. The main challenge is that directly utilizing low-dimensional (classical and traditional) statistical inference and computing methods for these data is prohibitive. Fortunately, the regularized partial likelihood method can perform parameter estimation and variable selection to enhance the prediction accuracy and interpretability of the Cox models.
There is the fact that the Lasso estimator is not asymptotically normal, and accurate and limit distribution of Lasso estimate is hard to derive and does not have explicit form, see Knight and Fu [15]. To avoid this trouble, a popular method is to derive the nonasymptotic oracle inequality based on some regularity conditions. Early in 2004, oracle inequalities for prediction error were derived without sparsity or restricted eigenvalue conditions for Lasso-type estimators [see Greenshtein and Ritov [10], Bartlett et al. [3]].
In the classical consistency analysis, the model size p is fixed, and the sample size n goes to infinity. While we need nonasymptotic error bounds in high-dimensional statistical consistency analysis when both model size p and sample size n go to infinity.
Let β * be the true regression coefficient obtained from regression data {X i , Y i } n i=1 , where X i is p-dimensional covariates and Y i ∈ R is the response. A modern problem, which will be the focus of this paper, is the behavior ofβ when its dimension grows with the number of samples. There are two types of statistical guarantees of a penalized estimate that are of interest in this setting (as mentioned by Bartlett et al. [3]): 1. Prediction error (Persistence):β performs well on future samples i.e., E X ββ * 2 (or its empirical version) is small, called persistence .
The two types of statistical guarantees can be obtained by following error bounds (say oracle inequalities) where λ n → 0 is a tuning parameter and s := β * 0 . Deriving oracle inequalities is a powerful mathematical skill that provides deep insight into the nonasymptotic fluctuation of an estimator compared to the ideal unknown parameter (it is called an oracle). Under linear models with group sparsity covariables, Lounici et al. [18] show oracle inequalities for estimation error (in terms of mixed (2, p)norm) and prediction error (for fixed design). Blazere et al. [5] study the properties of group Lasso estimator in sparse high-dimensional generalized linear models (GLMs) with group sparsity of the covariates, and the oracle inequalities for the prediction and estimation error. Structured sparsity has recently attracted attention to the high-dimensional data. [36] focus on the oracle inequalities for GLMs with overlapping group structures. Zhou et al. There have been considerable developments in oracle inequalities, not limited to the linear models and GLMs. Lemler [17] introduces a data-driven weighted Lasso to estimate Cox models by approximating the intensity (without using partial likelihood), and oracle inequalities in terms of an appropriate empirical K-L divergence are obtained. By focusing on misspecified Cox models with their partial likelihood, Kong and Nan [16] derive the nonasymptotic oracle inequalities for the weighted Lasso penalized negative log partial likelihood function. Similar results have been proposed for Cox models with time-dependent covariances, see Huang et al. [13] for using martingale analysis of KKT conditions. Honda and Hardle [11] consider group SCAD-type and the adaptive group Lasso estimator to do variable selection for Cox models with varying coefficients, and the L 2 convergence rate is obtained for increasing-dimension setting p/n → 0. Contributions: • The existing work on weighted group Lasso penalized Cox models has little attention on theoretical results. Yan and Huang [28] propose a weighted group Lasso method that selects important time-dependent variables with a group structure. We propose the oracle inequalities for the prediction and estimation error under the random design, which is different to Huang et al. [13] and Kong and Nan [16] (they do not consider the random design and prediction error). • Huang et al. [13] do not give a clear definition of the true coefficient, our true coefficient in the oracle inequalities is defined by the minimizer of the expected loss function. It is applicable for misspecified Cox models. • We provide unified nonasymptotic results in terms of oracle inequalities for prediction and estimation error, and this provides a theoretical justification for the consistency of weighted group Lasso estimator in Cox models (time-dependent covariates and random design). The sections are presented as follows. Section 2 gives a brief review of Cox models. Section 3 presents the weighted group Lasso penalty for misspecified Cox models. Section 4 shows the oracle inequalities for prediction and estimation for weighted group Lasso penalized partial likelihood for misspecified Cox models, while detailed proofs are included in Sect. 5.

A brief review of Cox models
The celebrated Cox models have provided a tremendously successful tool for exploring the association of covariates with failure time and survival distributions. In order to match the drop-out situation in clinical trails, we consider that the continuous survival time T * i is governed by random right censoring. For subject i, let T i := T i * ∧ C i be the observed survival time which is right-censored by C i . And the censored indicator is Here we assume that the censoring is noninformative. The time-dependent covariates may degenerate to time-independent covariates, i.e., z ik (t) ≡ z ik for some index k. For example, the CD4 count (relate to longitudinal process) is time-dependent. The time-independent covariates are baseline covariates (i.e., internal variables), which includes treatment indicator ages, sex, treatment indicator, and so on.
Suppose that we observe n independent and identically distributed (i.i.d.) data which is sampling from the random population (T, , {z(t)} 0≤t≤τ ). Let S(t|Z) = P(T > t|Z) be the conditional survival function, where Z is the sigma algebra generated by some covariate variables. The relation of conditional distribution function and S(t|Z) is F(t|Z) = P(T ≤ t|Z) = 1 -S(t|Z). Denote f (t|Z) = d dt F(t|Z) as the conditional probability density function. Different from the linear model for modeling conditional mean or the quantile regression for modeling conditional quantiles, the Cox models (also called proportional hazards regression or Cox regressions) aim to model the conditional hazard rate defined by The h(t|Z) is the conditional hazard rate at time t conditional on survival until time t or later (i.e., T ≥ t). From (2.2), the S(t|Z) can be represented as the exponential integral of the cumulative hazard function defined by H(t) = where h 0 (t) is an unknown baseline hazard function, and β * ∈ R p is an unknown parameter which needs to be estimated. By profiling our the term h 0 (t), Cox [6] suggests that the inference on β * is based on the random likelihood function where R i = {j : T j ≥ T i } is the risk set (set of individuals whose survival times are greater than T i ). In a later paper, Cox [7] strictly derives the so-called partial likelihood function.
Suppose that the observed time is a continuous variable, and there is no tie in the observation time. The joint likelihood for the i.i.d. data (2.1) can be written as follows: which contains the unknown h 0 (·).
Following the counting process framework in Andersen and Gill [2], let N i (t) = 1(T i ≤ t, i = 1) be the counting process, and denote Y i (t) =: 1(T i ≥ t) to be the at-risk process for subject i. The σ -filtration is defined by F t = σ {N i (s), Y i (s), z i (s), s ≤ t, i = 1, . . . , n}, which represents the information that occurs up to time t. Let dN i (s) := 1{T i ∈ [s, s + ds], i = 1}. The negative log-partial-likelihood (2.4) for data (2.1) is rewritten as follows: where R n (u, β) = 1 n n j=1 1(T j ≥ u) exp{z τ j (u)β} is the empirical relative risk function. The negative log-partial likelihood function (2.7), as the summands are neither independent nor Lipschitz, can be approximated by the following intermediate empirical loss function: We define the loss function by l(β; T, z, The gradient of n (β; T, z, ) can be written as is the random weighted sum of covariates. The ∇ n (β; T, z, ) is called score process, which is a martingale adapted to the filtration F t . Furthermore, the Hessian matrix of n (β; T, z, ) is is the random weighted sample covariance matrix. Readers can refer to technical details required to make the counting process rigorous in Andersen et al. [1].

Weighted group lasso for misspecified Cox models
In this section, we present the concepts and mathematics notations for the penalized misspecified Cox models with the group structure.
Many high-dimensional variables in microarrays data and other scientific applications have a natural group structure. It is better to divide p variables into small sets of variables based on biological knowledge, see Kanehisa and Goto [14], Wang et al. [27]. Suppose that the p-dimensional covariate X is divided into G n groups each of size d g for g ∈ {1, . . . , G n }, It is allowed that the number of groups increases with the sample size n and G n n. We define the two quantities which are crucial constants in the theoretical analysis.
For β ∈ R p , let β g be the sub-vector of β whose indexes correspond to the index set of the gth group of X. Given a proper tuning parameter λ, we are interested in weighted group Lasso estimator which achieves group sparsity. It is obtained as the solution of the convex optimization problem: where · 2 refers to the Euclidian norm and w g is a given weighted function. If all d g are of size one and w g = 1, then G n g=1 w g β g 2 reduces to β 1 which is essentially a Lasso problem; If all d g are of size one and {w j } p j=1 are data-dependent weights (the weights only depend on observed data). Let W = diag{w 1 , . . . , w p }, thus the weighted group Lasso penalty G n g=1 d g β g 2 becomes weighted Lasso penalty W β 1 . Increasing λ leads to the shrinkage of β g tending to zero, which indicates that some blocks of β diminish to zero simultaneously, and groups of predictors are eliminated from the model. Typically in the reference, they usually choose w g := d g to penalize more heavily groups of large size. For adaptive group Lasso in Cox models, Yan and Huang [28] use w g = d g / β g , where d g is the size of group g andβ g is some consistent estimator of β g .
Taking the subdifferential of the objective function (3.1), we get the first order condition: (It is also called Karush-Kuhn-Tucker (KKT) condition, see Sect. 2.2 of Huang et al. [13] for un-group version.) From the adaptive estimation point of view, the weights in equation (3.1) can be determined from the observed data, where KKT conditions (3.2) hold with high probability, for example, 1p r , r < 0. Applying the concentration inequalities to martingale, the data-driven weights {w j } p j=1 are obtained from the KKT conditions with high probability, see Huang et al. [12] and the references therein. The motif of this work is to derive nonasymptotic oracle inequalities in a mathematical view. The choice of optimal adaptive weight and statistical inferences (confidence interval, testing the coefficient, FDR control) is left for future studies.
In the high-dimensional settings, we study the estimation and prediction of the oracle inequalities for the weighted group Lasso even when the number of groups is extremely greater than the sample size, i.e., G n n. Define H * = {g : β * g = 0} as the group index set corresponding to the nonzero sub-vectors of β * .
Let X 1 , . . . , X n be a random sample from a measure P on a measurable space (X , A). We denote the empirical distribution as a discrete uniform measure The expected loss function is defined by Corresponding to the form of estimator, the true parameter of the misspecified Cox models is the minimizer of the expected loss function 3) was pioneeringly studied in Struthers and Kalbfleisch [21] by clarifying the true parameter as a solution of estimating equation neatly mentioned in the proof of Lemma 3.1 in Andersen and Gill [2].
Here, the expectation of the random variables in the model is unknown, thus as well as β * . By solving the optimization problem in (3.3), β * satisfies In order to get the unique solution in (3.4), we require that the Hessian matrix for expected loss function is nonpositive definite. We aim to estimate sparse β * and to predict the hazard function h(t|z i (t)) conditionally on a given process z i (t). To facilitate the technical proof, additional assumptions are required. • (H.1) and (H.2) are standard assumptions in deriving consistency property for regularized GLMs, see Blazere et al. [5], Zhang and Wu [33]. (H.2) is also used in Zhao et al. [35] for the increasing dimensional Cox models with interval-censored data. (H.3) has been addressed by Kong and Nan [16]. (H.4) makes sure the object function for a minimizer of population expected loss is strongly convex, a similar assumption is used in Andersen and Gill [2], Fan and Li [9].
As mentioned by one reviewer, we often assume that the data are generated from the model with some baseline hazard function and some true parameter β * . In (3.3), the true parameter is defined as the minimizer of true loss function. We present it in detail from Theorem 1 in Struthers and Kalbfleisch [21].

Lemma 3.1 (Consistency) Let the expectation E be taken with respect to randomness of
from the true model. Consider the following notations for r = 0, 1, 2: where, for a column vector a, a ⊗2 refers to the matrix aa T , a ⊗1 refers to the vector a, and a ⊗0 refers to the scalar 1. Consider the following conditions.
• When the data are generated from the correctly specified Cox models (2.3), under Conditions 3.1 and 3.2, we have that the maximum partial likelihood estimatorβ is a consistent estimator for β * , where β * is the solution to the equation • When the model is misspecified, i.e., suppose that the true hazard function is

Oracle inequalities for estimation and prediction
As a powerful mathematical skill, oracle inequalities provide deep insight into the nonasymptotic fluctuation of an estimator compared to the unknown true parameter. A comprehensive theory of oracle inequalities in high-dimensional regressions has been developed for Lasso and its generalization, see Chap. 7 of Wainwright [26].

Key of nonasymptotic analysis
In this section, nonasymptotic oracle inequalities for weighted group Lasso estimates of Cox models are sought, as well as assumptions of the required restricted eigenvalue (such as group stabil condition). The proof leans on several steps: • Step1: To avoid ill behavior of Hessian, propose the restricted eigenvalue condition or other analogous conditions about the design matrix. • Step2: Find the tuning parameter based on high-probability event (KKT conditions or other KKT-like conditions). • Step3: According to some restricted eigenvalue assumptions and tuning parameter selection, derive the oracle inequalities via the definition of weighted group Lasso optimality and the minimizer under unknown expected risk function and some basic inequalities. There are three sub-steps: -(i) Under the KKT-like conditions, show that the error vectorββ * is in some restricted set with structure sparsity, and moreover check thatββ * is in a big compact set; -(ii) Show that likelihood-based divergence ofβ and β * can be lower bounded by some quadratic distance betweenβ and β * ; -(iii) By some elementary inequalities and (ii), show that G n g=1 w g β g nβ * g 2 is in a smaller compact set with radius of optimal rate (proportional to λ). As mentioned by one reviewer, our general framework of the proof is quite standard, but consecutive steps of defining some high-probability events rely on nontrivial new results. For simplicity, we introduce and use the notation in empirical processes, see van der Vaart and Wellner [25].
Let X 1 , . . . , X n be a random sample from a measure P on a measurable space (X , A). We denote the empirical distribution as a discrete uniform measure P n = n -1 n i=1 δ X i , where δ x is the probability distribution that degenerates at x.
Given a measurable function f : X → R, we write P n f for the expectation of f under the empirical measure P n , and Pf for the expectation under P. Thus The P n f is called empirical processes index by n. In fact, we treat P n and P as operators rather than the measure. It follows from (2.8) and P n l(β; T, z, ) :=˜ n (β; T, z, ) that n (β; T, z, ) = P n l(β; T, z, ) -

Define some events with high probability
Using the definition ofβ n in (3.3), we have Hence we get P l(β n ; T, z, )l β * ; T, z, + λ Then, by (4.1), the first and second terms in the right-hand side of (4.3) are n β * ; T, z, -Pl β * ; T, z, It implies n β * ; T, z, -Pl β * ; T, z, n (β n ; T, z, ) -Pl(β n ; T, z, ) = (P n -P) l β * ; T, z, l(β n ; T, z, ) -D n β , β * , where D n β , β * : To obtain oracle inequalities for the weighed group Lasso applied to misspecified Cox models, it is necessary to study the rate of convergence of the empirical process (P n -P)(l(β * ; T, z, )l(β n ; T, z, )) and D n (β, β * ). The centralized empirical loss (P n -P)(l(β * ; T, z, )l(β n ; T, z, )) and the normalized error D n (β, β * ) represent the fluctuation between the expected loss and sample loss. It will be shown that (P n -P) l β * ; T, z, l(β n ; T, z, ) and D n β , β * have stochastic Lipschitz properties with respect to G n g=1 w g β g nβ * g 2 . The concentration inequalities are essential tools to obtain an upper bound of (4.4), which is proportional to a regularization parameter that ensures good statistical properties of the regularized estimator with high probability.
Plugging t = T i , we have componentwise Taylor's expansion Considering the first term in (4.4), we have (P n -P) l β * ; T, z, l(β n ; T, z, ) (4.5) To get the stochastic Lipschitz properties, we define the following two events: The random sum in event A 2 is not independent, which renders this problem more challenging. We need to check a uniform version of the event A 2 in terms of β. Concentration inequalities for suprema empirical processes are powerful to check that event A 2 holds with high probability. It will be derived from Talagrand's sharper bounds for suprema empirical processes, which is a generalization of Dvoretzky-Kiefer-Wolfowitz inequality, see Talagrand [22]. Like an index function class for the empirical distribution function, boundedness assumption (H.1) on the components of z(t) guarantees the conditions for concentrations of suprema empirical processes.
Next, an upper bound is obtained for the centralized empirical process (P n -P)[l(β * ; T, z, )l(β n ; T, z, )]. where λ a := λ a1 + λ a2 with This proposition states that the difference between the centralized empirical processes is bounded from above by the tuning parameter multiplied by the weighted group Lasso norm of the difference between the estimated parameter and the true parameter β * .
For the normalized error D n (β, β * ), set Observe that for certain random variable t s on a compact set [0, τ ].
By the first order Taylor's expansion of the function g t s (β) := log( 1 n n i=1 ), let the corresponding mean valueβ = (β 1 , . . . ,β p ) T be between β * j and β j for each j = 1, 2, . . . , p. We have (4.9) From the following decomposition and inequality a n b n - where the last inequality is from by using assumptions (H.1)-(H.2). If we haveβ ∈ S M (β * ) for some finite M, thusβ ∈ S M (β * ) by Note that summation (4.10) contains a common random variable t s which renders (4.10) to be a dependent summation. In order to bound the quotient and the two centralized summations, we denote three events by B 0 , B 1 , B 2 , respectively: To solve the problem, we need the concentration inequalities for the suprema of the empirical processes in {B l } 2 l=0 uniformly in t ∈ [0, τ ] and β ∈ S M (β * ), see Sect. 2.14 of van der Vaart and Wellner [25].
We aim to show that each event in {B l } 2 l=0 holds with high probability. Thus B is also a high probability event via utilizing the basic inequality P(B) ≥ P(B 0 ) + P(B 1 ) + P(B 2 ) -2.
Based on (4.10), we obtain the following local stochastic Lipschitz condition under the event B: where λ b can be viewed as the local stochastic Lipschitz constant. The following proposition is a similar but significant improvement of Corollary 2 in Kong and Nan [16] from the Lasso to the group Lasso case and from the fixed design to the random design.
If the true model is sparse and log p = o(n), then the two propositions above illustrate that P(A), P(B) → 1 as p, n → ∞.

Sharp oracle inequalities from restricted eigenvalue conditions
In this section, we give sharp bounds for estimation and prediction errors for Cox models using a weaker condition similar to the restricted eigenvalue condition of Bickel et al. [4]. Consider The key condition to derive oracle inequalities rests on the correlation between the covariates, i.e., on the behavior of the sample covariance matrix n = 1 n n i=1 X i X T i , which is necessarily singular when p > n. Let S be any subset of {1, 2, . . . , p}. The restricted eigenvalue condition (RE in short) of p × p matrix n is defined by It should be noted that if we omit the sparse restricted set C(η, S), (4.13) leads to > RE 2 (η, S, n ). Thus it means that the smallest eigenvalue of the sample covariance matrix n is positive, which is impossible when p > n ( n is not full rank). To avoid the low rank of n , Bickel et al. [4] consider the restricted eigenvalue condition under the sparse restricted set C(η, S) as considerable relation in the sparse high-dimensional estimation. The restricted eigenvalue is from the restricted strong convexity, which enforces a type of strong convexity condition for the negative log-likelihood function of linear models under certain sparse restrict set.
A shortcoming for (4.13) is that we cannot assume that RE(η, S, n ) > 0 happens with high probability 1. Instead, we replace n by a non-random version: = E n . Observe ε for a constant k > 0 and a relax constant ε > 0. Technically, for group penalty, here we use a condition which is a modified version of the restricted eigenvalue conditions presented in Blazere et al. [5] for generalized linear models. Define by H * = {g : β * g = 0} the index set of the groups and γ * := |H * | Definition (Group stabil condition) Let c 0 , ε, k > 0 be given constants. Let be the p × p non-random matrix, which satisfies the group stabil condition GS(c 0 , ε, k, H * ) if there exists k > 0 such that where the restricted set is defined as S(c 0 , S(c 0 , H * ) is a restricted cone set with group sparsity, which is similar to the condition used by Lounici et al. [18] to prove oracle inequalities for group Lasso in linear models. The ε is an error or relax term that can be set to zero, and we can view k as the smallest generalized eigenvalue of .
If we assume that the group stabil condition is satisfied for the covariance matrix := E[z(t)z τ (t)] under the restricted cone set S(c 0 , H * ) with δ =β nβ * , then we check that β nβ * ∈ S(1, H * ) holds with high probability. With the preparation above, we are now able to present the main result of this paper, which provides sharper and minimax optimal bounds for the estimation and prediction error when the true model is sparse and log p is small as compared to n.  b2 given by (4.7) and (4.12).
Then, with probability at least (A 2 > 2 we haveβ nβ * ∈ S(1, H * ) and where c 1 > 0 is a constant given in (H.4). Moreover, if a new covariate z * (t) (the test data) is an independent copy of z(t) (as the training data) and E * represents expectation only about z * (t), then the square prediction error under = 1 is Consider ε n = 0. The obtained results are for the fixed design which is analogous to the bounds in Lounici et al. [18] who show the optimal convergence rate of the group Lasso estimator for linear models under the fixed design. Note that if γ * = O(1) then the bound on the estimation error is of the order O( log p n ) + O( log(G n ) n ) and the weighted group Lasso estimator still remains consistent for the 2,1 -estimation error and for the square prediction error under the group stabil condition if the number of groups increases almost as fast as e o(n) . The terms log p and √ log G n are the price to pay for the unknown group sparsity of β * . If the relax error ε n is a big order of λ, it leads to the convergence rate ε n for the estimation error G n g=1 w g β g nβ * g 2 . From Theorem 4.1, if all d g = 1, it enables us to derive analogous results for un-weighted Lasso penalty in what follows.
Then, with probability at least we haveβ nβ * ∈ S(1, H * ) and Corollary 4.1 presents an upper bound of the 1 -estimation error, which is similar to the existing result in Theorem 3.2 in Huang et al. [13] for classical Lasso penalized Cox models. The advantages of Corollary 4.1 are that the restricted eigenvalue condition is not stochastic and Theorem 3.2 in Huang et al. [13] requires further analysis of the restricted eigenvalue condition to guarantee a high-probability event. Another significant difference is that oracle inequalities in Huang et al. [13] require that the sample size is larger than a given constant. Our oracle inequalities are valid for any finite n under the given highprobability event.

Proofs of Theorem 4.1
The proof is based on the following three steps.
Step2: Find a lower bound for P(l(β n ; T, z, )l(β * ; T, z, )). The next proposition provides the desired lower bound.
with c l > 0 is a constant given in (H.4).
From Proposition 5.1 and (5.4), it deduced that Note that the term g / ∈H * w g β g nβ * g 2 = g / ∈H * w g β g n 2 that we have discarded for the first inequality sign in the above expression is very small on the set {g : β * g = 0}.
Then using oracle inequality for G n g=1 w g β g nβ * g 2 leads to Finally we conclude the proof by using Propositions 4.1 and 4.2. We show that the desired oracle inequalities hold with high probability under the event A ∩ B.

Proof of Proposition 4.1
First we show that the summation is satisfied by applying Hoeffding's inequality, see Wainwright [26].

Lemma 5.2
Suppose that X 1 , . . . , X n are independent random vectors all taking values in the set A, and assume that f : A n → R is a function satisfying the bounded difference condition Then, for all t > 0, If there are no absolute signs in the above event, then the upper bound is changed by Similar to the treatment of A 1 , let , j = 1, . . . , d g ; i = 1, . . . , n. Then and f (z 1 , . . . , z k-1 ,z k , z k+1 , . . . , z n ) For fixed j, (5.15) gives for all z 1 , . . . , z n ,z k .
It is sufficient to estimate the sharper upper bounds of E(sup β∈S M (β * ) | 1 n n i=1 Z g ij (β)|) by the symmetrization theorem and the contraction theorem below, which can be found in van der Vaart and Wellner [25], Wainwright [26].
For any w i (β), we can find a sequence of random vectors {a i } n i=1 ∈ R p with a i ∞ = 1 and vector b ∈ R p with b 1 ≤ L such that Then we have By Hölder's inequality Next, we are going to use the following maximal inequality for bounded variables; see [31] for more discussions.

Lemma 5.4 (Maximal inequality)
Let X 1 , . . . , X n be independent random vectors that take values in a measurable space X and f 1 , . . . , f n be real-valued functions in X which satisfy, for all j = 1, . . . , p and all i = 1, . . . , n, By Proposition 5.4, with E[z i i a ij ] = 0 and i a ij ≤ max 1≤i≤n a i ∞ = 1, we get Therefore, (5.13) can be further bounded by letting λ a2 Let 2d max G n exp(- Finally, we have ). Together with (5.12), it gives Then (4.6) is obtained by using (4.5) conditioning on the event A 1 ∩ A 2 .

Proof of Proposition 4.2
For the event B 0 , we need the exponential concentration inequality for the uniform convergence of empirical distribution function Lemma 5.5 (DKW inequality, Massart [19]) For x ∈ R, the DKW inequality bounds the probability that the random function F n (x) differs from F(x) by more than a given constant ε > 0: [8] proves the inequality with an unspecified multiplicative constant multiples of the exponent in the tail bounds. Massart [19] shows that the DKW inequality has the multiplying constant 2. Let p τ := P(T 1 ≥ τ ) = 2Ue LB , so U = p τ e -LB /2. We have Let (F , · ) be a subset of a normed space of real functions f : X → R in some set X . Define the L r (Q)-norm by f L r (Q) = ( |f | r dQ) 1/r . For probability measures Q, we have L r (Q)-spaces endowed by the L r (Q)-norm. Given two functions l(·) and u(·), the bracket  In what follows, we assume that z(t) is non-random. For {B 1gj } in (5.22), we have the function classes [z 1j (t)e z τ (t)β + Le LB ]w min 2Le LB w g : t ∈ [0, τ ], β ∈ R p , j = 1, . . . , d g ; g = 1, . . . , G n , so 0 ≤ f t,β (x, z) ≤ 1.
Note that U = p τ e -LB /2, thus we put 2Le LB t . Then we have with the tuning parameter λ b1 determined by λ b1 = 4tL d g p τ e -2LB w min = 2 √ 2LAe 2LB d g p τ w min log(G n ) n .
For the event B 2 , we have K = √ 2 and V = 2 in Lemma 5.6. Define e -2nt 2 .

Conclusions and future study
In this paper, we focus on the survival analysis problem by proportional hazard regressions, which includes situations when both the number of covariates p and sample size n are increasing, and p n. When p > n, the classical partial likelihood estimation is over-parameterized and requires Lasso or weighted group Lasso regularization estimation to obtain a stable and satisfactory fitting of proportional hazard regressions. Under the group stabil condition, the sharp oracle inequalities for weighted group Lasso regularized misspecified Cox models are derived. The upper bound of their 2,1 -estimation error is determined by the tuning parameter with the rate O( log p n ) + O( log(G n ) n ). The obtained nonasymptotic oracle inequalities imply that the penalized estimator is consistent when log p/n → 0 under mild conditions. The rate is rate-optimality in the minimax sense.
In the future study, the statistical inferences (confidence interval and testing for the coefficient, FDR control) are left for further studies.