Compound Poisson Point Processes, Concentration and Oracle Inequalities

This note aims at presenting several new theoretical results for the compound Poisson point process, which follows the work of Zhang \emph{et al.} [Insurance~Math.~Econom.~59(2014), 325-336]. The first part provides a new characterization for a discrete compound Poisson point process (proposed by {Acz{\'e}l} [Acta~Math.~Hungar.~3(3)(1952), 219-224]), it extends the characterization of the Poisson point process given by Copeland and Regan [Ann.~Math.~(1936): 357-362]. Next, we derive some concentration inequalities for discrete compound Poisson point process (negative binomial random variable with unknown dispersion is a significant example). These concentration inequalities are potentially useful in count data regressions. We give an application in the weighted Lasso penalized negative binomial regression whose KKT conditions of penalized likelihood hold with high probability and then we derive non-asymptotic oracle inequalities for a weighted Lasso estimator.


Introduction
Powerful and useful concentration inequalities of empirical processes have been derived one after another for several decades. Various types of concentration inequalities are mainly based on moments condition (such as sub-Gaussian, sub-exponential and sub-Gamma) and bounded difference condition; see Boucheron et al. (2013). Its central task is to evaluate the fluctuation of empirical processes from some value (for example, the mean and median) in probability. Attention has greatly expanded in various area of research such as high-dimensional statistics, models, etc. For the Poisson or negative binomial count data regression (Hilbe (2011)), it should be noted that the Poisson distribution (the limit distribution of negative binomial is Poisson) is not sub-Gaussian, but locally sub-Gaussian distribution (Chareka et al. (2006)) or sub-exponential distribution (see Sect. 2.1.3 of Wainwright (2019)). However, few relevant results are recognized about the concentration inequalities for a weighted sum of negative binomial (NB) random variables and its statistical applications. With applications to the segmentation of RNA-Seq data, a documental example is that Cleynen and Lebarbier (2014) obtained a concentration inequality of the sum of independent centered negative binomial random variables and its derivation hinges on the assumption that the dispersion parameter is known. Thus, the negative binomial random variable belongs to the exponential family. But, if the dispersion parameter in negative binomial random variables X is unknown in real world problems, it does not belong to the exponential family with density where T (x), h(x), η(θ), and A(θ) are some given functions. Whereas, it is well-known that Poisson and negative binomial distributions belong to the family of infinitely divisible distributions. Houdré (2002) studies dimension-free concentration inequalities for non-Gaussian infinitely divisible random vectors with finite exponential moments. In particular, the geometric case, a special case of negative binomial, has been obtained in Houdré (2002). Via the entropy method, Kontoyiannis and Madiman (2006) gives a simple derivation of the concentration inequalities for a Lipschitz function of discrete compound Poisson random variable. Nonetheless, when deriving oracle inequalities of Lasso or Elastic-net estimates (see Ivanoff et al. (2016), Zhang and Jia Zhang and Jia (2017)), it is strenuous to use the results of Kontoyiannis and Madiman (2006), Houdré (2002) to get an explicit expression of tuning parameter under the KKT conditions of the penalized negative binomial likelihood (Poisson regression is a particular case). Moreover, when the negative binomial responses are treated as sub-exponential distributions (distributions with the Cramer type condition), the upper bound of the 1 -estimation error oracle inequality is determined by the tuning parameter with the rate O( log p n ), which does not show rate-optimality in the minimax sense, although it is sharper than O( log p n ). Consider linear regressions Y = Xβ * + ε with noise Var ε = σ 2 I n and β * being the p-dimensional true parameter. Under the 0 -constraints on β * , Raskutti et al. (2011) shows that the lower bound on the minimax prediction error for all estimatorsβ is inf β sup β * 0≤s0 1 n Xβ − Xβ * 2 2 ≥ O( s 0 log (p/s 0 ) n ), (X is an n × p design matrix) with probability at least 1/2. Thus, the optimal minimax rate for β − β * 1 should be proportional to the root of the rate O( log p n ). Similar minimax lower bound for Poisson regression is provided in Jiang et al. (2015); see Chap. 4 of Rigollet and Hütter (2019) for more minimax theory. This paper is prompted by the motif of high-dimensional count regression that derives useful NB concentration inequalities which can be obtained under compound Poisson frameworks. The obtained concentration inequalities are beneficial to optimal high-dimensional inferences. Our concentration inequalities are about the sum of independent centered weighted NB random variables, which is more general than the un-weighted case in Cleynen and Lebarbier (2014), which supposes that the over-dispersion parameter of NB random variables is known. However, in our case, the over-dispersion parameter can be unknown. As a by-product, we proposed a characterization for discrete compound Poisson point processes following the work in Zhang et al. (2014) and Zhang and Li (2016). The paper will end up with some applications of high-dimensional NB regressions. Städler et al. (2010) studies the oracle inequalities for 1 -penalization finite mixture of Gaussian regressions model but they didn't concern the mixture of count data regression (such as mixture Poisson, see Yang et al. (2019) for example). Here the NB regression can be approximately viewed as finite mixture of Poisson regression since NB distribution is equivalent to a continuous mixture of Poisson distributions where the mixing distribution of the Poisson rate is gamma distributed. The paper will end up with some applications of high-dimensional NB regression in terms of oracle inequalities.
The paper is organized as follows. Section 2 is an introduction to discrete compound Poisson point processes (DCPP). We give a theorem for characterizing DCPP, which is similar to the result that initial condition, stationary and independent increments, condition of jumps, together quadruply characterize a compound Poisson process, see Wang and Ji (1993) and the references therein. In Sect. 3, we derive the concentration inequality for discrete compound Poisson point process with infinite-dimensional parameters. As an application, for the optimization problem of weighted Lasso penalized negative binomial regression, we show that the true parameter version of KKT conditions hold with high probability, and the optimal data-driven weights in weighted Lasso penality are determined. We also show the oracle inequalities for weighted Lasso estimates in NB regressions, and the convergence rate attain the minimax rate derived in references.

Introductions to discrete compound Poisson point process
In this section, we present the preliminary knowledge for discrete compound Poisson point process. A new characterization of the point process with five assumptions is obtained.

Definition and preliminaries
To begin with, we need to be familiar with the definition of the discrete compound Poisson distribution and its relation to the weighted sum of Poisson random variables. For more details, we refer the reader to Sect. 9.3 of Johnson et al. (2005), Zhang et al. (2014), Zhang and Li (2016) and the references therein.
where (α 1 λ, α 2 λ, . . .) are infinite-dimensional parameters satisfying If α i = 0 for all i > r and α r = 0, we say that Y is DCP distributed of order r. If r = +∞, then the DCP distribution has infinite-dimensional parameters and we say it is DCP distributed of order +∞. When r = 1, it is the well-known Poisson distribution for modeling equal-dispersed count data. When r = 2, we call it a Hermite distribution, which can be applied to model the over-dispersed and multi-modal count data; see Giles (2010).
Equation (2) is the canonical representation of characteristic function for non-negative valued infinitely divisible random variable, i.e. a special case of the Lévy-Khinchine formula (see Chap. 2 of Sato (2013) and Sect. 1.6 of Petrov (1995)), where a ∈ R, σ ≥ 0 and I {·} (x) is indicator function, the function ν(·) is called the Lévy measure with restriction: R\{0} min(x 2 , 1)ν(dx) < ∞. Let δ x (t) := 1 {x} (t) and N + be the positive integer set. If we set ν(t) = x∈N + λδ x (t)α x , the ν(·) is the Lévy measure in the Lévy-Khinchine formula. It is easy to see that Y has a weighted Poisson decomposition: Y = ∞ i=1 iN i , where the N i are independent Poisson distributed with mean λα i . This decomposition is also called a Lévy-Ito decomposition; see Chap. 4 of Sato Sato (2013). If Ee −θY < ∞ for θ in a neighbourhood of zero, the moment generating function (m.g.f.) of DCP is In order to define the discrete compound Poisson point process, we shall repeatedly use the concept of Poisson random measure. Let A be any measurable set on a measurable space (E, A), a good example could be E = R d . In the following sections, we let E = R d and denote N (A, ω) as the number of random points in set A. We introduce the Poisson point processes below, and sometimes it is called the Poisson random measure (Sato (2013), Kingman Kingman (1993).
Definition 2.2. Let (E, A, µ) be a measurable space with σ-finite measure µ, and N be the nonnegative integer set. The Poisson random measure with intensity µ is a family of random variables {N (A, ω)} A∈A (defined on some probability space (Ω, F, P )) as the product map N : A × Ω → N satisfying 1. ∀A ∈ A, the N (A, ·) is Poisson random variable on (Ω, F, P ) with mean µ(A); 2. ∀ω ∈ Ω, the N (·, ω) is a counting measure on (E, A); In order to define the discrete compound Poisson point process, we shall repeatedly use the concept of Poisson random measure. Let A be any measurable set on a measurable space (E, A). A good example is E = R d . In the following section, we let E = R d and denote N (A, ω) as the number of random points in set A. We introduce the Poisson point processes below, and sometimes it is called the Poisson random measure (Sato (2013), Kingman (1993).
Definition 2.3. Let (E, A, µ) be a measurable space with σ-finite measure µ, and N be the nonnegative integer set. The Poisson random measure with intensity µ is a family of random variables {N (A, ω)} A∈A (defined on some probability space (Ω, F, P )) as the product map is Poisson random variable on (Ω, F, P ) with mean µ(A); 2. ∀ω ∈ Ω, the N (·, ω) is a counting measure on (E, A); 3. If sets A 1 , A 2 , . . . , A n ∈ A are disjoint, then N (A 1 , ·), N (A 2 , ·), · · · , N (A n , ·) are mutually independent. Let We can see that λ(·) is the intensity function of the generating Poisson point processes {N k (A)} A∈A for each k. Aczél (1952) derives the p.m.f. and he calls it the inhomogeneous composed Poisson distribution as d = 1. If the intensity function λ(x) is the constant (i.e. the mean measure is a multiple of Lebesgue measure), the N is said to be a homogeneous Poisson point process.
Define the probability P k (A) by P (N (A) = k). If N (A) follows from a DCP point process with intensity measure λ(A), then the probability of having k (k ≥ 0) points in the set A is given by where R(s, k) := k t=1 ts t ; see Aczél (1952) for the proof.

A new characterisation of DCP point process
Based on preliminaries, we give a new characterization of DCP point process, which is an extension of Copeland and Regan (1936). A similar characterization for DCP distribution and process (not for point process in terms of random measure) is derived by Wang and Ji (1993). For more characterizations of the point process, see the monograph Last and Penrose (2017) and the references therein.
Theorem 2.1 (Characterization for DCP point process). Consider the following assumptions: If P k (A) satisfies the assumptions 1-5, then there exists a measure ν(A) (countable additivity, absolutely continuous) such that P k (A) is represented by the equation (3) for all k and A.
For A being the closed interval in R, this setting turns to the famous characterization of the Poisson process; see p. 447 of Feller (1968) (The postulates for the Poisson process). The proof of Theorem 2.1 is given in the Appendix. The proof consists of showing countably additive and absolutely continuous of the intensity measure and dealing with the matrix differential equation.
As a random measure, the DCPP is a mathematical extension from discrete compound Poisson distribution and discrete compound Poisson process index by real line. It has been studied by Baraud and Birgé (2009) that they use histogram type estimator to estimate the intensity of the Poisson random measure. A recent application, Das (2018) proposes a nonparametric Bayesian approach to RNA-seq count data by DCP process. For a set of regions {A ∈ A} of interest on the genome, such as genes, exons, or junctions, the number of reads is counted, and the count in the region A i is viewed as a DCP process for gene expression. The P k (A) satisfies assumptions 1-5 in Theorem 2.1 by the real-world setting (the process of generating the expression level for each gene in the experiment).

Concentration inequalities for discrete compound Poisson point process
This section is about the construction of concentration inequalities for DCPP. It has applications in high-dimensional 1 -regularized negative binomial regression; see Sect. 4.
As an appetizing example, if the DCPP has finite parameters, the existing result such as Lemma 3 of Baraud and Birgé (2009) can be used to obtain the concentration inequality for the sum of DCP random variables.
In the sequel, without employing Cramer-type conditions, we will derive a Bernstein-type concentration inequality for DCPP with infinite-dimensional parameters r = +∞. Theorem 3.1 below is a weighted-sum version that is not the same as Proposition 3.1.
Then the concentration inequality for a stochastic integral of DCPP is where From the guide of the proof in Corollary 3.1 below, it is not hard to get the Theorem 3.1 by a linear combination of random variables which is more important in applications in Sect. 4. The proof of Theorem 3.1 is given in the Appendix. This is essentially from the fact that the infinitely divisible distributions are closed under the convolutions and scalings. Corollary 3.1 is the version for the sum of n independent random measures.
correspondingly, then, for a series of measurable function Si The difference between Proposition 3.1 and Corollary 3.1 is that the DCP random variables in Proposition 3.1 have order r < +∞ while Corollary 3.1 is related to DCP point process (a particular case of DCPP is the constant intensity λ(A) = λ) with order r = +∞. Moreover, Proposition 3.1 require dependent variance factors σ 2 i for DCP random variables.
Remark 1. Significantly, Corollary 3.1 is the weighted-sum version, and it improves the results in Cleynen and Lebarbier (2014) who deals with the un-weighted sum of NB random variables (as a special case of DCP distribution). Moreover, Cleynen and Lebarbier (2014) result relies on Lemma 3.1, which depends on the sub-Gamma condition. It should be noted that we do not impose any bounded restriction for y in our concentration inequalities (7) and (8), while Theorem 1 of Houdré (2002) and concentration inequalities of Houdré and Privault (2002) require that the infinitely divisible random vector satisfies the Cramer-type conditions and the y in P (X − EX ≥ y) is bounded by a given constant. The moment conditions in Lemma 3.1 sometimes are difficult to check, being hard to apply in the Lasso KKT condition in Sect. 4.1 if the count data are other infinitely divisible discrete distributions. In our Theorem 3.1 and Corollary 3.1, we only need to check that the means of the DCP processes are finite and Theorem 3.1 and Corollary 3.1 do not need the Bernstein-type condition (see Theorem 2.8 in Petrov (1995) and p. 27 of Wainwright (2019)). In addition, our concentration result does not require the higher moment condition proposed by Kontoyiannis and Madiman (2006): for the given discrete random variable Z.
Before we give the proof of Corollary 3.1, we need some notations and a lemma. For a general point process N (A) defined on R d , let f be any non-negative measurable function on E. The Laplace functional is defined by where the Π is the state space of the point process N (A). The Laplace functional for a random measure is a crucial numerical characteristics that enables us to handle stochastic integral of the point process. Moreover, we rely on the following theorem due to Campbell.  (1993)). Let N (A) be a Poisson random measure with intensity λ(t) and a measurable function f : R d → R, the random sum S = x∈Π f (x) is absolutely convergent with probability one if and only if the integral R d min(|f (x)|, 1)λ(x) dx < ∞. Assume this condition holds, thus we assert that, for any complex number value θ, the equation holds if the integral on the right-hand side is convergent.
Proof. Now, we can easily compute the DCPP's Laplace functional by Campbell's theorem. Set The above Laplace functional of compound random measure seems like Lévy-Khintchine representation. Define for η > 0 (10) Therefore, we obtain ED r (η) = 1.

Application to KKT conditions
For negative binomial random variable N B, the probability mass function (p.m.f.) is where s is a certain real value number. When s is positive integer, the negative binomial distribution reduces to Pascal distribution which is modeled as the number of failures before the s-th success in repeated mutually independent Bernoulli trials (with probability of success p = 1 − q).
Here the parametrization of DCP is , and the vectors covariates x i = (x i1 , . . . , x ip ) T ∈ R p are p dimensional fixed (non-random) variable for simplicity; see the so-called NB2 model in Hilbe (2011) for details. Based on the log-linear model in regression analysis, we usually suppose that log µ Two types of high-dimensional setting for count data regression are classified as follows.
• Sparse approximation for nonparametric regression.
Sometimes, it is too rough and too simple to use a linear function to approximate log µ i by a function of x i , denoted as f 0 (x i ). The connection between log µ i and x i is sometimes unknown, thus it is unlikely to be linear. The dimension of x i would be much larger than the sample size when the f is extremely flexible and complex. In order to capture the unknown functional relation f 0 (x i ), we prefer to use a given dictionary (orthogonal base functions) D = {φ j (·)} p j=1 such that the linear combinationf (x i ) = p j=1β j φ j (x ij ) is a sparse approximation to f 0 (x i ) which contains increasing-dimensional parameter β * := (β * 1 , . . . , β * p ) T with p := p n → ∞ as n → ∞. A crucial point for orthogonal base functions D is that many f 0 (x i ), which have no sparse representations in the non-orthogonal base, should be represented as sparse linear combinations of D in the orthogonal scenario; see Sect. 10 of Hastie et al. (2015) for details. In practice, the advantage of the dictionary approach is that the flexible f 0 (x i ) may admit a good approximation sparse in some dictionary functions while not in others. For high-dimensional count data, Ivanoff et al. (2016) mentioned that: "The richer the dictionary, the sparser the decomposition, so that p can be larger than n and the model becomes high-dimensional". Typical candidates of the dictionary are Fourier basis, cosine basis, Legendry base, wavelet basis (Haar basis), etc.
• High-dimensional features in gene engineering.
In the big data era, one challenge case we encountered in massive data sets is that the number of given covariates p is larger than the sample size n and the responses of interest are measured as counts. For example, in gene engineering, the covariates (features) are types of genes impacting on specific count phenotypes. The NB regression is a flexible framework for the analysis of over-dispersed counts RNA-seq data; see Li (2016), Mallick and Tiwari (2016) for more details.
We assume that the expectation of Y i will be related to φ T (x i )β after a log-transformation, Here {φ j (·)} p j=1 are given transformations of bounded covariances. A trivial example is φ j (x) = x. Let {Y i } n i=1 be a NB random variable with failure parameters {q i = µi θ+µi } where θ > 0 is an unknown dispersion parameter which can be estimated (see Hilbe (2011)). Then the p.m.f.s of where µ i > 0 is the parameter with µ i = EY i and Var Let R(y, β, x) = yφ T (x)β − (θ + y) log(θ + e φ T (x)β ) denote the negative average empirical risk function by (β) = − 1 n l(β) := P n R(Y, β, x). We assume that the expectation of Y i will be related to φ T (x i )β after a log-transformation.
where θ > 0 is an unknown dispersion parameter which can be estimated (see Hilbe (2011)). Then the p.m.f. of where µ i > 0 is the parameter with µ i = EY i and VarY i = µ i + Let R(y, β, x) = yφ T (x)β − (θ + y) log(θ + e φ T (x)β ), denote the negative average empirical risk function by (β) = − 1 n l(β) := P n R(Y, β, x). Having obtained high-dimensional covariates, a principal task is to select important variables. Here, we prefer the weighted 1 -penalized likelihood principle for sparse NB regression, where {w j } p j=1 are data-dependent weights which are supposed to be specified in the following discussion.
By using KKT conditions (i.e. the first order condition in convex optimization; see p. 68 of Bühlmann and van de Geer Bühlmann and van de Geer (2011)), theβ is a solution of (21) iffβ satisfies first order conditions where˙ j (β) := ∂ (β) ∂βj (j = 1, 2, . . . , p). Let Φ(X) be the n × p dictionary design matrix given by The KKT conditions implies |˙ j (β)| ≤ w j for all j. Anḋ The Hessian matrix of the log-likelihood function is Let d * = |{j : β * j = 0}|; the true coefficient vector β * is defined by For the optimization problem of maximizing the 1 -penalized empirical likelihood, our establishment is that KKT conditions are satisfied for the estimated parameter. But here we use the true parameter version of the KKT conditions to approximate estimated parameter version of KKT conditions. When θ → ∞, this approach of defining data-driven weights w j is proposed in Ivanoff et al. (2016) for Poisson regression by using concentration inequality for Poisson point process version of Theorem 3.1. Similarly, in NB regression the w j are determined such that the event (29) holds with high probability for all j by applying Corollary 3.1. Recall For the subsequent step, letφ j (x i ) := φj (xi)θ θ+EYi ≤ φ j (x i ) and we want to evaluate the complementary events of the true parameter version of the KKT conditions: The aim is to find the values {w j } p j=1 by the concentration inequalities we proposed. Then it is sufficient to estimate the probability in the right-hand side of the above inequality.
Let d * = |{j : β * j = 0}|, the true coefficient vector β * is defined by For the optimization problem of maximizing the 1 -penalized empirical likelihood, our establishment is that KKT conditions are satisfied for the estimated parameter. But here we use the true parameter version of KKT conditions to approximate estimated parameter version of KKT conditions. This approach of defining datadriven weights w j is proposed in Ivanoff et al. (2016), we want to find w j such that the event (29) hold with high probability for all j s. Recall the weighted Poisson decomposition of the NB point processes For the subsequent step, letφ j (x i ) := φj (xi)θ θ+EYi ≤ φ j (x i ) and we want to evaluate the complementary events of true parameter version of KKT conditions: The aim is to find the value {w j } p j=1 by the concentration inequalities we proposed. Then it is sufficient to estimate the probability in the right hand of the above inequality.
Let µ(A) be a Lebesgue measure for A ∈ R d and N i,k (A) be Poisson random measure with rate We define the independent NB point processes N B i (A) as : where {N i,k (A)} are independent for each k, i. For [0, 1] d := ∪ n i=1 S i being a disjoined union, we assume that the intensity function is a histogram type estimator (Baraud and Birgé (2009)). It is just a piece-wise constant intensity function. Now we will apply Corollary 3.1, note that via the definition of λ(t) that we defined before. Thus, both (N B 1 (S 1 ), . . . , N B n (S n )) and (Y 1 , . . . , Y n ) have the same law by the weighted Poisson decomposition.
Assume that there exists constants C 1 , C 2 such that Then From (31), we apply concentration inequality in Corollary 3.1 to the last probability in inequality below

Now we can let
From (31), we apply concentration inequality in Corollary 3.1 to the last probability in above inequality. This implies The weights is proportional to Here the γ can be treated as tuning parameters. In high-dimensional statistics, we often have dimension assumption log p in the weights, comparing to the term 2γ log p n .
Then for large dimension p, the above upper bound of the probability inequality tends to zero. Therefore, with high probability, the event of the KKT condition holds.
Based on (33) and some regular assumptions such as compatibility factor (a generalized restricted eigenvalue condition), we will derive the following 1 -estimation error with high probability: in the next section. Under the restricted eigenvalue condition proposed by Bickel et al. (2009), oracle inequalities for a weighted Lasso regularized Poisson binomial regression are studied in Ivanoff et al. (2016). Since the negative binomial is neither sub-gaussian nor a member of the exponential family, our extension is not straightforward. We employ the assumption of compatibility factor which renders the proof simpler than that of restricted eigenvalue condition.

Oracle inequalities for the weighted Lasso in count data regressions
The ensuing subsections are described in two steps: The first step is to construct two lemmas to get a lower or upper bound for symmetric Bregman divergence in the weighted Lasso case. Adopting a symmetric Bregman divergence, the second step is to derive oracle inequalities by some inequality scaling tricks.

Case of NB regression
The symmetric Bregman divergence (sB divergence) is defined by see Nielsen and Nock Nielsen and Nock (2009).
The following lemma is a crucial inequality for deriving the oracle inequality under the assumption of a compatibility factor which is also used in Yu Yu (2010) for the Cox model. Let β * be the true estimated coefficients vector defined by (28) Moreover, in the event l m ≤ ξ−1 ζ+1 w min for any ξ > 1 and Φ(X) ∞ ≤ K, with probability Proof. The left inequality is obtained by D s (β, β * ) = (β − β * ) T [˙ (β) −l(β * )] ≥ 0 due to convexity of the function (β). Note that where the second last inequality follows from KKT conditions and dual norm inequality. Thus we obtain (34). In fact, inequality (34) turns to ξ+1 w min }, applying the NB concentration inequality (9), we have where the second last ∼ is due to c 1n = O(n) = c 2n by the assumption (32) and the the last ∼ is by letting w min = O(r log p n ). Here r > 0 is a suitable tuning parameter, such that the event { ˙ (β * )|| ∞ ≤ ξ−1 ξ+1 w min } holds with probability tending to 1 as p, n → ∞.
In order to derive Theorem 4.1, we adopt a crucial inequality by using the similar approach in Lemma 5 of Zhang and Jia (2017), which is the technique of Taylor's expansion for convex log-likelihood functions.

Case of Poisson regression
In this section, we use the same procedure as in Sect. 4.2.1 to prove the case of Poisson regression with weighted Lasso penalty. The derivation of this type of oracle inequalities is simpler than that in Ivanoff et al. (2016). Similar to Lemma 5 of Zhang and Jia (2017), the proof is based on the following key lemma.
Proof. Without loss of generality, we assume that φ T (x i )δ = 0. By the expression of˙ (β) and¨ (β), one deduces where the last inequality is from e x −e y x−y ≥ e −|x|∨|y| .

Simulations
This section aims to compare the performances of the un-weighted Lasso and weighted Lasso for NB regression on simulated data sets. We use the R package mpath with the function glmregNB to fit the un-weighted Lasso estimator of NB regression. The R package lbfgs is employed to do a weighted 1 -penalized optimization for NB regression. For weighted Lasso, we first run an unweighted Lasso estimator and find the optimal tuning parameter λ op by cross-validation. The actual weights we use are the standardized weights given bỹ And then we solve the optimization problem Here we first apply the function cv.glmregNB() to do a 10-fold cross-validation to obtain the optimal penalty parameter λ. For comparison, we use the function glmregNB to do un-weighted Lasso for NB regression and use the estimated θ as the initial value for our weighted Lasso algorithms.
For simulations, we simulate 100 random data sets. By optimization with suitable λ, we obtain the model with the parameterβ λopt . Then we estimate the performance of the 1 estimation error β * −β λopt 1 and the prediction error X test β * − X testβλopt 2 on the test data (X test of size n test ) by the average of the 100-times errors.
The simulation results are displayed in Table 1. The table shows that the proposed weighted Lasso estimators are more accurate than the un-weighted Lasso estimators. With the help of the optimal weights, it reflects that the controlling of KKT conditions by a data-dependent tuning parameter is able to improve the accuracy of the estimation in aspects of the 1 -estimation, square prediction errors. It can be also seen that the increasing p will hinder the 1 -estimation error by the curse of dimensionality. The proof is divided into two parts. In the first part, we show that ν(A) is countably additive and absolutely continuous. Moreover, in the second part, we will show Eq. (3).
Step 2. For convenience, we first consider the case that A = I a,b = (a, b] is a d-dimensional interval on R d , and then extend it to any measurable sets. Let T (x) =: ν(I 0,x ), so T (x) is a d dimensional coordinately monotone transformation with where the componentwise order symbol (partially ordered symbol) "≤" means that each coordinate in x is less than or equal to the corresponding coordinate of x (for example, with d = 2, we have (1, 2) ≤ (3, 4)). This is by the fact that ν(·) is absolutely continuous and finite additive, which implies ν(I 0,0 ) = lim m(I0,x)→0 ν(I 0,x ) = 0. The definition of the minimum set of the componentwise ordering relation lays the base for our study; see Sect. 4.1 of Dinh The Luc Dinh The Luc (2016). Let A be a nonempty set in R +d . A point a ∈ A is called a Pareto minimal point of the set A if there is no point a ∈ A such that a ≥ a and a = a. The sets of Pareto minimal points of A is denoted by PMin(A). Next, we define the generalized inverse T −1 (t) (like some generalizations of the inverse matrix, it is not unique, we just choose one) by x = (x 1 , . . . , x d ) = T −1 (t) =: PMin a ∈ R +d : t ≤ T (a) .
When d = 1, the generalized inverse T −1 (t) is just a similar version of the quantile function. By Theorem 4.1.4 in Dinh The Luc Dinh The Luc (2016), if A is a nonempty compact set, then it has a Pareto minimal point. Set x = T −1 (t), x + ξ = T −1 (t + τ ) and define ϕ k (τ, t) = P k (I x,x+ξ ) for t = T (x), τ = T (x + ξ) − T (x).
To solve (44), we need to write it in matrix form, The general solution is where c is a constant vector. To specify c, we observe that Hence, the Q can be written as We use the expansion of the power of the multinomial where R(s, k) =: k t=1 ts t and the last equality is obtained by the fact that N l c = (0, . . . , 0, 1, 0, . . . , 0 l ) T and N l = 0 as l ≥ k + 1.
A i for disjoint intervals A 1 , A 2 , . . .. We use the following lemma in Copeland and Regan Copeland and Regan (1936). To see this, since e n = ∞ i=n+1 A i , we have m(e n ) = ∞ i=n+1 m(A i ) → 0 as n → ∞. Thus, we obtain the countable additivity with respect to (46).

Proof of Corollary 3.1
Let CP i,ri (A) =: ri k=1 kN i,k (A), where {N i,k (A)} n i=1 are independent Poisson point process with intensity α k (i) A λ(x) dx for each fixed k. By Campbell's theorem, it follows that For η > 0, define the normalized exponential transform of the stochastic integral by D {ri} (η): (49) Therefore, we obtain ED {r} (η) = 1 by (48).

Summary and discussion
The paper contributes three parts. In the first part, we provide a characterization of discrete compound Poisson point processes (DCPP) in the same fashion as it has been done in 1936 by Arthur H. Copeland and Francis Regan for the Poisson process. The second part is focused on deriving concentration inequalities for DCPP, which are applied in the third part to derive optimal oracle inequalities of weighted Lasso in high-dimensional NB regression.
Oracle inequalities for discrete distributions are statistically useful in both the non-asymptotic and asymptotical analysis of count data regression. The motivation for deriving new concentration inequalities for DCP processes is that we want to show that the KKT conditions of the NB regressions with Lasso penalty (with a tuning parameter λ) is a high-probability event. The KKT conditions is based on the concentration equality for centralized weighted sum of NB random variables. However, existing concentration results are not friendly applicable to the aim of optimal inference procedure. The optimal inference procedure here is to choose a optimal tuning parameter for Lasso estimates in high-dimensional NB regressions, which lead to the minimax optimal convergence for the estimator by considering the oracle inequalities for 1 -estimation error. In the future, it is of interest to study the concentration inequalities for other extended Poisson distributions in regression analysis such as Conway-Maxwell-Poisson distribution, see Li et al. (2019) for the existing probability properties. In another direction, our proposed concentration inequalities would be desirable to extend to establish oracle inequalities for penalized projection estimators studied by Reynaud-Bouret (2003), when considering the intensity of some inhomogeneous compound Poisson processes.