Skip to main content

Compound Poisson point processes, concentration and oracle inequalities

Abstract

This note aims at presenting several new theoretical results for the compound Poisson point process, which follows the work of Zhang et al. (Insur. Math. Econ. 59:325–336, 2014). The first part provides a new characterization for a discrete compound Poisson point process (proposed by Aczél (Acta Math. Hung. 3(3):219–224, 1952)), it extends the characterization of the Poisson point process given by Copeland and Regan (Ann. Math. 37:357–362, 1936). Next, we derive some concentration inequalities for discrete compound Poisson point process (negative binomial random variable with unknown dispersion is a significant example). These concentration inequalities are potentially useful in count data regression. We give an application in the weighted Lasso penalized negative binomial regressions whose KKT conditions of penalized likelihood hold with high probability and then we derive non-asymptotic oracle inequalities for a weighted Lasso estimator.

1 Introduction

Powerful and useful concentration inequalities of empirical processes have been derived one after another for several decades. Various types of concentration inequalities are mainly based on moments condition (such as sub-Gaussian, sub-exponential and sub-Gamma) and bounded difference condition; see Boucheron et al. [4]. Its central task is to evaluate the fluctuation of empirical processes from some value (for example, the mean and median) in probability. Attention has greatly expanded in various area of research such as high-dimensional statistics, models, etc. For the Poisson or negative binomial count data regression (Hilbe [13]), it should be noted that the Poisson distribution (the limit distribution of negative binomial is Poisson) is not sub-Gaussian, but locally sub-Gaussian distribution (Chareka et al. [6]) or sub-exponential distribution (see Sect. 2.1.3 of Wainwright [34]). However, few relevant results are recognized about the concentration inequalities for a weighted sum of negative binomial (NB) random variables and its statistical applications. With applications to the segmentation of RNA-Seq data, a documental example is that Cleynen and Lebarbier [7] obtained a concentration inequality of the sum of independent centered negative binomial random variables and its derivation hinges on the assumption that the dispersion parameter is known. Thus, the negative binomial random variable belongs to the exponential family. But, if the dispersion parameter in negative binomial random variables X is unknown in real world problems, it does not belong to the exponential family with density

$$ f_{X}(x\mid \theta ) = h(x) \exp \bigl(\eta ( \theta )T(x) -A(\theta ) \bigr), $$
(1)

where \(T(x)\), \(h(x)\), \(\eta (\theta )\), and \(A(\theta )\) are some given functions.

Whereas, it is well-known that Poisson and negative binomial distributions belong to the family of infinitely divisible distributions. Houdré [14] studies dimension-free concentration inequalities for non-Gaussian infinitely divisible random vectors with finite exponential moments. In particular, the geometric case, a special case of negative binomial, has been obtained in Houdré [14]. Via the entropy method, Kontoyiannis and Madiman [20] gives a simple derivation of the concentration inequalities for a Lipschitz function of discrete compound Poisson random variable. Nonetheless, when deriving oracle inequalities of Lasso or Elastic-net estimates (see Ivanoff et al. [16], Zhang and Jia [38]), it is strenuous to use the results of Kontoyiannis and Madiman [20], Houdré [14] to get an explicit expression of tuning parameter under the KKT conditions of the penalized negative binomial likelihood (Poisson regression is a particular case). Moreover, when the negative binomial responses are treated as sub-exponential distributions (distributions with the Cramer type condition), the upper bound of the \(\ell _{1}\)-estimation error oracle inequality is determined by the tuning parameter with the rate \(O( {\frac{{\log p}}{n}} )\), which does not show rate-optimality in the minimax sense, although it is sharper than \(O(\sqrt{ \frac{{\log p}}{n}} )\). Consider linear regressions \(Y = X\beta ^{*} + \varepsilon \) with noise \(\operatorname{Var} \varepsilon = {\sigma ^{2}}{I_{n}}\) and \(\beta ^{*}\) being the p-dimensional true parameter. Under the \(\ell _{0}\)-constraints on \(\beta ^{*}\), Raskutti et al. [28] shows that the lower bound on the minimax prediction error for all estimators β̂ is

$$ \inf_{\hat{\beta }} \sup_{ \Vert \beta ^{*} \Vert _{0} \leq s_{0}} \frac{1}{n} \bigl\Vert X \hat{\beta }-X \beta ^{*} \bigr\Vert _{2}^{2} \ge O \biggl(\frac{s_{0}\log (p / s _{0} )}{n} \biggr),\quad (X \text{ is an } n \times p \text{ design matrix}) $$

with probability at least \(1/2\). Thus, the optimal minimax rate for \(\Vert \hat{\beta }-\beta ^{*}\Vert _{1}\) should be proportional to the root of the rate \(O( {\frac{{\log p}}{n}} )\). Similar minimax lower bound for Poisson regression is provided in Jiang et al. [17]; see Chap. 4 of Rigollet and Hütter [30] for more minimax theory.

This paper is prompted by the motif of high-dimensional count regression that derives useful NB concentration inequalities which can be obtained under compound Poisson frameworks. The obtained concentration inequalities are beneficial to optimal high-dimensional inferences. Our concentration inequalities are about the sum of independent centered weighted NB random variables, which is more general than the un-weighted case in Cleynen and Lebarbier [7], which supposes that the over-dispersion parameter of NB random variables is known. However, in our case, the over-dispersion parameter can be unknown. As a by-product, we proposed a characterization for discrete compound Poisson point processes following the work in Zhang et al. [40] and Zhang and Li [39]. The paper will end up with some applications of high-dimensional NB regression. Städler et al. [32] studies the oracle inequalities for \(\ell_{1}\)-penalization finite mixture of Gaussian regressions model but they didn’t concern the mixture of count data regression (such as mixture Poisson, see Yang et al. [36] for example). Here the NB regression can be approximately viewed as finite mixture of Poisson regression since NB distribution is equivalent to a continuous mixture of Poisson distributions where the mixing distribution of the Poisson rate is gamma distributed. The paper will end up with some applications of high-dimensional NB regression in terms of oracle inequalities.

The paper is organized as follows. Section 2 is an introduction to discrete compound Poisson point processes (DCPP). We give a theorem for characterizing DCPP, which is similar to the result that initial condition, stationary and independent increments, condition of jumps, together quadruply characterize a compound Poisson process, see Wang and Ji [35] and the references therein. In Sect. 3, we derive the concentration inequalities for discrete compound Poisson point process with infinite-dimensional parameters. As an application, for the optimization problem of weighted Lasso penalized negative binomial regression, we show that the true parameter version of KKT conditions hold with high probability, and the optimal data-driven weights in weighted Lasso penality are determined. We also show the oracle inequalities for weighted Lasso estimates in NB regressions, and the convergence rate attain the minimax rate derived in references.

2 Introductions to discrete compound Poisson point process

In this section, we present the preliminary knowledge for discrete compound Poisson point process. A new characterization of the point process with five assumptions is obtained.

2.1 Definition and preliminaries

To begin with, we need to be familiar with the definition of the discrete compound Poisson distribution and its relation to the weighted sum of Poisson random variables. For more details, we refer the reader to Sect. 9.3 of Johnson et al. [18], Zhang et al. [40], Zhang and Li [39] and the references therein.

Definition 2.1

We say that Y is discrete compound Poisson (DCP) distributed if the characteristic function of Y is

$$ {\varphi _{Y}}(t) = \mathrm{{E}} {e^{\mathrm{{i}}tY}} = \exp \Biggl\{ \sum_{i = 1}^{\infty }{{ \alpha _{i}}\lambda \bigl({e^{\mathrm{{i}}t \theta }} - 1 \bigr)} \Biggr\} \quad ( \theta \in \mathbb{R}), $$
(2)

where \(({\alpha _{1}}\lambda ,{\alpha _{2}}\lambda ,\ldots )\) are infinite-dimensional parameters satisfying \(\sum_{i = 1}^{\infty } {\alpha _{i}} = 1\), \({\alpha _{i}} \ge 0\), \(\lambda > 0\). We denote it by \(Y \sim \operatorname{DCP}({\alpha _{1}}\lambda ,{\alpha _{2}}\lambda ,\ldots )\).

If \({\alpha _{i}}=0\) for all \(i > r\) and \({\alpha _{r}} \ne 0\), we say that Y is DCP distributed of order r. If \(r = + \infty \), then the DCP distribution has infinite-dimensional parameters and we say it is DCP distributed of order +∞. When \(r=1\), it is the well-known Poisson distribution for modeling equal-dispersed count data. When \(r=2\), we call it the Hermite distribution, which can be applied to model the over-dispersed and multi-modal count data; see Giles [11].

Equation (2) is the canonical representation of characteristic function for non-negative valued infinitely divisible random variable, i.e. a special case of the Lévy–Khinchine formula (see Chap. 2 of Sato [31] and Sect. 1.6 of Petrov [27]),

$$ {\mathrm{{E}}} {e^{\mathrm{{i}}t Y}} = \exp \biggl( {a\mathrm{{i}}t - \frac{1}{2}{\sigma ^{2}} {t^{2}} + \int _{\mathbb{R}\setminus \{ 0\} } { \bigl({e^{\mathrm{{i}}tx}} - 1 -\mathrm{{i}}t x{ {\mathrm{I}}_{\vert x \vert < 1}}(x) \bigr) \nu (dx)} } \biggr), $$

where \(a \in \mathbb{R}\), \(\sigma \ge 0\) and \(\mathrm{I}_{\{\cdot \}}(x)\) is indicator function, the function \(\nu (\cdot )\) is called the Lévy measure with restriction: \(\int _{\mathbb{R}\setminus \{ 0 \} } {\min ({x^{2}},1)\nu(dx)} < \infty \).

Let \({\delta _{x}}(t) := 1_{\{ x\}}(t)\) and \(\mathbb{N}^{+}\) be the positive integer set. If we set \(\nu (t) = \sum_{x \in \mathbb{N}^{+}} {\lambda {\delta _{x}}(t){\alpha _{x}}} \), the \(\nu (\cdot )\) is the Lévy measure in the Lévy–Khinchine formula. It is easy to see that Y has a weighted Poisson decomposition: \(Y = \sum_{i = 1}^{\infty }{i{N_{i}}} \), where the \({{N_{i}}}\) are independent Poisson distributed with mean \(\lambda {\alpha _{i}}\). This decomposition is also called a Lévy–Ito decomposition; see Chap. 4 of Sato [31]. If \(\mathrm{{E}}{e^{ - \theta Y}}<\infty \) for θ in a neighbourhood of zero, the moment generating function (m.g.f.) of DCP is

$$ {M_{Y}}(\theta ):=\mathrm{{E}} {e^{ - \theta Y}} = \exp \Biggl\{ \sum_{k = 1}^{\infty }{{\alpha _{k}}\lambda \bigl({e^{ - k\theta }} - 1 \bigr)} \Biggr\} = \exp \biggl\{ \int _{0}^{\infty }{ \bigl({e^{-\theta x}} - 1 \bigr)} \nu ({ {dx}}) \biggr\} \quad \bigl( \vert \theta \vert < R \bigr) $$

for some \(R>0\).

In order to define the discrete compound Poisson point process, we shall repeatedly use the concept of Poisson random measure. Let A be any measurable set on a measurable space \((E, \mathcal{A})\), a good example could be \(E=\mathbb{R}^{d}\). In the following sections, we let \(E=\mathbb{R} ^{d}\) and denote \(N(A,\omega )\) as the number of random points in set A. We introduce the Poisson point processes below, and sometimes it is called the Poisson random measure (Sato [31], Kingman [19]).

Definition 2.2

Let \((E, \mathcal{A}, \mu )\) be a measurable space with σ-finite measure μ, and \(\mathbb{N}\) be the non-negative integer set. The Poisson random measure with intensity μ is a family of random variables \(\{N(A,\omega )\}_{{A\in {\mathcal{A}}}}\) (defined on some probability space \((\varOmega , \mathcal{F}, P)\)) as the product map

$$ N : \mathcal{A} \times \varOmega \rightarrow \mathbb{N} $$

satisfying

  1. 1.

    \(\forall A\in \mathcal{A}\), the \(N(A, \cdot )\) is Poisson random variable on \((\varOmega , \mathcal{F}, P)\) with mean \(\mu (A)\);

  2. 2.

    \(\forall \omega \in \varOmega \), the \(N( \cdot ,\omega )\) is a counting measure on \((E, \mathcal{A})\);

  3. 3.

    If sets \(A_{1},A_{2},\ldots,A_{n}\in \mathcal{A}\) are disjoint, then \(N({A_{1}}, \cdot ), N({A_{2}}, \cdot ),\ldots, N({A_{n}}, \cdot )\) are mutually independent.

Let \(N(\mathcal{A}):=N(\mathcal{A},\omega )\). Based on the Poisson random measure, we define the discrete compound Poisson point processes (DCPP) \(\{CP( A)\}_{{A\in {\mathcal{A}}}}\) by the Lévy–Ito decomposition

$$ CP(A) := \sum_{k = 1}^{\infty }{k{N_{k}}(A)}, $$

where the \({N_{k}}(A)\) are independent Poisson point process with mean measure \(\mu _{k} (A): = {\alpha _{k}} \int _{A} {\lambda (x)} \,dx\).

The m.g.f. of \(\{CP(A)\}_{{A\in {\mathcal{A}}}}\) is \({M_{CP(A)}}( \theta ) = \exp \{ \sum_{k = 1}^{\infty }{{\alpha _{k}}\int _{A} {\lambda (x)} \,dx({e^{ - k\theta }} - 1)} \}\). We can see that \(\lambda (\cdot )\) is the intensity function of the generating Poisson point processes \(\{{N_{k}}(A)\}_{{A\in {\mathcal{A}}}}\) for each k. Aczél [1] derives the p.m.f. and he calls it the inhomogeneous composed Poisson distribution as \(d=1\). If the intensity function \({\lambda (x)}\) is the constant (i.e. the mean measure is a multiple of Lebesgue measure), the N is said to be a homogeneous Poisson point process.

Define the probability \(P_{k}(A)\) by \(P(N(A)=k)\). If \(N(A)\) follows from a DCP point process with intensity measure \(\lambda (A)\), then the probability of having k (\(k \geq 0\)) points in the set A is given by

$$ P_{k}(A)=\sum_{R(s,k)=k} \frac{\alpha _{1}^{s_{1}}\cdots \alpha _{k}^{s_{k}}}{s_{1}!\cdots s_{k}!} \bigl[\lambda (A) \bigr]^{s_{1}+\cdots +s_{k}}e ^{-\lambda (A)}, $$
(3)

where \(R(s,k):=\sum_{t=1}^{k} ts_{t}\); see Aczél [1] for the proof.

2.2 A new characterisation of DCP point process

Based on preliminaries, we give a new characterization of DCP point process, which is an extension of Copeland and Regan [8]. A similar characterization for DCP distribution and process (not for point process in terms of random measure) is derived by Wang and Ji [35]. For more characterizations of the point process, see the monograph Last and Penrose [21] and the references therein.

Theorem 2.1

(Characterization for DCP point process)

Consider the following assumptions:

  1. 1.

    \(P_{k}(A)>0\)if \(0< m(A)<\infty \), where \(m(\cdot )\)is the Lebesgue measure;

  2. 2.

    \(\sum_{k=0}^{\infty }P_{k}(A)=1\);

  3. 3.

    \(P_{k}(A_{1}\cup A_{2})=\sum_{i=0}^{k} P_{k-i}(A _{1}) P_{i}(A_{2})\)if \(A_{1}\cap A_{2}=\emptyset \);

  4. 4.

    Let \(S_{k}(A)=\sum_{i=k}^{\infty }P_{i}(A)\), then \(\lim_{m(A)\to 0} S_{1}(A)=0\);

  5. 5.

    \(\lim_{m(A)\to 0}\frac{P_{k}(A)}{S_{1}(A)}=\alpha _{k}\), where \(\sum_{k=1}^{\infty }\alpha _{k}=1\).

If \(P_{k}(A)\)satisfies the assumptions 15, then there exists a measure \(\nu (A)\) (countable additivity, absolutely continuous) such that \(P_{k}(A)\)is represented by the equation (3) for allkandA.

For A being the closed interval in \(\mathbb{R}\), this setting turns to the famous characterization of the Poisson process; see p. 447 of Feller [10] (The postulates for the Poisson process). The proof of Theorem 2.1 is given in the Appendix. The proof consists of showing countably additive and absolutely continuous of the intensity measure and dealing with the matrix differential equation.

As a random measure, the DCPP is a mathematical extension from discrete compound Poisson distribution and discrete compound Poisson process index by real line. It has been studied by Baraud and Birgé [2] that they use histogram type estimator to estimate the intensity of the Poisson random measure. A recent application, Das [9] proposes a nonparametric Bayesian approach to RNA-seq count data by DCP process. For a set of regions \(\{A \in {\mathcal{A}}\}\) of interest on the genome, such as genes, exons, or junctions, the number of reads is counted, and the count in the region \(A_{i}\) is viewed as a DCP process for gene expression. The \(P_{k}(A)\) satisfies assumptions 1–5 in Theorem 2.1 by the real-world setting (the process of generating the expression level for each gene in the experiment).

3 Concentration inequalities for discrete compound Poisson point process

This section is about the construction of concentration inequalities for DCPP. It has applications in high-dimensional \(\ell _{1}\)-regularized negative binomial regression; see Sect. 4.

As an appetizing example, if the DCPP has finite parameters, the existing result such as Lemma 3 of Baraud and Birgé [2] can be used to obtain the concentration inequality for the sum of DCP random variables.

Lemma 3.1

(Baraud and Birgé [2])

Let \(Y_{1},\ldots,Y_{n}\)benindependent centered random variables, andτ, κ, \(\{ {\eta _{i}}\} _{i = 1}^{n}\)be some positive constants.

  1. (a)

    If \(\log (\mathrm{{E}}e^{zY_{i}})\leq \kappa \frac{z^{2}\eta _{i}}{2(1-z\tau )} \)for all \(z\in [0,1/\tau [\), and \(1\leq i\leq n\), then

    $$ P \Biggl[\sum_{i=1}^{n} Y_{i} \geq \Biggl(2\kappa x\sum_{i=1}^{n} \eta _{i} \Biggr)^{1/2} +\tau x \Biggr]\leq e^{-x}\quad \textit{for all } x>0. $$
  2. (b)

    If for \(1\leq i\leq n\)and all \(z>0\) \(\log (\mathrm{{E}}e^{-zY _{i}})\leq \kappa z^{2}\eta _{i}/2\), then

    $$ P \Biggl[\sum_{i=1}^{n} Y_{i} \leq - \Biggl(2\kappa x\sum_{i=1}^{n} \eta _{i} \Biggr)^{1/2} \Biggr]\leq e^{-x} \quad \textit{for all } x>0. $$

Actually, the moment conditions in Lemma 3.1 is a direct application of the existing result based on Cramer-type conditions (Lemma 2.2 in Petrov [27], or sub-exponential condition (4) discussed below) for the convolutions’ Poisson distribution with different rates. The centered random variable \(Y_{i}\) with the moment condition \(\log (\mathrm{{E}}e^{zY})\leq \kappa \frac{z^{2}\eta _{i}}{2(1-z\tau )} \) in Lemma 3.1(a), is called the sub-Gamma random variables (see Sect. 2.4 in Boucheron et al. [4]). A random variable X with zero mean is sub-exponential (see Sect. 2.1.3 in Wainwright [34]) if there are non-negative parameters \((v, \alpha )\) such that

$$ {\mathrm{{E}}}e^{\lambda X} \leq e^{\frac{v^{2}\lambda ^{2}}{2}} \quad \text{for all } \vert \lambda \vert < \frac{1}{\alpha }. $$
(4)

Note that the sub-exponential implies the sub-gamma condition by the following inequality: \(\log (\mathrm{{E}}{e^{ - z{Y_{i}}}}) \le \frac{ {\kappa {z^{2}}{\eta _{i}}}}{2} \le \frac{{\kappa {z^{2}}{\eta _{i}}}}{ {2(1 - z\tau )}}\) for all \(z\in [0,1/\tau [\).

Proposition 3.1

Let \({Y_{i}} \sim \operatorname{DCP}({\alpha _{1}}(i)\lambda (i),\ldots,{\alpha _{r}}(i)\lambda (i))\)for \(i = 1,2,\ldots,n\), and \({\sigma _{i}} := \operatorname{Var}{Y_{i}} = \lambda (i)\sum_{k = 1} ^{r} {{k^{2}}{\alpha _{k}}(i)} \), then for all \(x > 0\)

$$\begin{aligned}& P \Biggl[ {\sum_{i = 1}^{n} {({Y_{i}}} - \mathrm{{E}} {Y_{i}}) \ge \sqrt{2x\sum _{i = 1}^{n} {\sigma _{i}^{2}} } + rx} \Biggr] \le {e^{ - x}}, \\& P \Biggl[ {\sum _{i = 1}^{n} {({Y_{i}}} - \mathrm{ {E}} {Y_{i}}) \le - \sqrt{2x\sum_{i = 1}^{n} {\sigma _{i}^{2}} } } \Biggr] \le {e^{ - x}}. \end{aligned}$$

Moreover, we have

$$ P \Biggl[ { \Biggl\vert {\sum_{i = 1}^{n} {({Y_{i}} - \mathrm{{E}} {Y_{i}})} } \Biggr\vert \ge \sqrt{2x\sum_{i = 1}^{n} {\sigma _{i}^{2}} } + rx} \Biggr] \le 2{e^{ - x}}. $$
(5)

Proof

First, in order to apply the above lemma, we need to evaluate the log-moment-generating function of centered DCP random variables. Let \({\mu _{i}} =: \mathrm{{E}}{Y_{i}} = \lambda (i)\sum_{k = 1} ^{r} {k{\alpha _{k}}(i)}\), we have

$$\begin{aligned} \log {\mathrm{{E}}} {e^{z({Y_{t}} - {\mu _{i}})}} &= - z{\mu _{i}} + \log \mathrm{E} {e^{z{Y_{i}}}} = - z{\mu _{i}} + \log {e^{\lambda (i)\sum _{k = 1}^{r} {{\alpha _{k}}(i)} ({e^{kz}} - 1)}} \\ &= \lambda (i)\sum_{k = 1}^{r} {{\alpha _{k}}(i) \bigl({e^{kz}} - kz - 1 \bigr)} . \end{aligned}$$

Using the inequality \(e^{z}-z-1 \leq \frac{ z^{2}}{ 2(1-z)}\) for \(1>z>0\), one derives

$$ \log {\mathrm{{E}}} {e^{z({Y_{t}} - {\mu _{i}})}} \le \lambda (i) \sum _{k = 1}^{r} {{\alpha _{k}}(i) \frac{{{k^{2}}{z^{2}}}}{{2(1 - kz)}}} \le \lambda (i)\sum_{k = 1}^{r} {{\alpha _{k}}(i)\frac{ {{k^{2}}{z^{2}}}}{{2(1 - rz)}}} = :\frac{{{\sigma _{i}}{z^{2}}}}{{2(1 - rz)}} \quad (z > 0), $$

where \({\sigma _{i}} = \operatorname{Var}Y_{i} = \lambda (i)\sum_{k = 1}^{r} {{k^{2}}{\alpha _{k}}(i)} \). And by applying the inequality \(e^{z}-z-1 \leq \frac{z^{2}}{2}\) for \(z<0\), we get

$$ \log {\mathrm{{E}}} {e^{z({Y_{t}} - {\mu _{i}})}} \le \lambda (i) \sum _{k = 1}^{r} {{\alpha _{k}}(i)\frac{{{k^{2}}{z^{2}}}}{2}} = :\frac{{{\sigma _{i}}{z^{2}}}}{2} \quad (z < 0). $$
(6)

Thus (6) implies \(\log {\mathrm{{E}}}{e^{ - z({Y_{t}} - {\mu _{i}})}} \le \frac{{{\sigma _{i}}{z^{2}}}}{2}\) for \(z > 0\).

Therefore, by letting \(\tau =1\), \(\kappa = 1\), \({\eta _{i}} = \sigma _{i}^{2}\) in Lemma 3.1, we have (5). □

In the sequel, without employing Cramer-type conditions, we will derive a Bernstein-type concentration inequality for DCPP with infinite-dimensional parameters \(r = + \infty \). Theorem 3.1 below is a weighted-sum version that is not the same as Proposition 3.1.

Theorem 3.1

Let \(f(x)\)be the measurable function such that \(\int _{E} {{f^{2}}(x)} \lambda (x)\,dx < \infty \), where the \(\lambda (x)\)is the intensity function of the DCPP with parameter \(({\alpha _{1}}\int _{E} {\lambda (x)\,dx}, {\alpha _{2}}\int _{E} {\lambda (x)\,dx} ,\ldots )\)under the condition \(\sum_{k = 1}^{\infty }{k{\alpha _{k}}}< \infty \). Then the concentration inequality for a stochastic integral of DCPP is

$$\begin{aligned} &P \Biggl( { \int _{E} {f(x)} \Biggl[CP(dx) - \Biggl(\sum _{k = 1}^{\infty } {k{\alpha _{k}}} \Biggr) \lambda (x)\,dx \Biggr] \ge \Biggl(\sum_{k = 1}^{\infty } {k{\alpha _{k}}} \Biggr) \biggl[\sqrt{2y{V_{f}}} + \frac{y}{3}{{ \Vert f \Vert }_{\infty }} \biggr]} \Biggr) \\ &\quad \le {e^{ - y}}\quad (y>0), \end{aligned}$$
(7)

where \({V_{f}} = \int _{E} {{f^{2}}(x)\lambda (x)\,dx}\), \({ \Vert f \Vert }_{\infty }\)is the supremum of \(f(x)\)onE.

Let \(f(x) = {1_{A}}(x)\) in Theorem 3.1, and we have \(P ( {Y - \mathrm{ {E}}Y \ge \frac{{\mathrm{{E}}Y}}{\lambda }[\sqrt{2y\lambda } + \frac{y}{3}]} ) \le {e^{ - y}}\) for \(Y \sim \operatorname{DCP}( {\alpha _{1}}\lambda ,{\alpha _{2}}\lambda ,\ldots )\).

From the guide of the proof in Corollary 3.1 below, it is not hard to get the Theorem 3.1 by a linear combination of random variables which is more important in applications in Sect. 4. The proof of Theorem 3.1 is given in the Appendix. This is essentially from the fact that the infinitely divisible distributions are closed under the convolutions and scalings. Corollary 3.1 is the version for the sum of n independent random measures.

Corollary 3.1

For a given set \(\{S_{i}\}_{i=1}^{n}\)inE, if we havenindependent DCPP \(\{ C{P_{i}}(S_{i})\} _{i = 1}^{n}\)with a series of parameters

$$ \biggl\{ \biggl({\alpha _{1}}(i) \int _{S_{i}} {\lambda (x)\,dx} ,{\alpha _{2}}(i) \int _{S_{i}} {\lambda (x)\,dx} ,\ldots \biggr) \biggr\} _{i = 1}^{n} $$

correspondingly, then, for a series of measurable function \(\{f_{i}(x)\}_{i=1}^{n}\), we have

$$\begin{aligned} &P \Biggl( {\sum_{i = 1}^{n} { \biggl\{ { \int _{{S_{i}}} {f_{i}(x)} \bigl[C {P_{i}}(dx) - {\mu _{i}}\lambda (x)\,dx \bigr]} \biggr\} } \ge \sum _{i = 1}^{n} {{\mu _{i}}} \biggl( \sqrt{2y{V_{i,f_{i}}}} + \frac{y}{3} {{ \Vert f_{i} \Vert }_{\infty }} \biggr)} \Biggr) \\ &\quad \le {e^{ - ny}}\quad (y>0), \end{aligned}$$
(8)

where \({\mu _{i}} := \sum_{k = 1}^{\infty }{k{\alpha _{k}}} (i)< \infty \)and \({V_{i,f_{i}}} := \int _{{S_{i}}} {{f_{i}^{2}}(x)} \lambda (x)\,dx< \infty \).

Moreover, let \({c_{1n}} := \sum_{i = 1}^{n} {{\mu _{i}}} \sqrt{2 {V_{i,f_{i}}}} \), \({c_{2n}} := \sum_{i = 1}^{n} {\frac{{{\mu _{i}}}}{3}}{ \Vert f_{i} \Vert _{\infty }}\), then

$$\begin{aligned} &P \Biggl( {\sum_{i = 1}^{n} { \biggl\{ { \int _{{S_{i}}} {f_{i}(x)} \bigl[C {P_{i}}(dx) - {\mu _{i}}\lambda (x)\,dx \bigr]} \biggr\} } \ge t} \Biggr) \\ &\quad \le \exp \biggl\{ { - n{{ \biggl( {\sqrt{\frac{t}{{{c_{2n}}}} + \frac{ {c_{1n}^{2}}}{{4c_{2n}^{2}}}} - \frac{{{c_{1n}}}}{{2{c_{2n}}}}} \biggr)} ^{2}}} \biggr\} . \end{aligned}$$
(9)

The difference between Proposition 3.1 and Corollary 3.1 is that the DCP random variables in Proposition 3.1 have order \(r < + \infty \) while Corollary 3.1 is related to DCP point process (a particular case of DCPP is the constant intensity \(\lambda (A) = \lambda \)) with order \(r = + \infty \). Moreover, Proposition 3.1 require dependent variance factors \(\sigma _{i}^{2}\) for DCP random variables.

Remark 1

Significantly, Corollary 3.1 is the weighted-sum version, and it improves the results in Cleynen and Lebarbier [7] who deals with the un-weighted sum of NB random variables (as a special case of DCP distribution). Moreover, Cleynen and Lebarbier’s [7] result relies on Lemma 3.1, which depends on the sub-Gamma condition. It should be noted that we do not impose any bounded restriction for y in our concentration inequalities (7) and (8), while Theorem 1 of Houdré [14] and concentration inequalities of Houdré and Privault [15] require that the infinitely divisible random vector satisfies the Cramer-type conditions and the y in \(P(X-EX \geq y)\) is bounded by a given constant. The moment conditions in Lemma 3.1 sometimes are difficult to check, being hard to apply in the Lasso KKT condition in Sect. 4.1 if the count data are other infinitely divisible discrete distributions. In our Theorem 3.1 and Corollary 3.1, we only need to check that the means of the DCP processes are finite and Theorem 3.1 and Corollary 3.1 do not need the Bernstein-type condition (see Theorem 2.8 in Petrov [27] and p. 27 of Wainwright [34]). In addition, our concentration result does not require the higher moments conditions proposed by Kontoyiannis and Madiman [20], they assume that \({\mathrm{E}}Z^{L} = \sum_{k = 1}^{\infty}{k^{L}{\alpha _{k}}} <\infty\) (\(L>1\)) about the compound distribution \(P(Z = i) :=\alpha _{i}\) for the given discrete random variable Z.

Before we give the proof of Corollary 3.1, we need some notations and a lemma. For a general point process \(N(A)\) defined on \(\mathbb{R}^{d}\), let f be any non-negative measurable function on E. The Laplace functional is defined by \({\mathrm{{L}}_{N}}(f) = \mathrm{{E}}[\exp \{ - \int _{E} f (x)N(dx)\} ]\), it is the stochastic integral is specified by \(\int _{E} f (x)N(dx) = :\sum_{{x_{i}} \in \varPi \cap E} f ({x_{i}})\), where the Π is the state space of the point process \(N(A)\). The Laplace functional for a random measure is a crucial numerical characteristics that enables us to handle stochastic integral of the point process. Moreover, we rely on the following theorem due to Campbell.

Lemma 3.2

(See Sect. 3.3 in Kingman [19])

Let \(N(A)\)be a Poisson random measure with intensity \(\lambda (t)\)and a measurable function \(f: \mathbb{R}^{d}\rightarrow \mathbb{R}\), the random sum \(S =\sum_{x\in {\varPi }}f(x)\)is absolutely convergent with probability one if and only if the integral \(\int _{\mathbb{R}^{d}} \min (\vert f(x)\vert ,1)\lambda (x)\,dx < \infty \). Assume this condition holds, thus we assert that, for any complex number valueθ, the equation

$$ {\mathrm{{E}}} e^{{\theta S}} = \exp \biggl\{ \int _{{\mathbb{R}^{d}}} { \bigl[ {e^{\theta f(x)}}} - 1 \bigr]\lambda (x)\,dx \biggr\} $$

holds if the integral on the right-hand side is convergent.

Proof

Now, we can easily compute the DCPP’s Laplace functional by Campbell’s theorem. Set \(C{P_{r}}(A) =: \sum_{k = 1}^{r} {k{N_{k}}(A)}\). And let \(\theta =-1\), that is, \({\mathrm{{L}}_{N}}(f) = \exp \{ \int _{E} {[{e^{-f(x)}}} - 1]\lambda (x)\,dx\} \). By independence, it follows that

$$\begin{aligned} {\mathrm{{L}}_{C{P_{r}}}}(f) & =\mathrm{{E}} \exp \biggl\{ - \int _{E} {f(x)} C{P_{r}}(dx) \biggr\} = \mathrm{{E}} \exp \Biggl\{ - \int _{E} {f(x)} \sum_{k = 1}^{r} {k{N_{k}}(dx)} \Biggr\} \\ &= \prod_{k = 1}^{r} {\mathrm{ {E}}} { \exp \biggl\{ - \int _{E} {kf(x)} {N_{k}}(dx) \biggr\} } \\ & = \prod_{k = 1}^{r} \exp \biggl\{ \int _{E} { \bigl[{e^{-kf(x)}}} - 1 \bigr] {\alpha _{k}}\lambda (x)\,dx \biggr\} \\ &= \exp \Biggl\{ \sum _{k = 1}^{r} { \int _{E} { \bigl[{e^{-kf(x)}}} - 1 \bigr]{\alpha _{k}}\lambda (x)\,dx} \Biggr\} . \end{aligned}$$

The above Laplace functional of a compound random measure seems like a Lévy–Khintchine representation. Define for \(\eta >0\)

$$\begin{aligned} D_{r}(\eta ) :=& \exp \Biggl\{ \Biggl\{ \int _{E} {\eta f(x)} \Biggl[C{P_{r}}(dx) - \Biggl(\sum_{k = 1}^{r} {k{\alpha _{k}}} \Biggr)\lambda (x) \Biggr]\,dx \Biggr\} \\ &{}- \int _{E} {\sum_{k = 1}^{r} { \bigl[{e^{kf(x)}} - k\eta f(x) - 1 \bigr]{\alpha _{k}} \lambda (x)} } \,dx \Biggr\} . \end{aligned}$$
(10)

Therefore, we obtain \(\mathrm{{E}} D_{r}(\eta ) = 1\).

It follows from (10) and the Markov’s inequality that

$$\begin{aligned} & P \Biggl( \int _{E} {\eta f(x)} \Biggl[C{P_{r}}(dx) - \Biggl(\sum_{k = 1} ^{r} {k{\alpha _{k}}} \Biggr)\lambda (x)\,dx \Biggr] \\ &\qquad \ge \int _{E} {\sum_{k = 1} ^{r} { \bigl[{e^{kf(x)}} - k\eta f(x) - 1 \bigr]{\alpha _{k}}\lambda (x)} } \,dx + y \Biggr) \\ &\quad = P \bigl(D(\eta ) \ge {e^{ y}} \bigr) \le {e^{ - y}}. \end{aligned}$$
(11)

Note that

$$\begin{aligned} {e^{k\eta f(x)}} - k\eta f(x) - 1 &\le \sum_{i = 2}^{\infty } {\frac{{{\eta ^{i}}}}{{i!}}} \Vert {kf} \Vert _{\infty }^{i - 2}{k^{2}} {f^{2}}(x) = {k^{2}} {f^{2}}(x)\sum _{i = 2}^{\infty }{ \frac{{{\eta ^{i}}}}{{i(i - 1) \cdots 2}}} \Vert {kf} \Vert _{\infty }^{i - 2} \\ & \le \frac{{{k^{2}}{\eta ^{2}}{f^{2}}(x)}}{2}\sum_{i = 2}^{ \infty }{{{ \biggl( {\frac{1}{3}k\eta {{ \Vert f \Vert }_{ \infty }}} \biggr)}^{i - 2}}} \\ & \le {{\frac{{{k^{2}}{\eta ^{2}}{f^{2}}(x)}}{2}} \Big/ { \biggl(1 - \frac{1}{3}k\eta {{ \Vert f \Vert }_{\infty }} \biggr)}}. \end{aligned}$$
(12)

Hence, using (12), (11) turns to

$$\begin{aligned} &P \Biggl( \int _{E} {f(x)} \Biggl[C{P_{r}}(dx) - \Biggl( \sum_{k = 1}^{r} {k{\alpha_{k}}} \Biggr)\lambda (x)\,dx \Biggr] \ge \int _{E} {\sum_{k = 1}^{r} {{\alpha _{k}} \biggl( {\frac{1}{2}\frac{{k^{2} \eta {f^{2}}(x) \lambda (x)}}{{1 - {\frac{1 }{3}}k\eta {{ \Vert f \Vert }_{\infty }}}} \,dx + \frac{y}{ \eta }} \biggr)} } \Biggr) \\ &\quad \le {e^{ - y}}. \end{aligned}$$
(13)

Let \({{V_{f}} = \int _{E} {{f^{2}}(x)\lambda (x)\,dx} }\), notice that

$$ \begin{aligned}[b] &\sum _{k = 1}^{r} {{\alpha _{k}} \biggl( { \frac{1}{2}\frac{{\eta {k^{2}}{V_{f}}}}{{1 - { \frac{1 }{3}}k\eta {{ \Vert f \Vert }_{\infty }}}} + \frac{y}{ \eta }} \biggr)} \\ &\quad = \sum_{k = 1}^{r} {{\alpha _{k}} \biggl( {\frac{1}{2}\frac{{\eta {k^{2}}{V_{f}}}}{{1 - { \frac{1 }{3}}k\eta {{ \Vert f \Vert }_{\infty }}}} + \frac{y}{ \eta } \biggl(1 - \frac{1}{3}k\eta {{ \Vert f \Vert }_{\infty }} \biggr) + \frac{ {ky}}{3}{{ \Vert f \Vert }_{\infty }}} \biggr)} \\ &\quad \ge \sum_{k = 1}^{r} {{\alpha _{k}} \biggl( {k \sqrt{2{V_{f}}y} + \frac{{ky}}{3}{{ \Vert f \Vert }_{\infty }}} \biggr)} = \Biggl( \sum_{k = 1}^{r} {k{\alpha _{k}}} \Biggr) \biggl[ \sqrt{2{V_{f}}y} + \frac{y}{3}{ \Vert f \Vert _{\infty }} \biggr]. \end{aligned} $$
(14)

Applying (14), so by minimizing η in (13), we have

$$\begin{aligned} \begin{aligned}[b] &P \Biggl( { \int _{E} {f(x)} \Biggl[C{P_{r}}(dx) - \Biggl( \sum_{k = 1}^{r} {k{\alpha _{k}}} \Biggr)\lambda (x)\,dx \Biggr] \ge \Biggl(\sum _{k = 1}^{r} {k{\alpha _{k}}} \Biggr) \biggl[\sqrt{2y{V_{f}}} + \frac{y}{3}{{ \Vert f \Vert } _{\infty }} \biggr]} \Biggr) \\ &\quad \le {e^{ - y}}. \end{aligned} \end{aligned}$$
(15)

Letting \(r \to \infty \) in (15), then \(C{P_{r}}(A) \xrightarrow{d} CP(A)\), so we finally prove the inequality (7). □

4 Application

4.1 Application to KKT conditions

For negative binomial random variable NB, the probability mass function (p.m.f.) is

$$ P({NB} = n) = \frac{{\varGamma (n + s)}}{{\varGamma (s)n!}}{(1 - q)^{s}} {q ^{n}}\quad \bigl(q \in (0,1),n \in \mathbb{N} \bigr), $$

where s is a certain real value number. When s is positive integer, the negative binomial distribution reduces to Pascal distribution which is modeled as the number of failures before the s-th success in repeated mutually independent Bernoulli trials (with probability of success \(p=1-q\)).

The m.g.f. of NB is

$$ {\varphi _{{NB}}}(t) = \exp \biggl\{ \log { \biggl( { \frac{{1 - q}}{{1 - q{e ^{\mathrm{{i}}t}}}}} \biggr)^{s}} \biggr\} = \exp \Biggl\{ \sum _{k = 1} ^{\infty }{ - s\log (1 - q)} \cdot \frac{{{q^{k}}}}{{ - k\log (1 - q)}} \bigl({e^{\mathrm{{i}}kt}} - 1 \bigr) \Biggr\} . $$
(16)

Here the parametrization of DCP is \({NB} \sim \operatorname{DCP}({\alpha _{1}}\lambda ,{\alpha _{2}}\lambda ,\ldots )\) with \(\lambda = - s \log (1 - q)\), \({\alpha _{k}} = \frac{{{q^{k}}}}{{ - k\log (1 - q)}}\).

Given the data \(\{Y_{i}, {x}_{i}\}_{i=1}^{n}\), the NB regression models suppose that the responses \(\{Y_{i}\}_{i=1}^{n}\) is NB distributed random variables with mean \(\{\mu _{i}\}_{i=1}^{n}\), and the vectors covariates \({x}_{i}=(x_{i1},\ldots,x_{ip})^{T} \in \mathbb{R}^{p}\) are p dimensional fixed (non-random) variable for simplicity; see the so-called NB2 model in Hilbe [13] for details. Based on the log-linear model in regression analysis, we usually suppose that \(\log \mu _{i} = {f_{0}}({{x}_{i}})\) is a linear function of \({{x}_{i}}\).

Two types of high-dimensional setting for count data regression are classified as follows.

  • Sparse approximation for nonparametric regression.

    Sometimes, it is too rough and too simple to use a linear function to approximate \(\log \mu _{i}\) by a function of \({{x}_{i}}\), denoted as \({f_{0}}({{x}_{i}})\). The connection between \(\log \mu _{i}\) and \({{x}_{i}}\) is sometimes unknown, thus it is unlikely to be linear. The dimension of \({{x}_{i}}\) would be much larger than the sample size when the f is extremely flexible and complex. In order to capture the unknown functional relation \({f_{0}}({{x}_{i}})\), we prefer to use a given dictionary (orthogonal base functions) \(\mathbb{D}=\{ {\phi _{j}}( \cdot )\} _{j = 1}^{p}\) such that the linear combination \(\hat{f}( {{x}_{i}}) = \sum_{j = 1}^{p} {{{\hat{\beta }}_{j}}} {\phi _{j}}( {x_{ij}})\) is a sparse approximation to \({f_{0}}({{x}_{i}})\) which contains increasing-dimensional parameter \({\beta ^{*}}: = {(\beta _{1} ^{*},\ldots,\beta _{p}^{*})^{T}}\) with \(p: = {p_{n}} \to \infty \) as \(n \to \infty \). A crucial point for orthogonal base functions \(\mathbb{D}\) is that many \({f_{0}}({{x}_{i}})\), which have no sparse representations in the non-orthogonal base, should be represented as sparse linear combinations of \(\mathbb{D}\) in the orthogonal scenario; see Sect. 10 of Hastie et al. [12] for details. In practice, the advantage of the dictionary approach is that the flexible \({f_{0}}({{x}_{i}})\) may admit a good approximation sparse in some dictionary functions while not in others. For high-dimensional count data, Ivanoff et al. [16] mentioned that: “The richer the dictionary, the sparser the decomposition, so that p can be larger than n and the model becomes high-dimensional”. Typical candidates of the dictionary are Fourier basis, cosine basis, Legendry base, wavelet basis (Haar basis), etc.

  • High-dimensional features in gene engineering.

    In the big data era, one challenge case we encountered in massive data sets is that the number of given covariates p is larger than the sample size n and the responses of interest are measured as counts. For example, in gene engineering, the covariates (features) are types of genes impacting on specific count phenotypes. The NB regression is a flexible framework for the analysis of over-dispersed counts RNA-seq data; see Li [23], Mallick and Tiwari [25] for more details.

We assume that the expectation of \(Y_{i}\) will be related to \({\phi ^{T}({{x}_{i}}){{\beta }}}\) after a log-transformation,

$$ {\mu _{i}} := \exp \Biggl\{ \sum_{j = 1}^{p} {{\beta _{j}}} {\phi _{j}}( {x_{ij}}) \Biggr\} := {e^{\phi ^{T}({{x}_{i}}){{\beta }}}}=: h({x_{i}}), $$

where \(\phi ({{x}_{i}})=(\phi _{1}(x_{i1}),\ldots,\phi _{p}(x_{ip}))^{T}\) with covariate \(\{\phi _{j}(x_{ij})\}\) being the jth component of \(\phi ({{x}_{i}})\). Here \(\{ {\phi _{j}}( \cdot )\} _{j = 1}^{p}\) are given transformations of bounded covariances. A trivial example is \({\phi _{j}}(x) = x\).

Let \(\{Y_{i}\}_{i=1}^{n}\) be a NB random variable with failure parameters \(\{{q_{i}} = \frac{{{\mu _{i}}}}{{\theta + {\mu _{i}}}}\}\) where \(\theta >0\) is an unknown dispersion parameter which can be estimated (see Hilbe [13]). Then the p.m.f.s of \(\{Y_{i}\}_{i=1}^{n}\) are

$$ \begin{aligned}[b] f({y_{i}};{\mu _{i}}, \theta ) &= \frac{{\varGamma (\theta + {y_{i}})}}{ {\varGamma (\theta ){y_{i}}!}}{ \biggl(\frac{{{\mu _{i}}}}{{\theta + {\mu _{i}}}} \biggr)^{ {y_{i}}}} { \biggl(\frac{\theta }{{\theta + {\mu _{i}}}} \biggr)^{\theta }} \\ &\propto \exp \bigl\{ {y_{i}}\phi ^{T}({{x}_{i}}) {\beta } - (\theta + {y_{i}}) \log \bigl(\theta + e^{\phi ^{T}({{x}_{i}}){\beta }} \bigr) \bigr\} , \end{aligned} $$
(17)

where \(\mu _{i} >0\) is the parameter with \(\mu _{i}=\mathrm{{E}}{Y_{i}}\) and \(\operatorname{{Var}} Y_{i}=\mu _{i}+\frac{\mu _{i}^{2}}{\theta } > \mathrm{ {E}}{Y_{i}}\).

The log-likelihood of the predictor variable \(\{Y_{i}\}_{i=1}^{n}\) is

$$ l({\beta } )=: \log \Biggl[ \prod _{i=1}^{n}f \bigl(Y_{i};f_{0}({{x}_{i}}), \theta \bigr) \Biggr]=\sum_{i=1}^{n} \bigl[Y_{i}\phi ^{T}({{x}_{i}}){ \beta } -( \theta +Y_{i})\log \bigl(\theta +e^{\phi ^{T}({{x}_{i}}){\beta } } \bigr) \bigr]. $$
(18)

Let \(R(y,\beta ,x) = y{\phi ^{T}}(x)\beta - (\theta + y)\log (\theta + {e^{{\phi ^{T}}(x)\beta }})\) denote the negative average empirical risk function by \(\ell ({\beta })=- \frac{1}{n} l ({\beta }):= \mathbb{P}_{n} R(Y,\beta ,x)\).

Having obtained high-dimensional covariates, a principal task is to select important variables. Here, we prefer the weighted \(\ell _{1}\)-penalized likelihood principle for sparse NB regression,

$$ \hat{{\beta }}=\operatorname*{argmin}_{{\beta } \in \mathbb{R}^{p}} \Biggl\lbrace \ell ({ \beta } )+ \sum_{j=1}^{p} w_{j} \vert {{\beta _{j}}} \vert \Biggr\rbrace , $$
(19)

where \(\{ {w_{j}}\} _{j = 1}^{p}\) are data-dependent weights which are supposed to be specified in the following discussion.

By using KKT conditions (i.e. the first order condition in convex optimization; see p. 68 of Bühlmann and van de Geer [5]), the β̂ is a solution of (19) iff β̂ satisfies first order conditions

$$ \textstyle\begin{cases} - {\dot{\ell }_{j}}(\hat{{{\beta }}}) = {w_{j}} \operatorname{{sign}}( {\hat{\beta }_{j}}) & \mbox{if } {\hat{\beta } _{j}} \ne 0, \\ \vert {\dot{\ell }_{j}}(\hat{{{\beta }}}) \vert \le {w _{j}} &\mbox{if } {\hat{\beta }_{j}} = 0, \end{cases} $$
(20)

where \({\dot{\ell }_{j}}(\hat{{{\beta }}}) := \frac{{\partial \ell (\hat{{{\beta }}})}}{{\partial {{\hat{\beta }}_{j}}}}\) (\(j = 1,2,\ldots,p\)).

Let \(\varPhi ({X})\) be the \(n \times p\) dictionary design matrix given by

Φ(X)=:(ϕj(xij))1in,1jp=(ϕ1(x11)ϕp(x1p)ϕ1(xn1)ϕp(xnp))=:(ϕT(x1),,ϕT(xn))T.

The KKT conditions implies \(\vert {\dot{\ell }_{j}}(\hat{\beta })\vert \le w _{j}\) for all j. And

$$\begin{aligned} \dot{\ell }(\beta ) &:= \frac{{\partial \ell (\beta )}}{{\partial \beta }} = -\frac{1}{n} \sum _{i = 1}^{n} \phi ({{{x}_{i}}}) \biggl[ {Y_{i}} - (\theta + {Y_{i}}) \frac{{{e^{{\phi ^{T}}({{x}_{i}})\beta }}}}{ {\theta + {e^{{\phi ^{T}}({{x}_{i}})\beta }}}} \biggr] \\ & = -\frac{1}{n} \sum_{i = 1}^{n} {\frac{{\phi ({{x}_{i}})( {Y_{i}} - {e^{{\phi ^{T}}({{x}_{i}})\beta }})\theta }}{{\theta + {e ^{{\phi ^{T}}({{x}_{i}})\beta }}}}} = -\frac{1}{n}\sum _{i = 1} ^{n} {\frac{{\phi ({{x}_{i}})({Y_{i}} - \mathrm{{E}}{Y_{i}})\theta }}{ {\theta + \mathrm{{E}}{Y_{i}}}}}. \end{aligned}$$

The Hessian matrix of the log-likelihood function is

$$ \ddot{\ell }(\beta ):= \frac{{{\partial ^{2}}\ell (\beta )}}{{\partial \beta \partial {\beta ^{T}}}} = \frac{1}{n}\sum_{i = 1}^{n} \phi ({{x}_{i}}){\phi ^{T}}({{x}_{i}}) \frac{{\theta (\theta + {Y_{i}}) {e^{{\phi ^{T}}({{x}_{i}})\beta }}}}{{{{(\theta + {e^{{\phi ^{T}}({{x} _{i}})\beta }})}^{2}}}}. $$
(21)

Let \({d^{*}} = \vert \{ j:\beta _{j}^{*} \ne 0\} \vert \); the true coefficient vector \({{{\beta }} ^{*}}\) is defined by

$$ {{{\beta }} ^{{{*}}}} = \operatorname*{argmin}_{{\beta } \in {{\mathbb{R}}^{p}}} { \mathrm{{E}}} R(Y,X,{{\beta }}). $$
(22)

For the optimization problem of maximizing the \(\ell _{1}\)-penalized empirical likelihood, our establishment is that KKT conditions are satisfied for the estimated parameter. But here we use the true parameter version of the KKT conditions

$$ \bigl\vert {\dot{\ell }_{j}} \bigl({ {{ \beta }} ^{{{*}}}} \bigr) \bigr\vert = \Biggl\vert \frac{1}{n}\sum_{i = 1}^{n} { \frac{{{\phi _{j}} ({{x}_{i}})( {Y_{i}} - \mathrm{{E}}{Y_{i}})\theta }}{{\theta + \mathrm{{E}}{Y_{i}}}}} \Biggr\vert \le {w_{j}} $$
(23)

to approximate estimated parameter version of KKT conditions.

When \(\theta \to \infty\), this approach of defining data-driven weights \({w_{j}}\) is proposed by Ivanoff et al. [16] for Poisson regression by using concentration inequality for Poisson point process version of Theorem 3.1. Similarly, in NB regression the \({w_{j}}\) are determined such that the event (23) holds with high probability for each j by applying Corollary 3.1.

Recall the weighted Poisson decomposition of the NB point processes

$$ {NB}(A) := \sum_{k = 1}^{\infty }{k{N_{k}}(A)}, $$

where \({N_{k}}(A)\) is an inhomogeneous Poisson point process with intensity \({{\alpha _{k}}\int _{A} {\lambda (x)} \,dx}\) and \({\alpha _{k}} = \frac{{{q^{k}}}}{{ - k\log (1 - q)}}\) defined in (16).

Let \(\mu (A): = \int _{A} {\lambda (x)} \,dx\), the m.g.f. of \({NB}(A)\) is written as

$$ {M_{{NB}(A)}}(\theta )=\exp \Biggl\{ \sum _{k = 1}^{\infty }{{\alpha _{k}}\mu (A) \bigl({e^{ - k\theta }} - 1 \bigr)} \Biggr\} . $$
(24)

For the subsequent step, let \({{\tilde{\phi }}_{j}}({x_{i}}): = \frac{ {{\phi _{j}}({x_{i}})\theta }}{{\theta + \mathrm{{E}}{Y_{i}}}} \le {\phi _{j}}({x_{i}})\) and we want to evaluate the complementary events of the true parameter version of the KKT conditions:

$$\begin{aligned} &P \Biggl( { \Biggl\vert {\sum_{i = 1}^{n} {\frac{{{\phi }_{j} ({{{x}} _{i}})({Y_{i}} - \mathrm{E}{Y_{i}})\theta }}{{\theta + \mathrm{E} {Y_{i}}}}} } \Biggr\vert \ge {w_{j}}} \Biggr) \\ &\quad = P \Biggl( { \Biggl\vert \sum_{i = 1}^{n} {{\tilde{\phi }}_{j} ({{{x}}_{i}}) ({Y_{i}} - \mathrm{{E}}{Y_{i}})} \Biggr\vert \ge {w_{j}}} \Biggr) \quad \text{for } j = 1,2,\ldots,p. \end{aligned}$$
(25)

The aim is to find the values \(\{ {w_{j}}\} _{j = 1}^{p}\) by the concentration inequalities we proposed. Then it is sufficient to estimate the probability in the right-hand side of the above inequality.

Let \(\mu ({A})\) be a Lebesgue measure for \(A \in \mathbb{R}^{d}\) and \(N_{i,k}({A})\) be Poisson random measure with rate \(\frac{{{\alpha _{k}}(i)\mu ({A})}}{{{\mu _{i}}}}\), where \({\mu _{i}} = \sum_{k = 1}^{\infty}{k{\alpha _{k}}(i)}\) with \({\alpha _{k}}(i) := \frac{{q_{i}^{k}}}{{ - k\log (1 - {q_{i}})}}\) and \(\{{q_{i}} := \frac{{{\mu _{i}}}}{{\theta + {\mu _{i}}}}\}_{i=1}^{n}\). We define the independent NB point processes \(NB_{i}(A)\) as:

$$NB_{i}(A) =: \sum_{k = 1}^{\infty} {k{N_{i,k}}(A)},\quad\text{with }{\mathrm{E}}[NB_{i}({A})] = \sum_{k = 1}^{\infty}k \frac{{{\alpha _{k}}(i)\mu ({A})}}{{{\mu _{i}}}} = \mu ({A}) $$

where \(\{{N_{i,k}}(A)\}\) are independent for each ki.

For \({[ {0,1} ]^{d}}:= \bigcup _{i = 1}^{n}{S_{i}}\) being a disjoined union, we assume that the intensity function

$$\lambda (t) = \sum_{i = 1}^{n} {\frac{{h({x_{i}})}}{{\mu ({S_{i}})}}} {{\mathrm{I}}_{{S_{i}}}}(t) $$

is a histogram type estimator (Baraud and Birgé [2]). It is just a piece-wise constant intensity function.

Now we will apply Corollary 3.1, note that

$$\mu ({S_{i}}) = \int_{{S_{i}}} {\lambda (t)d} t ={h({x_{i}})}:={\mathrm{E}} Y_{i} $$

via the definition of \(\lambda (t)\) that we defined before.

Thus, both \((N{B}_{1}({S_{1}}),\ldots,N{B}_{n}({S_{n}}))\) and \(({Y_{1}},\ldots,{Y_{n}})\) have the same law by the weighted Poisson decomposition.

Choosing \(f(x)={g_{j}}(x) = \sum_{l = 1}^{n} {{{\tilde{\phi}}_{j}}({{{x}}_{l}}){{\mathrm{1}}_{{S_{l}}}}(x)} \), it derives

$$\sum_{i = 1}^{n} \int_{{S_{i}}} {g_{j}(x)} NB_{i}(dx) = \sum_{i = 1}^{n} {\int_{{S_{i}}} {\sum_{l = 1}^{n} {{{\tilde{\phi}}_{j}}({{{x}}_{l}}){{\mathrm{I}}_{{S_{l}}}}(x)} } NB_{i}(dx) = \sum_{i = 1}^{n} {{{\tilde{\phi}}_{j}}({{{x}}_{i}}){Y_{i}}} } $$

for each j.

Assume that there exists constants \(C_{1}\), \(C_{2}\) such that

$$ {C_{1}} = \max _{1 \le i \le n}\mu_{i} = \max _{1 \le i \le n} \sum_{k = 1}^{\infty}{k{\alpha _{k}}(i)} = O(1)\quad\text{and}\quad O(1) = \max _{1 \le i \le n} h({x_{i}}) \le {C_{2}}. $$
(26)

Then

$$\begin{aligned} {V_{i,{g_{j}}}} = \int_{{S_{i}}} {g_{j}^{2}(x)} \lambda (x)dx &= \int_{{S_{i}}} {\sum_{l = 1}^{n} {\tilde{\phi}_{j}^{2}({x_{l}}){{\mathrm{I}}_{{S_{l}}}}(x)} } \sum_{i = 1}^{n} {\frac{{h({x_{i}})}}{{\mu ({S_{i}})}}} {{\mathrm{I}}_{{S_{i}}}}(t)\,dx\\ & = \tilde{\phi}_{j}^{2}({x_{i}})h({x_{i}}) \le {C_{2}}\max_{1 \le i \le n} \tilde{\phi}_{j}^{2}({x_{i}}) \le {C_{2}}\max_{1 \le i \le n} \phi _{j}^{2}({x_{i}}). \end{aligned}$$

Hence

$$\begin{aligned}& P \Biggl( {\frac{1}{n} \Biggl\vert {\sum _{i = 1}^{n} {{{\tilde{\phi }} _{j}}({\mathbf{{x}}_{i}}) ({Y_{i}} - \mathrm{E}{Y_{i}})} } \Biggr\vert \ge {C_{1}}\biggl[\sqrt{2y {C_{2}\max_{1 \le i \le n} \phi _{j}^{2}({x_{i}})}} + \frac{y}{3}{{ \bigl\Vert {{g_{j}}(x)} \bigr\Vert }_{\infty }}\biggr]} \Biggr) \\& \quad \le P \Biggl( { \Biggl\vert {\frac{1}{n}\sum _{i = 1}^{n} { \biggl\{ { \int _{{S_{i}}} {g(x)} \bigl[{NB}_{i}(dx) - {\mu _{i}}\lambda (x)\,dx\bigr]} \biggr\} } } \Biggr\vert \ge \frac{1}{n}\sum_{i = 1}^{n} {{ \mu _{i}}\biggl[\sqrt{2y {V_{i,{g_{j}}}}} + \frac{y}{3}{{ \bigl\Vert {{g_{j}}(x)} \bigr\Vert } _{\infty }}\biggr]} } \Biggr). \end{aligned}$$

Now we let \({w_{j}} = {C_{1}}[\sqrt{2y{C_{2}\max_{1 \le i \le n} \phi _{j}^{2}({x_{i}})}} + \frac{y}{3}{ \Vert {{g_{j}}(x)} \Vert _{\infty }}]\).

From (25), we apply concentration inequality in Corollary 3.1 to the last probability in the above inequality. This implies

$$\begin{aligned}& P \Biggl( {\frac{1}{n} \Biggl\vert {\sum _{i = 1}^{n} {\frac{{{\phi _{j}}({\mathbf{{x}}_{i}})({Y_{i}} - \mathrm{E}{Y_{i}})\theta }}{{\theta + \mathrm{E} {Y_{i}}}}} } \Biggr\vert \ge {w_{j}}} \Biggr) \\& \quad \le P \Biggl( { \Biggl\vert { \frac{1}{n}\sum_{i = 1}^{n} { \biggl\{ { \int _{{S_{i}}} {g(x)} \bigl[{NB}_{i}(dx) - {\mu _{i}} \lambda (x)\,dx \bigr]} \biggr\} } } \Biggr\vert \ge {w_{j}}} \Biggr) \le 2{e^{ - ny}}. \end{aligned}$$

Take \(y = \frac{\gamma }{n}\log p\) where \(\gamma >0\), we have

$$ P \Biggl( {\frac{1}{n} \Biggl\vert {\sum _{i = 1}^{n} {\frac{{{\phi _{j}}({{{x}}_{i}})({Y_{i}} -\mathrm{E}{Y_{i}})\theta }}{{\theta + \mathrm{E}{Y_{i}}}}} } \Biggr\vert \ge {w_{j}}} \Biggr) \le \frac{2}{{{p^{\gamma }}}}. $$
(27)

The weights are proportional to

$$ {w_{j}} \propto \sqrt{\frac{{2\gamma \log p}}{n}} C_{2}\Bigl( \max _{i = 1,\ldots,n} \bigl\vert {{\phi _{j}}({ {{x}}_{i}})} \bigr\vert \Bigr) + \frac{{\gamma \log p}}{{3n}} \max _{i = 1,\ldots,n} \bigl\vert {{\phi _{j}}({ {{x}}_{i}})} \bigr\vert . $$

Here the γ can be treated as tuning parameters. In high-dimensional statistics, we often have dimension assumption \(\frac{{\log p}}{{n}}=o(1)\), \(\max _{i = 1,\ldots,n} \vert {{\phi _{j}}({{\mathbf{x}}_{i}})} \vert <\infty\), the term \(\frac{{\gamma \log p}}{{3n}} \max _{i = 1,\ldots,n} \vert {{\phi _{j}}({{\mathbf{x}}_{i}})} \vert \) is negligible in the weights, comparing to the term \(\sqrt {\frac{{2\gamma \log p}}{n}}\).

Then, for large dimension p, the above upper bound of the probability inequality tends to zero. Therefore, the event of the KKT condition holds with high probability.

Based on (27) and some regular assumptions such as compatibility factor (a generalized restricted eigenvalue condition), we will derive the following \(\ell _{1}\)-estimation error with high probability:

$$ \bigl\Vert {\hat{\beta }- {\beta ^{*}}} \bigr\Vert _{1} \le O_{p} \biggl({d^{*}}\sqrt{ \frac{ {\log p}}{n}} \biggr) $$

in the next section.

Under the restricted eigenvalue condition proposed by Bickel et al. [3], oracle inequalities for a weighted Lasso regularized Poisson binomial regression are studied in Ivanoff et al. [16]. Since the negative binomial is neither sub-gaussian nor a member of the exponential family, our extension is not straightforward. We employ the assumption of compatibility factor which renders the proof simpler than that of restricted eigenvalue condition.

4.2 Oracle inequalities for the weighted Lasso in count data regressions

The ensuing subsections are described in two steps: The first step is to construct two lemmas to get a lower or upper bound for symmetric Bregman divergence in the weighted Lasso case. Adopting a symmetric Bregman divergence, the second step is to derive oracle inequalities by some inequality scaling tricks.

4.2.1 Case of NB regression

The symmetric Bregman divergence (sB divergence) is defined by

$$ {D^{s}}(\hat{\beta },\beta ) = {(\hat{\beta }- \beta )^{T}} \bigl( \dot{\ell }(\hat{\beta }) - \dot{\ell }(\beta ) \bigr); $$

see Nielsen and Nock [26].

The following lemma is a crucial inequality for deriving the oracle inequality under the assumption of a compatibility factor which is also used in Yu [37] for the Cox model. Let \({\beta ^{*}}\) be the true estimated coefficients vector defined by (22), and \(H = \{ j:\beta _{j}^{*} \ne 0\} \), \({H^{c}} = \{ j:\beta _{j}^{*} = 0\}\). Then we have the following.

Lemma 4.1

Letβ̂be the estimate in (19), set \({l_{m}} := {\Vert {\dot{l}({\beta ^{{{*}}}})} \Vert _{\infty }}\), and \(\tilde{\beta }= \hat{\beta }- {\beta ^{*}}\). Then, we have

$$ ({w _{\min }} - {l_{m}}){ \Vert {{{\tilde{\beta }}_{{H^{c}}}}} \Vert _{1}} \le {D^{s}} \bigl( \hat{\beta },\beta ^{*} \bigr) + ({w _{\min }} - {l_{m}}){ \Vert {{{\tilde{\beta }}_{{H^{c}}}}} \Vert _{1}} \le ({w _{\max }} + {l_{m}}) { \Vert {{{\tilde{\beta }}_{H}}} \Vert _{1}}, $$

where \({w_{\max }} = \max_{1 \le j \le p} {w_{j}}\), \({w_{\min }} = \min_{1 \le j \le p} {w_{j}}\).

Moreover, in the event \(\{{l_{m}} \le \frac{{\xi - 1}}{{\zeta + 1}} {w_{\min }}\}\)for any \(\xi > 1\)and \({\Vert {\varPhi ({{X}})} \Vert _{\infty }} \le K\), with probability \(1-{p^{1 - O(1){r ^{2}}}}\), we have

$$ {\Vert {{{\tilde{\beta }}_{{H^{C}}}}} \Vert _{1}} \le \frac{{\xi {w _{\max }}}}{ {{w _{\min }}}}{ \Vert {{{\tilde{\beta }}_{H}}} \Vert _{1}}. $$
(28)

Proof

The left inequality is obtained by \({D^{s}}(\hat{\beta },{\beta ^{{ {*}}}}) = {(\hat{\beta }- {\beta ^{{{*}}}})^{T}}[\dot{\ell }( \hat{\beta }) - \dot{l}({\beta ^{{{*}}}})] \ge 0\) due to convexity of the function \(\ell (\beta )\). Note that

$$\begin{aligned} {{\tilde{\beta }}^{T}} \bigl\{ \dot{\ell } \bigl({\beta ^{{{*}}}} + \tilde{\beta } \bigr) - \dot{\ell } \bigl({\beta ^{{{*}}}} \bigr) \bigr\} &= \sum_{j \in {H^{c}}} {{{ \tilde{\beta }}_{j}}} { \dot{\ell }_{j} \bigl( {\beta ^{{{*}}}} + \tilde{\beta } \bigr)} + \sum _{j \in H} {{{\tilde{\beta }}_{j}}}{ \dot{\ell }_{j} \bigl({ \beta ^{{{*}}}} + \tilde{\beta } \bigr)} + {{\tilde{\beta }}^{T}} \bigl( - \dot{\ell } \bigl({\beta ^{{ {*}}}} \bigr) \bigr) \\ & \le \sum_{j \in {H^{c}}} { - {w_{j}} {{\hat{\beta }}_{j}}} \operatorname{{sign}}({{ \hat{\beta }}_{j}}) + \sum_{j \in H} {w _{j}} { \vert {{\tilde{\beta }}_{j}}\vert } + \Vert \tilde{\beta } \Vert _{1}{l_{m}} \\ & = \sum_{j \in {H^{c}}} { - {w _{j}} \vert {{{\hat{\beta }}_{j}}}\vert } + \sum _{j \in H} {{w _{j}} \vert {{{\tilde{\beta }}_{j}}}\vert } + {l_{m}} \Vert {{{\tilde{\beta }}_{H}}} \Vert _{1} + {l_{m}} \Vert {{{\tilde{\beta }}_{{H^{c}}}}} \Vert _{1} \\ & \le \sum_{j \in {H^{c}}} { - {w _{\min }} \vert {{{\hat{\beta }} _{j}}} \vert } + \sum _{j \in H} {{w _{\max }} \vert {{{\tilde{ \beta }} _{j}}} \vert } + {l_{m}} \Vert {{{\tilde{\beta }}_{H}}} \Vert _{1} + {l_{m}} \Vert {{{\tilde{\beta }}_{{H^{c}}}}} \Vert _{1} \\ &= ({l_{m}} - {w _{\min }}) \Vert {{{\tilde{\beta }}_{{H^{c}}}}} \Vert _{1} + ({w _{\max }} + {l_{m}}) \Vert {{{\tilde{\beta }}_{H}}} \Vert _{1}. \end{aligned}$$

where the second last inequality follows from KKT conditions and dual norm inequality.

Thus we obtain (28). In fact, inequality (28) turns to

$$ \frac{{2{w _{\min }}}}{{\xi + 1}} \Vert {{{\tilde{\beta }}_{{H^{c}}}}} \Vert _{1} \le {D^{s}} \bigl( \hat{\beta },\beta ^{*} \bigr) + \frac{{2{w _{\min }}}}{ {\xi + 1}} \Vert {{{ \tilde{\beta }}_{{H^{c}}}}} \Vert _{1} \le \frac{{2 \xi {w _{\max }}}}{{\xi + 1}} \Vert {{{\tilde{\beta }}_{H}}} \Vert _{1} $$
(29)

due to \({w _{\min }} - {l_{m}} \ge \frac{{2{w _{\min }}}}{{\xi + 1}}\), \({w _{\max }} + {l_{m}} \le \frac{{2\xi }}{{\xi + 1}}{w _{\max }}\).

For the event \(\{ \Vert \dot{\ell }({\beta ^{{{*}}}})\Vert _{\infty } \le \frac{{\xi - 1}}{{\xi + 1}}{w_{\min }}\}\), applying the NB concentration inequality (9), we have

$$\begin{aligned} &P \biggl( { \bigl\Vert \dot{\ell } \bigl({\beta ^{{{*}}}} \bigr) \bigr\Vert _{\infty } \ge \frac{ {\xi - 1}}{{\xi + 1}}{w_{\min }}} \biggr) \\ &\quad \le pP \Biggl( { \Biggl\vert {\sum_{i = 1}^{n} {\frac{{{\phi _{1}}({x_{i}})\theta }}{{( \theta + E{Y_{i}})}}({Y_{i}} - E{Y_{i}})} } \Biggr\vert \ge \frac{{\xi - 1}}{{\xi + 1}}n{w_{\min }}} \Biggr) \\ &\quad \le p\exp \biggl\{ { - n{{ \biggl( {\sqrt{\frac{{\xi - 1}}{{\xi + 1}} \cdot \frac{{n{w_{\min }}}}{{{c_{2n}}}} + \frac{{c_{1n}^{2}}}{ {4c_{2n}^{2}}}} - \frac{{{c_{1n}}}}{{2{c_{2n}}}}} \biggr)}^{2}}} \biggr\} \\ &\quad = p\exp \biggl\{ { - n{{ \biggl( {{{\frac{{\xi - 1}}{{\xi + 1}} \cdot \frac{{n{w_{\min }}}}{{{c_{2n}}}}} \Big/ {\sqrt{\frac{{\xi - 1}}{{\xi + 1}} \cdot \frac{{n{w_{\min }}}}{{{c_{2n}}}} + \frac{{c_{1n}^{2}}}{{4c_{2n} ^{2}}}} + \frac{{{c_{1n}}}}{{2{c_{2n}}}}}}} \biggr)}^{2}}} \biggr\} \\ &\quad \sim p\exp \bigl\{ { - nw_{\min }^{2}O(1)} \bigr\} \sim {p ^{1 - O(1){r^{2}}}}. \end{aligned}$$

where the second last is due to \(c_{1n}=O(n)=c_{2n}\) by the assumption (26) and the the last is by letting \({w_{\min }} = O(r\sqrt{\frac{{\log p}}{n}})\).

Here \(r>0\) is a suitable tuning parameter, such that the event \(\{ \Vert \dot{\ell }( {\beta ^{{{*}}}})\Vert _{\infty } \le \frac{{\xi - 1}}{{\xi + 1}} {w_{\min }}\}\) holds with probability tending to 1 as \(p,n \to \infty\). □

In order to derive Theorem 4.1, we adopt a crucial inequality by using the similar approach in Lemma 5 of Zhang and Jia [38], which is the technique of Taylor’s expansion for convex log-likelihood functions.

Lemma 4.2

Let the Hessian matrix \(\ddot{\ell }(\beta )\)be given in (21), and, for \(\delta \in {\mathbb{R}}^{p}\), assume the identifiability condition: \({\phi ^{T}}({{x}_{i}})(\beta + \delta )={\phi ^{T}}({{x}_{i}})\beta \)implies \({\phi ^{T}}({{x}_{i}})\delta = 0\). Then we have

$$ {D^{s}}(\beta + \delta ,\beta ) \ge {\delta ^{T}} \ddot{\ell }(\beta ) \delta {e^{ - 2{{\Vert { \varPhi ({{X}})} \Vert }_{\infty }}\Vert \delta \Vert }}, $$

where \({\Vert {\varPhi ({X})} \Vert _{\infty }} = \max \{ \vert {\phi _{j}({x_{ij}})} \vert ;1 \le i \le n,1 \le j \le p\} \).

By the result in the second parts of Theorem 4.1, in the event \(\{ {{{\Vert {\dot{\ell }({\beta ^{{{*}}}})} \Vert }_{\infty }} \le \frac{{\xi - 1}}{{\xi + 1}}{\lambda _{\min }}} \} \), the error of estimate \(\tilde{\beta }= \hat{\beta }- {\beta ^{*}}\) belongs to the following cone set:

$$ \mathrm{S} (\eta ,H) = \bigl\{ {b} \in {\mathbb{R} ^{p}}:{ \Vert {{{b}_{{H^{c}}}}} \Vert _{1}} \le \eta { \Vert {{{b}_{H}}} \Vert _{1}} \bigr\} $$
(30)

with \(\eta = \frac{{\xi {w_{\max }}}}{{{w_{\min }}}}\).

Let \({d^{*}(\beta ^{*})} = : \vert {H(\beta ^{*})} \vert \) with \(H(\beta ^{*}) = :\{ j:{\beta ^{*}_{j}} \ne 0, \beta ^{*} = ({\beta ^{*} _{1}},\ldots,{\beta ^{*}_{j}},\ldots,{\beta ^{*}_{p}}) \in {\mathbb{R} ^{p}}\} \). Sometimes we write \(H(\beta ^{*})\) as H and \(d^{*}(\beta ^{*})\) as \(d^{*}\) if there is no ambiguity. Define the compatibility factor (see van de Geer [33]) of a \(p \times p\) nonnegative-definite matrix Σ as

$$ C(\eta ,H,\varSigma ) = \inf_{0 \ne b \in {\mathrm{{S}}}(\eta ,H)} \frac{{{({d^{o}(b)})^{1/2}} {{({b^{T}}\varSigma b)}^{1/2}}}}{{{{ \Vert {{b_{H}}} \Vert }_{1}}}} > 0 \quad (\eta \in \mathbb{R}), $$

where \(\mathrm{{S}}(\eta ,H)=\{ b \in {\mathbb{R}^{p}}:{\Vert {{b_{{H ^{c}}}}} \Vert _{1}} \le \eta {\Vert {{b_{H}}} \Vert _{1}}\}\) is the cone condition.

The compatibility factor is strongly linked to the restricted eigenvalue proposed by Bickel et al. [3],

$$ \operatorname{RE}(\eta ,H,\varSigma ) = \inf_{0 \ne {{{b}}} \in {\mathrm{{S}}}(\eta ,H)} \frac{ {{{({{{b}}^{T}}\varSigma {{{b}}})}^{1/2}}}}{{{ \Vert b \Vert _{2}}}} > 0\quad (\eta \in {{R}}). $$
(31)

Actually, by the Cauchy–Schwarz inequality, it is evident that \({[C(\eta ,H,\varSigma )]^{2}} \ge [\operatorname{RE}(\eta , H,\varSigma )]^{2}\). In the proof of Theorem 4.1, the upper bounds of the oracle inequalities are sharper if we use a compatibility factor rather than a restricted eigenvalue.

Next, let \(C(\eta ,H) = C(\xi ,H,\ddot{l}({\beta ^{*}}))\), then we have the following theorem which is analogous to Zhang and Jia [38] who give oracle inequalities for the Elastic-net estimate in NB regression.

Theorem 4.1

Suppose \({\Vert {\varPhi ({{X}})} \Vert _{\infty }} \le K\)with some \(K > 0\)and let \(\tau = \frac{{K(\xi + 1){d^{*}}w_{\max }^{2}}}{{2{w_{\min }} {{ [ {C(\xi {w_{\max }}/{w_{\min }},H)} ]}^{2}}}} \le {\frac{1}{2}e^{ - 1}}\)with assumption (26). Setting \({w_{\min }} = O(r\sqrt{\frac{{\log p}}{n}})\)where \(r>0\)is a suitable tuning parameter, then in the event \(\{ \Vert \dot{\ell }({\beta ^{{{*}}}}) \Vert _{\infty } \le \frac{{\xi - 1}}{{\xi + 1}}{w_{\min }}\}\)with probability tending to 1 as \(p,n \to \infty \), we have

$$ \bigl\Vert {\hat{\beta }- {\beta ^{{{*}}}}} \bigr\Vert _{1} \le \frac{{{e^{ {2a_{\tau }}}}(\xi + 1){d^{*}}w_{\max }^{2}}}{{2{w_{\min }}{{ [ {C(\xi {w_{\max }}/{w_{\min }},H)} ]}^{2}}}}={O_{p}} \biggl({d^{*}}\frac{ {w_{\max }^{2}}}{{{w_{\min }}}} \biggr), $$

where \({a_{\tau }} \le \frac{1}{2}\)is the smaller solution of the equation \(a {e^{ - 2a }} = \tau \). And

$$ {D^{s}} \bigl(\hat{\beta },{\beta ^{*}} \bigr) \le \frac{{4{e^{2{a_{\tau }}}}{\xi ^{2}}{d^{*}}w_{\max }^{2}}}{{{{(\xi + 1)}^{2}}{{ [ {C(\xi {w_{ \max }}/{w_{\min }},H)} ]}^{2}}}}. $$

Proof

Let \(\tilde{\beta }= \hat{\beta }- {\beta ^{*}} \ne 0\) and \(b = \tilde{\beta }/{\Vert {\tilde{\beta }} \Vert _{1}}\), and observe that \(\ell ({\beta ^{*}} + {b}x)\) is a convex function in x since \(\ell (\beta )\) is convex. From the proof of (29) in Lemma 4.1, we have

$$ {{b}^{T}} \bigl[\dot{l} \bigl({\beta ^{*}} + {b}x \bigr) - \dot{\ell } \bigl({\beta ^{*}} \bigr) \bigr] + \frac{ {2{w_{\min }}}}{{\xi + 1}}{ \Vert {{{b}_{{H^{C}}}}} \Vert _{1}} \le \frac{ {2\xi {w_{\max }}}}{{\xi + 1}}{ \Vert {{{b}_{H}}} \Vert _{1}} $$
(32)

holding for \(x \in [0,{\Vert {\tilde{\beta }} \Vert _{1}}] \).

Since \({\ell (\beta )}\) is a convex in β, \({{b}^{T}}[\dot{\ell }( \beta + {b}x) - \ell (\beta )]\) is increasing in x) and the function \({b} \in {\mathrm{{S}}}(\xi {w_{\max }}/{w_{\min }},H)\). By the assumption, we have \({\Vert { \varPhi ({X}) } \Vert _{\infty }}\Vert {x{b}} \Vert _{1} \le Kx\) by noticing that \(\Vert {b} \Vert _{1} = 1\). And assuming \({\Vert {\varPhi ({{x}})} \Vert _{ \infty }} \le K\), we obtain the following by Lemma 4.2:

$$ x{b^{T}} \bigl[\dot{\ell } \bigl({\beta ^{*}} + {b}x \bigr) - \dot{\ell } \bigl({\beta ^{*}} \bigr) \bigr] \ge {x^{2}} {e^{ - 2{{\Vert { \varPhi ({{X}})} \Vert }_{\infty }}\Vert {x{b}} \Vert }} {{b}^{T}}\ddot{\ell }(\beta ){b} \ge {x^{2}} {e^{ - 2Kx}} {{b}^{T}} \ddot{ \ell }(\beta ){b}, $$

this implies

$$ {{b}^{T}} \bigl[\dot{\ell } \bigl({\beta ^{*}} + {b}x \bigr) - \dot{\ell } \bigl({\beta ^{*}} \bigr) \bigr] \ge x{e^{ - 2Kx}} {{b}^{T}}\ddot{\ell }(\beta ){b}. $$
(33)

For the Hessian matrix evaluated at the true coefficient \({{\beta ^{*}}}\), the compatibility factor is written as \(\mathrm{ {C}}(\eta ,H)= :\mathrm{{C}}(\eta ,H,\ddot{\ell }({\beta ^{*}}))\). From the definition of the compatibility factor and the inequality (33), we have

$$\begin{aligned} Kx{e^{ - 2Kx}} { \bigl[ {C(\xi {w_{\max }}/{w_{\min }},H)} \bigr] ^{2}} { \Vert {{{b}_{H}}} \Vert ^{2}}/{d^{*}} & \le Kx{e^{ - 2Kx}} {{b}^{T}} \ddot{\ell }(\beta ){b} \\ (\text{by (33)}) & \le K{{b}^{T}} \bigl[\dot{\ell } \bigl({\beta ^{*}} + {b}x \bigr) - \dot{\ell } \bigl({\beta ^{*}} \bigr) \bigr] \\ (\text{by (29)}) & \le K \biggl(\frac{{2\xi {w_{\max }}}}{{\xi + 1}}{ \Vert {{{b}_{H}}} \Vert _{1}} - \frac{{2{w_{\min }}}}{{\xi + 1}}{ \Vert {{{b}_{{H^{c}}}}} \Vert _{1}} \biggr) \\ & = K \biggl[\frac{{2\xi {w_{\max }}}}{{\xi + 1}}{ \Vert {{{b}_{H}}} \Vert _{1}} - \frac{ {2{w_{\min }}}}{{\xi + 1}} \bigl(1 - { \Vert {{{b}_{H}}} \Vert _{1}} \bigr) \biggr] \\ & \le K\biggl[\frac{2}{{\xi + 1}}(\xi {w_{\max }} + {w_{\max }}){ \Vert {{{b} _{H}}} \Vert _{1}} - \frac{{2{w_{\min }}}}{{\xi + 1}} \biggr] \\ & = K \biggl[2{w_{\max }} { \Vert {{{b}_{H}}} \Vert _{1}} - \frac{{2\xi {w_{\min }}}}{ {\xi + 1}} \biggr] \\ & \le \frac{{K(\xi + 1) \Vert {{{b}_{H}}} \Vert _{1}^{2}w_{\max }^{2}}}{ {2{w_{\min }}}}, \end{aligned}$$

where the last inequality is due to \(\frac{{2{w_{\min }}}}{{\xi + 1}} + \frac{{(\xi + 1){{\Vert {{{b}_{H}}} \Vert }_{1}}w_{\max }^{2}}}{ {2{w_{\min }}}} \ge 2{w_{\max }}{\Vert {{{b}_{H}}} \Vert _{1}}\).

Then we have

$$ Kx{e^{ - 2Kx}} \le \frac{{K(\xi + 1){d^{*}}w_{\max }^{2}}}{{2{w_{ \min }}{{ [ {C(\xi {w_{\max }}/{w_{\min }},H)} ]}^{2}}}} = :\tau $$
(34)

for any \(x \in [0,{\Vert {\tilde{\beta }} \Vert _{1}}]\).

So with constrain \(x \in [0,{\Vert {\tilde{\beta}} \Vert _{1}}]\), let \({a_{\tau}}\), \({b_{\tau}}\), (\({a_{\tau}}<0.5<b_{\tau}\)) be the two solutions of the equation \(\{z: z {e^{ - 2z}} = \tau\} \). Let \(S = [0,{\Vert {\tilde{\beta}} \Vert _{1}}] \cap \{ ( - \infty ,{a_{\tau}}] \cup [{b_{\tau}}, + \infty )\}\) satisfy (34).

Note that \({{b}^{T}}[\dot{\ell }( \beta + {b}x) - \ell (\beta )]\) is an increasing function of x, then S cannot be a disjoint union of the two intervals by the restriction condition in (32), thus S is a closed interval \(x \in [0,\tilde{x}]\). The fact that all \(x \in [0,{\Vert {\tilde{\beta }} \Vert _{1}}]\) implies \(x \in [0,\tilde{x}]\), therefore we have

$$ Kx \le K\tilde{x} \le {a_{\tau }}. $$

Using (34) again, the end point satisfies \(K\tilde{x}{e^{ - 2K\tilde{x}}} \le \tau \). Then, for \(\forall x \in [0,\tilde{x}] \), we have

$$ {\Vert {\tilde{\beta }} \Vert _{1}} = \Vert {{b}x} \Vert _{1} = \Vert {b} \Vert _{1} x = x \le \tilde{x} \le \frac{{{a_{\tau }}}}{K} = \frac{{{e^{2{a_{\tau }}}} \tau }}{K} = \frac{{{e^{2{a_{\tau }}}}(\xi + 1){d^{*}}w_{\max }^{2}}}{ {2{w_{\min }}{{ [ {C(\xi {w_{\max }}/{w_{\min }},H)} ]} ^{2}}}}. $$

It remains to show the oracle inequality of \({D^{s}}(\hat{\beta }, \beta ^{*})\). Let \(\delta ={b}x\) in Lemma 4.2. Using the definition of compatibility factor, and observing that \({a_{\tau }} \ge Kx\) and \(Kx \ge {\Vert { \varPhi ({{X}})} \Vert _{\infty }}\Vert \delta \Vert _{1}\), one derives

$$\begin{aligned} {e^{ - 2{a_{\tau }}}} { \bigl[ {C(\xi {w_{\max }}/{w_{\min }},H)} \bigr] ^{2}} \Vert {{{\tilde{\beta }}_{H}}} \Vert _{1}^{2}/{d^{*}} \le& {e^{ - 2{a _{\tau }}}} {{\tilde{\beta }}^{T}}\ddot{\ell } \bigl({\beta ^{*}} \bigr) \tilde{\beta }\le {e^{ - 2Kx}} {{\tilde{\beta }}^{T}} \ddot{l} \bigl({\beta ^{*}} \bigr)\tilde{\beta } \\ \le& {e^{ - 2{{ \Vert { \varPhi ({X}) } \Vert }_{\infty }} \Vert \delta \Vert _{1}}} {{\tilde{\beta }}^{T}}\ddot{\ell } \bigl({\beta ^{*}} \bigr) \tilde{\beta }. \end{aligned}$$

Applying Lemma 4.2, one immediately deduces that

$$\begin{aligned} {e^{ - 2{a_{\tau }}}} { \bigl[ {C(\xi {w_{\max }}/{w_{\min }},H)} \bigr] ^{2}} \Vert {{{\tilde{\beta }}_{H}}} \Vert _{1}^{2}/{d^{*}} \le {e^{ - 2{{ \Vert { \varPhi ({{X}})} \Vert }_{\infty }} \Vert \delta \Vert _{1}}} {{\tilde{\beta }}^{T}} \ddot{\ell } \bigl({\beta ^{*}} \bigr)\tilde{\beta } \le& {D^{s}} \bigl(\hat{\beta },\beta ^{*} \bigr) \\ \le& \frac{{2\xi {w_{\max }}}}{{\xi + 1}}{ \Vert {{{\tilde{\beta }} _{H}}} \Vert _{1}}, \end{aligned}$$

where the last inequality is from (32).

Canceling the term \({\Vert {{{\tilde{\beta }}_{H}}} \Vert _{1}}\) on the both sides of the inequality above, we get

$$ { \Vert {{{\tilde{\beta }}_{H}}} \Vert _{1}} \le \frac{{2{e^{2{a_{\tau }}}} \xi {d^{*}}{w_{\max }}}}{{(\xi + 1){{ [ {C(\xi {w_{\max }}/{w _{\min }},H)} ]}^{2}}}}. $$

Consequently,

$$ {D^{s}} \bigl(\hat{\beta },{\beta ^{*}} \bigr) \le \frac{{2\xi {w_{\max }}}}{{\xi + 1}}{ \Vert {{{\tilde{\beta }}_{H}}} \Vert _{1}} \le \frac{{4{e^{2{a_{\tau }}}} {\xi ^{2}}{d^{*}}w_{\max }^{2}}}{{{{(\xi + 1)}^{2}}{{ [ {C(\xi {w_{\max }}/{w_{\min }},H)} ]}^{2}}}}. $$

 □

Theorem 4.1 indicates that, if we presume \({d^{*}} = O(1)\), the \(l_{1}\)-estimation error bound is of the order \(\sqrt{\frac{ {\log p}}{n}} \). And this convergence rate is rate-optimal in the minimax sense; see Raskutti et al. [28] and Li et al. [22]. Under some regularity conditions, the weighted Lasso estimates are expected to have the consistency property in the sense of the \(\ell _{1}\)-norm when the dimension increases as order \(\exp (o(n))\). Unlike the MLE estimator with convergence rate \(\frac{1}{{\sqrt{n} }}\), in the high-dimensional case we have to multiply \(\sqrt{\log p} \) to the rate of MLE as the price to pay.

4.2.2 Case of Poisson regression

In this section, we use the same procedure as in Sect. 4.2.1 to prove the case of Poisson regression with weighted Lasso penalty. The derivation of this type of oracle inequalities is simpler than that in Ivanoff et al. [16]. Similar to Lemma 5 of Zhang and Jia [38], the proof is based on the following key lemma.

Lemma 4.3

The Hessian matrix and gradient of the log-likelihood function in the Poisson regression model are \(\dot{\ell }(\beta ) = - \frac{1}{n} \sum_{i = 1}^{n} {\phi ({{x}_{i}})[} {Y_{i}} - {e^{{\phi ^{T}}( {{x}_{i}})\beta }}]\), \(\ddot{\ell }(\beta ) = \frac{1}{n}\sum_{i = 1}^{n} {\phi ({{x}_{i}}){\phi ^{T}}({{x}_{i}})} {e^{{\phi ^{T}}({{x} _{i}})\beta }}\). Assume the identifiability condition: \({\phi ^{T}}({{x}_{i}})(\beta + \delta )={\phi ^{T}}({{x}_{i}})\beta \)implies \({\phi ^{T}}({{x}_{i}})\delta = 0\), then we have

$$ {D^{s}}(\beta + \delta ,\beta ) \ge {\delta ^{T}} \ddot{\ell }(\beta ) \delta {e^{ - {{\Vert {\varPhi ({{X}})} \Vert }_{\infty }}\Vert \delta \Vert _{1}}}, $$

where \({\Vert {\varPhi ({{X}})} \Vert _{\infty }} = \max \{ \vert {\phi _{j}({x_{ij}})}\vert ;1 \le i \le n,1 \le j \le p\} \).

Proof

Without loss of generality, we assume that \({\phi ^{T}}({{x}_{i}})\delta \ne 0\). By the expression of \(\dot{\ell }(\beta )\) and \(\ddot{\ell }( \beta )\), one deduces

$$\begin{aligned} {\delta ^{T}} \bigl[\dot{\ell }(\beta +\delta ) - \dot{\ell }(\beta ) \bigr] &= {\delta ^{T}} \cdot \frac{1}{n}\sum _{i = 1}^{n} {\phi ({{x}_{i}}) \bigl( {e^{{\phi ^{T}}({{x}_{i}})(\beta + \delta )}} - {e^{{\phi ^{T}}({{x} _{i}})\beta }} \bigr)} \\ &= {\delta ^{T}} \cdot \frac{1}{n}\sum _{i = 1}^{n} {\phi ( {{{x}}_{i}}){e^{{\phi ^{T}}({{x}_{i}})\beta }} \cdot \frac{ {{e^{{\phi ^{T}}({{x}_{i}})\delta }} - 1}}{{{\phi ^{T}}({{x}_{i}}) \delta - 0}}} \cdot {\phi ^{T}}({{x}_{i}}) \delta \\ & \ge {\delta ^{T}}\frac{1}{n}\sum _{i = 1}^{n} {\phi ({{x}_{i}}) {\phi ^{T}}({{x}_{i}}){e^{{\phi ^{T}}({{x}_{i}})\beta }}} \delta \cdot {e^{ - {{\Vert { {\varPhi ({{X}})}} \Vert }_{\infty }}\Vert \delta \Vert _{1}}}, \end{aligned}$$

where the last inequality is from \(\frac{{{e^{x}} - {e^{y}}}}{{x - y}} \ge {e^{ - \vert x \vert \vee \vert y\vert }}\). □

Theorem 4.2

For a weighted Lasso estimatorβ̂in a Poisson regression, suppose \({\Vert {\varPhi ({{X}})} \Vert _{\infty }} \le K\)with assumption (26) and let \(\tau = \frac{{K(\xi + 1){d^{*}}w_{\max }^{2}}}{{2{w_{\min }}{{ [ {C(\xi {w_{\max }}/{w_{\min }},H)} ]}^{2}}}} \le {e^{ - 1}}\)with a certain \(K > 0\). Then, in the event \({{{\Vert {\dot{\ell }({\beta ^{{{*}}}})} \Vert }_{\infty }} \le \frac{{\xi - 1}}{{\xi + 1}}{w_{\min }}}\), we have

$$ { \bigl\Vert {\hat{\beta }- {\beta ^{{{*}}}}} \bigr\Vert _{1}} \le \frac{{{e^{ {a_{\tau }}}}(\xi + 1){d^{*}}w_{\max }^{2}}}{{2{w_{\min }}{{ [ {C(\xi {w_{\max }}/{w_{\min }},H)} ]}^{2}}}}, $$

where \({a_{\tau }} \le 1\)is the smaller solution of the equation \(a {e^{ - a }} = \tau \). Also,

$$ {D^{s}} \bigl(\hat{\beta },{\beta ^{*}} \bigr) \le \frac{{4{e^{{a_{\tau }}}}{\xi ^{2}}{d^{*}}w_{\max }^{2}}}{{{{(\xi + 1)}^{2}}{{ [ {C(\xi {w_{ \max }}/{w_{\min }},H)} ]}^{2}}}}. $$

Proof

Note that \({a_{\tau }}\) is the small solution of \(z {e^{ - z}} = \tau \) in Poisson regression. By Lemma 4.1 and Lemma 4.3, the proof is almost exactly the same as the proof of Theorem 4.1. □

5 Simulations

This section aims to compare the performances of the un-weighted Lasso and weighted Lasso for NB regression on simulated data sets. We use the R package mpath with the function glmregNB to fit the un-weighted Lasso estimator of NB regression. The R package lbfgs is employed to do a weighted \(\ell _{1}\)-penalized optimization for NB regression. For weighted Lasso, we first run an un-weighted Lasso estimator and find the optimal tuning parameter \({\lambda _{\mathrm{{op}}}}\) by cross-validation. The actual weights we use are the standardized weights given by

$$ \tilde{w}_{j}: = p \frac{{w_{j}}}{{\sum_{j = 1}^{p} {w_{j}} }}. $$

And then we solve the optimization problem

$$ \hat{\beta }^{(t)}= \operatorname*{argmin}_{\beta \in \mathbb{R}^{p}} \Biggl\lbrace - \ell (\beta )+ {\lambda _{\mathrm{{op}}}} \sum _{j=1}^{p} \tilde{w}_{j} \vert {{\beta _{j}}} \vert \Biggr\rbrace . $$
(35)

Here we first apply the function cv.glmregNB() to do a 10-fold cross-validation to obtain the optimal penalty parameter λ. For comparison, we use the function glmregNB to do un-weighted Lasso for NB regression and use the estimated θ as the initial value for our weighted Lasso algorithms.

For simulations, we simulate 100 random data sets. By optimization with suitable λ, we obtain the model with the parameter \(\hat{\beta }_{\lambda _{\mathrm{opt}}}\). Then we estimate the performance of the \(\ell _{1}\) estimation error \(\Vert \beta ^{*} - \hat{\beta }_{\lambda _{\mathrm{opt}}} \Vert _{1}\) and the prediction error \(\Vert X_{\mathrm{test}} \beta ^{*} - X_{\mathrm{test}} \hat{\beta }_{\lambda _{\mathrm{opt}}} \Vert _{2}\) on the test data (\(X _{\mathrm{test}}\) of size \(n_{\mathrm{test}}\)) by the average of the 100-times errors.

For each simulation \(n_{\mathrm{train}} = 100\), \(n_{\mathrm{test}} = 200\). We mainly adopt the simulation setting: The predictor variables X are randomly drawn from \({N}_{p}(0, \varSigma )\), where Σ has elements \(\rho _{\vert j - l\vert }\) (\(j, l = 1, 2,\ldots, p\)). The correlation among predictor variables is controlled by ρ with \(\rho = 0.5\mbox{ and }0.8\), respectively. We assign the true vector as

$$ \beta ^{*} =( 1.25, -0.95, 0.9, -1.1, 0.6, \underbrace{0,\ldots, 0} _{p - 5} ). $$

The simulation results are displayed in Table 1. The table shows that the proposed weighted Lasso estimators are more accurate than the un-weighted Lasso estimators. With the help of the optimal weights, it reflects that the controlling of KKT conditions by a data-dependent tuning parameter is able to improve the accuracy of the estimation in aspects of the \(\ell _{1}\)-estimation, square prediction errors. It can be also seen that the increasing p will hinder the \(\ell _{1}\)-estimation error by the curse of dimensionality.

Table 1 Means of \(\ell _{1}\) error and prediction error

6 Summary and discussion

The paper contributes to three parts. In the first part, we provide a characterization of discrete compound Poisson point processes (DCPP) in the same fashion as it has been done in 1936 by Arthur H. Copeland and Francis Regan for the Poisson process. The second part is focused on deriving concentration inequalities for DCPP, which are applied in the third part to derive optimally oracle inequalities of the weighted Lasso estimation in high-dimensional NB regression.

Oracle inequalities for discrete distributions are statistically useful in both non-asymptotic and asymptotical analyses of count data regression. The motivation for deriving new concentration inequalities for DCP processes is that we want to show that the KKT conditions of the NB regressions with Lasso penalty (wuth a tuning parameter λ) is a high-probability event. The KKT conditions are based on the concentration equality for a centralized weighted sum of NB random variables. However, existing concentration results are not friendly applicable to the aim of optimal inference procedure. The optimal inference procedure here is to choose an optimal tuning parameter for Lasso estimates in high-dimensional NB regressions, which lead to the minimax optimal convergence for the estimator by considering the oracle inequalities for the \(\ell _{1}\)-estimation error.

In the future, it is of interest to study the concentration inequalities for other extended Poisson distributions in regression analysis such as the Conway–Maxwell–Poisson distribution; see Li et al. [22] for the existing probability properties. In another direction, our proposed concentration inequalities would be desirable to be extended to establish oracle inequalities for penalized projection estimators studied by Reynaud-Boure [29], by considering the intensity of some inhomogeneous compound Poisson processes.

References

  1. Aczél, J.: On composed Poisson distributions, III. Acta Math. Hung. 3(3), 219–224 (1952)

    Article  MathSciNet  Google Scholar 

  2. Baraud, Y., Birgé, L.: Estimating the intensity of a random measure by histogram type estimators. Probab. Theory Relat. Fields 143(1–2), 239–284 (2009)

    Article  MathSciNet  Google Scholar 

  3. Bickel, P.J., Ritov, Y.A., Tsybakov, A.B.: Simultaneous analysis of Lasso and Dantzig selector. Ann. Stat. 37, 1705–1732 (2009)

    Article  MathSciNet  Google Scholar 

  4. Boucheron, S., Lugosi, G., Massart, P.: Concentration Inequalities: A Nonasymptotic Theory of Independence. Oxford University Press, London (2013)

    Book  Google Scholar 

  5. Bühlmann, P., van de Geer, S.A.: Statistics for High-Dimensional Data: Methods, Theory and Applications. Springer, Berlin (2011)

    Book  Google Scholar 

  6. Chareka, P., Chareka, O., Kennendy, S.: Locally sub-Gaussian random variables and the strong law of large numbers. Atl. Electron. J. Math. 1(1), 75–81 (2006)

    MathSciNet  MATH  Google Scholar 

  7. Cleynen, A., Lebarbier, E.: Segmentation of the Poisson and negative binomial rate models: a penalized estimator. ESAIM Probab. Stat. 18, 750–769 (2014)

    Article  MathSciNet  Google Scholar 

  8. Copeland, A.H., Regan, F.: A postulational treatment of the Poisson law. Ann. Math. 37, 357–362 (1936)

    Article  MathSciNet  Google Scholar 

  9. Das, A.: Design and Analysis of Statistical Learning Algorithms which Control False Discoveries. Doctoral dissertation, Universität zu Köln (2018)

  10. Feller, W.: An Introduction to Probability Theory and Its Applications, vol. I, 3rd edn. Wiley, New York (1968)

    MATH  Google Scholar 

  11. Giles, D.E.: Hermite regression analysis of multi-modal count data. Econ. Bull. 30(4), 2936–2945 (2010)

    Google Scholar 

  12. Hastie, T., Tibshirani, R., Wainwright, M.: Statistical Learning with Sparsity: The Lasso and Generalizations. CRC Press, Boca Raton (2015)

    Book  Google Scholar 

  13. Hilbe, J.M.: Negative Binomial Regression, 2nd edn. Cambridge University Press, Cambridge (2011)

    Book  Google Scholar 

  14. Houdré, C.: Remarks on deviation inequalities for functions of infinitely divisible random vectors. Ann. Probab. 30, 1223–1237 (2002)

    Article  MathSciNet  Google Scholar 

  15. Houdré, C., Privault, N.: Concentration and deviation inequalities in infinite dimensions via covariance representations. Bernoulli 8(6), 697–720 (2002)

    MathSciNet  MATH  Google Scholar 

  16. Ivanoff, S., Picard, F., Rivoirard, V.: Adaptive Lasso and group-Lasso for functional Poisson regression. J. Mach. Learn. Res. 17(55), 1–46 (2016)

    MathSciNet  MATH  Google Scholar 

  17. Jiang, X., Raskutti, G., Willett, R.: Minimax optimal rates for Poisson inverse problems with physical constraints. IEEE Trans. Inf. Theory 61(8), 4458–4474 (2015)

    Article  MathSciNet  Google Scholar 

  18. Johnson, N.L., Kemp, A.W., Kotz, S.: Univariate Discrete Distributions, 3rd edn. Wiley, New York (2005)

    Book  Google Scholar 

  19. Kingman, J.F.C.: Poisson Processes. Oxford University Press, London (1993)

    MATH  Google Scholar 

  20. Kontoyiannis, I., Madiman, M.: Measure concentration for compound Poisson distributions. Electron. Commun. Probab. 11, 45–57 (2006)

    Article  MathSciNet  Google Scholar 

  21. Last, G., Penrose, M.D.: Lectures on the Poisson Process. Cambridge University Press, Cambridge (2017)

    Book  Google Scholar 

  22. Li, B., Zhang, H., He, J.: Some characterizations and properties of COM-Poisson random variables. Commun. Stat., Theory Methods (2019). https://doi.org/10.1080/03610926.2018.1563164

    Article  Google Scholar 

  23. Li, Q.: Bayesian Models for High-Dimensional Count Data with Feature Selection. Doctoral dissertation, Rice University (2016)

  24. Luc, D.T.: Multiobjective Linear Programming: An Introduction. Springer, Berlin (2016)

    Book  Google Scholar 

  25. Mallick, H., Tiwari, H.K.: EM adaptive LASSO a multilocus modeling strategy for detecting SNPs associated with zero-inflated count phenotypes. Front. Genet. 7, Article 32 (2016)

    Article  Google Scholar 

  26. Nielsen, F., Nock, R.: Sided and symmetrized Bregman centroids. IEEE Trans. Inf. Theory 55(6), 2882–2904 (2009)

    Article  MathSciNet  Google Scholar 

  27. Petrov, V.V.: Limit Theorems of Probability Theory: Sequences of Independent Random Variables. Clarendon, Oxford (1995)

    MATH  Google Scholar 

  28. Raskutti, G., Wainwright, M.J., Yu, B.: Minimax rates of estimation for high-dimensional linear regression over \(\ell _{q}\)-balls. IEEE Trans. Inf. Theory 57(10), 6976–6994 (2011)

    Article  MathSciNet  Google Scholar 

  29. Reynaud-Bouret, P.: Adaptive estimation of the intensity of inhomogeneous Poisson processes via concentration inequalities. Probab. Theory Relat. Fields 126(1), 103–153 (2003)

    Article  MathSciNet  Google Scholar 

  30. Rigollet, P., Hütter, J.C.: High dimensional statistics (2019). http://www-math.mit.edu/~rigollet/PDFs/RigNotes17.pdf

  31. Sato, K.: Lévy Processes and Infinitely Divisible Distributions, revised edn. Cambridge University Press, Cambridge (2013)

    MATH  Google Scholar 

  32. Städler, N., Bühlmann, P., van de Geer, S.A.: \(\ell_{1}\)-penalization for mixture regression models. Test 19, 209–256 (2010)

    Article  MathSciNet  Google Scholar 

  33. van de Geer, S.A.: The deterministic lasso. Seminar für Statistik, Eidgenössische Technische Hochschule (ETH) Zürich (2007)

  34. Wainwright, M.J.: High-Dimensional Statistics: A Non-asymptotic Viewpoint, vol. 48. Cambridge University Press, Cambridge (2019)

    Book  Google Scholar 

  35. Wang, Y.H., Ji, S.: Derivations of the compound Poisson distribution and process. Stat. Probab. Lett. 18(1), 1–7 (1993)

    Article  MathSciNet  Google Scholar 

  36. Yang, X., Zhang, H., Wei, H., Zhang, S.: Sparse density estimation with measurement errors. arXiv preprint (2019). arXiv:1911.0621

  37. Yu, Y.: High-dimensional variable selection in Cox model with generalized Lasso-type convex penalty (2010). https://people.maths.bris.ac.uk/~yy15165/index_files/Cox_generalized_convex.pdf

  38. Zhang, H., Jia, J.: Elastic-net regularized high-dimensional negative binomial regression: consistency and weak signals detection. arXiv preprint (2017). arXiv:1712.03412

  39. Zhang, H., Li, B.: Characterizations of discrete compound Poisson distributions. Commun. Stat., Theory Methods 45(22), 6789–6802 (2016)

    Article  MathSciNet  Google Scholar 

  40. Zhang, H., Liu, Y., Li, B.: Notes on discrete compound Poisson model with applications to risk theory. Insur. Math. Econ. 59, 325–336 (2014)

    Article  MathSciNet  Google Scholar 

Download references

Acknowledgements

This work is part of the doctoral thesis of the first author who would like to express sincere gratefulness to my advisor Professor Jinzhu Jia for his guidance, and candidate Dr. Xiangyang Li for his assistance of simulations. The authors also express their thanks to the two anonymous referees for their valuable comments, which greatly improved the quality of our manuscript. In writing the early version of this paper, Professor Patricia Reynaud-Bouret provided several helpful comments and materials about the concentration inequalities, to whom warm thanks are due.

Availability of data and materials

Not applicable.

Funding

No funding is used to support this paper.

Author information

Authors and Affiliations

Authors

Contributions

The authors completed the paper and approved the final manuscript.

Corresponding author

Correspondence to Huiming Zhang.

Ethics declarations

Competing interests

The authors declare that they have no competing interests.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix:  Proof of Theorem 2.1

Appendix:  Proof of Theorem 2.1

1.1 A.1 Proof of Theorem 2.1

The proof is divided into two parts. In the first part, we show that \(\nu (A)\) is countably additive and absolutely continuous. Moreover, in the second part, we will show Eq. (3).

Proof

Step 1.

(1) Note that \(P_{k}(A)>0\) and \(\sum_{k=0}^{\infty }P_{k}(A)=1\) by assumption 1, we have \(0< P_{k}(A)<1\) whenever \(0< m(A)<\infty \). And from assumption 2, we get \(P_{0}(A)=1\) if \(m(A)=0\). It follows that \(0< P_{0}(A)\leq 1 \text{ if }0< m(A)<\infty \). By assumptions 2 and 4, one has \(\lim_{m(A)\to 0}\nu (A)=\lim_{m(A)\to 0}- \log {[1-S_{1}(A)]}=0\). Therefore, \(\nu (\cdot )\) is absolutely continuous.

(2) Let \(\nu (A):=-\log {P_{0}(A)}\). For \(A_{1}\cap A_{2}=\emptyset \), it implies by assumption 3 that

$$ \nu (A_{1}\cup A_{2})=-\log {P_{0}(A_{1} \cup A_{2})}=-\log { \bigl[P_{0}(A _{1}) \cdot P_{0}(A_{2}) \bigr]}=\nu (A_{1})+\nu (A_{2}), $$

which shows that \(\nu (\cdot )\) is finitely additive.

Based on the two arguments in Step 1, we have shown that \(\nu ( \cdot )\) is countable additive.

Step 2. For convenience, we first consider the case that \(A=I_{\mathbf{a},\mathbf{b}}=(\mathbf{a}, \mathbf{b}]\) is a d-dimensional interval on \(\mathbb{R}^{d}\), and then extend it to any measurable sets. Let \(T({x}) =:\nu ({I_{0,{x}}})\), so \(T({x})\) is a d dimensional coordinately monotone transformation with

$$ T(\mathbf{0})=\mathbf{0},\qquad T({x}) \le T \bigl({x}' \bigr) \quad \text{iff}\quad {x} \le {x}', $$

where the componentwise order symbol (partially ordered symbol) “≤” means that each coordinate in x is less than or equal to the corresponding coordinate of \({x}'\) (for example, with \(d=2\), we have \((1,2)\le (3,4)\)). This is by the fact that \(\nu (\cdot )\) is absolutely continuous and finite additive, which implies \(\nu ({I_{\mathbf{0}, \mathbf{0}}}) = \lim_{m({I_{\mathbf{0},{x}}}) \to 0} \nu ({I_{\mathbf{0},{x}}}) = 0\).

The definition of the minimum set of the componentwise ordering relation lays the base for our study; see Sect. 4.1 of Dinh The Luc [24]. Let A be a nonempty set in \(\mathbb{R}^{+d}\). A point \(\mathbf{a} \in A\) is called a Pareto minimal point of the set A if there is no point \(a' \in A\) such that \(\mathbf{a}' \ge \mathbf{a}\) and \(\mathbf{a}' \ne \mathbf{a}\). The sets of Pareto minimal points of A is denoted by \(\operatorname{PMin}(A)\). Next, we define the generalized inverse \({T^{ - 1}}(t)\) (like some generalizations of the inverse matrix, it is not unique, we just choose one) by

$$ {x} = ({x_{1}},\ldots,{x_{d}}) ={T^{ - 1}}(t)=: \operatorname{PMin} \bigl\{ \mathbf{a} \in \mathbb{R}^{+d}:t \le T( \mathbf{a}) \bigr\} . $$

When \(d=1\), the generalized inverse \({T^{ - 1}}(t)\) is just a similar version of the quantile function.

By Theorem 4.1.4 in Dinh The Luc [24], if A is a nonempty compact set, then it has a Pareto minimal point. Set \({x}=T^{-1}(t)\), \({x}+\boldsymbol{\xi }=T^{-1}(t+\tau )\) and define \(\varphi _{k}(\tau ,t)=P _{k}(I_{{x},{x}+\xi })\) for \(t=T({x})\), \(\tau =T({x}+\boldsymbol{\xi })-T( {x})\).

Consequently, by part one, finite additivity of \(\nu (\cdot )\), it implies that

$$ \tau =T({x}+\boldsymbol{\xi })-T({x})=\nu (I_{0,{x}+\boldsymbol{\xi }})- \nu (I_{\mathbf{0},{x}})=\nu (I_{{x},{x}+\boldsymbol{\xi }})=-\log {P_{0}(I _{{x},{x}+\boldsymbol{\xi }})}. $$

Thus, we have \(\varphi _{0}(\tau ,t)=e^{-\tau }\). Let \(\varPhi _{k}(\tau ,t)= \sum_{i=k}^{\infty }\varphi _{i}(\tau ,t)\). Due to assumption 5, it follows that

$$ \alpha _{k}=\lim_{\tau \to 0} \frac{\varphi _{k}(\tau ,t)}{\varPhi _{1}(\tau ,t)}=\lim_{\tau \to 0}\frac{\varphi _{k}(\tau ,t)}{ \tau }\quad \text{for }k\geq 1. $$
(36)

Applying assumption 3, we get \(\varphi _{k}(\tau +h,t)=\sum_{i=0} ^{k}\varphi _{k-i}(\tau ,t)\varphi _{i}(h,t+\tau )\). Subtracting \(\varphi _{k}(\tau ,t)\) and dividing by h (\(h > 0\)) in the above equation, then

$$ \frac{\varphi _{k}(\tau +h,t)-\varphi _{k}(\tau ,t)}{h}=\varphi _{k}( \tau ,t)\cdot \frac{\varphi _{0}(h,t+\tau )-1}{h}+\sum_{i=1} ^{k} \varphi _{k-i}(\tau ,t)\cdot \frac{\varphi _{i}(h,t+\tau )}{h}. $$
(37)

Here the meaning of h in \(\varphi _{i}(h,t+\tau )\) is that

$$ h=T({x}+\boldsymbol{\xi }+\boldsymbol{\eta })-T({x}+\boldsymbol{\xi })=\nu (I_{ \mathbf{0},{x}+\boldsymbol{\xi }+\boldsymbol{\eta }})-\nu (I_{\mathbf{0}, {x}+\xi })=\nu (I_{{x}+\xi , {x}+\boldsymbol{\xi }+\boldsymbol{\eta }}), $$

where \({x}+\boldsymbol{\xi }+\boldsymbol{\eta }=T^{-1}(t+\tau +h)\) and \({x}+\boldsymbol{\xi }=T^{-1}(t+\tau )\).

Let \(h \to 0\) in (37). By using assumption 5 and its conclusion (36), it implies that

$$ {{\varphi '}_{k}}(\tau ,t) =: \frac{\partial }{{\partial \tau }}{\varphi _{k}}( \tau ,t) = - {\varphi _{k}}(\tau ,t) + {\alpha _{1}} {\varphi _{k - 1}}( \tau ,t) + {\alpha _{2}} {\varphi _{k - 2}}(\tau ,t) + \cdots + {\alpha _{k}} {\varphi _{0}}(\tau ,t) , $$
(38)

which is a difference–differential equation.

To solve (38), we need to write it in matrix form,

Pk(τ,t)τ:=(φk(τ,t)φ1(τ,t)φ0(τ,t))=(1α1αkα11)(φk(τ,t)φ1(τ,t)φ0(τ,t))=:QPk(τ,t).

The general solution is

$$ \mathbf{P}_{k}(\tau , t)=e^{\mathbf{Q}\tau }\cdot \mathbf{c}=: \sum _{m=0}^{\infty }\frac{\mathbf{Q}^{m}}{m!}\tau ^{m}\cdot \mathbf{c}, $$

where c is a constant vector. To specify c, we observe that

$$ (0,\ldots,0,1)^{\mathrm{T}}=\lim_{\tau \to 0} \mathbf{P}_{k}( \tau , t)=\lim_{\tau \to 0}e^{\mathbf{Q}\tau } \mathbf{c}= \mathbf{c}. $$

Hence, the Q can be written as

$$ \mathbf{Q}=-\mathbf{I}_{k+1}+\alpha _{1} \mathbf{N}+ \alpha _{2} \mathbf{N}^{2}+\cdots +\alpha _{k}\mathbf{N}^{k}, $$

where N=:(0k×1Ik001×k). As \(i\geq k+1\), we have \(\mathbf{N}^{i}=\mathbf{0}\).

We use the expansion of the power of the multinomial

$$ \frac{1}{{m!}}{ \Biggl( - {\mathbf{{I}}_{k + 1}} + \sum _{i = 1}^{k} {{\alpha _{i}}} { \mathbf{{N}}^{i}} \Biggr)^{m}} = \sum _{{s_{0}} + {s_{1}} + \cdots + {s_{k}} = m} {\frac{{{{( - {\mathbf{{I}}_{k + 1}})}^{{s_{0}}}}\alpha _{1}^{{s_{1}}} \cdots \alpha _{k}^{{s_{k}}}}}{{{s_{0}}!{s_{1}}! \cdots {s_{k}}!}}} {\mathbf{{N}} ^{{s_{1}} + 2{s_{2}} + \cdots + k{s_{k}}}}. $$

Then

$$\begin{aligned} {{\mathbf{P}_{k}}(\tau ,t)} &= \sum_{m = 0}^{\infty }{ \sum_{{s_{0}} + {s_{1}} + \cdots + {s_{k}} = m} {\frac{{{{( - {\mathbf{{I}}_{k + 1}})}^{{s_{0}}}}\alpha _{1}^{{s_{1}}} \cdots \alpha _{k}^{{s_{k}}}}}{{{s_{0}}!{s_{1}}! \cdots {s_{k}}!}}} {\mathbf{{N}} ^{{s_{1}} + 2{s_{2}} + \cdots + k{s_{k}}}} {\tau ^{{s_{0}} + {s_{1}} + \cdots + {s_{k}}}} {\mathbf{{c}}}} \\ &=\sum_{m = 0}^{\infty }{\sum _{{s_{0}} = 0}^{m} {{\tau ^{ {s_{0}}}} \frac{{{{( - 1)}^{{s_{0}}}}}}{{{s_{0}}!}}\sum_{{s_{1}} + \cdots + {s_{k}} = m - {s_{0}}} { \frac{{\alpha _{1} ^{{s_{1}}} \cdots \alpha _{k}^{{s_{k}}}}}{{{s_{1}}! \cdots {s_{k}}!}}} } {\mathbf{{N}}^{{s_{1}} + 2{s_{2}} + \cdots + k{s_{k}}}} {\tau ^{{s_{1}} + \cdots + {s_{k}}}} {\mathbf{{c}}}} \\ &= \sum_{{s_{0}} = 0}^{\infty }{\sum _{m - {s_{0}} = 0} ^{\infty }{{\tau ^{{s_{0}}}} \frac{{{{( - 1)}^{{s_{0}}}}}}{{{s_{0}}!}} \sum_{{s_{1}} + \cdots + {s_{k}} = m - {s_{0}}} { \frac{{\alpha _{1}^{{s_{1}}} \cdots \alpha _{k}^{{s_{k}}}}}{{{s_{1}}! \cdots {s_{k}}!}}} } {\tau ^{{s_{1}} + \cdots + {s_{k}}}} {\mathbf{{N}} ^{{s_{1}} + 2{s_{2}} + \cdots + k{s_{k}}}} {\mathbf{{c}}}} \\ (\text{let } r = m - {s_{0}}) & = \sum _{{s_{0}} = 0}^{ \infty }{{\tau ^{{s_{0}}}} \frac{{{{( - 1)}^{{s_{0}}}}}}{{{s_{0}}!}} \sum_{r = 0}^{\infty }{ \sum_{{s_{1}} + \cdots + {s_{k}} = r} {\frac{{\alpha _{1}^{{s_{1}}} \cdots \alpha _{k}^{{s_{k}}}}}{{{s _{1}}! \cdots {s_{k}}!}}} } {\tau ^{{s_{1}} + \cdots + {s_{k}}}} {\mathbf{{N}} ^{{s_{1}} + 2{s_{2}} + \cdots + k{s_{k}}}} {\mathbf{{c}}}} \\ &= {e^{ - \tau }}\sum_{r = 0}^{\infty }{ \sum_{{s_{1}} + \cdots + {s_{k}} = r} {\frac{{\alpha _{1}^{{s_{1}}} \cdots \alpha _{k}^{{s_{k}}}}}{{{s_{1}}! \cdots {s_{k}}!}}} } {\tau ^{{s _{1}} + \cdots + {s_{k}}}} {\mathbf{{N}}^{{s_{1}} + 2{s_{2}} + \cdots + k{s_{k}}}} {\mathbf{{c}}} \\ &=\sum_{l = 0}^{\infty }{\sum _{R(s,k) = l} {\frac{ {\alpha _{1}^{{s_{1}}} \cdots \alpha _{k}^{{s_{k}}}}}{{{s_{1}}! \cdots {s_{k}}!}}{\tau ^{{s_{1}} + \cdots + {s_{k}}}} {e^{ - \tau }} {\mathbf{{N}} ^{l}} {\mathbf{{c}}}} } \\ &= \sum_{l = 0}^{k} {\sum _{R(s,k) = l} {\frac{{\alpha _{1}^{{s_{1}}} \cdots \alpha _{k}^{{s_{k}}}}}{{{s _{1}}! \cdots {s_{k}}!}}{\tau ^{{s_{1}} + \cdots + {s_{k}}}} {e^{ - \tau }} {\mathbf{{N}}^{l}} {\mathbf{{c}}}} }, \end{aligned}$$
(39)

where \(R(s,k)=:\sum_{t=1}^{k} ts_{t}\) and the last equality is obtained by the fact that \({\mathbf{N}^{l}}\mathbf{c} = (0,\ldots,0,1, \underbrace{0,\ldots,0}_{l})^{T}\) and \(\mathbf{N}^{l}= \mathbf{0}\) as \(l\geq k+1\).

Choose the first element in (39). We show that \(\varphi _{k}( \tau , t)=\sum_{R(s,k)=k}\frac{\alpha _{1}^{s_{1}}\cdots \alpha _{k}^{s_{k}}}{s_{1}!\cdots s_{k}!}\tau ^{s_{1}+\cdots +s_{k}} e^{- \tau }\). By the relation of \(\varphi _{k}(\tau , t)\) and \(P_{k}(A)\), we obtain

$$ P_{k}(I_{0,\boldsymbol{\xi }})=\sum _{R(s,k)=k}\frac{\alpha _{1}^{s _{1}}\cdots \alpha _{k}^{s_{k}}}{s_{1}!\cdots s_{k}!} \bigl[\nu (I_{0, \boldsymbol{\xi }}) \bigr]^{s_{1}+\cdots +s_{k}}\cdot e^{-\nu (I_{0, \boldsymbol{\xi }})}. $$
(40)

Step 3. We need to show the countable additivity in terms of (40). Let \(A_{1}\), \(A_{2}\) be two disjoint intervals, we first show that

$$ P_{k}(A_{1}\cup A_{2})= \sum_{R(s,k)=k}\frac{\alpha _{1}^{s_{1}} \cdots \alpha _{k}^{s_{k}}}{s_{1}!\cdots s_{k}!} \bigl[\nu (A_{1}\cup A_{2}) \bigr]^{s _{1}+\cdots +s_{k}} e^{-\nu (A_{1}\cup A_{2})} , $$
(41)

which means that (40) is closed under a finite union of the \(A_{i}\).

By assumption 3, to show (41), it is sufficient to verify \(\sum_{i=0}^{k}P_{k-i}(A_{1})P_{i}(A_{2})\) equals the right-hand side of (41). From the m.g.f. of the DCP distribution with complex θ, we have

$$ {M_{{A_{i}}}}(\theta ) =: \mathrm{{E}} {e^{ - \theta N({A_{i}})}} = \exp \Biggl\{ \lambda ({A_{i}})\sum_{a = 1}^{\infty }{{ \alpha _{a}} \bigl( {e^{ - a\theta }} - 1 \bigr)} \Biggr\} \quad \text{for } i=1,2. $$

And by the definition of m.g.f. and assumption 3,

$$\begin{aligned} {M_{{A_{1}} \cup {A_{2}}}}(\theta ) &= \sum_{k = 0}^{ + \infty} {{P_{k}}({A_{1}} \cup {A_{2}}){e^{ - k\theta }}} = \sum_{k = 0}^{ + \infty } { \Biggl( {\sum _{i = 0}^{k} {{P_{k - i}}} ( {A_{1}}){P_{i}}({A_{2}})} \Biggr)} {e^{ - k\theta }} \\ &= \Biggl( {\sum_{j = 0}^{ + \infty } {{P_{j}}({A_{1}}){e^{ - j\theta }}} } \Biggr) \Biggl( {\sum_{s = 0}^{ + \infty } {{P _{s}}({A_{2}}){e^{ - s\theta }}} } \Biggr) \\ &={M_{{A_{1}}}}(\theta ){M_{{A_{2}}}}(\theta )=\exp \Biggl\{ \bigl[\nu ({A_{1}}) + \nu ({A_{1}}) \bigr]\sum _{a = 1}^{\infty }{{\alpha _{a}} \bigl({e^{ - a \theta }} - 1 \bigr)} \Biggr\} \\ &=\exp \Biggl\{ \bigl[\nu ({A_{1}} + {A_{2}}) \bigr] \sum_{a = 1}^{\infty } {{\alpha _{a}} \bigl({e^{ - a\theta }} - 1 \bigr)} \Biggr\} \\ &=\sum_{k = 0}^{\infty }{\sum _{R(s,k) = k} {\frac{ {\alpha _{1}^{{s_{1}}} \cdots \alpha _{k}^{{s_{k}}}}}{{{s_{1}}! \cdots {s_{k}}!}}} {{ \bigl[\nu ({A_{1}} \cup {A_{2}}) \bigr]}^{{s_{1}} + \cdots + {s _{k}}}} {e^{ -\nu ({A_{1}} \cup {A_{2}})}}} {\mathrm{{e}}^{ - k\theta }}. \end{aligned}$$

Thus, we show (41).

Finally, let \(E_{n}=\bigcup_{i=1}^{n} A_{i}\), \(E=\bigcup_{i=1}^{\infty }A_{i}\) for disjoint intervals \(A_{1}, A_{2},\ldots \) . We use the following lemma in Copeland and Regan [8].

Lemma A.1

If we have assumptions 14, let \({e_{n}} = \{ E\setminus (E \cap {E_{n}})\} \cup \{ {E_{n}}\setminus (E \cap {E_{n}})\} \). Then

$$ \lim_{n\to \infty } m(e_{n})=0 \quad \textit{implies} \quad \lim_{n\to \infty } P_{k}(E_{n})=P_{k}(E). $$

To see this, since \(e_{n}=\bigcup_{i=n+1}^{\infty }A_{i}\), we have \(m(e_{n})=\sum_{i=n+1}^{\infty }m(A_{i}) \to 0\) as \(n \to \infty \). Thus, we obtain the countable additivity with respect to (40). □

1.2 A.2 Proof of Corollary 3.1

Let \(C{P_{i,{r_{i}}}}(A) =: \sum_{k = 1}^{r_{i}} {k{N_{i,k}}(A)}\), where \(\{{N_{i,k}}(A)\}_{i=1}^{n}\) are independent Poisson point process with intensity \({{\alpha _{k}(i)}\int _{A} {\lambda (x)} \,dx}\) for each fixed k. By Campbell’s theorem, it follows that

$$\begin{aligned} \mathrm{{E}}\exp \Biggl\{ - \sum_{i = 1}^{n} { \int _{{S_{i}}} {f_{i}(x)} C{P_{i, {r_{i}}}}(dx)} \Biggr\} & = \mathrm{{E}}\exp \Biggl\{ - \sum_{i = 1}^{n} { \int _{{S_{i}}} {f_{i}(x)} \sum_{k = 1}^{{r_{i}}} {k{N_{i,k}}(dx)} } \Biggr\} \\ & = \prod_{i = 1}^{n} {\prod _{k = 1}^{{r_{i}}} {\mathrm{ {E}}} \exp \biggl\{ - \int _{{S_{i}}} {kf_{i}(x)} {N_{i,k}}(dx) \biggr\} } \\ & = \prod_{i = 1}^{n} {\prod _{k = 1}^{{r_{i}}} {\exp \biggl\{ \int _{{S_{i}}} { \bigl[{e^{ - kf_{i}(x)}}} - 1 \bigr]{\alpha _{k}}(i)\lambda (x)\,dx \biggr\} } } \\ & = \exp \Biggl\{ \sum_{i = 1}^{n} { \sum_{k = 1}^{{r_{i}}} { \int _{{S_{i}}} { \bigl[{e^{ - kf_{i}(x)}}} - 1 \bigr]{\alpha _{k}}(i)\lambda (x)\,dx} } \Biggr\} . \end{aligned}$$
(42)

For \(\eta >0\), define the normalized exponential transform of the stochastic integral by \(D_{\{ r_{i}\} }(\eta )\):

$$\begin{aligned} \log {D_{\{ r_{i}\} }(\eta )} : =& \eta \sum_{i = 1}^{n} { \int _{ {S_{i}}} {f_{i}(x)} \Biggl[C{P_{i,{r_{i}}}}(dx) - \sum _{k = 1}^{{r_{i}}} {k{\alpha _{k}}(i)\lambda (x)} \Biggr]\,dx} \\ &{} - \int _{{S_{i}}} {\sum_{k = 1}^{{r_{i}}} { \bigl[{e^{kf_{i}(x)}} - k\eta f_{i}(x) - 1 \bigr]{\alpha _{k}}(i) \lambda (x)} } \,dx. \end{aligned}$$
(43)

Therefore, we obtain \(\mathrm{{E}} D_{\{ r\} }(\eta )= 1\) by (42).

It follows from (43) and Markov’s inequality that

$$\begin{aligned} &P \Biggl( \eta \sum_{i = 1}^{n} { \int _{{S_{i}}} {f_{i}(x)} \Biggl[C {P_{i,{r_{i}}}}(dx) - \sum _{k = 1}^{{r_{i}}} {k{\alpha _{k}}(i) \lambda (x)} \Biggr]\,dx} \\ &\qquad \ge \sum _{i = 1}^{n} { \int _{{S_{i}}} {\sum_{k = 1}^{{r_{i}}} { \bigl[{e^{kf_{i}(x)}} - k\eta f_{i}(x) - 1 \bigr]{\alpha _{k}}(i) \lambda (x)} } \,dx} + ny \Biggr) \\ &\quad = P \bigl(D_{\{ r\} }(\eta ) \ge {e^{ ny}} \bigr) \le {e^{ - ny}}. \end{aligned}$$
(44)

By inequality (12), Eq. (29) implies

$$\begin{aligned} & P \Biggl( \sum_{i = 1}^{n} { \int _{{S_{i}}} {f_{i}(x)} \Biggl[C{P_{i, {r_{i}}}}(dx) - \sum _{k = 1}^{{r_{i}}} {k{\alpha _{k}}(i)\lambda (x)} \Biggr]\,dx} \\ &\qquad \ge \sum _{i = 1}^{n} { \int _{S_{i}} {\sum_{k = 1}^{{r_{i}}} {{\alpha _{k}}(i) \biggl( {\frac{1}{2} \frac{{{k^{2}}\eta {f_{i}^{2}}(x)\lambda (x)}}{{1 - \frac{1}{3}k\eta {{ \Vert f_{i} \Vert }_{\infty }}}}\,dx} + \frac{{y}}{\eta } \biggr)} } } \Biggr) \\ &\quad \le {e^{ - ny}}. \end{aligned}$$
(45)

Let \({V_{i,f_{i}}} := \int _{S_{i}} {{f_{i}^{2}}(x)\lambda (x)\,dx} \) for \(i=1,2,\ldots,n\). By a basic inequality, we have

$$\begin{aligned} &\sum _{i = 1}^{n} { \int _{S_{i}} {\sum_{k = 1}^{{r_{i}}} {{\alpha _{k}}(i) \biggl( {\frac{1}{2} \frac{{{k^{2}}\eta {f_{i}^{2}}(x)\lambda (x)}}{{1 - \frac{1}{3}k\eta {{ \Vert f_{i} \Vert }_{\infty }}}}\,dx} + \frac{{y}}{\eta } \biggr)} } } \Biggr)=\sum_{i = 1}^{n} {\sum _{k = 1}^{{r_{i}}} {{\alpha _{k}}(i) \biggl( {\frac{1}{2}\frac{{\eta {k^{2}}{V_{i,f_{i}}}}}{{1 - \frac{1}{3}k \eta {{ \Vert f_{i} \Vert }_{\infty }}}} + \frac{y}{\eta }} \biggr)} } \\ &\quad = \sum_{i = 1}^{n} {\sum _{k = 1}^{{r_{i}}} {{\alpha _{k}}(i) \biggl( {\frac{1}{2}\frac{{\eta {k^{2}}{V_{i,f_{i}}}}}{ {1 - \frac{1}{3}k\eta {{ \Vert f_{i} \Vert }_{\infty }}}} + \frac{y}{ \eta } \biggl(1 - \frac{1}{3}k\eta {{ \Vert f_{i} \Vert }_{\infty }} \biggr) + \frac{ {ky}}{3}{{ \Vert f_{i} \Vert }_{\infty }}} \biggr)} } \\ &\quad \ge \sum_{i = 1}^{n} {\sum _{k = 1}^{{r_{i}}} {{\alpha _{k}}(i) \biggl( {k\sqrt{2{V_{i,f_{i}}}y} + \frac{{ky}}{3}{{ \Vert f_{i} \Vert }_{\infty }}} \biggr)} = \sum _{i = 1}^{n} \sum _{k = 1}^{{r_{i}}} {k{\alpha _{k}}} (i) \biggl[\sqrt{2{V_{i,f_{i}}}y} + \frac{y}{3}{{ \Vert f_{i} \Vert }_{\infty }} \biggr]}. \end{aligned}$$
(46)

Optimizing η in (45) by Eq. (46), we have

$$\begin{aligned}& P \Biggl( {\sum_{i = 1}^{n} { \int _{{S_{i}}} {f_{i}(x)} \Biggl[C{P_{i, {r_{i}}}}(dx) - \sum _{k = 1}^{{r_{i}}} {k{\alpha _{k}}(i)\lambda (x)} \Biggr]\,dx} \ge \sum _{i = 1}^{n} {\sum_{k = 1}^{{r_{i}}} {k{\alpha _{k}}} (i) \biggl(\sqrt{2y{V_{i,f_{i}}}} } + \frac{y}{3}{{ \Vert f_{i} \Vert }_{\infty }} \biggr)} \Biggr) \\& \quad \le {e^{ - ny}}. \end{aligned}$$
(47)

For \(i=1,2,\ldots \) , let \(r_{i} \to \infty \) in (47), then \(CP_{i,r_{i}}(A) \xrightarrow{d} CP_{i}(A)\). It implies the concentration inequality (8).

Define \({c_{1n}} := \sum_{i = 1}^{n} {{\mu _{i}}} \sqrt{2 {V_{i,f_{i}}}} \), \({c_{2n}} := \sum_{i = 1}^{n} {\frac{{{\mu _{i}}}}{3}}{ \Vert f_{i} \Vert _{\infty }}\), let \(t={{c_{1n}} \sqrt{y} + {c_{2n}}y}\), and we obtain the expression of y by solving a quadratic equation.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhang, H., Wu, X. Compound Poisson point processes, concentration and oracle inequalities. J Inequal Appl 2019, 312 (2019). https://doi.org/10.1186/s13660-019-2263-8

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1186/s13660-019-2263-8

MSC

Keywords