- Research
- Open access
- Published:
Optimal distribution-free concentration for the log-likelihood function of Bernoulli variables
Journal of Inequalities and Applications volume 2023, Article number: 81 (2023)
Abstract
This paper aims to establish distribution-free concentration inequalities for the log-likelihood function of Bernoulli variables, which means that the tail bounds are independent of the parameters. Moreover, Bernstein’s and Bennett’s inequalities with optimal constants are obtained. The simulation study shows significant improvements over the previous results.
1 Introduction
Concentration inequality has been applied in a variety of scenarios, including statistical inference, information theory, machine learning, etc., see [1, 6, 8, 9]. Let \(X_{1}, X_{2}, \ldots , X_{n}\) be independent Bernoulli random variables with parameters \(p_{i}\), respectively. For simplicity, denote \(X_{i}\sim \operatorname{Ber} (p_{i} )\). Inspired by theoretical studies of likelihood-based methods for binary data, in particular for community detection in networks (see [3, 5, 12] for details), it is of significance to investigate the concentration behavior of the joint likelihood function of \(X_{1}, X_{2}, \ldots , X_{n}\), say, \(L_{n}=\prod_{i=1}^{n}p_{i}^{X_{i}}(1-p_{i})^{1-X_{i}}\).
Consider a simple case that \(p_{1}=p_{2}=\cdots =p_{n}=p\in [0,1]\). The asymptotic equipartition property (AEP), one of the most classical results in information theory [7], asserts that
which can be obtained by the law of large numbers. Indeed, this relation implies that the sample averaged Shannon entropy (scaled log-likelihood function \(n^{-1}\log L_{n}\)) converges to the population Shannon entropy in probability. To get a clearer perspective of the AEP, this paper aims to derive a nonasymptotic concentration inequality for the tail probability. Zhao [10] demonstrated a novel Bernstein-type inequality for \(\sum_{i=1}^{n}X_{i}\log p_{i}\) (a part of \(\log L_{n}\)), which asserts that, for all \(\epsilon >0\),
by providing an upper bound, independent of the parameter \(p_{i}\), for the moment generating function (MGF) of \((X_{i}-p_{i})\log p_{i}\).
Invoke the definition of sub-gamma variable (see [1, 8] for more details), which states that a random variable X is sub-gamma with the variance factor \(v>0\) and the scale parameter \(b>0\) (denoted as \(\operatorname{sub}\Gamma (v,b)\)) if its MGF satisfies, for all \(|\lambda |< b^{-1}\),
Indeed, Theorem 2 of [10] shows that the random variable \(X_{i}\log p_{i}\) is \(\operatorname{sub}\Gamma (1,1)\). As pointed out by Remark 2 of [10], the scale factor \(b=1\) on the denominator is optimal, while the variance factor \(v=1\) is not sharp. A natural question is whether we can improve the variance factor as well as the Bernstein-type inequality (1). Moreover, moving back to the joint log-likelihood function, can we derive a sharp Bernstein-type inequality for \(\log L_{n}\)?
In this paper, we are interested in studying optimal distribution-free (independent of parameters \(p_{i}\)) concentration bounds for \(\sum_{i=1}^{n}(X_{i}-p_{i})\log p_{i}\) and \(\log L_{n}=\prod_{i=1}^{n}(X_{i}\log p_{i}+(1-X_{i})\log (1-p_{i}))\) with independent \(X_{i}\sim \operatorname{Ber} (p_{i} )\) for \(p_{i}\in [0,1]\). Those results, which cannot be derived by classical Hoeffding’s inequality (see [4, 10]), will be particularly useful in cases where the assumptions for \(p_{i}\) are not convenient to be made. In addition, the improvements by optimal uniform constants are nonnegligible in the sense of nonasymptotic, especially in the case of small sample datasets.
The rest of this paper is organized as follows. Section 2 establishes the Bernstein-type inequalities for \(\sum_{i=1}^{n}(X_{i}-p_{i})\log p_{i}\) and the log-likelihood function of the Bernoulli variable, which both enjoy the optimal variance factors and scale factors. Inspired by Bennett’s inequality for the bounded random variable, we use Bennett’s inequality to improve the tail bounds on the right tail in Sect. 3. Some extensions are illustrated in Sect. 4. The improvements by the optimal constants are demonstrated by various simulation studies in Sect. 5. Finally, Sect. 6 concludes the article with a discussion.
Notation: Throughout this paper, set \(0 \log 0 = 0\) for convention. All logarithms and exponentials are in the natural base.
2 Bernstein’s inequality
Based on the classical Chernoff method (see [1, 8, 10]), this section illustrates the optimal Bernstein-type inequalities for \(\sum_{i=1}^{n}(X_{i}-p_{i})\log p_{i}\) and \(\log L_{n}\).
Theorem 1
Let \(X_{i}\) be independent \(\operatorname{Ber} (p_{i} )\) for \(i=1, \ldots , n\), where \(p_{i} \in [0,1]\). Then, for all \(\epsilon >0\),
where \(h(u)=1+u-\sqrt{1+2u}\) for \(u>0\) and \(\gamma :=\max_{0\leq x\leq 1}x(1-x)(\log x)^{2}\).
Proof
Denote \(Y_{i}:= (X_{i}-p_{i} ) \log p_{i}\), where \(X_{i}\sim \operatorname{Ber} (p_{i} )\). It follows that \(\mathbb{E}Y_{i}=0\) and \(\operatorname{Var}(Y_{i})=p_{i}(1-p_{i})(\log p_{i})^{2}\). Note \(\gamma =\max_{0\leq x\leq 1} x(1-x)(\log x)^{2}\), which is approximately 0.477365 at \(x_{0}\approx 0.104058\). Hence, \(\operatorname{Var}(Y_{i})\leq \gamma \) with the equality if and only if \(p_{i}=x_{0}\). Now, we claim that the moments of \(Y_{i}\) satisfy the Bernstein condition (i.e. see Theorem 1 of [10] or Theorem 2.10 of [1]) as follows:
By the power series expansion of the exponential function and Fubini’s theorem, it follows that, for all \(|\lambda |<1\),
where γ is defined as above, and the last inequality uses \(1+x\leq \exp (x)\) for \(x\in \mathbb{R}\). Therefore, the random variable \(Y_{i}\) is sub-gamma (\(\gamma ,1\)), which is independent of \(p_{i}\). To this end, applying the standard Chernoff method (cf. Theorem 2.10 of [1]) gives, for any \(\epsilon > 0\),
where \(h(u)=1+u-\sqrt{1+2u}\) for \(u>0\) and γ is defined as above.
It remains to show claim (3). The case of \(k=2\) follows trivially. Similar to Theorem 1 of [10], for \(k\geq 3\),
where the first inequality follows from \(p_{i}\in [0,1]\) and the last inequality is implied by the fact that the function \(f(x):=x^{r}(\log x)^{k}\) for \(x\in (0,1)\) achieves its optimum at \(x=e^{-k/r}\) for any \(r>0\) and integer \(k\geq 1\). Now we prove the claim by induction. The case of \(k=3\) follows from simple computations, that is, \(28\exp (-3)\approx 1.394038<3\gamma \). Suppose for some \(k\geq 3\), \((k^{k}+1)\exp (-k)\leq k!\gamma /2\). For the case of \(k+1\), it follows that
which completes the induction and finishes the proof of this proposition. □
Remark 1
The constant γ is optimal because \(Y_{i}\), being sub-gamma (\(v,b\)), satisfies \(\operatorname{Var}(Y_{i})\leq v\) by the property of sub-gamma variables. Indeed, one can choose \(p_{i}=x_{0}\), which achieves \(\operatorname{Var}(Y_{i})=\gamma \), thus \(\gamma \leq v\), implying its optimality. As pointed out by Remarks 1 and 2 of [10], the constant \(b=1\) is optimal since \(\mathbb{E}(Y_{i})^{k}\) can achieve \(c(k/e)^{k}\) for some constant \(0< c<1\) at \(p_{i}=\exp (-k)\) and the Stirling approximation
for large enough k and any \(0< b<1\). Hence, both the variance factor and the scale factor in Theorem 1 are optimal.
Remark 2
To give a nice form of (2), by the elementary inequality (Exercise 2.8 of [1])
we have, for any \(\epsilon >0\),
where γ is defined as Theorem 1. Compared to equation (3) of [10], we improved the constant \(\sigma ^{2}=1\) to \(\gamma \approx 0.477\), which is optimal. Furthermore, one can generalize 1 to the multinoulli variables and the grouped observation, analogously to Corollary 1 and Theorem 2 of [10].
Analogously, Theorem 2 provides the optimal Bernstein inequality of \(\log L_{n}\).
Theorem 2
Let \(X_{i}\) be independent \(\operatorname{Ber} (p_{i} )\) for \(i=1, \ldots , n\), where \(p_{i} \in [0,1]\). Then, for all \(\epsilon >0\),
where \(h(u)=1+u-\sqrt{1+2u}\) for \(u>0\) and \(\gamma _{0}:=\max_{0\leq x\leq 1}x(1-x)(\log \frac{x}{1-x})^{2}\).
Proof
Similar to the proof of 1, denote \(Z_{i}:=X_{i} \log p_{i}+(1-X_{i})\log (1-p_{i})-p_{i}\log p_{i}-(1-p_{i}) \log (1-p_{i})\). It follows that \(\log L_{n}-\mathbb{E}(\log L_{n})=\sum_{i=1}^{n}Z_{i}\). One can easily verify that \(\mathbb{P}(Z_{i}=(1-p_{i})\log (p_{i}/(1-p_{i})))=p_{i}\) and \(\mathbb{P}(Z_{i}=-p_{i}\log (p_{i}/(1-p_{i})))=1-p_{i}\) for \(p_{i}\in (0,1)\). It follows that \(\mathbb{E}(Z_{i})=0\) and
by the definition of \(\gamma _{0}=\max_{0\leq x\leq 1} x(1-x)(\log \frac{x}{1-x})^{2}\), which is approximately 0.439229 at \(x_{0}\approx 0.083222\) and \(1-x_{0}\approx 0.916778\). In what follows, we shall show that the moments of \(Z_{i}\) are well bounded, satisfying the Bernstein condition as
which implies the desired results by the same arguments of Theorem 1. To this end, the case of \(k=2\) follows trivially by \(\mathbb{E}(Z_{i})^{2}=\operatorname{Var}(Z_{i})\leq \gamma _{0}\). For \(k\geq 3\) and all \(p_{i}\in [0,1]\), we have
which is symmetric about \(p_{i}=1/2\). It suffices to consider \(p_{i}\in [0,1/2]\). Indeed, it follows that, for \(0\leq p_{i}\leq 1/2\),
where the first inequality uses \(|a+b|\leq |a|+|b|\), the second inequality follows from \((1-p_{i})^{k}\leq 1\) for \(p_{i}\in [0,1]\), the third inequality uses the fact that \(| (\log p_{i}-\log (1-p_{i}) )^{k}|\leq |\log p_{i}|^{k}\) and \(0\leq p_{i}/(1-p_{i})\leq 1\) for \(0\leq p_{i}\leq 1/2\), and the last inequality is implied by that the function \(f(x):=x^{r}(\log x)^{k}\) for \(x\in (0,1)\) achieves its optimum at \(x=e^{-k/r}\) for any \(r>0\) and integer \(k\geq 1\). Note that the case of \(k=3\) follows from simple computations, that is,
by \(\gamma _{0}>2/5\). While for \(k\geq 4\), we shall prove (6) by induction with the bound \(|\mathbb{E}(Z_{i})^{k}|\leq (k^{k}+1)\exp (-k)\). The case of \(k=4\) can be verified by
where \(\gamma _{0}> 2/5\). Suppose for some \(k\geq 4\), \((k^{k}+1)\exp (-k)\leq k!\gamma _{0}/2\). For the case of \(k+1\), it follows that
by \((1+1/k)^{k}\leq e\) for all \(k>4\), which completes the induction and finishes the proof of this proposition. □
Remark 3
Similar to Remark 1, the constants \(\gamma _{0}\) and \(b=1\) are optimal for Bernstein-type inequality (5). One can also obtain a friendly form of (5) as
in which \(\gamma _{0}\) is defined as Theorem 2.
3 Bennett’s inequality
The Bernstein-type inequalities in Sect. 2 are useful on both left and right tails, while one may be more interested in the right tail in practice. A natural question to ask at this point is whether we can derive a tighter bound on right tails for the log-likelihood function of binary data. Thanks to the definition of the Bernoulli variables, the components of the log-likelihood function are well upper-bounded, which inspires us to use the Bennett inequality to derive a more informative upper bound on the right tail, see [1, 8] for more details on Bennett’s inequality.
Theorem 3
Under the condition of Theorem 1, for all \(\epsilon >0\), we have
where \(g(u)=(1+u)\log (1+u)-u\) for \(u>0\) and \(\gamma :=\max_{0\leq x\leq 1}x(1-x)(\log x)^{2}\).
Proof
Consider \(Y_{i}:= (X_{i}-p_{i} ) \log p_{i}\) and \(S=\sum_{i=1}^{n}Y_{i}\). It is not hard to verify that \(Y_{i}\leq -p_{i}\log p_{i}\leq 1/e\) for all \(p_{i}\in [0,1]\). Following the classical method of Bennett’s inequality, let \(\phi (u):=e^{u}-u-1\) for all \(u\in \mathbb{R}\). By \(u^{-2}\phi (u)\) is a nondecreasing function of \(u\in \mathbb{R}\) (where at 0 we continuously extend the function). Hence, for all \(\lambda >0\), \(e^{\lambda Y_{i}}-\lambda Y_{i}-1\leq e^{2}Y_{i}^{2} \phi (\lambda /e )\), implying that \(\mathbb{E} (e^{\lambda Y_{i}} )\leq 1+ e^{2}\mathbb{E}(Y_{i}^{2})\phi (\lambda /e )\) by \(\mathbb{E}(Y_{i})=0\). Since \(Y_{i}\)’s are independent, it follows that
by \(\log (1+u)\leq u\) for all \(u\geq 0\). Note \(\mathbb{E}(Y_{i}^{2})=p_{i}(1-p_{i})(\log p_{i})^{2}\leq \gamma \), where γ is defined as in Theorem 3, which implies \(\log \mathbb{E}(e^{\lambda S})\leq e^{2} n\gamma \phi (\lambda /e )\) for all \(\lambda >0\). Then the Cramér transform of S is bounded by that of a corresponding Poisson random variable (see Chap. 2 of [1]), that is,
in which γ is defined as above and \(g(u)=(1+u)\log (1+u)-u\) for \(u>0\), completing the proof of (8). □
Analogously, we can obtain Bennett’s inequality for the joint log-likelihood function \(\log L_{n}\).
Theorem 4
Under the condition of Theorem 2, for all \(\epsilon >0\), we have
where \(g(u)=(1+u)\log (1+u)-u\) for \(u>0\), \(\beta :=\max_{0< x<1} (1-x)\log (x/(1-x))\) and \(\gamma _{0}:=\max_{0\leq x\leq 1}x(1-x)(\log \frac{x}{1-x})^{2}\).
Proof
The proof follows similar arguments of Theorem 3. Indeed, set \(Z_{i}:=X_{i} \log p_{i}+(1-X_{i})\log (1-p_{i})-p_{i}\log p_{i}-(1-p_{i}) \log (1-p_{i})\), which satisfies \(\mathbb{E}(Z_{i})=0\) and \(\mathbb{E}(Z_{i}^{2})\leq \gamma _{0}\) by Theorem 2. To this end, we have \(Z_{i}\leq \max_{0\leq t\leq 1} (1-t)\log \frac{t}{1-t}=\beta \) for all \(1\leq i\leq n\). Hence, by similar arguments as those of Theorem 3, for all \(\epsilon >0\),
where β, \(g(u)\) are defined in Theorem 4 and \(\gamma _{0}\) is defined as in Theorem 2. □
Remark 4
To get a nice form of (8), one can verify that
which delivers that, under the condition of Theorem 3,
improving upon the Bernstein-type inequality by a factor of 3e on the right tail (compared with (4) in Remark 2). One can also obtain that
improving upon the Bernstein-type inequality by a factor of \(3/\beta \) on the right tail (compared with (7) in Remark 3).
4 Extensions
The results in Sects. 2 and 3 are distribution-free, it is of interest to investigate some extensions of the concentration inequalities. In this section, we point out a possible direction that generalizes the results above.
As illustrated in Sect. 3, the improved upper tail relies on (9), which uses an elemental inequality \(1+x\leq \exp (x)\) for all \(x\in \mathbb{R}\). In addition to the distribution-free concentration, it is also valuable to obtain some concentration inequalities dependent on parameters, which might be better than that when faced with specific problems. Inspired by the refined Bennett inequality provided by [13], we can sharpen Theorems 3 and 4 by equipping the arithmetic-geometric (AG) mean inequality. Under the condition of Theorem 3, we have
where \(\bar{s}:=\sum_{i=1}^{n}\mathbb{E}(Y_{i}^{2})/n\) by \(\prod_{k=1}^{n} x_{k}\leq (\bar{x})^{n}\) for \(x_{k}\geq 0\) with \(\bar{x}=\sum_{i=1}^{n}x_{k}/n\). Then we can deduce a similar Bennett inequality by the classical Chernoff method. Moreover, following the arguments in Sect. 3 of [13], we can obtain refined Bennett inequalities by applying a refined AG mean inequality, introduced by [2], which improves the upper bound of the MGF of S by extracting the difference between the parameters \(p_{i}\)s.
5 Simulation study
In this section, we are going to show some simulation studies to express the improvements in our results. Because of the distribution-free property of the tail bounds, we shall ignore the parameters \(p_{i}\)s and compare the logarithmic tail probabilities for various sample sizes n and the error rates ϵ. Both the two-sides tail and the right-side tail are illustrated with different n and ϵ.
At first, consider the improvement of the concentration bound for \(\sum_{i=1}^{n}(X_{i}-p_{i})\log p_{i}\). The sample size takes four values with \(n=100,200,500\), and 1000. For each n, the error rate varies from 0.1 to 1. For each fixed \((n,\epsilon )\), the two-sides Bernstein-type tail probability bounds (2) and (4) with optimal constants and the result due to [10] (see (1)) are compared, which can be found in Fig. 1. The sharp Bernstein-type inequalities (Theorem 1) improve the previous result (1) significantly for various cases.
For the joint log-likelihood function, the result provided by [11] asserts that (see Theorem 1 of [11] with \(K=2\)), for small \(\epsilon >0\),
Similar to the first example above, the sample size \(n=100,200,500\), and 1000. For each n, the error rate varies from 0.1 to 1. The tail probabilities (5), (7), and (13) are demonstrated. From Fig. 2, our results perform better than (13) over different n and small ϵ, which are interesting cases in practice. Moreover, the right-tail concentration results (Theorems 3 and 4) are also compared. To this end, let the sample size take the values from \(\{100,300,1000,2000\}\), and the error rate varies from 0.1 to 1. The one-side tail bounds (1), (2), and (8) can be found in Fig. 3, where the factor 2 in (1) and (2) is removed for the right-tail case. Similarly, the right-tail bounds (5), (10), and (13) can be found in Fig. 4 in which the factor 2 in (5) and (13) is removed for the right-tail case. The sharp Bennett inequalities improve the right-tail bounds significantly when the error rate increases.
6 Conclusion
We study the distribution-free concentration inequalities for the log-likelihood function of Bernoulli variables. Indeed, we established the optimal Bernstein-type inequalities with the best constants of variance factor and scale factor in the sense of sub-gamma random variable. Moreover, Bennett’s inequalities with sharp constants are also illustrated, which improves the scale factors of Bernstein-type inequalities on the right tails.
There are some limitations of this study. First, it is more interesting to consider the concentration with multiple discrete distributions, see [11] for more details. Secondly, we focus on the distribution-free concentration, while it is also valuable to obtain some concentration inequalities dependent on parameters, which might be better than that when faced with specific problems. Furthermore, one can consider the concentration of the likelihood ratio statistics, which is an interesting direction for further study.
Availability of data and materials
No data were used to support this study.
References
Boucheron, S., Lugosi, G., Massart, P.: Concentration Inequalities: A Non-asymptotic Theory of Independence. Oxford University Press, Oxford (2013)
Cartwright, D.I., Field, M.J.: A refinement of the arithmetic mean-geometric mean inequality. Proc. Am. Math. Soc. 71, 36–38 (1978)
Choi, D.S., Wolfe, P.J., Airoldi, E.M.: Stochastic blockmodels with a growing number of classes. Biometrika 99, 273–284 (2012)
Hoeffding, W.: Probability inequalities for sums of bounded random variables. J. Am. Stat. Assoc. 58, 13–30 (1963)
Paul, S., Chen, Y.: Consistent community detection in multi-relational data through restricted multi-layer stochastic block model. Electron. J. Stat. 10, 3807–3870 (2016)
Raginsky, M., Sason, I.: Concentration of measure inequalities in information theory, communications, and coding. Found. Trends Commun. Inf. Theory 10, 1–246 (2013)
Shannon, C.E.: A mathematical theory of communication. Bell Syst. Tech. J. 27, 379–423 (1948)
Zhang, H., Chen, S.: Concentration inequalities for statistical inference. Commun. Math. Sci. 37, 1–85 (2021)
Zhang, H., Wei, H.: Sharper sub-Weibull concentrations. Mathematics 10, 2252 (2022)
Zhao, Y.: A note on new Bernstein-type inequalities for the log-likelihood function of Bernoulli variables. Stat. Probab. Lett. 163, 108779 (2020)
Zhao, Y.: An optimal uniform concentration inequalities for discrete entropy in the high-dimensional setting. Bernoulli 28, 1892–1911 (2022)
Zhao, Y., Weko, C.: Network inference from grouped observations using hub models. Stat. Sin. 29, 225–244 (2019)
Zheng, S.: An improved Bennett’s inequality. Commun. Stat., Theory Methods 47, 4152–4159 (2017)
Funding
No funding was received for this research.
Author information
Authors and Affiliations
Contributions
Zhonggui Ren wrote the main manuscript text, prepared Figs. 1-4, and reviewed the manuscript
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Ren, Z. Optimal distribution-free concentration for the log-likelihood function of Bernoulli variables. J Inequal Appl 2023, 81 (2023). https://doi.org/10.1186/s13660-023-02995-1
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s13660-023-02995-1