- Research Article
- Open Access
Bounds for Tail Probabilities of the Sample Variance
© V. Bentkus and M. Van Zuijlen. 2009
- Received: 11 February 2009
- Accepted: 20 June 2009
- Published: 9 August 2009
We provide bounds for tail probabilities of the sample variance. The bounds are expressed in terms of Hoeffding functions and are the sharpest known. They are designed having in mind applications in auditing as well as in processing data related to environment.
- Convex Function
- Central Limit Theorem
- Sample Variance
- Elementary Calculation
- Point Distribution
for the mean, variance, and the fourth central moment of , and assume that . Some of our results hold only for bounded random variables. In such cases without loss of generality we assume that . Note that is a natural condition in audit applications.
The paper is organized as follows. In the introduction we give a description of bounds, some comments, and references. In Section 2 we obtain sharp upper bounds for the fourth moment. In Section 3 we give proofs of all facts and results from the introduction.
If , then the range of interest in (1.5) is , where
The restriction on the range of in (1.4) (resp., in (1.5) in cases where the condition is fulfilled) is natural. Indeed, for , due to the obvious inequality . Furthermore, in the case of we have for since (see Proposition 2.3 for a proof of the latter inequality).
where is a standard normal random variable, and is the standard normal distribution function.
All our bounds are expressed in terms of the function . Using (1.11), it is easy to replace them by bounds expressed in terms of the function , and we omit related formulations.
where is a Poisson random variable with parameter .
as it follows from (1.19) using the obvious bound .
Let us note that the known bounds (1.19)–(1.21) are the best possible in the framework of an approach based on analysis of the variance, usage of exponential functions, and of an inequality of Hoeffding (see (3.3)), which allows to reduce the problem to estimation of tail probabilities for sums of independent random variables. Our improvement is due to careful analysis of the fourth moment which appears to be quite complicated; see Section 2. Briefly the results of this paper are the following: we prove a general bound involving , , and the fourth moment ; this general bound implies all other bounds, in particular a new precise bound involving and ; we provide as well bounds for lower tails ; we compare the bounds analytically, mostly as is sufficiently large.
From the mathematical point of view the sample variance is one of the simplest nonlinear statistics. Known bounds for tail probabilities are designed having in mind linear statistics, possibly also for dependent observations. See a seminal paper of Hoeffding  published in JASA. For further development see Talagrand , Pinelis [4, 5], Bentkus [6, 7], Bentkus et al. [8, 9], and so forth. Our intention is to develop tools useful in the setting of nonlinear statistics, using the sample variance as a test statistic.
Theorem 1.1 extends and improves the known bounds (1.19)–(1.21). We can derive (1.19)–(1.21) from this theorem since we can estimate the fourth moment via various combinations of and using the boundedness assumption .
Let and .
Both bounds and are increasing functions of , and .
In order to derive upper confidence bounds we need only estimates of the upper tail (see ). To estimate the upper tail the condition is sufficient. The lower tail has a different type of behavior since to estimate it we indeed need the assumption that is a bounded random variable.
For Theorem 1.1 implies the known bounds (1.19)–(1.21) for the upper tail of . It implies as well the bounds (1.26)–(1.29) for the lower tail. The lower tail has a bit more complicated structure, (cf. (1.26)–(1.29) with their counterparts (1.19)–(1.21) for the upper tail).
The bounds above do not cover the situation where both and are known. To formulate a related result we need additional notation. In case of we use the notation
Write . Assume that .
with , where , and is defined by (1.34).
of survival functions (cf. definitions (1.13) and (1.14) of the related Hoeffding functions). The bounds expressed in terms of Hoeffding functions have a simple analytical structure and are easily numerically computable.
We provide the values of these constants for all our bounds and give the numerical values of them in the following two cases.
For defined by (1.41), the constants and we give as .
For the constants and with defined by (1.42) we give as .
while calculating the constants in (1.44) and (1.46) we choose . The quantity in (1.43) and (1.45) is defined by (1.34).
Our new bounds provide a substantial improvement of the known bounds. However, from the asymptotic point of view these bounds seem to be still rather crude. To improve the bounds further one needs new methods and approaches. Some preliminary computer simulations show that in applications where is finite and random variables have small means and variances (like in auditing, where a typical value of is ), the asymptotic behavior is not related much to the behavior for small . Therefore bounds specially designed to cover the case of finite have to be developed.
Recall that we consider bounded random variables such that , and that we write and . In Lemma 2.1 we provide an optimal upper bound for the fourth moment of given a shift , a mean , and a variance . The maximizers of the fourth moment are either Bernoulli or trinomial random variables. It turns out that their distributions, say , are of the following three types (i)–(iii):
notice that (2.4) supplies a three-point probability distribution only in cases where the inequalities and hold;
Note that the point in (2.2)–(2.7) satisfies and that the probability distribution has mean and variance .
and , where and are given in (2.5). Let us mention the following properties of the regions.
(a)If , then since for such obviously for all . The set is a one-point set. The set is empty.
(b)If , then since for such clearly for all . The set is a one-point set. The set is empty.
For all three regions , , are nonempty sets. The sets and have only one common point , that is, .
with a random variable satisfying (2.11) and defined as follows:
(i)if , then is a Bernoulli random variable with distribution (2.2);
(ii)if , then is a trinomial random variable with distribution (2.4);
(iii)if , then is a Bernoulli random variable with distribution (2.7).
with . Henceforth we write , so that can assume only the values , , with probabilities , , defined in (2.2)–(2.7), respectively. The distribution is related to the distribution as for all .
We omit the elementary calculations leading to (2.17). The calculations are related to solving systems of linear equations.
which proves the lemma.
To complete the proof we note that the random variable with defined by (2.2) assumes its values in the set . To find the distribution of we use (2.17). Setting in (2.17) we obtain and , as in (2.2).
By our construction . To find a distribution of supported by the set we use (2.17). It follows that has the distribution defined in (2.4).
To conclude the proof we notice that the random variable with given by (2.7) assumes values from the set .
To prove Theorems 1.1 and 1.3 we apply Lemma 2.1 with . We provide the bounds of interest as Corollary 2.2. To prove the corollary it suffices to plug in Lemma 2.1 and, using (2.2)–(2.7), to calculate explicitly. We omit related elementary however cumbersome calculations. The regions , , and are defined in (1.32).
Let . Then, with probability , the sample variance satisfies with given by (1.6).
satisfies . The function is convex. To see this, it suffices to check that restricted to straight lines is convex. Any straight line can be represented as with some . The convexity of on is equivalent to the convexity of the function of the real variable . It is clear that the second derivative is nonnegative since . Thus both and are convex.
Since both and are convex, the function attains its maximal value on the boundary of . Moreover, the maximal value of is attained on the set of extremal points of . In our case the set of the extremal points is just the set of vertexes of the cube . In other words, the maximal value of is attained when each of is either or . Since is a symmetric function, we can assume that the maximal value of is attained when and with some . Using (2.28), the corresponding value of is . Maximizing with respect to we get , if is even, and , if is odd, which we can rewrite as the desired inequality .
which means that allows a representation of type (3.1) with and all identically distributed, due to our symmetry and i.i.d. assumptions. Thus, (3.3) implies (3.6).
with being a sum of i.i.d. random variables specified in (3.10). Depending on the choice of the family of functions given by (3.11), the in (3.14) is taken over or , respectively.
If , then .
which yields the desired bound for .
where is a Bernoulli random variable such that and .
See [2, Lemmas and ].
Proof of Theorem 1.1.
The proof is based on a combination of Hoeffding's observation (3.6) using the representation (3.8) of as a -statistic, of Chebyshev's inequality involving exponential functions, and of Proposition 3.2. Let us provide more details. We have to prove (1.22) and (1.24).
and ; see Proposition 3.1.
where is a sum of independent copies of a Bernoulli random variable, say , such that and with as in (1.23), that is, . Note that in (3.26) we have the equality since .
To see that the third equality in (3.27) holds, it suffices to change the variable by . The fourth equality holds by definition (1.13) of the Hoeffding function since is a Bernoulli random variable with mean zero and such that . The relation (3.27) proves (3.25) and (1.22).
A proof of (1.24) repeats the proof of (1.22) replacing everywhere and by and , respectively. The inequality in (3.23) has to be replaced by , which holds due to our assumption . Respectively, the probability now is given by (1.25).
The bound is an obvious corollary of Theorem 1.1 since by Proposition 3.1 we have , and therefore we can choose . Setting this value of into (1.22), we obtain (1.19).
To prove (1.26), we set in (1.24). Such choice of is justified in the proof of (1.19).
Using the definition of the Hoeffding function we see that the right-hand sides of (3.28) and (3.31) are equal.
Proof of Theorem 1.3.
We use Theorem 1.1. In bounds of this theorem we substitute the value of being the right-hand side of (2.27), where a bound of type is given. We omit related elementary analytical manipulations.
which completes the proof of (1.7) and (1.8).
Figure 1 was produced by N. Kalosha. The authors thank him for the help. The research was supported by the Lithuanian State Science and Studies Foundation, Grant no. T-15/07.
- Hoeffding W: Probability inequalities for sums of bounded random variables. Journal of the American Statistical Association 1963, 58: 13–30. 10.2307/2282952MathSciNetView ArticleMATHGoogle Scholar
- Bentkus V, van Zuijlen M: On conservative confidence intervals. Lithuanian Mathematical Journal 2003,43(2):141–160. 10.1023/A:1024210921597MathSciNetView ArticleMATHGoogle Scholar
- Talagrand M: The missing factor in Hoeffding's inequalities. Annales de l'Institut Henri Poincaré B 1995,31(4):689–702.MathSciNetMATHGoogle Scholar
- Pinelis I: Optimal tail comparison based on comparison of moments. In High Dimensional Probability (Oberwolfach, 1996), Progress in Probability. Volume 43. Birkhäuser, Basel, Switzerland; 1998:297–314.View ArticleGoogle Scholar
- Pinelis I: Fractional sums and integrals of -concave tails and applications to comparison probability inequalities. In Advances in Stochastic Inequalities (Atlanta, Ga, 1997), Contemporary Mathematics. Volume 234. American Mathematical Society, Providence, RI, USA; 1999:149–168.View ArticleGoogle Scholar
- Bentkus V: A remark on the inequalities of Bernstein, Prokhorov, Bennett, Hoeffding, and Talagrand. Lithuanian Mathematical Journal 2002,42(3):262–269. 10.1023/A:1020221925664MathSciNetView ArticleMATHGoogle Scholar
- Bentkus V: On Hoeffding's inequalities. The Annals of Probability 2004,32(2):1650–1673. 10.1214/009117904000000360MathSciNetView ArticleMATHGoogle Scholar
- Bentkus V, Geuze GDC, van Zuijlen M: Trinomial laws dominating conditionally symmetric martingales. Department of Mathematics, Radboud University Nijmegen; 2005.Google Scholar
- Bentkus V, Kalosha N, van Zuijlen M: On domination of tail probabilities of (super)martingales: explicit bounds. Lithuanian Mathematical Journal 2006,46(1):3–54.MathSciNetView ArticleMATHGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.