Zipf–Mandelbrot law, f-divergences and the Jensen-type interpolating inequalities

Motivated by the method of interpolating inequalities that makes use of the improved Jensen-type inequalities, in this paper we integrate this approach with the well known Zipf–Mandelbrot law applied to various types of f-divergences and distances, such are Kullback–Leibler divergence, Hellinger distance, Bhattacharyya distance (via coefficient), \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$\chi^{2}$\end{document}χ2-divergence, total variation distance and triangular discrimination. Addressing these applications, we firstly deduce general results of the type for the Csiszár divergence functional from which the listed divergences originate. When presenting the analyzed inequalities for the Zipf–Mandelbrot law, we accentuate its special form, the Zipf law with its specific role in linguistics. We introduce this aspect through the Zipfian word distribution associated to the English and Russian languages, using the obtained bounds for the Kullback–Leibler divergence.


Introduction
Let us start with the notion of f -divergences which measure the distance between two probability distributions by making an average value, which is weighted by a specific function, of the odds ratio given by two probability distributions. Among the existing f -divergences introduced in the process of finding the adequate distance between two probability distributions, let us point out the Csiszár f -divergence [1,2], some special cases of which are the Kullback-Leibler divergence (see [3,4]), the Hellinger distance, the Bhattacharyya distance, the total variation distance, the triangular discrimination (see [5,6]). The notion of 'distance' mainly appears as somewhat stronger than 'divergence' since it suggests the properties of symmetry and the triangle inequality. Considering a great number of fields in which probability theory cooperates, it is no wonder that divergences between probability distributions have many specific applications in a variety of those fields.
Jensen's inequality, on the other hand, with its numerous refinements, variants and improvements is often called 'a king inequality' , obviously not without a reason. Here we try to integrate one of such results concerning Jensen's inequality in order to obtain new estimates for mentioned divergences (which deal with the convex functions for the most part). It is well known that the Jensen inequality holds for a convex function f : I → R, I ⊆ R, an n-tuple x = (x 1 , . . . , x n ) ∈ I n , n ≥ 2 and nonnegative n-tuple p = (p 1 , . . . , p n ), such that P n = n i=1 p i > 0. Here we cite a result of Pečarić [7, p. 717] who investigated the method of interpolating inequalities which have reverse inequalities of Aczél type. Using Jensen's inequality and its reverse he proved the main and the following deduced result, which holds for a convex function f defined on an interval I ⊂ R: where x = (x 1 , . . . , x n ) ∈ I n , n ≥ 2 and a nonnegative n-tuple p = (p 1 , . . . , p n ) is such that P n = n i=1 p i > 0. In recent investigations of relation (2) and its numerous consequences, it appeared as a fruitful field for many significant results. We accentuate those which deal with this relation in view of superadditivity and monotonicity of the Jensen-type functionals, in [8,9] or [10], obtained via [11] and suitably summarized in the monograph [12]. In the following part we are going to make use of relation (2) while presenting certain bounds for a selected spectrum of f -divergences that originate from the Csiszár divergence functional.
All of the results thus obtained concerning f -divergences are going to be observed here in the context of the Zipf-Mandelbrot law and then specified for the Zipf law.
George Kingsley Zipf ) was a linguist after whom one of the most common laws in probability and statistics was named. Today this experimental law for the discrete probability distribution frequently is used in information science, bibliometrics, linguistics, social sciences, economy (where it is known as the Pareto law), as well as in physics, biology, computer science etc. Thus the term 'Zipfian distribution' is used to describe various types of distributions of the probability occurrences which approximately follow the mathematical form of the Zipf law. It was in the first place established with the frequency of the words in a text in view and as such it revealed a hyperbolic relation. As is e.g. explained in [13], if words of a language are sorted in the order of decreasing frequency of usage, a word's frequency f is inversely proportional to its rank r, or sequence number in the list and the product of these is a constant: r · f = C (' A few occur very often while many others occur rarely. '). Benoit Mandelbrot, a mathematician very well known for his contribution in fractal theory, generalized the Zipf law in 1966 [14,15] according to his field of investigation and gave its improvement for the count of the low-rank words [16]. It is also used in information sciences for the purpose of indexing [17,18], in ecological field studies [19] and it plays its role in art when determining the aesthetics criteria in music [20]. The Zipf-Mandelbrot law is a discrete probability distribution and is defined by the following probability mass function: where is a generalization of a harmonic number and N ∈ {1, 2, . . .}, s > 0 and t ∈ [0, ∞) are parameters. For finite N and for t = 0 the Zipf-Mandelbrot law is simply called the Zipf law. (In particular, if we observe the infinite N and t = 0 we actually have the Zeta distribution.) According to the expressions above, the probability mass function referring to the Zipf law is The rest of the paper is organized as follows. In Section 1 we define the Csiszár functional and various f -divergences for which we give in Section 3 the results based on relation (2). These are further examined in Section 4 in the light of the Zipf-Mandelbrot law and the Zipf law. For the latter we give in Section 5 a specific application in linguistics, concerning the Kullback-Leibler divergence.

Preliminaries
The previously mentioned f -divergences were studied independently by several matematicians. Here we focus on the Csiszár f -divergences. Csiszár [1,2] introduced the fdivergence functional as where p = (p 1 , . . . , p n ) and q = (q 1 , . . . , q n ) are probability distributions, that is, is a convex function, the so-called 'distance function' on the set of all probability distributions.
As in [1], we interpret the undefined expressions by The definition of the f -divergence functional (6) can be generalized for a function f : I → R, I ⊆ R, where p i q i ∈ I, for every i = 1, . . . , n. Since we are going to observe this wider class of functions as well, the corresponding functional (6) will be denoted byD f (p, q) (see also [21]).
The general aspect of the Csiszár divergence functional (6) can be interpreted as a series of well-known entropies, divergencies and distances, for special choices of the kernel f . In the sequel we present some of the most frequent among them.
Entropies quantify the diversity, uncertainty and randomness of a system. The concept of the Rényi entropy was introduced by [22] and has been of a great importance in statistics, ecology, theoretical computer science etc.
The Rényi entropy of order α of p is defined as where α ≥ 0, α = 1 and p = (p 1 , . . . , p n ) is a probability distribution. Among special cases of the Rényi entropy (e.g. the Hartley or max-entropy, min-entropy, and the collision entropy), the Rényi entropy tends to the Shannon entropy (see [23]) for the limiting value of α → 1. The Shannon entropy (which is sometimes called information divergence) is thus defined as Besides the absolute entropies, one can also observe the relative entropies, as did Rényi when he introduced a special form of f -divergence. The Rényi divergence of order α, α ≥ 0, α = 1 for the probability distributions p = (p 1 , . . . , p n ) and q = (q 1 , . . . , q n ) is defined as A relation similar to the one between the Rényi entropy and the Shannon entropy holds in the case of the Rényi divergence and the Kullback-Leibler divergence (see [24]) for the probability distributions p = (p 1 , . . . , p n ) and q = (q 1 , . . . , q n ). As α → 1, the Rényi divergence tends to the Kullback-Leibler divergence. The latter is sometimes called the relative entropy and is defined by Remark 1 Although it is common to take the logarithm function with the base 2, it will not be essential in the sequel. Moreover, we are going to analyze the results including the logarithm function for different (positive) bases, namely, for those greater than 1 as well as for those that are less than 1.
Among various divergences and considering the properties of symmetry and triangular inequality which some of them possess, we can also define certain distances between two probability distributions.
Thus the Hellinger distance between the probability distributions p = (p 1 , . . . , p n ) and q = (q 1 , . . . , q n ) is defined by The Hellinger distance is a metric and is often used in its squared form, i.e. as h 2 (p, q) := Among the values of the order α for the Rényi divergence, some have wider application than others. The value 1 is already determined by continuity in α since it cannot be calculated directly by (9) and another interesting example is the order 1/2. This order makes the Rényi divergence symmetric in its arguments. In this context, it is interesting to see how the Rényi divergence, although not itself a metric, relates to the Hellinger distance: Furthermore, the Bhattacharyya coefficient is an approximate measure of the amount of overlapping between two distributions and as such can be used to determine their relative closeness. It is defined as whereas the Bhattacharyya distance is defined as D B (p, q) := -log B(p, q). The relation between the Bhattacharyya coefficient and the Hellinger distance is In order to conclude this overview, let us remind the reader that the χ 2 divergence is defined as the total variation distance or statistical distance is given by and the definition of the triangular discrimination reads as follows: More detailed analyses of the mentioned divergences as well as their wider spectrum one can find e.g. in [5,6,24].

Basic relations for f -divergences
In order to deduce the relations from relation (2) for the f -divergences described in the introductory part, we start with the general result for bounds obtained for the Csiszár functional (6) observed under more general conditions asD f (p, q). Theorem 1 Let I ⊆ R be an interval. Suppose p = (p 1 , . . . , p n ) is an n-tuple of real numbers with P n = n i=1 p i and q = (q 1 , . . . , q n ) is an n-tuple of nonnegative real numbers with Q n = n i=1 q i , such that p i q i ∈ I, i = 1, . . . , n. If f : I → R is a convex function, then If f is a concave function, then the inequality signs are reversed.
is a concave function, then the inequality signs are reversed.
Proof If we observe a convex function f and replace p i by q i as well as x i by p i q i in relation (2), we get (17).
If we then observe the function t → tf (t) as a convex function and replace p i by q i and x i by p i q i we get (18).
The following corollary precedes the related result for the Kullback-Leibler divergence (10). Recall that (10) can be interpreted as a special case of the functional (6).
Corollary 1 Let p = (p 1 , . . . , p n ) and q = (q 1 , . . . , q n ) be n-tuples of nonnegative real numbers with P n = n i=1 p i and Q n = n i=1 q i . Then where the logarithm base is greater than 1.
If the logarithm base is less than 1, then the inequality signs are reversed.
Proof It follows from Theorem 1 as a special case of inequalities (18), for the function t → t log t, which is convex when the logarithm base is greater than 1 (and concave when the base is less than 1).
If we additionally specify the n-tuples p and q as in the sequel, we provide the bounds for the Kullback-Leibler divergence.
Remark 2 If we observe p = (p 1 , . . . , p n ) and q = (q 1 , . . . , q n ) as probability distributions, we may write where the logarithm base is greater than 1.
If the logarithm base is less than 1, then the inequality signs are reversed.
In other words, we obtained the corresponding bounds for the Kullback-Leibler divergence (10).
Remark 3 The Kullback-Leibler divergence is sometimes used in its reversed form KL(q, p). A similar type of bounds can be obtained when observing the reversed Kullback-Leibler divergence making use of the kernel function f (t) = -log t, its convexity and concavity related to the observed logarithm base (greater than 1 or less than 1, respectively), and following the analogous procedure described in Corollary 1 and Remark 2.
It is natural to observe in a similar fashion the other divergences (distances) described in Section 1: the Hellinger distance, the Bhattacharyya coefficient, the chi-square divergence, the total variation distance and the triangular discrimination.
Corollary 2 Let p = (p 1 , . . . , p n ) and q = (q 1 , . . . , q n ) be n-tuples of nonnegative real numbers with P n = n i=1 p i and Q n = n i=1 q i . Then Proof It follows from Theorem 1 as a special case of inequalities (17), for the convex func- Remark 4 If we observe p = (p 1 , . . . , p n ) and q = (q 1 , . . . , q n ) as probability distributions, we may write In other words, we obtained the corresponding bounds for the (squared) Hellinger distance h 2 (p, q). Corollary 3 Let p = (p 1 , . . . , p n ) and q = (q 1 , . . . , q n ) be n-tuples of nonnegative real numbers with P n = n i=1 p i and Q n = n i=1 q i . Then Proof It follows from Theorem 1 as a special case of inequalities (17), for the convex function f (t) = -√ t.
Remark 5 If we observe p = (p 1 , . . . , p n ) and q = (q 1 , . . . , q n ) as probability distributions and adopt by the definition (13) that B(p, q) = √ p i q i , we may write In other words, we obtained the corresponding bounds for the Bhattacharyya coefficient B(p, q).
Corollary 4 Let p = (p 1 , . . . , p n ) be an n-tuple of real numbers and q = (q 1 , . . . , q n ) an ntuple of nonnegative real numbers with P n = n i=1 p i and Q n = n i=1 q i . Then Proof It follows from Theorem 1 as a special case of inequalities (17), for the convex function f (t) = (t -1) 2 .
Remark 6 If we observe p = (p 1 , . . . , p n ) and q = (q 1 , . . . , q n ) as probability distributions, we may write In other words, we obtained the corresponding bounds for the chi-square divergence χ 2 (p, q).
Corollary 5 Let p = (p 1 , . . . , p n ) be an n-tuple of real numbers and q = (q 1 , . . . , q n ) an ntuple of nonnegative real numbers with P n = n i=1 p i and Q n = n i=1 q i . Then Proof It follows from Theorem 1 as a special case of inequalities (17), for the convex function f (t) = |t -1|.
Remark 7 If we observe p = (p 1 , . . . , p n ) and q = (q 1 , . . . , q n ) as probability distributions, we may write In other words, we obtained the corresponding bounds for the total variation distance V (p, q). Corollary 6 Let p = (p 1 , . . . , p n ) be an n-tuple of real numbers and q = (q 1 , . . . , q n ) an ntuple of nonnegative real numbers with P n = n i=1 p i and Q n = n i=1 q i . Then Proof It follows from Theorem 1 as a special case of inequalities (17), for the convex function f (t) = (t-1) 2 t+1 .
Remark 8 If we observe p = (p 1 , . . . , p n ) and q = (q 1 , . . . , q n ) as probability distributions, we may write In other words, we obtained the corresponding bounds for the triangular discrimination (p, q).

On f -divergences for the Zipf-Mandelbrot law
In this section we are going to derive the results from the previous section for the Zipf- where f : I → R, I ⊆ R, and the parameters N ∈ N, s 2 > 0, t 2 ≥ 0 are such that p i (i + t 2 ) s 2 H N,s 2 ,t 2 ∈ I, i = 1, . . . , N . The Csiszár functional (6) assumes the following form when p and q are both defined as Zipf-Mandelbrot law N -tuples: where f : I → R, I ⊆ R, and N ∈ N, s 1 , s 2 > 0, t 1 , t 2 ≥ 0 are such that Our next step is providing the corresponding forms of Theorem 1 which are suitable for further applications. Thus we start with the Csiszár functionalD f (i, N, s 2 , t 2 , p), which implies single Zipf-Mandelbrot laws q i , for i = 1, . . . , N .

Corollary 7
Let p = (p 1 , . . . , p N ) be an N-tuple of real numbers with P N = N i=1 p i . Suppose I ⊆ R is an interval, N ∈ N and s 2 > 0, t 2 ≥ 0 are such that p i (i + t 2 ) s 2 H N,s 2 ,t 2 ∈ I, i = 1, . . . , N .
If f : I → R is a convex function, then If f is a concave function, then the inequality signs are reversed. If t → tf (t) is a convex function, then 1 (N + t 2 ) s 2 H N,s 2 ,t 2¯ 1 + P N f (P N ) ≤D id·f (i, N, s 2 , t 2 , p) If t → tf (t) is a concave function, then the inequality signs are reversed.
Proof It leans on the proof of Theorem 1 with its described substitutions, where we insert for q i the expression If we have both p and q defined via the Zipf-Mandelbrot law, then the following corollary plays a role.

Corollary 8 Let I ⊆ R be an interval and suppose N
If f : I → R is a convex function, then s 1 H N,s 1 ,t 1 ) and

If t → tf (t) is a concave function, then the inequality signs are reversed.
Proof Since the corollary is a special case of the previous one, its proof is provided by inserting equation (3), which defines the Zipf-Mandelbrot law instead of p i , i = 1, . . . , N , as was already done for q i . That is, p i = 1 (i+t 1 ) s 1 H N,s 1 ,t 1 , i = 1, . . . , N , where P N = 1. The rest of the proof follows along the same lines as in Corollary 7, so inequalities (37) and (38) follow for convex functions f and t → tf (t), respectively. They change their signs in the case of concavity as a consequence of the Jensen inequality implicitly included.
Finally, if both p and q are defined via the Zipf law (5), then the following statements hold. If f : I → R is a convex function, then

If t → tf (t) is a concave function, then the inequality signs are reversed.
Proof Inequalities (39) and (40) are proved analogously to Corollary 8 if we observe the probability mass functions p i and q i as Zipf laws defined by (5).
Let us provide the accompanied results of this type for some special cases of fdivergences, starting with the Kullback-Leibler divergence (10). Again, we firstly observe the more general case in which only one of two N -tuples p and q is defined via the Zipf-Mandelbrot law (3).
Corollary 10 Let p = (p 1 , . . . , p N ) be an N -tuple of nonnegative real numbers with P N = N i=1 p i , N ∈ N and s 2 > 0, t 2 ≥ 0. If the logarithm base is greater than 1, then If the logarithm base is less than 1, then the inequality signs are reversed.
Proof It follows from Corollary 7 as a special case of inequalities (36), for the function t → t log t, which is convex when the logarithm base is greater than 1. It can also be derived from Corollary 1 and Remark 2 in the context of the Zipf-Mandelbrot law.
When both p and q are defined via the Zipf-Mandelbrot law (3) or via the Zipf law (5), the following statements hold.
If the logarithm base is greater than 1, then If the parameters t 1 , t 2 = 0, the corresponding inequalities for the Zipf law follow: If the logarithm base is less than 1, then the signs in inequalities (42) and (43) are reversed.
Proof Inequalities (42) follow from Corollary 8 as a special case of inequalities (38), for the function t → t log t, which is convex when the logarithm base is greater than 1. Similarly, inequalities (43) follow from Corollary 9 as a special case of inequalities (40).
The following corollaries deal with the Hellinger distance (11) considering one or two N -tuples defined via the Zipf-Mandelbrot law or the Zipf law, as its special case.

Corollary 12
Let p = (p 1 , . . . , p N ) be an N -tuple of nonnegative real numbers with P N = N i=1 p i , N ∈ N and s 2 > 0, t 2 ≥ 0. Then Proof It follows from Corollary 7 as a special case of inequalities (35), for the convex function t → 1 2 ( √ t -1) 2 . It can also be derived from Corollary 2 and Remark 4 in the context of the Zipf-Mandelbrot law.
When both p and q are defined via the Zipf-Mandelbrot law (3) or via the Zipf law (5), the following statements hold.
In the sequel we provide the results of this type for the Bhattacharyya coefficient (13), starting with one N -tuple defined via the Zipf-Mandelbrot law and proceeding with both such N -tuples, as well as with the Zipf law, as its special case.

Corollary 14
Let p = (p 1 , . . . , p N ) be an N -tuple of nonnegative real numbers with P N = N i=1 p i , N ∈ N and s 2 > 0, t 2 ≥ 0. Then Proof It follows from Corollary 7 as a special case of inequalities (35), for the convex function t → -√ t. It can also be derived from Corollary 3 and Remark 5 in the context of the Zipf-Mandelbrot law.

Corollary 15
Let N ∈ N and s 1 , s 2 > 0, t 1 , t 2 ≥ 0. Then If parameters t 1 , t 2 = 0, the corresponding inequalities for the Zipf law follow: Proof Inequalities (48) follow from Corollary 8 as a special case of inequalities (37), for the convex function t → -√ t. Similarly, inequalities (49) follow from Corollary 9 as a special case of inequalities (39).
In the same manner we proceed with analogous results for the chi-square divergence (14) and the total variation distance (15).

Corollary 16
Let p = (p 1 , . . . , p N ) be an N-tuple of real numbers with P N = N i=1 p i , N ∈ N and s 2 > 0, t 2 ≥ 0. Then Proof It follows from Corollary 7 as a special case of inequalities (35), for the convex function t → (t -1) 2 . It can also be derived from Corollary 4 and Remark 6 in the context of the Zipf-Mandelbrot law.

Corollary 18
Let p = (p 1 , . . . , p N ) be an N-tuple of real numbers with P N = N i=1 p i , N ∈ N and s 2 > 0, t 2 ≥ 0. Then Proof It follows from Corollary 7 as a special case of inequalities (35), for the convex function t → |t -1|. It can also be derived from Corollary 5 and Remark 7 in the context of the Zipf-Mandelbrot law.
In order to conclude this section providing the Jensen-inequality related results for the f -divergences based on the Zipf-Mandelbrot law (or the Zipf law), for the triangular discrimination (16) we give only the latter one: the bounds obtained in the case of both Ntuples observed via the Zipf law.

Corollary 20
Let N ∈ N and s 1 , s 2 > 0. Then Proof The inequalities can easily be deduced from Corollary 9 as a special case of inequalities (39), for the convex function t → (t-1) 2 t+1 . It can also be derived from Corollary 6 and Remark 8 in the context of the Zipf law.

An application of the Zipf law
In the final section we are going to show how the experimental character of the Zipf law can be interpreted through the bounds (43) obtained for the Kullback-Leibler divergence. Namely, the coefficients s 1 and s 2 from the Zipf law were analyzed by Gelbukh and Sidorov in [25] as assigned to the Russian and English languages. They calculated the mentioned coefficients and their difference for each of the 39 literature texts in both languages, with more than 10,000 running words inside of each of them. In the process they obtained the average of s 1 = 0.892869 for the Russian and s 2 = 0.973863 for the English language.
In this context, with the described experimental values of s 1 and s 2 involved, the bounds for the Kullback-Leibler divergence in (43) assume the following form which thus depends only on the parameter N .
Example 1 Let p = (p 1 , . . . , p N ) and q = (q 1 , . . . , q N ) be distributions associated to the Russian and English languages, respectively, and let N ∈ N be a parameter. If the logarithm base is greater than 1, then

Conclusions
In this paper we investigated f -divergences that originate from the Csiszár functional and their link to the Jensen inequality with a specific type of the Jensen-type interpolating inequalities. By means of these inequalities we derived new bounds for f -divergences in general via the Csiszár functional and in particular for the Kullback-Leibler divergence, Hellinger distance, Bhattacharyya distance (coefficient), χ 2 -divergence, total variation distance and triangular discrimination. Consequently, we deduced analogous results in the light of the well-known Zipf-Mandelbrot law, with the adequate probability mass functions and the adjusted form of the Csiszár functional. The Zipf-Mandelbrot law was analyzed as a more general form of the Zipf law, for which we also gave the corresponding results and an application in linguistics in order to accentuate its experimental character. Thus this paper includes three important and widely investigated issues: the Jensen inequality, the divergences (for probability distributions) and the Zipf-Mandelbrot law with its less general, but not less important form, the Zipf law. In this way, the paper can be of an interest for mathematicians who investigate any of these fields with an accent put on mathematical inequalities, as well as for the interdisciplinary fields (e.g. linguistics was involved in this case).