Open Access

PAC-Bayesian inequalities of some random variables sequences

Journal of Inequalities and Applications20152015:244

https://doi.org/10.1186/s13660-015-0768-3

Received: 23 March 2015

Accepted: 24 July 2015

Published: 7 August 2015

Abstract

In this paper, we establish some PAC-Bayesian inequalities for a sequence of random variables, which include conditionally symmetric random variables and locally square integrable martingales.

Keywords

PAC-Bayesian inequality conditionally symmetric random variables martingale

MSC

60G42 60E15 60G50

1 Introduction

In this paper, we establish several PAC-Bayesian inequalities. The PAC-Bayesian analysis is an abbreviation for the Probably Approximately Correct learning model and has been introduced a decade ago (Shawe-Taylor and Williamson [1]; Shawe-Taylor et al. [2]) and has made a significant contribution to the analysis and development of supervised learning methods, the random multiarmed bandits problem and so on. PAC-Bayesian analysis provides high probability bounds on the deviation of weighted averages of empirical means of sets of independent random variables from their expectations. It supplies generalization guarantees for many influential machine learning algorithms.

Shawe-Taylor and Williamson [1] established the PAC-Bayesian learning theorems. They showed that if one can find a ball of sufficient volume in a parameterized concept space, then the center of that ball has a low error rate. For non-i.i.d. data, application of PAC-Bayesian analysis was partially addressed only recently by Ralaivola et al. [3] and Lever et al. [4]. For a martingale, taking advantage of the martingale’s properties, Seldin et al. [5] obtained the PAC-Bayesian-Bernstein inequality and applied it to multiarmed bandits. Seldin et al. [6] presented a generalization of the PAC-Bayesian analysis. Their generalization makes it possible to consider model order selection simultaneously with the exploration-exploitation trade-off.

In the present paper, we continue to study the PAC-Bayesian inequalities for some random variable sequence. We concentrate on the conditionally symmetric random variables and the locally square integrable martingale. For the first case, we only assume the conditional symmetry of the random variable sequence, without any other dependent conditions and moment conditions. For the second case, the bounded condition in Seldin et al. [6] is weakened. The paper is divided as follows: In Section 2 we state the main results and make some remarks. Proofs of the main results are provided in Section 3.

2 Main results

In this section, we discuss the PAC-Bayesian inequalities for conditionally symmetric random variables and martingales. In order to present our main theorems, we give a few definitions. Let \(\mathbb{H}\) be an index (or a hypothesis) space, possibly uncountably infinite. Let \(\{X_{1}(h), X_{2}(h),\ldots: h\in\mathbb{H}\} \) be a sequence of variables adapted to an increasing sequence of σ-fields \(\{\mathcal{F}_{n}\}\), where \(\mathcal{F}_{n}=\{ X_{k}(h): 1\le k\le n \text{ and } h\in\mathbb{H}\}\) is a set of random sequences observed up to time n (the history).

Seldin et al. [6] obtained a PAC-Bayes-Bernstein inequality for martingale with finite jumps.

Theorem 2.1

(Seldin et al. [6])

Let \(\{X_{1}(h), X_{2}(h),\ldots: h\in\mathbb{H}\}\) be a set of martingale difference sequences adapted to an increasing sequence of σ-fields \(\mathcal{F}_{n}=\{X_{k}(h): 1\le k\le n \textit{ and } h\in\mathbb{H}\}\). Furthermore, let \(M_{n}(h)=\sum_{k=1}^{n}X_{k}(h)\) be martingales corresponding to the martingale difference sequences and let \(V_{n}(h)=\sum_{k=1}^{n}\mathbb{E}[X_{k}(h)^{2}|\mathcal{F}_{k-1}]\) be cumulative variances of the martingales. For a distribution ρ over \(\mathbb{H}\), define \(M_{n}(\rho) =\mathbb{E}_{\rho(h)}[M_{n}(h)]\) and \(V_{n}(\rho) =\mathbb{E}_{\rho(h)}[V_{n}(h)]\). Let \(\{C_{1}, C_{2}, \ldots\}\) be an increasing sequence set in advance, such that \(|X_{k}(h)|\le C_{k}\) for all h with probability 1. Let \(\{\mu_{1},\mu _{2},\ldots\}\) be a sequence ofreference’ (‘prior’) distributions over \(\mathbb{H}\), such that \(\mu_{n}\) is independent of \(\mathcal{F}_{n}\) (but can depend on n). Let \(\{\lambda_{1},\lambda_{2},\ldots\}\) be a sequence of positive numbers set in advance that satisfy \(\lambda_{k}\le C_{k}^{-1}\). Then for all possible distributions \(\rho_{n}\) over \(\mathbb {H}\) given n and for all n simultaneously with probability greater than \(1-\delta\),
$$\bigl\vert M_{n}(\rho_{n})\bigr\vert \le(e-2) \lambda_{n}V_{n}(\rho_{n})+\frac{\operatorname{KL}(\rho_{n}\parallel\mu _{n})+2\ln(n+1)+\ln\frac{2}{\delta}}{\lambda_{n}}, $$
where \(\operatorname{KL}(\mu\parallel\nu)\) is the KL-divergence (relative entropy) between two distributions μ and ν.

Seldin et al. [6] (see Theorem 2.1) considered the bounded martingale difference sequence and the parameters \((\lambda_{k})\) depend on the bounds \((1/C_{k})\). Since \(C_{k}\) is an increasing sequence, \(\lambda_{k}\) is a decreasing sequence. Furthermore, Seldin et al. [6] studied the deviation properties between the martingale and the conditional variance of the martingale. Based on the above works, in the present paper, we want to consider the conditionally symmetric random variables and the locally square integrable martingale. For the conditionally symmetric random variables, we can establish the PAC-Bayesian inequality without any dependent assumptions (for example, independence or being a martingale) and any moment assumptions for the sequences. For the locally square integrable martingale, we can remove the bounded restriction.

2.1 Conditionally symmetric random variables

Assume that \(\{X_{k}(h): k\ge1, h\in\mathbb{H}\}\) are conditionally symmetric with respect to \((\mathcal{F}_{n})\) (i.e., \(\mathcal{L}(X_{i}(h)|\mathcal{F}_{i})=\mathcal {L}(-X_{i}(h)|\mathcal{F}_{i})\)). Let
$$M_{n}(h)=\sum_{k=1}^{n}X_{k}(h) \quad \text{and}\quad V_{n}(h)=\sum_{k=1}^{n} \bigl[X_{k}(h)^{2}\bigr]. $$
For a distribution ρ over \(\mathbb{H}\), define \(M_{n}(\rho) =\mathbb{E}_{\rho(h)}[M_{n}(h)]\) and \(V_{n}(\rho) =\mathbb{E}_{\rho (h)}[V_{n}(h)]\). We establish the PAC-Bayesian inequality between the partial sums and the total quadratic variation of the partial sums.

Theorem 2.2

Let \(\{\mu_{1},\mu_{2},\ldots\}\) be a sequence ofreference’ (‘prior’) distributions over \(\mathbb{H}\), such that \(\mu_{n}\) is independent of \(\mathcal{F}_{n}\) (but can depend on n). Then for all possible distributions \(\rho_{n}\) over \(\mathbb{H}\) given n and for all n simultaneously with probability greater than \(1-\delta\),
$$\bigl\vert M_{n}(\rho_{n})\bigr\vert \le \frac{\lambda}{2}V_{n}(\rho_{n})+\frac{\operatorname{KL}(\rho_{n}\parallel \mu_{n})+2\ln(n+1)+\ln\frac{2}{\delta}}{\lambda},\quad \lambda>0. $$

The following theorem gives an inequality in the sense of a self-normalized sequence.

Theorem 2.3

Under the assumptions in Theorem  2.2, then for all \(y>0\) and n simultaneously with probability greater than \(1-\frac{\delta}{2}\),
$$\begin{aligned}& \mathbb{E}_{\rho_{n}(h)} \biggl(\ln\frac{y}{\sqrt {V_{n}(h)+y^{2}}}+\frac{M_{n}(h)^{2}}{ 2(V_{n}(h)+y^{2})} \biggr) \\& \quad \leq \operatorname{KL}(\rho_{n}\parallel\mu_{n})+2\ln(n+1)+\ln \frac{2}{\delta}. \end{aligned}$$

2.2 Martingales

Let \((M_{n}(h))\) be a locally square integrable real martingale adapted to the filtration \((\mathcal{F}_{n})\) with \(M_{0}=0\). The predictable quadratic variation and the total quadratic variation of \((M_{n}(h))\) are, respectively, given by
$$\langle M \rangle_{n}(h)=\sum_{k=1}^{n} \mathbb{E}\bigl[\bigl(\Delta M_{k}(h)\bigr)^{2}|\mathcal {F}_{k-1}\bigr]\quad \text{and}\quad [M]_{n}(h)=\sum _{k=1}^{n}\bigl(\Delta M_{k}(h) \bigr)^{2}, $$
where \(\Delta M_{n}(h)=M_{n}(h)-M_{n-1}(h)\). For a distribution ρ over \(\mathbb{H}\), define
$$M_{n}(\rho) =\mathbb{E}_{\rho(h)}\bigl[M_{n}(h)\bigr]. $$

Theorem 2.4

Let \((M_{n}(h))\) be a locally square integrable real martingale adapted to a filtration \(\mathbb{F}=(\mathcal{F}_{n})\). Let \(\{\mu_{1},\mu_{2},\ldots\}\) be a sequence ofreference’ (‘prior’) distributions over \(\mathbb{H}\), such that \(\mu_{n}\) is independent of \(\mathcal{F}_{n}\) (but it can depend on n). Then for all possible distributions \(\rho_{n}\) over \(\mathbb{H}\) given n and for all n simultaneously with probability greater than \(1-\delta\),
$$\bigl\vert M_{n}(\rho_{n})\bigr\vert \le \frac{\lambda}{2} \biggl(\frac{[M]_{n}(\rho _{n})}{3}+\frac{2\langle M \rangle_{n}(\rho_{n})}{3} \biggr)+ \frac{\operatorname{KL}(\rho_{n}\parallel\mu_{n})+2\ln (n+1)+\ln\frac{2}{\delta}}{\lambda}, \quad \lambda>0. $$

Theorem 2.5

Under the assumptions in Theorem  2.4, for all \(y>0\) and n simultaneously with probability greater than \(1-\frac{\delta}{2}\),
$$\begin{aligned}& \mathbb{E}_{\rho_{n}(h)} \biggl(\ln\frac{y}{\sqrt{ (\frac {[M]_{n}(h)}{3}+\frac{2\langle M \rangle_{n}(h)}{3} )+y^{2}}}+\frac{M_{n}(h)^{2}}{ 2 ( (\frac{[M]_{n}}{3}+\frac{2\langle M \rangle_{n}}{3} )+y^{2} )} \biggr) \\& \quad \leq \operatorname{KL}(\rho_{n}\parallel \mu_{n})+2 \ln(n+1)+\ln\frac{2}{\delta}. \end{aligned}$$

3 The proofs of main results

Before giving the proofs of our results, we state the following basic lemmas.

Lemma 3.1

[7]

Let \(\{d_{i}\}\) be a sequence of variables adapted to an increasing sequence of σ-fields \(\{\mathcal{F}_{i}\}\). Assume that the \(\{d_{i}\}\)s are conditionally symmetric. Then
$$\exp \Biggl(\lambda\sum_{i=1}^{n}d_{i}- \frac{\lambda^{2}}{2}\sum_{i=1}^{n}d_{i}^{2} \Biggr),\quad n\geq1, $$
is a supermartingale with mean ≤1, for all \(\lambda\in\mathbb{R}\).

Lemma 3.2

Under the assumptions of Lemma  3.1, for any \(y>0\), we have
$$\mathbb{E}\frac{y}{\sqrt{\sum_{i=1}^{n}d_{i}^{2}+y^{2}}}\exp \biggl(\frac{(\sum_{i=1}^{n}d_{i})^{2}}{ 2(\sum_{i=1}^{n}d_{i}^{2}+y^{2})} \biggr)\leq1. $$

Remark 3.1

Hitczenko [8] proved the above inequality for conditionally symmetric martingale difference sequences, and De la Peña [7] obtained the above inequality without the martingale difference assumption, hence without any integrability assumptions. Note that any sequence of real valued random variables \(X_{i}\) can be ‘symmetrized’ to produce an exponential supermartingale by introducing random variables \(X_{i}'\) such that
$$\mathcal{L}\bigl(X_{i}'|X_{1},X_{1}', \ldots,X_{n-1},X_{n-1}',X_{n}\bigr) = \mathcal{L}\bigl(X_{n}'|X_{1}, \ldots,X_{n-1}\bigr) $$
and we set \(d_{n}=X_{n}-X_{n}'\).

Proof

By using Fubini’s theorem, we have
$$\begin{aligned}& \mathbb{E}\frac{y}{\sqrt{\sum_{i=1}^{n}d_{i}^{2}+y^{2}}}\exp \biggl(\frac{(\sum_{i=1}^{n}d_{i})^{2}}{ 2(\sum_{i=1}^{n}d_{i}^{2}+y^{2})} \biggr) \\& \quad = \mathbb{E}\biggl[\frac{y}{\sqrt{\sum_{i=1}^{n}d_{i}^{2}+y^{2}}}\exp \biggl(\frac{(\sum_{i=1}^{n}d_{i})^{2}}{ 2(\sum_{i=1}^{n}d_{i}^{2}+y^{2})} \biggr) \\& \qquad {}\times\frac{\sqrt{\sum_{i=1}^{n}d_{i}^{2}+y^{2}}}{ \sqrt{2\pi}}\int_{-\infty}^{\infty} \exp \biggl\{ \frac{\sum_{i=1}^{n}d_{i}^{2}+y^{2}}{2} \biggl(\lambda -\frac {\sum_{i=1}^{n}d_{i}^{2}}{\sum_{i=1}^{n}d_{i}^{2}+y^{2}} \biggr)^{2} \biggr\} \, d\lambda\biggr] \\& \quad = \int_{-\infty}^{\infty} \mathbb{E} \Biggl[ \frac{y}{\sqrt{2\pi}}\exp \Biggl(\lambda\sum_{i=1}^{n}d_{i}- \frac{\lambda^{2}}{2} \sum_{i=1}^{n}d_{i}^{2} \Biggr)\exp \biggl(-\frac{\lambda ^{2}y^{2}}{2} \biggr) \Biggr]\, d\lambda \\& \quad \leq \int_{-\infty}^{\infty} \biggl[\frac{y}{\sqrt{2\pi}} \exp \biggl(-\frac{\lambda^{2}y^{2}}{2} \biggr) \biggr]\, d\lambda=1, \end{aligned}$$
where we used the fact
$$\frac{\sqrt{\sum_{i=1}^{n}d_{i}^{2}+y^{2}}}{\sqrt{2\pi}}\int_{-\infty}^{\infty} \exp \biggl\{ \frac{\sum_{i=1}^{n}d_{i}^{2}+y^{2}}{2} \biggl(\lambda -\frac {\sum_{i=1}^{n}d_{i}^{2}}{\sum_{i=1}^{n}d_{i}^{2}+y^{2}} \biggr)^{2} \biggr\} \, d\lambda=1. $$
 □

The following inequality is about a transformation of the measure inequality [9].

Lemma 3.3

For any measurable function \(\phi(h)\) on \(\mathbb{H}\) and any distributions \(\mu(h)\) and \(\rho(h)\) on \(\mathbb{H}\), we have
$$\mathbb{E}_{\rho(h)}\bigl(\phi(h)\bigr)\le \operatorname{KL}(\rho\parallel \mu)+\ln\mathbb {E}_{\mu(h)}\bigl(e^{\phi(h)}\bigr). $$

Proof of Theorem 2.2

Taking
$$\phi(h)=\lambda M_{n}(h)-\frac{\lambda^{2}}{2}V_{n}(h), \quad \lambda>0, $$
then from Lemma 3.1 and Lemma 3.3, for all \(\rho_{n}\) and n simultaneously with probability greater than \(1-\frac{\delta}{2}\),
$$\begin{aligned}& \lambda M_{n}(\rho_{n})-\frac{\lambda^{2}}{2}V_{n}( \rho_{n}) \\& \quad = \mathbb{E}_{\rho_{n}(h)} \biggl(\lambda M_{n}(h)- \frac{\lambda^{2}}{2}V_{n}(h) \biggr) \\& \quad \le \operatorname{KL}(\rho_{n}\parallel\mu_{n})+\ln \mathbb{E}_{\mu_{n}(h)} \bigl(e^{\lambda M_{n}(h)-\frac{\lambda^{2}}{2}V_{n}(h)} \bigr) \\& \quad \le \operatorname{KL}(\rho_{n}\parallel\mu_{n})+\ln \mathbb{E}_{\mathcal{F}_{n}} \bigl(\mathbb{E}_{\mu_{n}(h)}e^{\lambda M_{n}(h)-\frac{\lambda^{2}}{2}V_{n}(h)} \bigr)+2 \ln(n+1)+\ln\frac {2}{\delta} \\& \quad = \operatorname{KL}(\rho_{n}\parallel\mu_{n})+\ln \mathbb{E}_{\mu_{n}(h)} \bigl(\mathbb {E}_{\mathcal{F}_{n}} e^{\lambda M_{n}(h)-\frac{\lambda^{2}}{2}V_{n}(h)} \bigr)+2\ln(n+1)+\ln\frac {2}{\delta} \\& \quad \le \operatorname{KL}(\rho_{n}\parallel\mu_{n})+2 \ln(n+1)+\ln\frac{2}{\delta}, \end{aligned}$$
where the second equality is due to the fact that \(\mu_{n}\) is independent of \(\mathcal{F}_{n}\). By applying the same argument to martingales \(-M_{n}(h)\), we obtain the result that, with probability greater than \(1-\delta\),
$$\bigl\vert M_{n}(\rho_{n})\bigr\vert \le \frac{\lambda}{2}V_{n}(\rho_{n})+\frac{\operatorname{KL}(\rho _{n}\parallel\mu_{n})+2\ln(n+1)+\ln\frac{2}{\delta}}{\lambda},\quad \lambda>0. $$
 □

Proof of Theorem 2.3

For all \(y>0\), taking
$$\phi(h)=\ln\frac{y}{\sqrt{V_{n}(h)+y^{2}}}+\frac{M_{n}(h)^{2}}{ 2(V_{n}(h)+y^{2})}, $$
then from Lemma 3.2 and Lemma 3.3, for all \(\rho_{n}\) and n simultaneously with probability greater than \(1-\frac{\delta}{2}\),
$$\begin{aligned}& \mathbb{E}_{\rho_{n}(h)} \biggl(\ln\frac{y}{\sqrt {V_{n}(h)+y^{2}}}+\frac{M_{n}(h)^{2}}{ 2(V_{n}(h)+y^{2})} \biggr) \\& \quad \le \operatorname{KL}(\rho_{n}\parallel\mu_{n})+\ln \mathbb{E}_{\mu_{n}(h)} \bigl(e^{\ln \frac{y}{\sqrt{V_{n}(h)+y^{2}}}+\frac{M_{n}(h)^{2}}{2(V_{n}(h)+y^{2})}} \bigr) \\& \quad \le \operatorname{KL}(\rho_{n}\parallel\mu_{n})+\ln \mathbb{E}_{\mathcal{F}_{n}} \bigl(\mathbb{E}_{\mu_{n}(h)}e^{\ln\frac{y}{\sqrt{V_{n}(h)+y^{2}}}+\frac{M_{n}(h)^{2}}{2(V_{n}(h)+y^{2})}} \bigr)+2 \ln(n+1)+\ln\frac{2}{\delta} \\& \quad = \operatorname{KL}(\rho_{n}\parallel\mu_{n})+\ln \mathbb{E}_{\mu_{n}(h)} \bigl(\mathbb {E}_{\mathcal{F}_{n}} e^{\ln\frac{y}{\sqrt{V_{n}(h)+y^{2}}}+\frac {M_{n}(h)^{2}}{2(V_{n}(h)+y^{2})}} \bigr)+2\ln(n+1)+\ln\frac{2}{\delta} \\& \quad \le \operatorname{KL}(\rho_{n}\parallel\mu_{n})+2 \ln(n+1)+\ln\frac{2}{\delta}. \end{aligned}$$
 □

In order to prove Theorem 2.4, we need to introduce the following lemma.

Lemma 3.4

Let \((M_{n})\) be a locally square integrable martingale. Putting
$$\langle M \rangle_{n}=\sum_{k=1}^{n} \mathbb{E}\bigl[(\Delta M_{k})^{2}|\mathcal {F}_{k-1}\bigr]\quad \textit{and} \quad [M]_{n}=\sum _{k=1}^{n}(\Delta M_{k})^{2}, $$
for all \(t\in \mathbb{R}\) and \(n\geq0\), denote
$$G_{n}(t)=\exp \biggl(tM_{n}-\frac{t^{2}}{6}[M]_{n}- \frac{t^{2}}{3} \langle M\rangle_{n} \biggr). $$
Then, for all \(t\in\mathbb{R}\), \((G_{n}(t))\) is a positive supermartingale with \(\mathbb{E}[G_{n}(t)]\leq1\).

Proof

Let X be a square integrable random variable with \(\mathbb{E}X=0\) and \(0<\sigma^{2}:=\mathbb{E}X^{2}<\infty\). Because of the basic inequality
$$\exp \biggl(x- \frac{x^{2}}{6} \biggr) \leq1 + x+ \frac{x^{2}}{3},\quad x \in\mathbb{R}, $$
we know
$$ \mathbb{E} \biggl[\exp \biggl(t X- \frac {t^{2}}{6} X^{2} \biggr) \biggr]\leq1+\frac{t^{2}}{3}\sigma^{2}. $$
(3.1)
Then, for all \(t \in\mathbb{R}\) and \(n \geq0\), we obtain from (3.1)
$$G_{n}(t)=G_{n}(t-1)\exp \biggl(t \Delta M_{n}- \frac{t^{2}}{6}\Delta[M]_{n}-\frac{t^{2}}{3}\Delta \langle M \rangle_{n} \biggr). $$
Hence, we deduce that, for all \(t \in\mathbb{R}\),
$$\begin{aligned} \mathbb{E}\bigl[G_{n}(t)|\mathcal{F}_{n-1}\bigr] \leq&G_{n}(t-1)\exp \biggl(-\frac{t^{2}}{3}\Delta\langle M \rangle_{n} \biggr)\cdot \biggl(1+\frac{t^{2}}{3}\Delta \langle M \rangle_{n} \biggr) \\ =& G_{n}(t-1). \end{aligned}$$
As a result, for all \(t\in\mathbb{R}\), \(G_{n}(t)\) is a positive supermartingale, i.e., for all \(n\geq1\), \(\mathbb {E}[G_{n}(t)]\leq \mathbb{E}[G_{n-1}(t)]\), which implies that \(\mathbb{E}[G_{n}(t)]\leq \mathbb{E}[G_{0}(t)]=1\). □

Proof of Theorem 2.4

Taking
$$\phi(h)=\lambda M_{n}(h)-\frac{\lambda^{2}}{2} \biggl(\frac{[M]_{n}(h)}{3}+ \frac {2\langle M \rangle_{n}(h)}{3} \biggr),\quad \lambda>0, $$
then from Lemma 3.3 and Lemma 3.4, for all \(\rho_{n}\) and n simultaneously with probability greater than \(1-\frac{\delta}{2}\),
$$\begin{aligned}& \lambda M_{n}(\rho_{n})-\frac{\lambda^{2}}{2} \biggl( \frac{[M]_{n}(\rho _{n})}{3}+\frac{2\langle M \rangle_{n}(\rho_{n})}{3} \biggr) \\& \quad = \mathbb{E}_{\rho_{n}(h)} \biggl[\lambda M_{n}(h)- \frac{\lambda^{2}}{2} \biggl(\frac{[M]_{n}(h)}{3}+\frac {2\langle M \rangle_{n}(h)}{3} \biggr) \biggr] \\& \quad \le \operatorname{KL}(\rho_{n}\parallel\mu_{n})+\ln \mathbb{E}_{\mu_{n}(h)} \bigl[e^{\lambda M_{n}(h)-\frac{\lambda^{2}}{2} (\frac{[M]_{n}(h)}{3}+\frac {2\langle M \rangle_{n}(h)}{3} )} \bigr] \\& \quad \le \operatorname{KL}(\rho_{n}\parallel\mu_{n})+\ln \mathbb{E}_{\mathcal{F}_{n}} \bigl[\mathbb{E}_{\mu_{n}(h)}e^{\lambda M_{n}(h)-\frac{\lambda^{2}}{2} (\frac{[M]_{n}(h)}{3}+\frac {2\langle M \rangle_{n}(h)}{3} )} \bigr]+2 \ln(n+1)+\ln\frac{2}{\delta} \\& \quad = \operatorname{KL}(\rho_{n}\parallel\mu_{n})+\ln \mathbb{E}_{\mu_{n}(h)} \bigl[\mathbb {E}_{\mathcal{F}_{n}} e^{\lambda M_{n}(h)-\frac{\lambda^{2}}{2} (\frac{[M]_{n}(h)}{3}+\frac {2\langle M \rangle_{n}(h)}{3} )} \bigr]+2\ln(n+1)+\ln\frac{2}{\delta} \\& \quad \le \operatorname{KL}(\rho_{n}\parallel\mu_{n})+2 \ln(n+1)+\ln\frac{2}{\delta}, \end{aligned}$$
where the second equality is due to the fact that \(\mu_{n}\) is independent of \(\mathcal{F}_{n}\). By applying the same argument to martingales \(-M_{n}(h)\), we obtain the result that, with probability greater than \(1-\delta\),
$$\bigl\vert M_{n}(\rho_{n})\bigr\vert \le \frac{\lambda}{2} \biggl(\frac{[M]_{n}(\rho _{n})}{3}+\frac{2\langle M \rangle_{n}(\rho_{n})}{3} \biggr)+ \frac{\operatorname{KL}(\rho_{n}\|\mu_{n})+2\ln (n+1)+\ln\frac{2}{\delta}}{\lambda},\quad \lambda>0. $$
 □

Proof of Theorem 2.5

For all \(y>0\), taking
$$\phi(h)=\ln\frac{y}{\sqrt{ (\frac{[M]_{n}(h)}{3}+\frac {2\langle M \rangle_{n}(h)}{3} )+y^{2}}}+\frac{M_{n}(h)^{2}}{ 2 ( (\frac{[M]_{n}(h)}{3}+\frac{2\langle M \rangle_{n}(h)}{3} )+y^{2} )}, $$
and \(V_{n}(h)= [\frac{[M]_{n}(h)}{3} +\frac{2\langle M\rangle_{n}(h)}{3} ]\), from Lemma 3.2 and Lemma 3.3, we can get for all \(\rho_{n}\) and n simultaneously with probability greater than \(1-\frac {\delta}{2}\),
$$\begin{aligned}& \mathbb{E}_{\rho_{n}(h)} \biggl(\ln\frac{y}{\sqrt{ (\frac {[M]_{n}(h)}{3}+\frac{2\langle M \rangle_{n}(h)}{3} )+y^{2}}}+\frac{M_{n}(h)^{2}}{ 2 ( (\frac{[M]_{n}(h)}{3}+\frac{2\langle M \rangle_{n}(h)}{3} )+y^{2} )} \biggr) \\& \quad = \mathbb{E}_{\rho_{n}(h)} \biggl(\ln\frac{y}{\sqrt {V_{n}(h)+y^{2}}}+\frac{M_{n}(h)^{2}}{ 2 (V_{n}(h)+y^{2} )} \biggr) \\& \quad \le \operatorname{KL}(\rho_{n}\parallel\mu_{n})+\ln \mathbb{E}_{\mu_{n}(h)} \bigl(e^{\ln \frac{y}{\sqrt{V_{n}(h)+y^{2}}}+\frac{M_{n}(h)^{2}}{2 (V_{n}(h)+y^{2} )}} \bigr) \\& \quad \le \operatorname{KL}(\rho_{n}\parallel\mu_{n})+\ln \mathbb{E}_{\mathcal{F}_{n}} \bigl(\mathbb{E}_{\mu_{n}(h)}e^{\ln\frac{y}{\sqrt{V_{n}(h)+y^{2}}}+\frac{M_{n}(h)^{2}}{2 (V_{n}(h)+y^{2} )}} \bigr)+2 \ln(n+1)+\ln\frac {2}{\delta} \\& \quad = \operatorname{KL}(\rho_{n}\parallel\mu_{n})+\ln \mathbb{E}_{\mu_{n}(h)} \bigl(\mathbb {E}_{\mathcal{F}_{n}} e^{\ln\frac{y}{\sqrt{V_{n}(h)+y^{2}}}+\frac {M_{n}(h)^{2}}{2 (V_{n}(h)+y^{2} )}} \bigr)+2\ln(n+1)+\ln\frac {2}{\delta} \\& \quad \le \operatorname{KL}(\rho_{n}\parallel\mu_{n})+2 \ln(n+1)+\ln\frac{2}{\delta}. \end{aligned}$$
 □

Declarations

Acknowledgements

This work is supported by IRTSTHN (14IRTSTHN023), NSFC (11471104), NCET (NCET-11-0945).

Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.

Authors’ Affiliations

(1)
Henan Engineering Laboratory for Big Data Statistical Analysis and Optimal Control, College of Mathematics and Information Science, Henan Normal University
(2)
Science College, Hunan Agricultural University

References

  1. Shawe-Taylor, J, Williamson, RC: A PAC analysis of a Bayesian estimator. In: Proceedings of the International Conference on COLT, pp. 2-9 (1997) Google Scholar
  2. Shawe-Taylor, J, Bartlett, PL, Williamson, RC, Anthony, M: Structural risk minimization over data-dependent hierarchies. IEEE Trans. Inf. Theory 44(5), 1926-1940 (1998) MathSciNetView ArticleMATHGoogle Scholar
  3. Ralaivola, L, Szafranski, M, Stempfel, G: Chromatic PAC-Bayes bounds for non-IID data: applications to ranking and stationary β-mixing processes. J. Mach. Learn. Res. 11, 1927-1956 (2010) MathSciNetMATHGoogle Scholar
  4. Lever, G, Laviolette, F, Shawe-Taylor, J: Distribution-dependent PAC-Bayes priors. In: Algorithmic Learning Theory. Lecture Notes in Computer Science, vol. 6331, pp. 119-133 (2010) View ArticleGoogle Scholar
  5. Seldin, Y, Laviolette, F, Cesa-Bianchi, N, Shawe-Taylar, J, Auer, P: PAC-Bayesian inequalities for martingales. IEEE Trans. Inf. Theory 58, 7086-7093 (2012) View ArticleGoogle Scholar
  6. Seldin, Y, Cesa-Bianchi, N, Auer, P, Laviolette, F, Shawe-Taylor, J: PAC-Bayes-Bernstein inequality for martingales and its application to multiarmed bandits. In: JMLR: Workshop and Conference Proceedings, vol. 26, pp. 98-111 (2012) Google Scholar
  7. De la Peña, VH: A general class of exponential inequalities for martingales and ratios. Ann. Probab. 27(1), 537-564 (1999) MathSciNetView ArticleMATHGoogle Scholar
  8. Hitczenko, P: Upper bounds for the \(L_{p}\)-norms of martingales. Probab. Theory Relat. Fields 86(2), 225-238 (1990) MathSciNetView ArticleMATHGoogle Scholar
  9. Dupuis, P, Ellis, RS: A Weak Convergence Approach to the Theory of Large Deviations. Wiley Series in Probability and Statistics. Wiley, New York (2011) Google Scholar

Copyright

© Wang et al. 2015