3.4 The Borel-Cantelli lemma
In this section we encounter the first deep result that has a genuinely probabilistic flavour. It is the standard tool to prove that some asymptotic property holds almost surely, and it lies at the heart of many important results in probability. A typical application is given in Example 3.20 below.
Let \((A_n)_{n \in \mathbb{N}}\) be a sequence of events in \(\mathcal A.\) We define the new events \[\begin{aligned} \limsup_n A_n &:=\bigcap_{n \geqslant 0} \bigcup_{k \geqslant n} A_k \\ \liminf_n A_n &:=\bigcup_{n \geqslant 0} \bigcap_{k \geqslant n} A_k\,. \end{aligned}\] The following interpretation is crucial, and it is often the better way to understand these events: \[\begin{aligned} \limsup_n A_n &= \{\omega \,\colon\omega \in A_k \text{ infinitely often}\}\,, \\ \liminf_n A_n &= \{\omega \,\colon\omega \in A_k \text{ eventually}\}\,. \end{aligned}\] In other words:
\(\limsup_n A_n\) is the set of realisations that appear in infinitely many \(A_k.\)
\(\liminf_n A_n\) is the set of realisations that appear in all \(A_k\) beyond a certain index.
It is very important that you play with these different formulations until you are comfortable with them.
We always have \[\tag{3.7} \liminf_n A_n \subset \limsup_n A_n\,.\] If you’re not sure why this is true, pause here until you see why.
The reason behind the terminology \(\limsup\) and \(\liminf\) stems from the fact that \[\mathbf 1_{\limsup_n A_n} = \limsup_n \mathbf 1_{A_n}\,, \qquad \mathbf 1_{\liminf_n A_n} = \liminf_n \mathbf 1_{A_n}\,,\] where the right-hand sides are the usual \(\limsup\) and \(\liminf\) on \(\mathbb{R}\); see the exercises. (This remark also provides another way of seeing (3.7).)
The event \(\limsup_n A_n\) is more useful and more commonly used than \(\liminf_n A_n,\) which appears quite rarely in probability.
- If \(\sum_{n \in \mathbb{N}} \mathbb{P}(A_n) < \infty\) then \(\mathbb{P}(\limsup_n A_n) = 0.\) In other words, almost surely \(A_n\) happens only finitely often.
- If \(\sum_{n \in \mathbb{N}} \mathbb{P}(A_n) = \infty\) and \((A_n)_{n \in \mathbb{N}}\) are independent, then \(\mathbb{P}(\limsup_n A_n) = 1.\) In other words, almost surely \(A_n\) happens infinitely often.
In (ii), the independence is important. The conclusion is clearly wrong if we take \(A_n = A\) for all \(n \in \mathbb{N}\) with \(0 < \mathbb{P}(A) < 1.\)
Proof of Proposition 3.16. For (i), we write (using Exercise 1.1) \[\mathbb{P}(\limsup_n A_n) = \lim_n \mathbb{P}\Biggl(\bigcup_{k \geqslant n} A_k\Biggr) \leqslant\lim_n \sum_{k \geqslant n} \mathbb{P}(A_k) = 0\,,\] since the sum \(\sum_k \mathbb{P}(A_k)\) is finite.
For (ii), we choose \(N \in \mathbb{N}\) and write for any \(n \geqslant N,\) by independence of the events \(A_k,\) \[\mathbb{P}\Biggl(\bigcap_{k = N}^n A_k^c\Biggr) = \prod_{k = N}^n \mathbb{P}(A_k^c) = \prod_{k = N}^n (1 - \mathbb{P}(A_k)) \leqslant\prod_{k = N}^n \mathrm e^{-\mathbb{P}(A_k)} = \mathrm e^{-\sum_{k = N}^n \mathbb{P}(A_k)}\,,\] which tends to zero as \(n \to \infty\) because of the assumption \(\sum_{n \in \mathbb{N}} \mathbb{P}(A_n) = \infty.\) Here we used the basic inequality \(1 - x \leqslant\mathrm e^{-x}\) (which follows for instance by convexity of the function \(\mathrm e^{-x}\)). We conclude (by Exercise 1.1) that \[\mathbb{P}\Biggl(\bigcap_{k \geqslant N}A_k^c\Biggr) = 0\,,\] and hence also \[\mathbb{P}\Biggl(\bigcup_{N \geqslant 0} \bigcap_{k \geqslant N}A_k^c\Biggr) = 0\,,\] which is equivalent to \[\mathbb{P}\Biggl(\bigcap_{N \geqslant 0} \bigcup_{k \geqslant N}A_k\Biggr) = 1\,. \]
A somewhat different proof of (i) follows from the observation that \[\mathbb{E}\biggl[\sum_n \mathbf 1_{A_n}\biggr] = \sum_n \mathbb{P}(A_n) < \infty\,,\] so that the random variable \(\sum_n \mathbf 1_{A_n}\) is almost surely finite, which implies that \(A_n\) happens only finitely often.
The Borel-Cantelli lemma states that whether \(A_n\) happens infinitely often depends on how fast the sequence \(\mathbb{P}(A_n)\) tends to zero. This principle is well illustrated by the following example, which provides a good intuition for Proposition 3.16 (i).
Let \(b_1, b_2, \dots \in [0,1).\) We partition \([0,\infty)\) into intervals \([a_n, a_{n+1})\) of length \(b_n,\) by setting \(a_0 = 0\) and \(a_{n} = a_{n - 1} + b_n\) for \(n \geqslant 1.\) Now we “fold” these intervals into the unit interval \([0,1)\) using the function \(f(x) :=x - \lfloor x \rfloor,\) the fractional part of \(x.\) Thus, we define \(A_n :=f([a_n, a_{n+1})).\) You may find it helpful to draw these sets.
If \(\sum_n b_n = \infty\) then \(\bigcup_n [a_n, a_{n+1}) = [0,\infty).\) This means that the folded sets \(A_n\) keep on “passing through” the interval \([0,1),\) and every \(\omega \in [0,1)\) is contained in infinitely many sets \(A_n.\)
If \(\sum_n b_n < \infty\) then \(\bigcup_n [a_n, a_{n+1})\) is a finite interval. This means that the folded sets \(A_n\) do not cover enough ground to keep on passing through \([0,1),\) and every \(\omega \in [0,1)\) is contained in only finitely many sets \(A_n.\)
These observations can be interpreted in light of the Borel-Cantelli lemma on the probability space \(([0,1), \mathcal B([0,1)), \mathbb{P}),\) where \(\mathbb{P}\) is Lebesgue measure. Then \(\mathbb{P}(A_n) = b_n\) (why?), and Proposition 3.16 (i) is applicable. In particular, we see that the condition \(\sum_n \mathbb{P}(A_n) < \infty\) is not only sufficient but in general also necessary. Note that Proposition 3.16 (ii) does not apply to this example, because the events \((A_n)\) are not independent (why?).
In this example we investigate the statistics of digits of random numbers. For simplicity, we work with binary digits, although the same argument can be easily adapted to any basis (such as 10). Take the probability space \(([0,1), \mathcal B([0,1)), \mathbb{P}),\) where \(\mathbb{P}\) is Lebesgue measure. Use the notation \(\omega = 0.\omega_1\omega_2\omega_3 \cdots\) for the binary digits \(\omega_n \in \{0,1\}\) of \(\omega \in [0,1),\) i.e. \(\omega = \sum_{n = 1}^\infty \omega_n 2^{-n}\) (with the usual convention that we cannot have \(\omega_n = 1\) for all \(n\) above a certain index; this ensures the uniqueness of the binary representation). This definition is not very direct, and it turns out to be more convenient to define the digits through the events \[B_n :=\bigcup_{k = 1}^{2^{n - 1}} \biggl[\frac{k - 1/2}{2^{n - 1}}, \frac{k}{2^{n - 1}} \biggr)\,, \qquad n \geqslant 1\,.\] (Draw them!) Then we define the random variable \(X_n :=\mathbf 1_{B_n}.\) One can then show by induction that \(X_n(\omega) = \omega_n\); the details are left as an exercise (drawing the events \(B_n\) should make this clear).
Next, we find that \(\mathbb{P}(X_n = 0) = \mathbb{P}(X_n = 1) = \frac{1}{2}\) for all \(n,\) and we claim that \((X_n)_{n \geqslant 1}\) are independent random variables. Indeed, for any \(p \in \mathbb{N}^*\) and \(x_1, \dots, x_p \in \{0,1\}\) we find \[\begin{gathered} \mathbb{P}(X_1 = x_1, \dots, X_p = x_p) = \mathbb{P}\Biggl(\Biggl[\sum_{k = 1}^p x_k 2^{-k}, \sum_{k = 1}^p x_k 2^{-k} + 2^{-p} \Biggr)\Biggr) \\ = \frac{1}{2^p} = \prod_{k = 1}^p \mathbb{P}(X_k = x_k)\,, \end{gathered}\] as desired.
Moreover, we claim that for any \(p \in \mathbb{N}^*\) and \(x_1, \dots, x_p \in \{0,1\}\) we have, almost surely, \[\tag{3.8} \# \{k \in \mathbb{N}\,\colon(X_{k+1}, \dots, X_{k+p}) = (x_1, \dots, x_p) \} = \infty\,.\] This is a simple consequence of Proposition 3.16 (ii). The only issue is that events \[\bigl\{(X_{k+1}, \dots, X_{k+p}) = (x_1, \dots, x_p)\bigr\} \,, \qquad k \in \mathbb{N}\,,\] are not independent (since the collections of random variables on which they depend overlap). There’s a very easy solution to this, however: we can just pick every \(p\)-th of them, in which case they are independent. Thus, defining \(Y_n :=(X_{np + 1}, \dots, X_{np + p}),\) from Remark 3.15 we conclude that \((Y_n)_{n \in \mathbb{N}}\) are independent random variables. Setting \(A_n :=\{Y_n = (x_1, \dots, x_p)\},\) the family \((A_n)_{n \in \mathbb{N}}\) is independent and satisfies \(\mathbb{P}(A_n) = 2^{-p}.\) Hence Proposition 3.16 (ii) yields (3.8).
This is an important general observation. Let \(P(\omega, i)\) be a statement depending on the realisation \(\omega\) and some index \(i \in I\) in an index set \(I.\) Then the statement “for all \(i \in I,\) almost surely \(P(\omega, i)\)” is weaker than “almost surely, for all \(i \in I,\) \(P(\omega, i)\)”. However, if the set \(I\) is countable, these statements are equivalent by a union bound.