6.3 Hypothesis testing
Someone gives you a coin and you would like to determine whether it is fair. You do so by flipping it many times and recording the results. From the observed results you would like to determine which of the following hypotheses is true.
The coin is fair (heads or tails occur with equal probability).
The coin is biased (heads or tails occur with different probabilities).
The hypothesis (\(\mathrm H_0\)) is called the null hypothesis, while the hypothesis (\(\mathrm H_1\)) is called the alternative hypothesis.
How to choose the null hypothesis is an art and requires insight and experience on the part of the statistician. In general, the null hypothesis describes the default or standard scenario, where the statistical phenomenon or effect one is looking for is absent (in the above example, a bias in the coin). The alternative hypothesis describes the scenario where, on the contrary, the phenomenon or effect is present.
A defendent is accused of a crime. In most modern systems of justice, the null and alternative hypotheses are:
The defendent is innocent.
The defendent is guilty.
In this example, the effect one is looking for is the guilt of the defendent.
The goal of statistical tests is to determine whether the observed sample provides sufficient evidence to reject the null hypothesis (and hence conclude that the phenomenon or effect one is investigating is present).
More formally, we consider an \(n\)-sample \(X_1, \dots, X_n\) drawn from \(\mathbb{P}_\theta\) with unknown \(\theta \in \Theta.\) We partition the parameter space in two: \[\Theta = \Theta_0 \cup \Theta_1 \,, \qquad \Theta_0 \cap \Theta_1 = \emptyset\,.\] We then define the hypotheses \[\begin{aligned} &(\mathrm H_0): \theta \in \Theta_0 \qquad \text{(null hypothesis)} \\ &(\mathrm H_1): \theta \in \Theta_1 \qquad \text{(alternative hypothesis)}\,, \end{aligned}\] and aim to determine, using the given \(n\)-sample, which of these two hypotheses is correct.
To that end, we follow the following test procedure for testing \(\mathrm H_0\) versus \(\mathrm H_1\):
Define a rejection region, which is an event \[D = D(X_1, \dots, X_n)\,.\]
Reject \(\mathrm H_0\) if and only if \(D\) holds.
There are two different kinds of errors that one can make:
Error of first kind: \(\mathrm H_0\) is true but we reject it.
Error of second kind: \(\mathrm H_0\) is false but we do not reject it.
In Example 6.26, the error of first kind is to convict an innocent defendant, while the error of second kind is to clear a guilty defendant.
The probabilities of commiting these two errors are quantified by the following definition.
Consider a statistical test determined by its rejection region \(D.\)
The confidence of the test is \[1 - \alpha :=\inf_{\theta \in \Theta_0} \mathbb{P}_\theta(D^c)\,.\] We also call \(\alpha\) the risk of the test.
The power of the test is \[1 - \beta :=\inf_{\theta \in \Theta_1} \mathbb{P}_\theta(D)\,.\]
Thus, \(\alpha\) is the probability of making a mistake of first kind, while \(\beta\) is the probability of making a mistake of second kind.
Let \(I\) be a confidence interval for \(\theta\) with confidence level \(1 - \alpha.\) Suppose that \(\theta_0 \in \Theta.\) We want to test the null hypothesis \(\theta = \theta_0\) versus the alternative hypothesis \(\theta \neq \theta_0.\) Then the rejection region \(D = \{\theta_0 \notin I\}\) yields a test of \(\mathrm H_0\) versus \(\mathrm H_1\) with confidence at least \(1 - \alpha.\) Indeed, by Definition 6.19, we have \[\mathbb{P}_{\theta_0}(D^c) = \mathbb{P}_{\theta_0}(\theta_0 \in I) \geqslant 1 - \alpha\,.\]
Next, we consider a few concrete Gaussian examples.
Let \(X_1, \dots, X_n\) be an \(n\)-sample drawn from the Gaussian law with mean \(\mu\) and variance \(\sigma^2.\) Let \(\mu_0 \in \mathbb{R}.\) We want to test the null hypothesis \(\mu = \mu_0\) versus the alternative hypothesis \(\mu \neq \mu_0.\) To construct the rejection region, we use the empirical mean \(\bar X_n\) (see Example 6.5).
Suppose first that the variance \(\sigma^2\) is known and only the mean \(\mu\) is unknown. For \(C > 0\) define \[D :=\bigl\{\lvert \bar X_n - \mu_0 \rvert \geqslant C\bigr\}\,.\] We require a test with confidence at least \(95\%,\) which means \[\mathbb{P}_{\mu_0}(\lvert \bar X_n - \mu_0 \rvert \geqslant C) = 0.05\,,\] which gives the condition \(C \approx \frac{1.96 \cdot \sigma}{\sqrt{n}}\) (where we used that \(\bar X_n\) is Gaussian with mean \(\mu\) and variance \(\sigma^2 / n\)).
Suppose now that we know neither the mean \(\mu\) nor the variance \(\sigma^2.\) We use the empirical mean \(\bar X_n\) and the empirical variance \(S_n^2\) from Example 6.9, and set \[D :=\biggl\{\frac{\lvert \bar X_n - \mu_0 \rvert}{S_n} \geqslant C\biggr\}\,.\] The law of the random variable \[T_{n-1} :=\frac{\sqrt{n}}{S_n} (\bar X_n - \mu)\] is called Student’s t distribution with \(n\) degrees of freedom; it has an explicit form that can be computed or found in the literature, which we shall not go into here. We require a test with confidence at least \(95\%,\) which means \[\mathbb{P}_{\mu_0}(D) = \mathbb{P}(\lvert T_{n-1} \rvert \geqslant C / \sqrt{n}) = 0.05\,,\] from which one can solve \(C \approx \frac{2.093}{\sqrt{n}}.\) Thus, the rejection region for a test with confidence \(95\%\) is \[D = \biggl\{\frac{\lvert \bar X_n - \mu_0 \rvert}{S_n} \geqslant\frac{2.093}{\sqrt{n}}\biggr\}\,.\]
Next, we consider testing of so-called simple, or binary, hypotheses, where \(\Theta = \{\theta_0, \theta_1\}\) consists of just two elements. The null hypothesis is \(\theta = \theta_0\) and the alternative hypothesis is \(\theta = \theta_1.\) A powerful (in fact, the most powerful, see Proposition 6.31 below) test in this situation is the Neyman-Pearson test, defined as follows. Recall the likelihood \(L\) from Definition 6.11. Define the likelihood ratio \[R(\theta_0, \theta_1 ; x_1, \dots, x_n) :=\frac{L(\theta_1; x_1, \dots, x_n)}{L(\theta_0; x_1, \dots, x_n)}\,.\] The Neyman-Pearson test is defined by the rejection region \[D = \bigl\{R(\theta_0, \theta_1 ; X_1, \dots, X_n) > C\bigr\}\,.\] Intuitively, a larger valure of \(R\) indicates that \(\theta_1\) is more likely than \(\theta_0,\) and hence a rejection of the null hypothesis \(\theta = \theta_0\) should be more likely.
A person has two coins. One is fair. For the other, the probability of obtaining heads is twice that of obtaining tails. She chooses one of the coins, tosses it 100 times and obtains 60 heads and 40 tails. Which coin did she pick?
We model this with an \(n\)-sample \(X_1, \dots, X_n\) drawn from a Bernoulli distribution with parameter \(p.\) The null hypothesis (fair coin) is \(p = 1/2\) while the alternative hypothesis (biased coin) is \(p = 2/3.\) The number of heads is \(K :=X_1 + \cdots + X_n.\) The likelihood for a sample with \(K = k\) is \[L(p, k) = p^k (1 - p)^{n - k} = (1 - p)^n \biggl(\frac{p}{1 - p}\biggr)^{k}\,.\] The likelihood ratio \(R = R(1/2, 2/3)\) is \[R(k) = \biggl(\frac{1 - 2/3}{1 - 1/2}\biggr)^n \biggl(\frac{2/3}{1/3}\biggr)^k = \biggl(\frac{2}{3}\biggr)^n \, 2^k\,.\] We use the Neyman-Pearson with rejection region \[D = \{R(K) > C\} = \{K > C'\}\] where we used that \(k \mapsto R(k)\) is monotone increasing and we set \(C' = R^{-1}(C).\)
Suppose that we want a test with a confidence of \(90\%.\) This results in the condition \[\mathbb{P}_{1/2}(K > C') = 0.1\,.\] The law of \(K\) for \(n = 100\) can be evaluated numerically or approximated by the Central Limit Theorem. This yields \(C' \approx 54.6.\) Since we observed \(K = 60 > C'\) we are in rejection region, and hence we reject the null hypothesis. Thus we can say, with \(90\%\) confidence, that the person chose the biased coin.
We conclude this chapter with a remarkable theoretical result, known as the Neyman-Pearson lemma, which states that, within the context of simple hypothesis testing, the Neyman-Pearson test is the most powerful test at any given confidence level.
The Neyman-Pearson test is the most powerful test at any given confidence level. More precisely, let \(\Theta = \{\theta_0, \theta_1\}\) and suppose that \(\mathbb{P}_\theta\) has a density for each \(\theta \in \Theta.\) Let \(0 < \alpha < 1.\) Suppose that the constant \(C\) in the Neyman-Pearson rejection region \[D = \{R(\theta_0, \theta_1) > C\}\] is chosen to that the risk \(\mathbb{P}_{\theta_0}(D) = \alpha.\) Then for any rejection region \(B\) with risk \(\mathbb{P}_{\theta_0}(B) = \alpha\) we have \[\mathbb{P}_{\theta_1}(B) \leqslant\mathbb{P}_{\theta_1}(D)\] with a strict inequality if \(\mathbb{P}_{\theta_1}(D \setminus B) > 0.\)
Proof. We use the notation \(x = (x_1, \dots, x_n)\) and \(\mathbb{P}_\theta (\mathrm dx_1) \cdots \mathbb{P}_\theta(\mathrm dx_n) = f_\theta(x) \, \mathrm dx.\) From \(\mathbb{P}_{\theta_0}(D) = \mathbb{P}_{\theta_0}(B) = \alpha\) we get \[\int_{D \setminus B} f_{\theta_0}(x) \, \mathrm dx = \alpha - \int_{D \cap B} f_{\theta_0}(x) \, \mathrm dx = \int_{B \setminus D} f_{\theta_0}(x) \, \mathrm dx\,.\] Since \(D \setminus B \subset D\) and \(B \setminus D \subset D^c,\) we therefore get, by definition of the event \(D.\) \[\int_{D \setminus B} f_{\theta_1}(x) \, \mathrm dx \geqslant C \int_{D \setminus B} f_{\theta_0}(x) \, \mathrm dx = C \int_{B \setminus D} f_{\theta_0}(x) \, \mathrm dx \geqslant\int_{B \setminus D} f_{\theta_1}(x) \, \mathrm dx\,,\] where the first inequality is strict if \(\mathbb{P}_{\theta_1}(D \setminus B) > 0.\) Thus we conclude that \[\mathbb{P}_{\theta_1}(D) = \mathbb{P}_{\theta_1}(D \setminus B) + \mathbb{P}_{\theta_1}(D \cap B) \geqslant\mathbb{P}_{\theta_1}(B \setminus D) + \mathbb{P}_{\theta_1}(D \cap B) = \mathbb{P}_{\theta_1}(B)\,,\] as claimed.
This is the end of this course. I hope you enjoyed it!
Now you know all of the fundamentals of probability. If you liked what you learned (as I hope!), you are fully equipped to go on and learn about more advanced topics such as martingales and Brownian motion.
7 The strong law of large numbers
In this appendix we prove the strong law of large numbers under the optimal condition on the random variables. Recall that we already proved the weak law of large numbers in Proposition 3.24, which had the deficiency of establishing convergence in \(L^2\) instead of almost surely, which led to issues explained in Remark 3.26. This issue was remedied in the strong law of large numbers in \(L^4\) in Proposition 3.27. But the latter result still required the random variables \(X_n\) to lie in \(L^4\) instead of in the optimal space, \(L^1.\) (This space is optimal since we clearly want \(\mathbb{E}[X_1]\) to be well-defined and finite.)
We shall need the following tool from measure theory, which is a consequence of the monotone class lemma.
For each \(i = 1, \dots, n,\) let \(\mathcal C_i \subset \mathcal A\) be a collection of events stable under intersections containing \(\Omega.\) Define \(\mathcal B_i :=\sigma(\mathcal C_i).\) If for all \(C_1 \in \mathcal C_1, \dots, C_n \in \mathcal C_n\) we have \[\mathbb{P}(C_1 \cap \cdots \cap C_n) = \mathbb{P}(C_1) \cdots \mathbb{P}(C_n)\,,\] then \(\mathcal B_1, \dots, \mathcal B_n\) are independent.
Proof. We use the monotone class lemma from Section 3.2, whose notations we also take over. Fix \(C_2 \in \mathcal C_2, \dots, C_n \in \mathcal C_n,\) and define \[\mathcal M_1 :=\bigl\{B_1 \in \mathcal B_1 \,\colon\mathbb{P}(B_1 \cap C_2 \cap \cdots \cap C_n) = \mathbb{P}(B_1)\, \mathbb{P}(C_2) \cdots \mathbb{P}(C_n)\bigr\}\,.\] By assumption, \(\mathcal C_1 \subset \mathcal M_1.\) Moreover, it is easy to verify that \(\mathcal M_1\) is a monotone class. Hence, \[\mathcal M_1 \supset \mathcal M(\mathcal C_1) = \sigma(\mathcal C_1) = \mathcal B_1\,,\] where the second step follows from the monotone class lemma (Proposition 3.8). We conclude: for all \(B_1 \in \mathcal B_1, C_2\in \mathcal C_2, \dots, C_n \in \mathcal C_n,\) we have \[\mathbb{P}(B_1 \cap C_2 \cap \cdots \cap C_n) = \mathbb{P}(B_1)\, \mathbb{P}(C_2) \cdots \mathbb{P}(C_n)\,.\]
We now continue in this fashion, moving on to the second argument. More precisely, fix \(B_1 \in \mathcal B_1, C_3 \in \mathcal C_3, \dots, C_n \in \mathcal C_n\) and define \[\mathcal M_2 :=\bigl\{B_2 \in \mathcal B_2 \,\colon\mathbb{P}(B_1 \cap B_2 \cap C_3 \cap \cdots \cap C_n) = \mathbb{P}(B_1) \, \mathbb{P}(B_2) \, \mathbb{P}(C_3) \cdots \mathbb{P}(C_n)\bigr\}\,.\] As above, it is easy to see that \(\mathcal M_2\) is a monotone class, and by the previous step we know that \(\mathcal C_2 \subset \mathcal M_2.\) By the monotone class lemma, we find that \(\mathcal M_2 \supset \mathcal B_2.\) By repeating this procedure \(n\) times we arrive at the claim.
Our proof of the strong law of large numbers rests on the following fundamental result. To state it, let \((X_n)_{n \geqslant 1}\) be a family of random variables. For \(n \geqslant 1\) we define the \(\sigma\)-algebra \[\mathcal B_n :=\sigma(X_n, X_{n+1}, \dots) = \sigma \biggl(\bigcup_{k \geqslant n} \sigma(X_k)\biggr)\] as well as the tail \(\sigma\)-algebra \[\mathcal B_\infty :=\bigcap_{n \geqslant 1} \mathcal B_n\,.\]
Let \((X_n)_{n \geqslant 1}\) be independent random variables. Then \(\mathcal B_\infty\) satisfies a zero-one law in the sense that any tail event \(B \in \mathcal B_\infty\) satisfies \(\mathbb{P}(B) = 0\) or \(\mathbb{P}(B) = 1.\)
It is important to understand the meaning of the objects in Proposition 7.2. The \(\sigma\)-algebra \(\mathcal B_n\) contains all the information from time \(n\) onwards, i.e. it discards all information up to time \(n - 1.\) The tail events in \(\mathcal B_\infty\) are precisely those whose occurrence can be determined if an arbitrarily large but finite initial segment of the variables \(X_k\) is discarded. For example \(\{\sup_n X_n \leqslant 1\}\) is not in \(\mathcal B_\infty,\) since it clearly depends on all random variables \(X_n.\) But \(\{\limsup_n X_n \leqslant 1\}\) is in \(\mathcal B_\infty,\) since it depends only on the “distant future”, i.e. changing any finite number of variables \(X_n\) does not change its occurrence.
Kolmogorov’s zero-one law is remarkable: it states that any tail event occurs almost surely or its complement occurs almost surely. As we shall see, the tail \(\sigma\)-algebra is rich (i.e. large) enough to make this statement very useful.
Proof of Proposition 7.2. Define \(\mathcal D_n :=\sigma(X_1, \dots, X_n)\) (the \(\sigma\)-algebra containing the information up to time \(n\)). Then we claim that \(\mathcal D_n\) and \(\mathcal B_{n+1}\) are independent. This sounds intuitively obvious, as \(\mathcal D_n\) contains information up to time \(n,\) and \(\mathcal B_{n+1}\) information starting from time \(n+1.\) For a rigorous proof, we proceed in two steps.
For any \(k \geqslant n+1\) we define \(\mathcal B_{n+1,k} :=\sigma(X_{n+1}, \dots, X_k).\) Define the collections \[\begin{aligned} \mathcal C_1 &:=\bigl\{B_1 \cap \cdots \cap B_n \,\colon B_i \in \sigma(X_i) \, \forall i\bigr\}\,, \\ \mathcal C_2 &:=\bigl\{B_{n+1} \cap \cdots \cap B_k \,\colon B_i \in \sigma(X_i) \, \forall i\bigr\}\,. \end{aligned}\] Clearly, these collections are stable under finite intersections and \(\mathcal D_n = \sigma(\mathcal C_n)\) and \(\mathcal B_{n+1,k} = \sigma(\mathcal C_2).\) By Lemma 7.1 and independence of the random variables \((X_n),\) we therefore conclude that \(\mathcal D_n\) and \(\mathcal B_{n+1,k}\) are independent.
Define the collections \(\mathcal C_1 :=\mathcal D_n\) and \(\mathcal C_2 :=\bigcup_{k \geqslant n+1} \mathcal B_{n+1,k},\) which are clearly stable under finite intersections. Thus, \(\sigma(\mathcal C_1) = \mathcal D_n\) and \(\sigma(\mathcal C_2) = \mathcal B_{n+1}.\) By the previous step and Lemma 7.1, we conclude that \(\mathcal D_n\) and \(\mathcal B_{n+1}\) are independent, as desired.
Next, choose \(\mathcal C_1 :=\bigcup_{n \geqslant 1} \mathcal D_n\) and \(\mathcal C_2 :=\mathcal B_\infty.\) Since \(\mathcal D_n\) and \(\mathcal B_{n+1}\) are independent for all \(n,\) we conclude that \(\mathbb{P}(C_1 \cap C_2) = \mathbb{P}(C_1) \, \mathbb{P}(C_2)\) for all \(C_1 \in \mathcal C_1\) and \(C_2 \in \mathcal C_2.\) By Lemma 7.1, we deduce that \(\sigma(\mathcal C_1) = \sigma(X_1, X_2, \dots) = \mathcal B_1\) and \(\mathcal B_\infty\) are independent. Since \(\mathcal B_\infty \subset \mathcal B_1,\) we conclude that \(\mathcal B_\infty\) is independent of itself! This means that any tail event \(B \in \mathcal B_\infty\) satisfies \(\mathbb{P}(B) = \mathbb{P}(B \cap B) = \mathbb{P}(B)^2,\) from which the zero-one law follows.
At first sight, this proof seems quite strange. It is in fact nothing but a careful justification of a simple fact: \(\mathcal B_\infty\) is independent of itself. Since we are working with rather abstract \(\sigma\)-algebras, it is important to proceed slowly and carefully, as we tried to do above. The zero-one law has deep implications in probability. The strong law of large numbers, which we are about to state and prove, is one. The following remark is another one.
Let \((X_n)_{n \geqslant 1}\) be independent random variables. Clearly, \[X_+ :=\limsup_{k \to \infty} \frac{1}{k} (X_1 + \cdots + X_k) = \limsup_{k \to \infty} \frac{1}{k} (X_n + \cdots + X_k)\] for any \(n \in \mathbb{N}^*.\) Hence, \(X_+\) is \(\mathcal B_n\)-measurable for all \(n \in \mathbb{N}^*,\) which implies that \(X_+\) is \(\mathcal B_\infty\)-measurable. The same holds for \(X_-\) where \(\limsup\) is replaced with \(\liminf.\) In particular, the event \[\biggl\{\frac{1}{k}(X_1 + \cdots + X_k) \text{ converges}\biggr\} = \{X_- = X_+\}\] is \(\mathcal B_\infty\)-measurable, and hence has either probability \(1\) or \(0.\) In the former case, the limiting random variable \(X_- = X_+\) is \(\mathcal B_\infty\)-measurable, and it is therefore almost surely constant (exercise). In summary: averages of independent random variables either diverge almost surely or converge almost surely to a constant.
We can now state and prove the strong law of large numbers.
Let \((X_n)_{n \geqslant 1}\) be independent random variables in \(L^1\) with the same law. Then \[\frac{1}{n} (X_1 + \cdots + X_n) \overset{\text{a.s.}}{\longrightarrow} \mathbb{E}[X_1]\,.\]
Proof. Let \(S_n :=X_1 + \cdots + X_n\) and \(S_0 :=0.\) Let \(a > \mathbb{E}[X_1]\) and define \(M :=\sup_{n \in \mathbb{N}} (S_n - na).\) Thus, \(M\) is a random variable with values in \([0,\infty].\) The core of the proof is to show that \[\tag{7.1} M < \infty \quad \text{a.s.}\]
Let us suppose first that (7.1) has been proved and use it to conclude the proof of the strong law of large numbers. By definition of \(M\) we have \(S_n \leqslant na + M\) for all \(n\) and hence (7.1) implies, for all \(a > \mathbb{E}[X_1],\) \[\limsup_{n} \frac{S_n}{n} \leqslant a \quad \text{a.s.}\] This implies that \[\tag{7.2} \limsup_{n} \frac{S_n}{n} \leqslant\mathbb{E}[X_1] \quad \text{a.s.}\,,\] since \[\mathbb{P}\biggl(\limsup_{n} \frac{S_n}{n} \leqslant\mathbb{E}[X_1]\biggr) = \mathbb{P}\biggl(\bigcap_{k \in \mathbb{N}^*} \biggl\{\limsup_{n} \frac{S_n}{n} \leqslant\mathbb{E}[X_1] + \frac{1}{k}\biggr\}\biggr) = 1\,,\] where we used that a countable intersection of events of probability one has probability one.
Replacing \(X_n\) with \(-X_n\) we obtain \[\tag{7.3} \liminf_{n} \frac{S_n}{n} \geqslant\mathbb{E}[X_1] \quad \text{a.s.}\] From (7.2) and (7.3) we conclude the strong law of large numbers.
What remains, therefore, is to prove (7.1). First, we claim that \(\{M < \infty\} \in \mathcal B_\infty.\) Indeed, for all \(k \geqslant 0\) we have \[\{M < \infty\} = \biggl\{\sup_{n \geqslant 0} (S_n - na) < \infty\biggr\} = \biggl\{\sup_{n \geqslant k} \bigl((S_n - S_k) - na\bigr) < \infty\biggr\} \in \sigma(X_{k+1}, X_{k+2}, \dots)\,,\] since \(S_n - S_k = X_{k+1} + X_{k+2} + \cdots + X_n.\) By the zero-one law, Proposition 7.2, to prove (7.1), it therefore suffices to prove that \(\mathbb{P}(M = \infty) < 1.\)
We proceed by contradiction and suppose that \(\mathbb{P}(M = \infty) = 1.\) For all \(k \in \mathbb{N}\) we define \[M_k :=\sup_{0 \leqslant n \leqslant k} (S_n - na)\,, \qquad M_k' :=\sup_{0 \leqslant n \leqslant k} (S_{n+1} - S_1 - na)\,.\] Since \(S_n = X_1 + \cdots + X_n\) and \(S_{n+1} - S_1 = X_2 + \cdots + X_{n+1},\) we conclude that \(M_k \overset{\mathrm d}{=}M_k'\) (recall (2.5)). Moreover, \(M_k\) and \(M'_k\) are increasing sequences that converge from below to their limits \(M\) and \(M'.\) By \(M_k \overset{\mathrm d}{=}M_k',\) we conclude that \(M \overset{\mathrm d}{=}M',\) since \[\mathbb{P}(M' \leqslant x) = \lim_{k \to \infty} \mathbb{P}(M'_k \leqslant x) = \lim_{k \to \infty} \mathbb{P}(M_k \leqslant x) = \mathbb{P}(M \leqslant x)\,.\] Moreover, \[\begin{aligned} M_{k+1} &= \sup \biggl\{0, \sup_{1 \leqslant n \leqslant k+1} (S_n - na)\biggr\} \\ &= \sup \biggl\{0, \sup_{0 \leqslant n \leqslant k} (S_{n+1} - (n+1)a)\biggr\} \\ &= \sup\{0, M'_k + X_1 - a\} \\ &= M'_k - \inf\{a - X_1, M'_k\}\,. \end{aligned}\] Hence, \[\tag{7.4} \mathbb{E}[\inf\{a - X_1, M'_k\}] = \mathbb{E}[M'_k] - \mathbb{E}[M_{k+1}] = \mathbb{E}[M_k] - \mathbb{E}[M_{k+1}] \leqslant 0\,,\] since the sequence \((M_k)\) is nondecreasing. Moreover, since \(M'_k \geqslant 0\) we have \[\lvert \inf\{a - X_1, M'_k\} \rvert \leqslant\lvert a - X_1 \rvert\] for all \(k,\) so that we may apply dominated convergence to (7.4) to get \[\mathbb{E}[\inf\{a - X_1, M'\}] \leqslant 0\,.\]
Now if \(\mathbb{P}(M = \infty) = 1\) then also \(\mathbb{P}(M' = \infty) = 1\) and hence \(\inf\{a - X_1, M'\} = a - X_1.\) But \[\mathbb{E}[a - X_1] > 0\] by assumption. This is the desired contradiction.