UniDistance | BMathE_M08

2.4 Expectation

Let \(X\) be a random variable with values in \(\mathbb{R}.\) We would like to determine a number that represents the typical, or mean, value of \(X.\) For example, consider a random variable \(X\) that is equal to \(a\) with probability \(p\) and to \(b\) with probability \(1 - p.\) In other words, \[\mathbb{P}_X = p \delta_a + (1-p) \delta_b\,.\] In school we learn that the mean value of \(X\) is \[p a + (1 - p) b = \int_R x \, \mathbb{P}_X(\mathrm dx) = \int_\Omega X(\omega) \, \mathbb{P}(\mathrm d\omega)\,,\] where the second identity follows by definition of \(\mathbb{P}_X\) (Definition 2.22). Note that which probability space \(X\) is defined on does not matter.

Definition 2.24

Let \(X\) be a random variable with values in \(\mathbb{R}.\) The expectation of \(X\) is \[\mathbb{E}[X] :=\int X(\omega) \, \mathbb{P}(\mathrm d\omega)\,,\] which is well-defined if \(X \geqslant 0\) (in which case \(\mathbb{E}[X] \in [0,\infty]\)) or if \(\mathbb{E}[\lvert X \rvert] < \infty\) (in which case \(X\) is called integrable).

If \(X = (X_1, \dots, X_d) \in \mathbb{R}^d\) has values in \(\mathbb{R}^d,\) then we define \[\mathbb{E}[X] :=(\mathbb{E}[X_1], \dots, \mathbb{E}[X_d])\,.\]

The following result states how to compute the expectation of a function of a random variable.

Proposition 2.25

Let \(X\) be a random variable with values in \((E, \mathcal E),\) and let \(f \,\colon E \to \mathbb{R}\cup \{\infty\}\) be measurable. Then \[\mathbb{E}[f(X)] = \int f \, \mathrm d\mathbb{P}_X\] provided that \(f \geqslant 0\) or \(\mathbb{E}[\lvert f(X) \rvert] < \infty.\)

Proof. The proof is an archetypal argument from measure theory, which we recall here. See also the discussion after Definition 1.9. Consider first the case where \(f\) is an indicator function, \(f = \mathbf 1_{B}\) for \(B \in \mathcal E.\) Then \[\mathbb{E}[f(X)] = \mathbb{E}[\mathbf 1_{B}(X)] = \mathbb{P}(X \in B) = \mathbb{P}_X(B) = \int_B \mathrm d\mathbb{P}_X = \int \mathbf 1_{B} \, \mathrm d\mathbb{P}_X = \int f \, \mathrm d\mathbb{P}_X\,,\] as desired.

Moreover, by linearity of the integrals over \(\mathbb{P}\) in \(\mathbb{E}[\cdot]\) as well as over \(\mathbb{P}_X,\) we deduce that the claim holds for any simple function \(f.\)

Next, let \(f\) be an arbitrary nonnegative measurable function. Choose a sequence of simple functions \(f_n\) that converge to \(f\) monotonically from below. Then by using the claim for simple functions, as well as the monotone convergence theorem twice (since \((f_n)\) and \((f_n(X))\) are pointwise nondecreasing sequences on \(E\) and \(\Omega,\) respectively), we obtain \[\mathbb{E}[f(X)] = \lim_{n \to \infty} \mathbb{E}[f_n(X)] = \lim_{n \to \infty} \int f_n \, \mathrm d\mathbb{P}_X = \int f \, \mathrm d\mathbb{P}_X\,.\] This concludes the proof for \(f \geqslant 0.\) If \(f\) is a general integrable function, we split \(f = f_+ - f_-\) into its positive and negative parts, and apply the result for positive \(f\) to \(f_+\) and \(f_-\) separately.

Remark 2.26

The proof used a trick that is so trivial that it may go unnoticed: the probability of an event \(A\) can be expressed as the expectation of its indicator function, \[\mathbb{P}(A) = \mathbb{E}[\mathbf 1_{A}]\,.\] This trick, as simple as it seems, is extremely useful and consequenty ubiquitous in probability. We shall use it repeatedly in this class.

Example 2.27 • Example 2.8 continued

Let us compute the expectation of the sum of two throws of a die: \[\mathbb{E}[X] = \frac{1}{36} \sum_{i,j = 1}^6 (i+j) = \frac{1}{36} \biggl(6 \sum_{i = 1}^6 i + 6 \sum_{j = 1}^6 j\biggr) = 7\,.\]

Example 2.28 • Example 2.9 continued

Let us compute the expected number of throws until we obtain a \(6\): \[\mathbb{E}[X] = \sum_{n = 1}^\infty n \, \mathbb{P}(X = n) = \frac{1}{6} \sum_{n = 1}^\infty n \, \biggl(\frac{5}{6}\biggr)^{n-1} = \frac{1}{6} \frac{1}{(1/6)^2} = 6\,,\] where we used the geometric series identity \(\sum_{n=1}^\infty n x^{n-1} = \frac{1}{(1-x)^2},\) which is proved by differentiating in \(x.\)

We conclude this section by introducing the notion of conditional expectation.

Definition 2.29

Let \(B\) and event satisfying \(\mathbb{P}(B) > 0.\) Let \(X\) be a random variable. The conditional expectation of \(X\) given \(B\) is \[\mathbb{E}[X \mid B] :=\frac{\mathbb{E}[X \mathbf 1_{B}]}{\mathbb{P}(B)}\,,\] i.e. the expectation of \(X\) with respect to the probability measure \(\mathbb{P}(\cdot \mid B).\)

In particular, for any events \(A\) and \(B\) we clearly have \[\mathbb{E}[\mathbf 1_{A} \mid B] = \mathbb{P}(A \mid B)\,.\]

Example 2.30 • Example 2.8 continued

Let us compute the expectation of the sum of two throws of a die, given that each number is even. The event \(B\) on which we condition is \[B = \{2,4,6\}^2\,.\] We find \(\mathbb{P}(B) = \frac{1}{4}\) and \[\mathbb{E}[X \mathbf 1_{B}] = \frac{1}{36} \sum_{i,j \in \{2,4,6\}} (i+j) = \frac{1}{36} \biggl(3 \sum_{i \in \{2,4,6\}} i + 3 \sum_{j \in \{2,4,6\}} j\biggr) = 2\,.\] Hence, \[\mathbb{E}[X \mid B] = \frac{2}{1/4} = 8\,.\] This is larger than the unconditional expectation from Example 2.27, which is hardly surprising: conditioning on a number from 1 to 6 being even makes it on average larger.

2.5 The classical laws

In this section we go to the zoo. We encounter the most common and useful laws and learn their names, along with any parameters they depend on. We leave it as a simple exercise to check that each of them is indeed a probability measure (i.e. that the total measure equals one).

Discrete laws

The following laws are defined on \((E, \mathcal E)\) with \(E\) finite or countable and \(\mathcal E = \mathcal P(E).\)

Uniform. Let \(E\) be a finite set and define \(\mathbb{P}(X = x) = \frac{1}{\# E}\) for all \(x \in E.\)
Bernoulli (\(p \in [0,1]\)). Let \(E = \{0,1\}\) and set \(\mathbb{P}(X = 1) = p\) and \(\mathbb{P}(X = 0) = 1-p.\) (This law models a coin toss that is biased for \(p \neq \frac12.\))
Binomial (\(n \in \mathbb{N}^*,\) \(p \in [0,1]\)). Let \(E = \{0,1,\dots, n\}\) and \[\mathbb{P}(X = k) = \binom{n}{k} p^k (1 - p)^{n-k}\] for \(k \in E.\) Here \(\binom{n}{k} = \frac{n!}{k! (n-k)!}\) is the binomial coefficient. (This law models the number of heads when tossing a biased coin \(n\) times; see the exercises.)
Geometric (\(p \in (0,1)\)). Let \(E = \mathbb{N}\) and \[\mathbb{P}(X=k) = (1-p) p^k\] for all \(k \in \mathbb{N}.\) (This law models the number of heads before the first tails when tossing a biased coin. See also Example 2.9 and its continuation in Section 2.3.1.) Note that the different convention \(\mathbb{P}(X = k) = (1 - p)^{k-1} p\) for \(k \in E = \mathbb{N}^*\) is also often used in the literature.
Poisson (\(\lambda > 0\)). Let \(E = \mathbb{N}\) and \[\mathbb{P}(X = k) = \frac{\lambda^k}{k!} \mathrm e^{-\lambda}\] for all \(k \in \mathbb{N}.\) (This law models the number of rare events observed in a long time interval. More precisely, if \(X_n\) has binomial law with parameters \(n\) and \(p_n\) satisfying \(n p_n \to \lambda\) as \(n \to \infty,\) then \(\mathbb{P}(X_n = k) \to \mathbb{P}(X = k)\) for all \(k.\) See the exercises.)

Continuous laws

The following laws are defined on \((\mathbb{R}, \mathcal B(\mathbb{R})).\) They are continuous, and therefore determined by their densities \(p(x).\)

Uniform on \([a,b].\) Let \[p(x) = \frac{1}{b-a} \mathbf 1_{[a,b]}(x)\,.\]
Exponential (\(\lambda > 0\)). Let \[p(x) = \lambda \mathrm e^{-\lambda x} \mathbf 1_{[0,\infty)}(x)\,.\]
Gaussian or normal (\(m \in \mathbb{R}, \sigma > 0\)). Let \[p(x) = \frac{1}{\sqrt{2 \pi} \sigma} \mathrm e^{- \frac{(x - m)^2}{2 \sigma^2}}\,.\] (\(m\) is called the mean and \(\sigma\) the standard deviation.)

2.6 Cumulative distribution function

The law of a real-valued random variable can be fully and equivalently characterised by a function on \(\mathbb{R}.\)

Definition 2.31

Let \(X\) be a random variable with values in \(\mathbb{R}.\) We define its cumulative distribution function \(F_X \,\colon\mathbb{R}\to [0,1]\) by \[F_X(t) :=\mathbb{P}(X \leqslant t)\,.\] For brevity, sometimes we simply speak of the distribution function of \(X.\)

Proposition 2.32

If \(F = F_X\) is the distribution function of a random variable \(X,\) then

\(F\) is nondecreasing;
\(\lim_{t \to -\infty} F(t) = 0\) and \(\lim_{t \to \infty} F(t) = 1\);
\(F\) is right-continuous, i.e. \(\lim_{s \downarrow t} F(s) = F(t)\) for all \(t \in \mathbb{R}.\)

Proof. The proof of (i) is obvious.

Let us prove (ii). It uses some basic facts from measure theory, which are reviewed in Exercise 1.1. Since the events \(\{X \leqslant n\}\) are increasing, we find \[\lim_{t \to \infty} F(t) = \lim_{n \to \infty} \mathbb{P}(X \leqslant n) = \mathbb{P}\Biggl(\bigcup_{n \in \mathbb{N}} \{X \leqslant n\}\Biggr) = \mathbb{P}(\Omega) = 1\,,\] where the second step follows by \(\sigma\)-additivity of \(\mathbb{P}\) and the last step from \(\bigcup_{n \in \mathbb{N}} \{X \leqslant n\} = \Omega\) since \(X \in \mathbb{R}.\) Analogously, since the events \(\{X \leqslant-n\}\) are decreasing, we find \[\lim_{t \to -\infty} F(t) = \lim_{n \to \infty} \mathbb{P}(X \leqslant-n) = \mathbb{P}\Biggl(\bigcap_{n \in \mathbb{N}} \{X \leqslant-n\}\Biggr) = \mathbb{P}(\emptyset) = 0\,,\] where the second step follows by \(\sigma\)-additivity of \(\mathbb{P}\) and the last step from \(\bigcap_{n \in \mathbb{N}} \{X \leqslant-n\} = \emptyset\) since \(X \in \mathbb{R}.\)

To prove (iii), let \((a_n)\) be a strictly decreasing sequence tending to \(0.\) Since the events \(\{X \leqslant t + a_n\}\) are decreasing, we find \[\lim_{n \to \infty} F(t + a_n) = \lim_{n \to \infty} \mathbb{P}(X \leqslant t + a_n) = \mathbb{P}\Biggl(\bigcap_{n \in \mathbb{N}} \{X \leqslant t + a_n\}\Biggr) = \mathbb{P}(X \leqslant t) = F(t)\,,\] where the second step follows by \(\sigma\)-additivity of \(\mathbb{P}\) and the last step from \(\bigcap_{n \in \mathbb{N}} \{X \leqslant t + a_n\} = \{X \leqslant t\}.\)

As a helpful check to see if you understood the proof of (iii), try to see what goes wrong if you try to prove that \(F\) is left-continuous, which is in general wrong.

Note that the \(F\) is discontinuous whenever \(\mathbb{P}_X\) has an atom. For instance, if \(X\) is equal to a constant \(a,\) then \(\mathbb{P}_X = \delta_a\) and hence \(F_X(t) = \mathbf 1_{t \geqslant a}.\)

Conversely, it is natural to ask whether any function \(F\) satisfying the three properties (i), (ii), (iii) is the distribution function of some random variable. As the following proposition shows, the answer is yes!

This very important result is not just a theoretical curiosity; it is extremely useful. Indeed, the proof relies on an explicit construction that is of great use in both theoretical probability and applications. It is the standard algorithm for generating random variables with any given distribution. See Remark 2.34 below.

Proposition 2.33

If \(F : \mathbb{R}\to [0,1]\) satisfies (i), (ii), (iii) from Proposition 2.32 then there exists a random variable \(X\) with values in \(\mathbb{R}\) such that \(F = F_X.\)

Proof. To construct \(X,\) we have to start by constructing the probability space. We take \(\Omega = (0,1),\) \(\mathcal A = \mathcal B((0,1)),\) and \(\mathbb{P}\) to be Lebesgue measure. Then we define \[X(\omega) :=\sup \{s \in \mathbb{R}\,\colon F(s) < \omega\}\,.\] We note that \(X(\omega) \in \mathbb{R}\) for any \(\omega \in (0,1)\) by the assumption (ii). The rest of the proof consists in showing that this explicit choice satisfies \(F = F_X.\)

To that end, we shall show that for all \(t \in \mathbb{R}\) \[\tag{2.8} \{\omega \in \Omega \,\colon X(\omega) \leqslant t\} = \{\omega \in \Omega \,\colon\omega \leqslant F(t)\}\,.\] Supposing that (2.8) is true, we obtain \[F_X(t) = \mathbb{P}(X \leqslant t) = \mathbb{P}\bigl(\{\omega \in \Omega \,\colon\omega \leqslant F(t)\}\bigr) = F(t)\] by definition of Lebesgue measure, as desired.

Hence, what remains is to show (2.8), which follows from the two following observations.

Suppose that \(\omega \leqslant F(t).\) Then, by definition of \(X\) and of \(\sup,\) we immediately deduce that \(t \geqslant X(\omega).\)
Suppose that \(\omega > F(t).\) Since \(F\) is right-continuous, there exists \(\varepsilon> 0\) such that \(F(t + \varepsilon) < \omega.\) Thus, again by definition of \(X,\) we deduce that \(X(\omega) \geqslant t + \varepsilon.\) Hence, \(X(\omega) > t.\)

As a helpful check to see that you understood the proof, you can try to spot exactly where we used each of the assumptions (i), (ii), (iii) on \(F.\)

Remark 2.34

The function \(X \,\colon(0,1) \to \mathbb{R}\) constructed in the above proof is often called inverse of \(F,\) and denoted by \(F^{-1},\) even if \(F\) is not bijective. (The function \(F\) can be locally constant.) If \(F\) is bijective, then \(F^{-1}\) coincides with the usual inverse.

The construction of the proof is remarkable. It provides an algorithm for generating a random variable \(X\) with distribution function \(F\) starting from a random variable \(Y\) which is uniformly distributed on \((0,1)\): simply set \(X :=F^{-1}(Y).\) Thus, the problem is reduced to the generation of the special and simple random variable \(Y.\)

To be more concrete, suppose that we have a computer that generates a random floating point number \(\omega\) that is uniformly distributed in \((0,1).\) (We shall see later that such a generator can be easily constructed by taking the binary digits of \(\omega\) to be independent Bernoulli random variables with parameter \(p = 1/2.\) Thus, all that we need is a random number generator that generates such Bernoulli random variables.) We wish to generate a standard Gaussian random variable, with parameters \(m = 0\) and \(\sigma = 1.\) To that end, we define the function \[F(t) :=\frac{1}{\sqrt{2 \pi}} \int_{-\infty}^t \mathrm e^{-\frac{x^2}{2}} \, \mathrm dx\,.\] The function \(F\) is famously not an elementary function, but it can be easily computed numerically, and most software packages have an implementation of (a version of) it, often called \(\text{Erf}.\) Then the desired Gaussian random variable is \(X :=F^{-1}(\omega),\) where the right-hand side is again evaluated numerically.

2.7 The \(\sigma\)-algebra generated by a random variable

Every random variable \(X\) gives naturally rise to a \(\sigma\)-algebra, which is the smallest (i.e. coarsest) \(\sigma\)-algebra on \(\Omega\) with respect to which \(X\) is measurable. To build intuition, consider the case where \(X\) is a random variable with values in \(\{1,2,3\},\) and define the events \(A_i :=X^{-1}(\{i\})\) for \(i = 1,2,3.\) Then \(X\) is measurable with respect to the \(\sigma\)-algebra \[\begin{aligned} \mathcal B &:=\{\emptyset, A_1, A_2, A_3, A_1 \cup A_2, A_1 \cup A_3, A_2 \cup A_3, \Omega\} \\ &= \{X^{-1}(\emptyset), X^{-1}(\{1\}), X^{-1}(\{2\}), X^{-1}(\{3\}), \\ &\qquad X^{-1}(\{1,2\}), X^{-1}(\{1,3\}), X^{-1}(\{2,3\}), X^{-1}(\{1,2,3\})\}\,. \end{aligned}\] You can convince yourself that this is the smallest \(\sigma\)-algebra with respect to which \(X\) is measurable.

In some sense, \(\mathcal B\) captures the resolving power of \(X,\) but it does not contain the full information about \(X.\) For instance, the random variable \(Y = 2 X\) generates the same \(\sigma\)-algebra as \(X,\) but it is clearly different from \(X.\) However, both \(X\) and \(Y\) have the same ability to resolve the probability space \(\Omega.\) This is the basic intuition behind the following definition.

Definition 2.35

Let \(X\) be a random variable with values in a measurable space \((E, \mathcal E).\) Then the \(\sigma\)-algebra generated by \(X\) is \[\sigma(X) :=\{X^{-1}(B) \,\colon B \in \mathcal E\}\,.\]

Note that, as advertised above, this is the smallest \(\sigma\)-algebra with respect to which \(X\) is measurable. Indeed, clearly any such \(\sigma\)-algebra will have to contain all sets of the form \(X^{-1}(B)\) for \(B \in \mathcal E\); moreover, the set \(\sigma(X)\) is a \(\sigma\)-algebra.

2.8 Moments and inequalities

Definition 2.36

Let \(X\) be a random variable with values in \(\mathbb{R}\) and \(p \geqslant 1.\) The \(p\)-th moment of \(X\) is \(\mathbb{E}[X^p],\) which is well-defined under either of the following conditions:

\(p \in \mathbb{N}^*\) and \(\mathbb{E}[\lvert X \rvert^p] < \infty\);
\(X \geqslant 0.\)

In probability, we say that some property \(P(\omega)\) depending on the realisation \(\omega\) holds almost surely instead of almost everywhere (as in measure theory) if \(\mathbb{P}(P \text{ true}) = 1.\) We often abbreviate a.s. for almost surely.

We use the following definition from measure theory (see the course Calculus II).

Definition 2.37

For \(p \in [1,\infty],\) we denote by \(L^p \equiv L^p(\Omega, \mathcal A, \mathbb{P})\) the usual \(L^p\)-space with norm denoted by \(\lVert X \rVert_p.\)

It might be helpful to do a quick review of measure theory to recall how these spaces are defined. As in measure theory, there is a technical annoyance, which arises from the need to identify random variables that are almost surely equal.

For \(p \in [1,\infty)\) we denote by \(\mathcal L^p(\Omega, \mathcal A, \mathbb{P})\) the set of real-valued random variables \(X\) satisfying \(\mathbb{E}[\lvert X \rvert^p] < \infty.\)
We denote by \(\mathcal L^\infty(\Omega, \mathcal A, \mathbb{P})\) the set of real-valued random variables \(X\) such that there exists a constant \(C\) satisfying \(\lvert X \rvert \leqslant C\) almost surely.
For \(p \in [1,\infty]\) we define the equivalence relation \(\sim\) on \(\mathcal L^p\) by setting \(X \sim Y\) if and only if \(X = Y\) almost surely.
For \(p \in [1,\infty]\) we define the quotient space \[L^p(\Omega, \mathcal A, \mathbb{P}) :=\mathcal L^p(\Omega, \mathcal A, \mathbb{P}) / \sim\,.\]

Thus an element of \(L^p\) is an equivalence class of random variables. Throughout the following, and in accordance with the literature, we usually skirt around this issue by abusing notation and identifying an element of \(L^p\) with a representative of its class. This convention is consistent provided that all operations performed on such representatives do not depend on the choice of the representative within its class. It is always good to keep the precise definition in mind, as this subtlety is sometimes important.

For \(p \in [1, \infty)\) and \(X \in L^p\) we write \[\lVert X \rVert_p :=\bigl(\mathbb{E}[\lvert X \rvert^p]\bigr)^{1/p}\,.\] Note that this definition makes sense, since it is independent of the representative of \(X.\)
We write \[\lVert X \rVert_\infty :=\inf\{C \geqslant 0 \,\colon\lvert X \rvert \leqslant C \text{ a.s.}\}\,.\] This number is sometimes called the essential supremum of \(\lvert X \rvert.\) It is independent of the representative of \(X\) (unlike \(\sup \lvert X \rvert\)).

The following result⁶ was proved in the course Calculus II.

Proposition 2.38

For each \(p \in [1,\infty],\) the space \(L^p(\Omega, \mathcal A, \mathbb{P})\) is a Banach space.

The following inequality is the most important inequality in all of analysis.

Proposition 2.39 • Hölder’s inequality

Let \(p,q \in [1,\infty]\) satisfy \(\frac{1}{p} + \frac{1}{q} = 1\) (with the convention \(\frac{1}{\infty} = 0\)). Then for any random variables \(X,Y\) we have \[\lVert XY \rVert_1 \leqslant\lVert X \rVert_p \lVert Y \rVert_q\,.\]

Note that Propositions 2.38 and 2.39 are true for any measure space, not just for a probability space.

Let us list some obvious but important special cases of Hölder’s inequality:

\(\lVert X \rVert_p \leqslant\lVert X \rVert_q\) if \(1 \leqslant p \leqslant q.\)
\(\mathbb{E}[\lvert XY \rvert] \leqslant\lVert X \rVert_2 \lVert Y \rVert_2\) (Cauchy-Schwarz inequality).
\(\mathbb{E}[\lvert X \rvert]^2 \leqslant\mathbb{E}[X^2].\)

Note that (ii) is true for any measure space, while (i) and (iii) are only true for a probability space.

Definition 2.40

Let \(X \in L^2.\) The variance of \(X\) is \[\mathop{\mathrm{Var}}(X) :=\mathbb{E}[(X - \mathbb{E}[X])^2]\] and its standard deviation is \(\sigma_X :=\sqrt{\mathop{\mathrm{Var}}(X)}.\)

Just as the expectation measures the typical mean value of \(X,\) the variance measures the typical spread of \(X\) around its mean value. It is important to realise that the variance is not the only quantity to quantify this spread, it is merely the most convenient and the most popular one. For example, another quantity that measures the spread is \(\mathbb{E}[\lvert X - \mathbb{E}[X] \rvert]\); as we shall see in the exercises, this quantity has advantages and disadvantages as compared to the variance, and it is sometimes used in statistics where it is closely related to the median of \(X\) (see the exercises).

Remark 2.41

The following observations follow immediately from the definition of the variance.

\(\mathop{\mathrm{Var}}(X) = \mathbb{E}[X^2] - \mathbb{E}[X]^2.\)
For all \(a \in \mathbb{R}\) we have \(\mathbb{E}[(X - a)^2] = \mathop{\mathrm{Var}}(X) + (\mathbb{E}[X] - a)^2,\) and hence \[\mathop{\mathrm{Var}}(X) = \inf_{a \in \mathbb{R}} \mathbb{E}[(X - a)^2]\] This gives another, so-called variational, interpretation of the variance.
\(\mathop{\mathrm{Var}}(X) = 0\) if and only if \(X\) is almost surely constant.

Next we state the most important inequality in probability, which is traditionally associated with at least the names of Bienaymé, Chebyshev, and Markov. We shall call it Chebyshev’s inequality, as it is also commonly known, for historical reasons that we do not go into here.

Proposition 2.42 • Chebyshev

Let \(f \,\colon\mathbb{R}\to [0,\infty)\) be nondecreasing and \(X\) a random variable. Then for all \(a \in \mathbb{R}\) we have \[\mathbb{P}(X \geqslant a) \leqslant\frac{\mathbb{E}[f(X)]}{f(a)}\,.\]

Proof. Since \(f\) is nondecreasing, on the event \(X \geqslant a\) we have \(f(X) \geqslant f(a).\) Thus, \[\mathbb{P}(X \geqslant a) = \mathbb{E}[\mathbf 1_{X \geqslant a}] \leqslant\mathbb{E}\biggl[\mathbf 1_{X \geqslant a} \frac{f(X)}{f(a)}\biggr] \leqslant\mathbb{E}\biggl[\frac{f(X)}{f(a)}\biggr]\,,\] as claimed.

Here are some important and famous special cases of Chebyshev’s inequality:

If \(X \geqslant 0\) and \(a > 0\) then \(\mathbb{P}(X \geqslant a) \leqslant\frac{\mathbb{E}[X]}{a}\) (often called Markov’s inequality).
If \(X \in L^2\) and \(a > 0\) then \[\mathbb{P}(\lvert X - \mathbb{E}[X] \rvert \geqslant a) \leqslant\frac{\mathop{\mathrm{Var}}(X)}{a^2}\] (often simply called Chebyshev’s inequality)
\(\mathbb{P}(X \geqslant a) \leqslant\mathrm e^{-t a} \mathbb{E}[\mathrm e^{tX}]\) for any \(t > 0\) (often called Chernov’s inequality). Since this inequality holds for any \(t > 0,\) one can even take the infimum over \(t\) to deduce that \(\mathbb{P}(X \geqslant a) \leqslant\mathrm e^{-I(a)},\) where \[I(a) :=\sup_{t > 0} \{t a - \log \mathbb{E}[\mathrm e^{tX}]\}\,.\] This estimate is often very sharp, and it plays a fundamental role in the so-called theory of large deviations and statistical mechanics, which however goes beyond the scope of this course.

Finally, the notion of variance can be generalised to the covariance of several random variables, which roughly measures how strongly they tend to fluctuate jointly.

Definition 2.43

For \(X,Y \in L^2\) define the covariance of \(X\) and \(Y\) as \[\mathop{\mathrm{Cov}}(X,Y) :=\mathbb{E}\bigl[(X - \mathbb{E}[X]) (Y - \mathbb{E}[Y])\bigr] = \mathbb{E}[XY] - \mathbb{E}[X] \mathbb{E}[Y]\,.\] For a random vector \(X = (X_1, \dots, X_d)\) with values in \(\mathbb{R}^d\) with \(X_i \in L^2\) for all \(i = 1, \dots, d,\) we define the \(d \times d\) covariance matrix \[\mathop{\mathrm{Cov}}(X) :=(\mathop{\mathrm{Cov}}(X_i, X_j))_{i,j = 1}^d\,.\]

The covariance matrix of a random vector is one of the most fundamental objects of study in high-dimensional statistics and machine learning. We shall discuss some of its properties in the exercises.