Preface
These are lecture notes for a one-semester course on probability theory. They are meant to be fully self-contained, assuming a basic knowledge of measure theory (which is reviewed briefly below). For further reading, I recommend the three following standard references, on which these lecture notes are in part based:
Rick Durrett: Probability: theory and examples
Jean Jacod and Philip Protter: Probability essentials (Springer, 2004).
Jean-François Le Gall: Intégration, probabilités et processus aléatoires (French).
There are of course many other excellent books on the subject.
These notes are liable (i.e. virtually certain) to contain typos. If you find any, please make sure you tell me!
Before getting into the subject proper, in this short preface we give a very brief overview of the subject’s history and of its relation to the natural sciences, with which it has always had a close interaction. This is meant only for the curious reader, and does not constitute a part of the course itself.
Probability is the study of uncertain events – events whose outcome cannot be predicted with certainty. Examples of such events include
- I obtain heads when I flip a coin;
- it rains in Brig tomorrow;
- my kitchen light breaks in the next six months.
The classical view of how uncertainty arises in nature is based on nineteenth century physics (Newtonian mechanics and Maxwell’s electrodynamics), where the state of a physical system at any time is a deterministic function of its initial state. In principle, therefore, the future state of any system is fully predictable, provided we have precise enough information about its current state. From this point of view, the uncertainty of a future event is simply an expression of a lack of knowledge about the present. In reality, however, this point of view is essentially useless for most systems of interest. This is because the complexity of the system and the sensitive dependence on the initial state means that the required precision in the knowledge of the initial state is not achievable by any conceivable means. A famous example is the impossibility of predicting the weather for more than two weeks into the future. A simpler example is the humble coin flip or toss of a die, whose outcome cannot be determined in advance no matter how accurately the initial toss is determined. The quantum revolution of the first half of the twentieth century went further: uncertainty is inherent in the laws of nature, and even simple physical systems behave in an intrinsically random fashion, no matter how accurately one determines the initial data (a famous example is the double slit experiment in quantum mechanics).
The historical development of probability was initially motivated by the desire for a theoretical understanding of gambling, in the sixteenth and seventeenth centuries. Today, probability theory has become one of the theoretical foundations of our modern society. It underpins statistics, machine learning, artificial intelligence, and computer science. It also constitutes the bedrock of any experimental discipline, and as such lies at the heart of the natural and social sciences.
Aside from its applications, probability theory is an area of pure mathematics, which has flourished in the past fifty years. Having shed its former reputation of an application-driven low-brow game of counting balls and boxes, it has become one of the most central and active areas of pure mathematics.
The study of probability can be roughly divided into two disciplines, which, while not wholly separate, have rather different goals and ways of thinking.
Probability theory – an area of mathematics, which develops a calculus for determining the probability of an event starting from a set of mathematical axioms. As a mathematical theory, it is purely a logical construct and detached from any interpretation in the real world. Its origins trace back to Blaise Pascal and Pierre de Fermat in the seventeenth century. It was put on a rigorous axiomatic basis by Kolmogorov in 1933, an achievement usually regarded as the beginning of modern probability theory.
As we shall see, Kolmogorov’s axioms build on measure theory. Thus, one could make the case that probability theory is nothing but a special case of analysis and measure theory. This point of view is however simplistic and often even misleading, since probability theory has developed its own very particular way of thinking, characterised by concepts such as independence, conditioning, and infinite product spaces.
Interpretation of probability – an area of epistemology and statistics, which aims to connect mathematical probability theory with random experiments. It strives to give meaning to probabilistic claims about real-world events, or in other words to give an interpretation of probability. There are several competing schools of thought, each with their strengths and weaknesses; which interpretation to adopt in a given situation is sometimes a matter of personal preference.
For instance, returning to the example (i) above, what does the claim “the probability of obtaining heads when flipping a coin is 50 percent” mean? The most natural, and indeed oldest, interpretation is that of frequentist probability: the probability of a random event is the relative frequency of occurrence of the event when the experiment is repeated indefinitely and independently. The frequentist interpretation is independent of the observer, and it is the most prominent instance of an objective interpretation of probability.
What about the example (ii) above? The frequentist interpretation fails here, because the event in question – it rains in Brig tomorrow – cannot be repeated independently: the current weather conditions are unique and we have no control over them. Nevertheless, we all believe that claims of the form “the probability that it rains in Brig tomorrow is 20 percent” somehow make sense, and indeed that is how weather forecasts are often formulated. For such events, a subjective interpretation of probability imposes itself, whereby the probability of an event corresponds to a degree of belief by a knowledgeable person, who incorporates expert knowledge (such as meteorology and weather models) and experimental data (such as the current and past weather conditions). The most popular version of subjective probability is Bayesian probability, whereby the expert knowledge is translated into a subjective prior probability distribution (an educated guess), which is then updated based on experimental data to obtain a posterior probability distribution. Different prior probability distributions will give rise to different posterior probability distribution when given the same experimental data. This captures the subjective element of Bayesian probability. In everyday life this is clearly illustrated by the fact that we often use several different weather apps to check the weather forecast, since they typically give different probabilities for the same event1 .See
https://www.rmets.org/metmatters/what-does-30-chance-rain-meanfor an insightful and more detailed explanation of the meaning of probabilities in weather forecasts.As for the example (iii), a frequentist interpretation is possible if I have a large supply of identical copies of my kitchen light, which I can test individually and measure the proportion of lights that fail in the next six months. On the other hand, if my kitchen light is a unique sample (say an inherited antique piece), a subjective interpretation is required.
In most instances, if one is familiar with probability theory, simple common sense is sufficient to answer probabilistic questions about the real world. Nevertheless, aside from important philosophical questions it raises, the study of the interpretation of probability can be of great practical importance in several applied fields. This is typically discussed in more detail in classes on statistics.
Being a course on mathematics, this course is entirely devoted to mathematical probability theory. Henceforth, we shall wrap ourselves in the warm blanket of mathematical rigour and axiomatic deduction, without having to worry about tricky epistemological questions raised by interpretation.
1 Recap of measure theory
Since probability theory is founded on measure theory, in this preliminary chapter we give a review of the most important ingredients from measure theory. It is meant to be understandable for a reader who has learned some basic measure theory but may have forgotten some details or more technical aspects of it.
For full details and for proofs, we refer to Chapter 3 of the course Calculus II that you took last year.
The term indicator function is used in probability theory, while the same object is usually called characteristic function in analysis. As we shall see, the latter term is reserved for a very different object in probability theory.
Let \(X\) be a set. A \(\sigma\)-algebra (or \(\sigma\)-field) on \(X\) is a collection \(\mathcal A\) of subsets of \(X\) satisfying
- \(X \in \mathcal A\);
- \(A \in \mathcal A \; \Rightarrow \; A^c \in \mathcal A\);
- if \(A_n \in \mathcal A\) for all \(n \in \mathbb{N}\) then \(\bigcup_{n \in \mathbb{N}} A_n \in \mathcal A.\)
If \(\mathcal A\) is a \(\sigma\)-algebra on \(X,\) then we say that any \(A \in \mathcal A\) is a measurable subset of \(X,\) and call \((X, \mathcal A)\) a measurable space.
The following construction plays a particularly prominent role in probability.
Let \(\mathcal C \subset \mathcal P(X).\) Then \[\sigma(\mathcal C) :=\bigcap_{\substack{\mathcal A \text{ is a $\sigma$-algebra}\\ \mathcal C \subset \mathcal A}} \mathcal A\] is the \(\sigma\)-algebra generated by \(\mathcal C\).
The \(\sigma\)-algebra generated by \(\mathcal C\) is indeed a \(\sigma\)-algebra as its name implies, because the intersection of \(\sigma\)-algebras is a \(\sigma\)-algebra.
- Let \(X = \mathbb{R}^d\) and \(\mathcal O\) be the collection of open subsets of \(\mathbb{R}^d.\) (More generally, \(X\) can be a topological space whose collection of open sets is \(\mathcal O.\)) Then \(\mathcal B(X) :=\sigma(\mathcal O)\) is the Borel \(\sigma\)-algebra of X.
- Let \((X_1, \mathcal A_1)\) and \((X_2, \mathcal A_2)\) be measurable spaces. The product \(\sigma\)-algebra on \(X_1 \times X_2\) is \[\mathcal A_1 \otimes \mathcal A_2 :=\sigma( A_1\times A_2 \,\colon A_1 \in \mathcal A_1 , A_2 \in \mathcal A_2).\]
A (positive) measure on a measurable space \((X, \mathcal A)\) is a function \(\mu \colon \mathcal A \to [0,\infty]\) satisfying \(\mu(\emptyset) = 0\) and \(\mu \bigl(\bigcup_{n \in \mathbb{N}} A_n\bigr) = \sum_{n \in \mathbb{N}} \mu(A_n)\) for any countable family \((A_n)_{n \in \mathbb{N}}\) of disjoint measurable subsets.
- Let \(X\) be finite or countable, \(\mathcal A = \mathcal P(X),\) and \(\mu(A) :=\# A.\) This is the counting measure on \(X.\)
- For \(x \in X\) we define the Dirac delta measure at \(x\) through \[\delta_x(A) := \begin{cases} 1 & \text{if } x \in A \\ 0 & \text{if } x \notin A\,. \end{cases}\]
- The Lebesgue measure on \((\mathbb{R}, \mathcal B(\mathbb{R}))\) is defined as the unique measure \(\lambda\) satisfying \(\lambda((a,b)) = b - a\) for all \(a < b.\) (Recall from your course on measure theory that the existence and uniqueness of \(\lambda\) is nontrivial. Later in this class we shall give a proof of uniqueness: see Example 3.10 below.)
A measurable space \((X, \mathcal A)\) endowed with a measure \(\mu\) is called a measure space and denoted by the triple \((X, \mathcal A, \mu).\)
Let \((X, \mathcal A, \mu)\) be a measure space. Then a property \(P(x)\) depending on \(x \in X\) holds almost everywhere if \[\mu(\{x \in X \,\colon P(x) \text{ false}\}) = 0\,.\]
For example, on \((\mathbb{R}, \mathcal B(\mathbb{R}))\) endowed with Lebesgue measure, the indicator function \(\mathbf 1_{\mathbb{Q}}\) equals \(0\) almost everywhere, or \(\mathbf 1_{\mathbb{Q}}(x) = 0\) for almost all \(x.\)
Let \((X, \mathcal A)\) and \((Y, \mathcal B)\) be measurable spaces. A function \(f \colon X \to Y\) is measurable if for all \(B \in \mathcal B\) we have \(f^{-1}(B) \in \mathcal A.\)
Here, \(f^{-1}\) denotes the preimage function on sets, i.e. \(f^{-1}(B) :=\{x \in X \,\colon f(x) \in B\}.\)
Often, the \(\sigma\)-algebras \(\mathcal A\) and \(\mathcal B\) are clear from the context, and we do not even mention them explicitly.
The following definition allows one to transport measures between measurable spaces using measurable functions.
Let \((X, \mathcal A)\) and \((Y, \mathcal B)\) be measurable spaces, \(f \colon X \to Y\) measurable, and \(\mu\) a measure on \((X, \mathcal A).\) Then we define the pushforward or image measure of \(\mu\) under \(f,\) denoted by \(f_*\mu,\) as the measure on \((Y, \mathcal B)\) defined by \[f_*\mu(B) :=\mu(f^{-1}(B)) \quad \text{for all } B \in \mathcal B\,.\]
We now recall the notation for the integral.
Let \(\mu\) be a measure on \((X, \mathcal A).\)
- Let \(f \colon X \to [0,\infty].\) We use the notation \[\int f \, \mathrm d\mu = \int f(x) \, \mu(\mathrm dx) \in [0,\infty]\] for the integral of \(f\) with respect to \(\mu\) (see the class on measure theory for its definition, which is also briefly reviewed below).
- A function \(f \colon X \to \mathbb{R}\) is called integrable if \(\int \lvert f \rvert \, \mathrm d\mu < \infty,\) in which case we define \[\int f \, \mathrm d\mu :=\int f_+ \, \mathrm d\mu - \int f_- \, \mathrm d\mu\,.\]
It is helpful to recall briefly the construction of the integral in Definition 1.9 (i). It proceeds in two main steps.
We integrate a simple function (finite linear combination of indicator functions) of the form \[\tag{1.1} f = \sum_{i = 1}^n c_i \, \mathbf 1_{A_i}\,,\] where \(c_i \in [0,\infty)\) and \(A_i \in \mathcal A\) for all \(i =1, \dots, n.\) By definition, the integral of this simple function is \[\int f \, \mathrm d\mu :=\sum_{i = 1}^n c_i \, \mu(A_i)\,.\] It is not hard to check that the left-hand side does not depend on the representation of the simple function \(f\) (if the sets \(A_i\) are not disjoint and the \(c_i\) are not distinct then the representation (1.1) of a simple function is not unique.)
Next, we note that an arbitrary measurable function \(f \,\colon X \to [0,\infty]\) can be approximated monotonically from below by step functions \(f_n.\) For example, we can choose \(f_n\) to be equal to \(f\) rounded to the nearest multiple of \(2^{-n}\) smaller than \(f,\) truncated at \(n,\) i.e. \[f_n(x) :=\bigl(2^{-n} \lfloor 2^n f(x) \rfloor\bigr) \wedge n\,,\] where \(\lfloor \cdot \rfloor\) denotes the integer part. (Plot this function!) Note that \((f_n(x))\) is a nondecreasing sequence for all \(x \in X.\) Then we define the integral of \(f\) through \[\int f \, \mathrm d\mu :=\lim_{n \to \infty} \int f_n \, \mathrm d\mu\,,\] where the limit exists in \([0,\infty]\) because it is the limit of a nondecreasing sequence. One can check that the left-hand side does not depend on the choice of the sequence \(f_n.\)
The preceding definition captures a basic idea of measure theory, which we shall consistently and often tacitly use in this class: one can define the integral of any function provided that it is nonnegative, in which case the integral may be infinite. If the function is not nonnegative, then one has to impose that it is integrable for its integral to make sense. (Otherwise one might end up with expressions of the form \(\infty - \infty,\) which are ill-defined.)
The integral satisfies the three following convergence theorems, which are stated for some fixed measure space \((X, \mathcal A, \mu).\)
Let \(f_1, f_2, \dots \,\colon X \to [0,\infty]\) be a pointwise nondecreasing sequence of measurable functions. Then \[\lim_{n \to \infty} \int f_n \, \mathrm d\mu = \int \lim_{n \to \infty} f_n \, \mathrm d\mu\,.\]
Let \(f_1, f_2, \dots \,\colon X \to [0,\infty]\) be a sequence of measurable functions. Then \[\liminf_{n \to \infty} \int f_n \, \mathrm d\mu \geqslant\int \liminf_{n \to \infty} f_n \, \mathrm d\mu\,.\]
Let \(g,f,f_1,f_2, \dots\) be measurable functions. Suppose that \(f_n \to f\) almost everywhere, that \(g\) is integrable, and that \(\lvert f_n \rvert \leqslant g\) almost everywhere for all \(n.\) Then \[\lim_{n \to \infty} \int f_n \, \mathrm d\mu = \int f \, \mathrm d\mu\,.\]
Next, we recall the notation of product measure. Its uniqueness is guaranteed by the following finiteness property. A measure \(\mu\) on \((X,\mathcal A)\) is \(\sigma\)-finite if there exists a countable decomposition \(X = \bigcup_{n \in \mathbb{N}} X_n\) of \(X\) such that \(\mu(X_n) < \infty\) for all \(n \in \mathbb{N}.\) (For instance Lebesgue measure on \(\mathbb{R}\) is \(\sigma\)-finite but not finite.)
Let \(\mu\) and \(\nu\) be \(\sigma\)-finite measures on \((X_1, \mathcal A_1)\) and \((X_2, \mathcal A_2),\) respectively. The product measure \(\mu_1 \otimes \mu_2\) is the unique measure on \((X_1 \times X_2, \mathcal A_1 \otimes \mathcal A_2)\) satisfying \[\mu_1 \otimes \mu_2 (A_1 \times A_2) = \mu_1(A_1) \, \mu_2(A_2) \quad \text{for all $A_1 \in \mathcal A_1$ and $A_2 \in \mathcal A_2$}.\] For the proof of existence and uniqueness, we refer to the class on measure theory.
The following theorem states that product measures can be integrated successively over each component separately, provided the function is nonnegative or integrable.
Let \(\mu_1\) and \(\mu_2\) be \(\sigma\)-finite measures on \((X_1, \mathcal A_1)\) and \((X_2, \mathcal A_2),\) respectively. Let \(f \colon X_1 \times X_2 \to [0,\infty]\) be measurable. Then \[\tag{1.2} \begin{aligned} \int_{X_1 \times X_2} f \, \mathrm d(\mu_1 \otimes \mu_2) &= \int_{X_1} \biggl(\int_{X_2} f(x_1,x_2) \, \mu_2 (\mathrm dx_2)\biggr) \, \mu_1(\mathrm dx_1) \\ &= \int_{X_2} \biggl(\int_{X_1} f(x_1,x_2) \, \mu_1 (\mathrm dx_1)\biggr) \, \mu_2(\mathrm dx_2)\,. \end{aligned}\] The same identity holds if \(f \,\colon X_1 \times X_2 \to \mathbb{R}\) is integrable with respect to \(\mu_1 \otimes \mu_2.\)
2 Foundations of probability theory
2.1 Probability spaces
In this section we shall give a motivation of Kolmogorov’s axioms of probability. We shall see that a mathematical formulation of probability theory rests on three core ingredients: (i) a set of realisations, (ii) a collection of events, and (iii) a probability measure that expresses probabilities of events.
A random experiment (such as the toss of a die) has a number of possible outcomes or realisations.
We denote by \(\Omega\) the set of realisations. Its elements (realisations of the randomness) are denoted by \(\omega.\)
We consider two basic examples.
Toss of a die: \(\Omega = \{1,2,3,4,5,6\}.\) The realisation \(\omega \in \Omega\) denotes the number shown by the die.
A game of darts. A person throws a dart at a disc-shaped dartboard. \(\Omega\) is the unit disc in the plane, \(\Omega = \{\omega \in \mathbb{R}^2 \,\colon\lvert \omega \rvert \leqslant 1\}.\) The realisation \(\omega \in \Omega\) denotes where dart hits the dartboard.
These examples show that it makes sense to consider very general sets \(\Omega,\) from finite to uncountable.
A collection \(\mathcal A \subset \mathcal P(\Omega)\) is the collection of events, i.e. subsets of \(\Omega\) whose probability can be determined.
The event \(A = \{2,4,6\}\) is the event that I obtained an even number. The event \(A = \{6\}\) is the event that I obtained a \(6.\)
The event \(A = \{\omega \in \mathbb{R}^2 \,\colon\lvert \omega \rvert \leqslant 1/20\}\) is the event that I hit the bull’s eye of the dartboard.
A function \(\mathbb{P}\,\colon\mathcal A \to [0,1]\) determines the probability \(\mathbb{P}(A)\) of an event \(A \in \mathcal A.\)
For a balanced die, we have \(\mathbb{P}(\{2,4,6\}) = 1/2\) and \(\mathbb{P}(\{6\}) = 1/6.\)
If the dart hits any region of the dartboard with uniform probability, then we have \(\mathbb{P}(\{\omega \in \mathbb{R}^2 \,\colon\lvert \omega \rvert \leqslant 1/20\}) = (1/20)^2\) (relative area of bull’s eye).
That \(\mathbb{P}(A) \in [0,1]\) reflects the fact that probabilities must be nonnegative and cannot exceed 1, i.e. 100 percent. Moreover, we require \(\mathbb{P}\) to satisfy the two following obvious properties.
\(\mathbb{P}(\Omega) = 1.\) This just expresses that with probability \(1\) we always see some realisation.
If \(A\) and \(B\) are disjoint events, then \(\mathbb{P}(A \cup B) = \mathbb{P}(A) + \mathbb{P}(B).\) In other words, the probabilities of mutually exclusive events are additive.
The triple \((\Omega, \mathcal A, \mathbb{P})\) therefore looks rather similar to a measure space. Imposing that the additivity property for mutually exclusive events extends to countable families, we arrive at the following celebrated and fundamental definition.
A probability space is a measure space \((\Omega, \mathcal A, \mathbb{P})\) satisfying \(\mathbb{P}(\Omega) = 1.\)
A measure \(\mathbb{P}\) on \((\Omega, \mathcal A)\) satisfying \(\mathbb{P}(\Omega) = 1\) is called a probability measure.
We give two examples that shall accompany us through much of this chapter.
I throw a balanced die twice: \[\Omega = \{1,2,\dots, 6\}^2\,, \qquad \mathcal A = \mathcal P(\Omega)\,, \qquad \mathbb{P}(A) = \frac{\# A}{36}\,.\]
Here is a more interesting (and more subtle) example. I throw a die repeatedly until I obtain a \(6.\) Since I may have to throw the die an arbitrarily large number of times, I choose \[\Omega = \{1,2,\dots,6\}^{\mathbb{N}^*}\,.\] As a reminder, this is the set of sequences \(\omega \,\colon\mathbb{N}^* \to \{1,2,\dots,6\}.\) We use the notation \(\omega = (\omega_k)_{k \in \mathbb{N}^*}\) for its elements.
The set \(\Omega\) is uncountable, and as we shall see it is ill-advised to take \(\mathcal A\) to be the full power set \(\mathcal P(\Omega).\) To find the correct choice for \(\mathcal A,\) let us begin by noting that we certainly want to ascribe a probability to any event depending on a finite number of throws (such as “the first 10 throws are all smaller than \(4\)”). Generally, such an event is called a cylinder set, and it is of the form \[\tag{2.1} \bigl\{\omega \in \Omega \,\colon\omega_1 = i_1, \dots, \omega_n = i_n\bigr\}\,,\] which is indexed by the parameters \(n \in \mathbb{N}^*\) and \(i_1, \dots, i_n \in \{1,2, \dots, 6\}.\) Hence, we define \(\mathcal A\) to be the \(\sigma\)-algebra generated by the cylinder sets, i.e. \[\tag{2.2} \mathcal A = \sigma \Bigl(\bigl\{\omega \in \Omega \,\colon\omega_1 = i_1, \dots, \omega_n = i_n\bigr\} \,\colon n \in \mathbb{N}^*, i_1, \dots, i_n \in \{1,2, \dots, 6\}\Bigr)\,.\] The \(\sigma\)-algebra \(\mathcal A\) thus constructed is called the cylinder \(\sigma\)-algebra, and it plays a fundamental role in probability. It is the canonical \(\sigma\)-algebra on an infinite product space (such as \(\Omega\)).
Clearly, the probability measure \(\mathbb{P}\) on \(\mathcal A\) should have the following value on any cylinder set: \[\tag{2.3} \mathbb{P}\Bigl(\bigl\{\omega \in \Omega \,\colon\omega_1 = i_1, \dots, \omega_n = i_n\bigr\}\Bigr) = \biggl(\frac{1}{6}\biggr)^n\,.\] In fact, we shall prove later that there exists a unique measure \(\mathbb{P}\) on \((\Omega, \mathcal A)\) satisfying (2.3).
We conclude with a more difficult example, which is of great interest in mathematics and the sciences. It goes beyond the scope of this course, but we can nevertheless mention its basic mathematical structure.
You will probably have heard of Brownian motion, which was first observed by the botanist Robert Brown in 1827. With a microscope, he observed a particle of pollen immersed in water and noticed that it underwent an erratic random motion. Brownian motion was famously studied by Albert Einstein in one of his groundbreaking papers of 1905, where he gave a theoretical explanation of its origin.
The random realisation is the entire trajectory of the particle, so that we choose \[\Omega = C([0,\infty),\mathbb{R}^3)\] to be the space of continuous paths \(\omega = (\omega(t))_{t \geqslant 0}\) in \(\mathbb{R}^3.\) For the collection of events, as in the previous example, we choose the cylinder \(\sigma\)-algebra, which in this instance takes the form \[\mathcal A = \sigma \Bigl(\{\omega \in \Omega \,\colon\omega(t) \in B\} \,\colon t \in [0,\infty), B \in \mathcal B(\mathbb{R}^3)\Bigr)\,.\] (If you wish, you can think about the analogy between this definition and (2.2). It may help to consider intersections of cylinder sets \(\{\omega \in \Omega \,\colon\omega(t) \in B\}.\)) What about the probability measure \(\mathbb{P}\) on \((\Omega, \mathcal A)\)? Clearly, there are many possible choices, but one of them stands out by being by far the most natural one; it is called Wiener measure, an infinite-dimensional Gaussian measure which underlies the mathematical definition of Brownian motion. We shall not discuss it further in this course.
As a hint, you can consider the family \(B_0 = A_0,\) \(B_1 = A_1 \setminus A_0,\) \(B_2 = A_2 \setminus (A_0 \cup A_1),\) ….
2.2 Random variables
Informally, a random variable is a variable whose value depends on the realisation \(\omega \in \Omega.\)
A random variable is a measurable real-valued function on \(\Omega.\) More generally, for a measurable space \((E, \mathcal E),\) a random variable with values in \(E\) is a measurable function from \(\Omega\) to \(E.\)
For instance we can speak about vector-valued random variables, with values in \(E = \mathbb{R}^d.\)
The sum of both values is the random variable \(X \,\colon\Omega \to \mathbb{R}\) defined by \[X((i,j)) :=i+j\,,\] with the notation \(\omega = (i,j) \in \{1,2, \dots,6\}^2.\)
Define the random variable \(X \,\colon\Omega \to \mathbb{N}^* \cup \{\infty\}\) to be the number of throws required to obtain a \(6\) for the first time, i.e. \[X(\omega) :=\inf \{k \,\colon\omega_k = 6\}\] with the convention that \(\inf \emptyset = \infty\) (which happens if I never throw a \(6\)).
To see that \(X\) is indeed a random variable, we have to check that it is measurable. To that end, we have to check that, for any \(n \in \mathbb{N}^*,\) the set \(X^{-1}(\{n\})\) is a cylinder set of the form (2.1). Indeed, \[X^{-1}(\{n\}) = \bigl\{\omega \in \Omega \,\colon\omega_1 \neq 6, \omega_2 \neq 6, \dots, \omega_{n - 1} \neq 6, \omega_n = 6\bigr\}\,,\] as desired. Intuitively, that \(X\) is a random variable is clear since the event “\(X\) equals \(n\)” clearly depends only on the first \(n\) throws, and \(\mathcal A\) is constructed precisely so that such events are measurable.
The law of a random variable with values in \(E\) is the measure \[\mathbb{P}_X :=X_* \mathbb{P}\] on \((E, \mathcal E).\) (Recall Definition 1.8.)
We sometimes use the equivalence relation \(\overset{\mathrm d}{=}\) on random variables, i.e. equality in law, defined by \[\tag{2.5} X \overset{\mathrm d}{=}Y \quad \Longleftrightarrow \quad \mathbb{P}_X = \mathbb{P}_Y\,.\]
Clearly, \(\mathbb{P}_X\) is a probability measure on \((E, \mathcal E).\) Hence, any random variable \(X\) with values in \(E\) gives rise to a new probability space \((E, \mathcal E, \mathbb{P}_X).\) The intuition is that this space is in general smaller than the original space, and it contains only information captured by the random variable \(X.\) If all we care about is the value of \(X,\) we can completely forget the original probability space \((\Omega, \mathcal A, \mathbb{P})\) and only work on the smaller space \((E, \mathcal E, \mathbb{P}_X),\) which is often much simpler.
For instance, in Example 2.8, if we only care about the value of \(X = i+j\) (and not, say, which of the two throws produced the larger value), we can work on the space \(E = X(\Omega) = \{2,3,\dots, 12\}\) instead of on the original larger space \(\Omega = \{1,2,\dots, 6\}^2.\) You can easily check that the probability measure \(\mathbb{P}_X\) on \(E\) is given by \[\mathbb{P}_X(\{k\}) = \frac{(k - 1) \wedge (13 - k)}{36}\,.\]
In general, for any \(B \in \mathcal E,\) we have \[\mathbb{P}_X(B) = \mathbb{P}(X^{-1}(B)) = \mathbb{P}(\{\omega \in \Omega \,\colon X(\omega) \in B\}) =:\mathbb{P}(X \in B)\,,\] where the notation on the right-hand side is being defined by this equation. This quantity is the probability that \(X\) lies in \(B.\)
Probability theory uses its own shorthand notation for events and probabilities determined by a random variable \(X\): \[\begin{aligned} \{\omega \,\colon X(\omega) \in B\} &\equiv \{X \in B\}\,, \\ \mathbb{P}\bigl(\{\omega \,\colon X(\omega) \in B\}\bigr) &\equiv \mathbb{P}(X \in B)\,. \end{aligned}\] In addition, inside \(\mathbb{P},\) intersection of events is often denoted with a comma instead of the symbol \(\cap.\) For instance, we write \[\tag{2.6} \mathbb{P}(\{X \in A\} \cap \{Y \in B\}) \equiv \mathbb{P}(X \in A, Y \in B)\,.\] We shall always use these shorthand notations.
Before looking at some examples, let us record the following rather banal remark, which is sometimes good to keep in mind. For a given probability measure \(\mu\) on a measurable space \((E, \mathcal E),\) can we construct a random variable \(X\) with law \(\mathbb{P}_X = \mu\)? Obviously yes, just by setting \((\Omega, \mathcal A, \mathbb{P}) = (E, \mathcal E, \mu)\) and \(X(\omega) = \omega.\)
2.2.1 Elementary special cases
Let us now review some special cases of random variables, some of which you may already have seen in school.
Let \(X\) be a random variable with values in \((E, \mathcal E).\)
Discrete random variables. Here \(E\) is finite or countable, and \(\mathcal E = \mathcal P(E).\) In that case, \[\tag{2.7} \mathbb{P}_X = \sum_{x \in E} p_x \, \delta_x\,,\] where \(p_x :=\mathbb{P}(X = x)\) and \(\delta_x\) is the delta measure from Example 1.5. To verify (2.7), we write, for any \(B \in \mathcal E,\) \[\begin{aligned} \mathbb{P}_X(B) &= \mathbb{P}(X \in B) = \mathbb{P}\Biggl(\bigcup_{x \in B} \{X = x\}\Biggr) \\ &= \sum_{x \in B} \mathbb{P}(X = x) = \sum_{x \in E} p_x \delta_x(B)\,, \end{aligned}\] where in the third step we used crucially the \(\sigma\)-additivity of measures, since \(E\) is at most countable by assumption.
For \(n \in \mathbb{N}^*\) let us compute the probablility that we first obtain a \(6\) on the \(n\)th throw, \[\begin{aligned} \mathbb{P}(X = n) &= \mathbb{P}(\omega_1 \neq 6, \dots, \omega_{n-1} \neq 6, \omega_n = 6) \\ &= \mathbb{P}\Biggl(\bigcup_{i_1, \dots, i_{n-1} = 1}^5 \{\omega_1 = i_1, \dots, \omega_{n-1} = i_{n-1}, \omega_n = 6\}\Biggr) \\ &= \sum_{i_1, \dots, i_{n-1} = 1}^5 \mathbb{P}\bigl(\omega_1 = i_1, \dots, \omega_{n-1} = i_{n-1}, \omega_n = 6\bigr) \\ &= 5^{n-1} \biggl(\frac{1}{6}\biggr)^n \\ &= \frac{1}{6} \biggl(\frac{5}{6}\biggr)^{n-1}\,. \end{aligned}\] This computation shows the power of a clear and rigorous formulation in solving very concrete problems. In particular, we find that the probability that we never throw a \(6\) is \[\mathbb{P}(X = \infty) = 1 - \mathbb{P}(X < \infty) = 1 - \sum_{n \in \mathbb{N}^*} P(X = n) = 1 - 1 = 0\,.\] Nevertheless, the event \(\{X = \infty\} = \{\omega \in \Omega \,\colon\omega_k < 6 \text{ for all } k \in \mathbb{N}^*\}\) is enourmous, in particular uncountable.
Continuous random variables. Let \((E, \mathcal E) = (\mathbb{R}^d, \mathcal B(\mathbb{R}^d))\) and suppose that \(\mathbb{P}_X\) is absolutely continuous4 with respect to Lebesgue measure. This means that there exists a measurable function \(p \colon \mathbb{R}^d \to [0,\infty)\) such that \[\mathbb{P}_X(B) = \int_B p(x) \, \mathrm dx\,,\] where \(\mathrm dx\) denotes Lebesgue measure on \(\mathbb{R}^d.\) The function \(p\) is called the density of the law of \(X\), sometimes just density of \(X\).As you may know from measure theory, the notion of absolute continuity that we use here is usually a consequence of the general definition, by the so-called Radon-Nikodym theorem. For our purposes, however, the above definition is sufficient.