TODO: * clean up the discussion about conditional probability * put in footnotes * make Fig. 1
In which I lay down foundations
Probability is interesting to me for two reasons:
- It’s useful.
- It’s a generalisation of logic1.
The problem is, I have never had any formal training in it. Not even in high school. So that leaves me in a bit of a limbo because I see understanding probability as one of the cornerstones of Good Thinking.
This is my attempt at curing that.
What I shall do is start from the axioms then prove theorems as I tackle classic problems. Which means three things: a) this page will be a work-in-progress indefinitely2, b) the vocabulary I will build here may not correspond to the standard one, and c) I’ll probably get things wrong a lot of times so this may not be the best page to cite in support of your Internet argument.
Before we start, I’d like to introduce a rule and a notion. I won’t start from the very foundations of mathematics because that will just waste everyone’s time. But I really admire the rigor of the Bourbaki school (in great part because I’m still in my “rigorous phase”3). So I shall follow a simple rule: every assertion must have a proof or a reference to one. And to do that more efficiently I’ll borrow a notion from programming and “import” complete mathematical objects in this manner:
- IPT 1: (the algebra of sets)
- see Wikipedia
Eventually, those external links should become internal ones.
Right. Let’s get to work. Here are some definitions:
- DEF: (data point)
- a value (e.g., “red”, “42.4 seconds”)
- DEF: (data set)
- a set of data points
- DEF: (universe, the Universal Data Set)
- , or the data set containing all data sets
- DEF: (probability)
- the probability of a data set is a real number associated with
These are the basic building blocks of probability theory.
As far as I know, probability theory is then completely axiomatised by the following three axioms (which come from A. Kolmogorov, according to Wikipedia):
- AXM 1: (“All probabilities are non-zero.”)
- AXM 2: (“The probability of the Universal Data Set is 1.”)
- AXM 3: (“The probabilities of disjoint data sets are additive.”)
Using these we can already say a few basic facts about probabilities.
- THM 1: “The probability of the empty set is 0.”
- THM 2: “The probability of the complement of a data set is one (1) minus the probability of the original.”
(Can I just say how awful and tedious it is to write multi-line in default WordPress?)
What we have though is still too bare. It lacks flavour. So let’s define a few more things:
- DEF: (reduced universe)
- the reduced universe of a data set E is the subset of the universe where is true
- DEF: (joint probability)
- the probability , or the probability of both A and B being true
- DEF: (conditional probability)
- the probability , or the probability of A being true given that B is true
- DEF: (independence)
- two data sets are independent if
What are these definitions for?
- We defined the reduced universe as such because we want to be able to say, “In the universe where data set A is true…”.
- What do we mean by a data set being true in the first place? Say . Then in a particular universe, is true if Alice is a mouse and if she is big and not otherwise. The truthiness of is the truthiness of all its conditions.
- Why the definition of conditional probability? We want to have a way of saying, “The truth of A depends on the truth of B by this much.” […]
- In this vein, saying that two data sets are independent is saying that whether or not B is true does not affect whether or not A is true: .
Fig. 1: versus
- Good Thinking
- following what works; see the Twelve Virtues of Rationality, particularly the twelfth virtue
1. E. T. Jaynes. “Probability: The Logic of Science.” 1995. Print. ↩