The Eightfold Path to the Calculus of Constructions

Has a dog Buddha-nature?

This is the most serious question of all.

If you say yes or no,

You lose your own Buddha-nature.

— from Chapter 1 of the Gateless Gate

In the beginning, there was nothing.

Then Alonzo Church came and said, “Let there be:”

variable [x]
a symbol or string of symbols that represents a particular value
λ-abstraction [(λx.E)]
a function on the λ-term E, with the variable x bound to E
application [(E F)]
applying the λ-term E to the argument F

And thus all three and only they attained λ-nature (i.e., are valid λ-terms).

Whence cometh these λ-terms?

Consider the noble function f(x) = x^3 + 1. We note that the name f is illusory and thus we can rewrite the same function as x \mapsto x^3 + 1. A λ-abstraction then is simple another way of writing this expression:

(\lambda x\;.\;x^3 + 1)

This is not a trivial transformation, however, as this also binds the variable x to the λ-term x^3 + 1. Their destinies are forever intertwined within their confining parentheses and hence the expression (\lambda y\;.\;((\lambda x \;.\; x^2)\; y+1)) is equivalent to (\lambda x\;.\;((\lambda x \;.\; x^2)\; x+1)) even though the variables have the same names.

To keep the expressions as elegant as possible, we shall adopt the following rules:

  1. Application is left-associative: (E \; F \; G) is equivalent to ((E \; F) \; G)
  2. The outermost parentheses are omitted: write \lambda x . x over (\lambda x . x)
  3. The term in a λ-abstraction extends up until a same-level closing parenthesis is found: (\lambda x . (\lambda y.E) \; F) is equivalent to \lambda x.((\lambda y.E) \; F), not (\lambda x . (\lambda y.E)) \; F
  4. Successive λ-abstractions can be combined under a single λ: \lambda x.\lambda y.\lambda z.E is equivalent to \lambda xyz.E.

Meditation: Which of the following do not have λ-nature?1

  1. \lambda x.x
  2. \lambda x^2.x
  3. \lambda x . 1
  4. \lambda xy . (\lambda zx . z^2 + x^2) \; y

Any λ-term may be further transformed according to the following reduction rules:

α-conversion
from \lambda x . E to \lambda y.F, where F is the same as E but with all instances of x replaced by y; only bound variables may be replaced in this manner
β-reduction
application to arguments leads to substitution by those arguments, e.g., (\lambda x.x^2) \; 3 is equivalent to the λ-term 3^2
η-conversion
functions are completely determined by their arguments, i.e., \lambda x . (f \; x) is equivalent to f iff all instances of x in f are bound

We are now ready to contemplate the truth2. Consider the following definitions:

  1. \text{TRUE} := \lambda x . \lambda y . x
  2. \text{FALSE} := \lambda x . \lambda y. y
  3. \text{AND} := \lambda p . \lambda q . p \; q \; p
  4. \text{OR} := \lambda p . \lambda q . p \; p \; q
  5. \text{NOT} := \lambda p . p \; \text{FALSE} \; \text{TRUE}

Meditation: Show that OR FALSE TRUE evaluates to TRUE3.


Do not be fooled: the simplicity of our system hides a profound strength. For instance, contemplate the following:

  1. \lambda fx.x
  2. \lambda fx.f \; x
  3. \lambda fx.f \; (f \; x)
  4. (etc.)

What do they do with f and x? What happens if I apply the two-argument λ-abstraction \lambda xyf.x \; (y \; f) to (2) and (3)? To (3) and (3)? To (1) and any of the succeeding λ-abstractions in the list?

Indeed, we have defined an operation that behaves like multiplication on a list that behaves like numbers. We have:

0 := \lambda fx.x
1:= \lambda fx.f \; x
2 := \lambda fx.f \; (f \; x)
(etc.)

And in general these numerals (called Church numerals) repeat the functions to which they are arguments. So for instance, if we rename our multiplication λ-abstraction above to \text{MULT} := \lambda xyf.x \; (y \; f), then:

\begin{aligned}  \text{MULT} \; 2 \; 1 &= (\lambda x y z . x \; (y \; z)) 2 1 \\  &= (\lambda z . 2 \; (1 \; z)) \\  &= (\lambda z . 2 \; (\; (\lambda f x . f \; x) \; z\;)) \\  &= (\lambda z . 2 \; (\lambda x . z \; x)) \\  &= (\lambda z . 2 \; (\lambda \alpha . z \; \alpha)) \\  &= (\lambda z . (\lambda f x . f \; (f \; x)) (\lambda \alpha . z \; \alpha)) \\  &= (\lambda z . (\lambda x . (\lambda \alpha . z \; \alpha) \; ((\lambda \alpha . z \; \alpha) \; x))) \\  &= (\lambda z . (\lambda x . (\lambda \alpha . z \; \alpha) \; (z \; x))) \\  &= (\lambda z . (\lambda x . z \; (z \; x))) \\  &= (\lambda zx . z \; (z \; x)) \\  &= (\lambda fx . f \; (f \; x)) \\  &= 2  \end{aligned}

Note: it may be best if you do this calculation by yourself using only the reduction rules above.

We may now describe the machinations of any sort of calculating apparatus by defining what “computability” is:

Computability
a function f on the natural numbers is computable if, for all its outputs, there is a corresponding λ-term that β-reduces to the output’s Church numeral

Or in the Platonic realm of mathematics, if I can find a λ-expression F \; x' =_\beta y' that is equivalent to f(x)=y where x' and y' are the Church numeral representations of x and y respectively, and by E =_\beta F we mean you can get F from E by a series of β-reductions and vice versa.

Or in plain English, if anything I can do using f in the world of math, I can also do using a corresponding F in the world of λ-calculus.

However, our symbolic excursion hides an uncomfortable truth. Can you guess what it is? (Hint: it involves the operator NOT.)

 

Consider the following derivation:

Let k := \lambda x.(\text{NOT} \; (x \; x)).

\begin{aligned} k \; k &= \lambda x.(\text{NOT} \; (x \; x)) \; k \\ &= \text{NOT} \; (k \; k)\\ \end{aligned}

We have just demonstrated a paradox, an inescapable death knell to our naïve λ-calculus.

Wither our quest to bring forth computation? Is computer science a vacuous discipline after all?

Stay tuned for the next in this series.


  1. Answer: (2), which is only valid if you consider “x^2” a string; note that (4) evaluates to a function and can still be applied 
  2. I’m just gonna follow the exposition on Wikipedia here. 
  3. Just use β-reduction until you arrive at \text{FALSE}
Advertisements

Things I don’t understand: Bayes’ theorem

In which I lay down foundations

Probability is interesting to me for two reasons:

  • It’s useful.
  • It’s a generalisation of logic1.

The problem is, I have never had any formal training in it. Not even in high school. So that leaves me in a bit of a limbo because I see understanding probability as one of the cornerstones of Good Thinking.

This is my attempt at curing that.

What I shall do is start from the axioms then prove theorems as I tackle classic problems. Which means three things: a) this page will be a work-in-progress indefinitely2, b) the vocabulary I will build here may not correspond to the standard one, and c) I’ll probably get things wrong a lot of times so this may not be the best page to cite in support of your Internet argument.

Before we start, I’d like to introduce a rule and a notion. I won’t start from the very foundations of mathematics because that will just waste everyone’s time. But I really admire the rigor of the Bourbaki school (in great part because I’m still in my “rigorous phase”3). So I shall follow a simple rule: every assertion must have a proof or a reference to one. And to do that more efficiently I’ll borrow a notion from programming and “import” complete mathematical objects in this manner:

IPT 1: (the algebra of sets)
see Wikipedia

Eventually, those external links should become internal ones.

Right. Let’s get to work. Here are some definitions:

DEF: (data point)
a value (e.g., “red”, “42.4 seconds”)
DEF: (data set)
a set of data points
DEF: (universe, the Universal Data Set)
\Omega, or the data set containing all data sets
DEF: (probability)
the probability P(E) of a data set E is a real number associated with E

These are the basic building blocks of probability theory.

As far as I know, probability theory is then completely axiomatised by the following three axioms (which come from A. Kolmogorov, according to Wikipedia):

AXM 1: (“All probabilities are non-zero.”)
(\forall S \subseteq \Omega)(P(S) \geq 0)
AXM 2: (“The probability of the Universal Data Set is 1.”)
P(\Omega) = 1
AXM 3: (“The probabilities of disjoint data sets are additive.”)
(\forall E_1, E_2 \subseteq \Omega)( (E_1 \cap E_2 = \emptyset) => (P(E_1 \cup E_2) = P(E_1) + P(E_2)) )

Using these we can already say a few basic facts about probabilities.

THM 1: “The probability of the empty set is 0.”
(P(\emptyset) = 0)
\begin{aligned} &\rightsquigarrow \emptyset \cap \Omega = \emptyset \\ &\implies P(\emptyset \cup \Omega) = P(\Omega) = P(\emptyset) + P(\Omega) \\ &\implies P(\emptyset) = 0\\ &\Box \\ \end{aligned}

 

THM 2: “The probability of the complement of a data set is one (1) minus the probability of the original.”
(P(A^c) = 1 - P(A))
\begin{aligned} &\rightsquigarrow A \cap A^c = \emptyset \\ &\implies P(A \cup A^c) = P(\Omega) = 1 = P(A) + P(A^c)\\ &\implies P(A^c) = 1 - P(A) \\ &\Box \\ \end{aligned}

(Can I just say how awful and tedious it is to write multi-line \LaTeX in default WordPress?)

What we have though is still too bare. It lacks flavour. So let’s define a few more things:

DEF: (reduced universe)
the reduced universe \Omega_E of a data set E is the subset of the universe \Omega where E is true
DEF: (joint probability)
the probability P(A \cap B), or the probability of both A and B being true
DEF: (conditional probability)
the probability P(A|B) = \frac{P(A \cap B)}{P(B)}, or the probability of A being true given that B is true
DEF: (independence)
two data sets are independent if P(A \cap B) = P(A)P(B)

 

What are these definitions for?

  • We defined the reduced universe as such because we want to be able to say, “In the universe where data set A is true…”.
  • What do we mean by a data set being true in the first place? Say A = \{\text{Alice is a big mouse.}\}. Then in a particular universe, A is true if Alice is a mouse and if she is big and not otherwise. The truthiness of A is the truthiness of all its conditions.
  • Why the definition of conditional probability? We want to have a way of saying, “The truth of A depends on the truth of B by this much.”
  • In this vein, saying that two data sets are independent is saying that whether or not B is true does not affect whether or not A is true: P(A \cap B) = P(A)P(B) \implies P(A|B) = \frac{P(A \cap B)}{P(B)} = \frac{P(A)P(B)}{P(B)} = P(A).

Definitions

Good Thinking
following what works; see the Twelve Virtues of Rationality, particularly the twelfth virtue
Things I don’t understand series
my own mental models of things, organised as best as I can

1. E. T. Jaynes. “Probability: The Logic of Science.” 1995. Print.

2. See the About section of gwern.net.

3. See Terry Tao’s post.