The mathematics olympiad meets Coq

What a nice tnetennba

The 60th International Olympiad in Mathematics kicks off in about a week. I decided to mark the event for myself by convincing Coq that I could solve an old IMO question. I selected question 3 from 2001:
Twenty-one girls and twenty-one boys took part in a mathematical contest. Prove that there was a problem that was solved by at least three girls and at least three boys.
Results available here indicate this is not completely trivial: just 4% of contestants were awarded full marks. (I suggest we ignore the fact that it took me eighteen years to come up with the proof discussed below, versus the c.90 minutes available to contestants.)

One reason I like this question is that it follows trivially from the following result which, once stated, is easy to prove. Specifically:

Given an $r \times c$ matrix of natural numbers such that each row contains at most $n$ distinct values and each column contains at most $m$ distinct values, there exists a cell whose value appears at least $k$ times, in the row to which it belongs as well as at least $k$ times in the column to which it belongs, where: $$ k = \left\lceil\frac{cr}{cm + rn - \min(cm, rn, c+r)}\right\rceil. $$
Let's call this the two-dimensional pigeonhole principle. My proof of it in Coq is available here.

The original IMO question follows from the above with $c = r = 21$ and $m = n = 6$, by factoring it through the following construction. Given an instance of the original IMO question, first, number all the problems. Next construct a $21\times 21$ matrix with rows corresponding to girls, and columns corresponding to boys. A cell in this matrix thus corresponds to a girl-boy pair, and we fill in a value by choosing any one of the problems that they both solved, and writing its number in this cell.

Proof of two-dimensional pigeonhole principle for humans

The result is trivial if $m = n = 1$ so we may assume we are not in this case. Given this we have: $$ k = \min\left( \left\lceil\frac{c}{n}\right\rceil, \left\lceil\frac{r}{m}\right\rceil, \left\lceil\frac{cr}{c(m-1) + r(n-1)}\right\rceil \right). $$ Consider a cell in the matrix. We will call it column-good if its value appears at least $k$ times in the column to which it belongs, and column-bad otherwise. Similarly we define row-good and row-bad. We want to show there exists a cell that is both column-good and row-good. We do this by counting cells that are either column-bad or row-bad.

Consider a single column; since $k \le \lceil r/m\rceil$, the usual (one-dimensional) pigeonhole principle tells us that this column contains at least one column-good cell, and thus at most $m-1$ of the column's values appear in column-bad cells. Since each of these bad values appears in at most $k-1$ of the column's cells, the column contains at most $(m-1)(k-1)$ column-bad cells. Since this is true for each column, the matrix contains at most $c(m-1)(k-1)$ column-bad cells.

Similarly the matrix contains at most $r(n-1)(k-1)$ row-bad cells, and thus at most: $$ B = c(m-1)(k-1) + r(n-1)(k-1) $$ cells that are either column-bad or row-bad. Since $$ k - 1 < \frac{cr}{c(m-1) + r(n-1)}, $$ we must have $B < cr$ and thus there exists a cell that is both row-good and column-good, as required.

(In fact the above generalises trivially to show that sufficient conditions for there to exist a cell whose value occurs at least $k$ times in its column as well as at least $l$ times in its row are: $$ \max\left(\frac{m}{r}(k-1), \frac{n}{c}(l-1), \frac{m-1}{r}(k-1) + \frac{n-1}{c}(l-1)\right) < 1. $$ Even the Coq proof could be easily extend to the above.)

Proof for Coq

This repository, also linked above, contains a statement and (constructive) proof of the two-dimensional pigeonhole principle. No doubt you have already spent many hours delving into it.

The proof aside, I think there is value in staring at just a statement of the result:

Section Php2dStatement.

  Variables (r : nat) (M : matrix nat r).

  Definition nunique (l : list nat) := length (nodup Nat.eq_dec l).
  Definition min3 (x y z : nat) := min x (min y z).
  Definition lmax := fold_right max 0.
  Definition freq (l : list nat) (x : nat) := count_occ Nat.eq_dec l x.

  Definition c := col_count M.
  Definition m := lmax (map nunique (columns M)).
  Definition n := lmax (map nunique (rows M)).
  Definition k := ceil (r*c) (c*m + r*n - min3 (c*m) (r*n) (c+r)).

  Theorem php_2d : r <> 0 -> c <> 0 -> exists x v w i j,
    nth_error (rows M) i = Some v /\
    nth_error (columns M) j = Some w /\
    nth_error w i = Some x /\
    nth_error v j = Some x /\
    k <= freq v x /\
    k <= freq w x.

End Php2dStatement.
The only terms above that are not part of Coq or its standard library are the ceil function, and those associated with the matrix type, i.e., matrix, columns, rows. These are all easily defined.

For me, the main takeaways from this exercise were:

Higher dimensions

The two-dimensional pigeonhole principle can be regarded as a basic result about edge colourings of complete bipartite graphs, vaguely in the spirit of Ramsey theory. I can't help wondering to what other combinatorial structures it might generalise. I may play with this some rainy weekend afternoon (Dublin generously provides these in abundance).

In the meantime, there is of course a trivial generalisation by dimension. Suppose we have a $d$-dimensional cube of natural numbers of shape $a_1 \times a_2 \times \cdots \times a_d$, and for $1 \le i \le d$, any "column" parallel to the $i$th axis contains at most $n_i$ unique values. Let: $$ k = \min\left( \left\lceil\frac{a_1}{n_1}\right\rceil, \ldots, \left\lceil\frac{a_d}{n_d}\right\rceil, \left\lceil\frac{a_1a_2\cdots a_d}{a_2a_3\cdots a_d(n_1-1) + \cdots + a_1a_2a_{d-1}(n_d-1)}\right\rceil \right), $$ then there exists a cell whose value appears at least $k$ times in each of the $d$ "columns" to which it belongs.

In the generic case, the smallest value in the $\min$ expression for $k$ is the final argument; this may be simplified to: $$ k = \left\lceil\left(\sum_i \frac{n_i - 1}{a_i}\right)^{-1}\right\rceil. $$ I think it might make a good question to pose a carefully stated version of this result where we consider a cube of shape: $$ 100 \times 400 \times 900 \times \cdots \times (10d)^2, $$ and: $$ n_1 = n_2 = \cdots = n_d = 21, $$ and request proof of a value appearing at least 4 times in each dimension. I like this because it meets the genericity conditon and as $d$ grows, $k$ tends to: $$ \left\lceil \frac{30}{\pi^2}\right\rceil $$ and since $30/\pi^2 = 3.03\cdots $, it's fairly tight.

Exercise: the other pigeon hole principle.

Here's an exercise for the reader: if we have a list of length $a$ containing $n$ unique values, then the usual one-dimesional pigeonhole principle says:
  1. There exists a value occuring at least $\lceil a/n\rceil$ times.
  2. There exists a value occuring at most $\lfloor a/n\rfloor$ times.
I claim the above is a multi-dimensional generalisation of 1. What about 2?

Bonus: interactive scratch pad

Design your own pattern below:

(Ensure local storage is enabled to save results.)