For simplicity, we take
WLOG
Huffman Code is defined inductively:

  1. If assign codewords 0 and 1
  2. If , find a Huffman coding in the case of probabilities

Append 0 (respectively 1) to the codewords for to give a codeword for (respectively )

Note that this construction is prefix free.

Theorem

Huffman codings are optimal.

Proof

Lemma

Suppose are emitted with probabilities
Let be an Optimal Code, prefix free, with word lengths

  1. If then
  2. There exist two codewords of maximal length which are equal up to the last digit.
Proof
  1. If not, swap th and th codewords. This decreases the expected word length, contradicting optimality.
  2. If not, then either there is only one code word of maximal length or any two codewords of maximal length differ before the last digit. In either case, delete the last digit of each codeword of maximal length. This maintains prefix-free condition, contradicting optimal.

We show by, by induction on , that any Huffman code of size is optimal.
: codewords are just 0,1, which is optimal
Source emits with probabilities
Source emits with probabilities
We construct a Huffman coding for and extend to for . The expected codeword length satisfies

Let be an optimal code for , where is prefix free.
By the previous lemma, shuffling codewords we may assume that the last two codewords (i.e. associated with the smallest probabilities) of are of maximal length and differ only in the last digit, say and for some string .
We define a code for with

Then is a prefix-free code and the expected word length satisfies

By induction hypothesis, is optimal so
Therefore, but was assumed to be optimal so has to be optimal as well.