Fibonacci and Huffman at the Crossroads

I'm currently working through The Data Compression Book written by Mark Nelson and Jean-Loup Gailly. It's a wonderful introduction to data compression and also a brief window into the history of software and computing in general from the early 1980s to late 1990s (including K&R C!).

In their explanation of adaptive Huffman coding, they introduce the overflow problem. There are two ways a Huffman tree can overflow.

The weight of the root node exceeds the maximum value of the integer used to store it.
The depth of the tree exceeds the number of bits in the integer used to store the code while traversing the tree from leaf to root.

I focus on the second here.

The maximum length of a Huffman code is related to the maximum count via a Fibonacci sequence.

We will start with a contrived example to build intuition. Imagine a tree with weights encoded using 4-bit integers. This tree constructs codes for an alphabet of six symbols: A, B, C, D, E, and F with corresponding frequencies 1, 1, 1, 2, 3, and 5. This tree effectively overflows because the path from the leaf node that represents symbol A to the root node with weight 13 consists of 5 edges, so the code for A is 00000 if we assign 0 to each left branch and 1 to each right branch.

                     ----
                     |13|
                     ----
                    /    \
                 ---      ---
                 |8|      |F|
                 ---      ---
                /   \      5
             ---     ---
             |5|     |E|
             ---     ---
            /   \     3
         ---     ---
         |3|     |D|
         ---     ---
        /   \     2
     ---     --- 
     |2|     |C|
     ---     ---
    /   \     1
 ---     ---
 |A|     |B|
 ---     ---
  1       1

To avoid this problem, we need to rescale the weights of the leaf nodes and reconstruct the internal nodes of the tree, which effectively limits the depth of the tree and so the length of its longest code.

We can build on the previous example. Say we use a 5-bit integer to store the code while traversing the tree and another symbol G to the input stream with frequency count 8 to continue the Fibonacci sequence. Then:

                          ----
                          |21|
                          ----
                         /    \
                     ----      ---
                     |13|      |G|
                     ----      ---
                    /    \      8
                 ---      ---
                 |8|      |F|
                 ---      ---
                /   \      5
             ---     ---
             |5|     |E|
             ---     ---
            /   \     3
         ---     ---
         |3|     |D|
         ---     ---
        /   \     2
     ---     --- 
     |2|     |C|
     ---     ---
    /   \     1
 ---     ---
 |A|     |B|
 ---     ---
  1       1

This tree uses 6 bits to encode A, overflowing again. This pattern generalizes. If the root node has a weight of fib(n), then the length of the maximum code is n-1. In their explanation, Nelson and Gailly use 16-bit integers to store Huffman codes, so a root node with weight fib(18) = 4181 leads to a 17-bit code [0]. In their final implementation, they use 32-bit integers to store Huffman codes, so a root node with weight fib(34) = 9227465 leads to overflow, and they prevent this by reconstructing the tree if the root node's weight reaches 0x8000, an early threshold.

There are a few different seed or indexing conventions for the sequence. Nelson and Gailly use an extra leading one: 1, 1, 1, 2, 3, 5, 8, and so on. A more typical sequence drops this leading one, but then the internal nodes of the Huffman tree don't follow the Fibonacci sequence at first glance. But the leaf nodes still do, and either convention leads to a lopsided tree.

                     ----
                     |20|
                     ----
                    /    \
                 ----     ---
                 |12|     |F|
                 ----     ---
                /    \     8
             ---      ---
             |7|      |E|
             ---      ---
            /   \      5
         ---     ---
         |4|     |D|
         ---     ---
        /   \     3
     ---     --- 
     |2|     |C|
     ---     ---
    /   \     2
 ---     ---
 |A|     |B|
 ---     ---
  1       1

The Fibonacci sequence essentially strikes a sweet spot or enters the Goldilocks zone for screwing a Huffman tree. It's the minimal sequence that creates a lopsided tree. Each iteration of Huffman's algorithm, we select the two nodes with lowest weights from the pool of available nodes, create an internal node whose weight equals their sum, and remove its children from the pool. This mimics the Fibonacci sequence. That is, with a Fibonacci distribution of weights or frequency counts, we merge a leaf node with an internal node each iteration because the cumulative sum remains a smidge below the weight of the next leaf node, which creates a lopsided tree.

A. B. Vinokur formalizes this explanation in a paper published in 1986. He begins with a left-sided elongated binary tree (a lopsided tree as depicted above) and asks about the definition of the minimal sequence that will create it, where minimal or "minimizing" refers to the minimization of the weighted path length of this fixed tree [1]. He takes a sorted and non-decreasing sequence of positive integers A defined by a_1, a_2, a_3, ..., a_k. Because we want a minimizing sequence, he defines a_1 = a_2 = 1. By definition, a_3 cannot also equal 1 since a_2 must be strictly less than a_3 so that no tie exists to break between the weights of leaf nodes and internal nodes at any iteration of the algorithm; in other words, the choice of which two nodes to merge must be unambiguous. This leaves a_3 = a_2 + 1. From here, we can start running Huffman's algorithm.

  0: a_1, a_2, a_3, ..., a_k
  1: a_3, a_1 + a_2, a_4, ..., a_k
  2: a_4, a_1 + a_2 + a_3, ..., a_k
  3: a_5, a_1 + a_2 + a_3 + a_4, ..., a_k
  ...
  j: a_{j+2}, sum(a_{j+1}), a_{j+3}, ..., a_k
  ...
k-1: a_1 + ... + a_k

We can solve the recurrence relation for step j of the algorithm: sum(a_{j+1}) + 1 = a_{j+3}, where sum(a_k) = \sum_{i=1}^k a_i, which holds because for each iteration, sum(a_{j+1}) must be strictly less than a_{j+3} to select one leaf and one internal node but still increase minimally. Then with a clever subtraction,

  a_{j+3} = sum(a_{j+1}) + 1
  a_{j+3} - a_{j+2} = (sum(a_{j+1}) + 1) - (sum(a_{j}) + 1)
  a_{j+3} - a_{j+2} = sum(a_{j+1}) - sum(a_{j})
  a_{j+3} - a_{j+2} = a_{j+1}
  a_{j+3} = a_{j+1} + a_{j+2}

Let i = j + 3 so that a_i = a_{i-2} + a_{i-1}, and then A must be the Fibonacci sequence.

  0: 1, 1, 2, ..., a_k
  1: 2, 1 + 1, 3, ..., a_k
  2: 3, (1 + 1) + 2, 5, ..., a_k
  3: 5, ((1 + 1) + 2) + 3, 8, ..., a_k
  ...
k-1: a_1 + ... + a_k

[0] Nelson and Gailly use def fib(n): return 1 if n <= 1 else fib(n-2) + fib(n-1) as their Fibonacci function.

[1] In the paper, Vinokur defines the weighted path length T by T(D, \pi, Z) = \sum_{i=1}^k z_il_i where D represents a binary tree, \pi a function that assigns nodes to weights, Z a sequence of weights, l_i as the depth or path length of leaf i, and k as the number of leaf nodes in the tree.