Close
Close

Huffman Codes | Greedy Algorithm


We are going to use Binary Tree and Minimum Priority Queue in this chapter. You can learn these from the linked chapters if you are not familiar with these.

Huffman code is a data compression algorithm which uses the greedy technique for its implementation. The algorithm is based on the frequency of the characters appearing in a file.

We know that our files are stored as binary code in a computer and each character of the file is assigned a binary character code and normally, these character codes are of fixed length for different characters. For example, if we assign 'a' as 000 and 'b' as 001, the length of the codeword for both the characters are fixed i.e., both 'a' and 'b' are taking 3 bits.

fixed length codeword

Huffman code doesn't use fixed length codeword for each character and assigns codewords according to the frequency of the character appearing in the file. Huffman code assigns a shorter length codeword for a character which is used more number of time (or has a high frequency) and a longer length codeword for a character which is used less number of times (or has a less frequency).

variable length codeword

Since characters which have high frequency has lower length, they take less space and save the space required to store the file. Let's take an example.

character and code table

In this example, 'a' is appearing 51 out of 100 times and has the highest frequency, 'c' is appearing only 2 out of 100 times and has the least frequency. Thus, we are assigning 'a' the codeword of the shortest length i.e., 0 and 'c' a longer one i.e., 1100.

Now if we use characters of fixed length, we need 100*3 = 300 bits (each character is taking 3 bit) to represent 100 characters of the file. But to represent 100 characters with the variable length character, we need 51*1 + 20*3 + 2*4 + 3*4 + 9*3 + 15*3 = 203 bits (51*1 as 'a' is appearing 51 out of 100 times and has length 1 and so on). Thus, we can save 32% of space by using the codeword for variable length in this case.

Also, the Huffman code is a lossless compression technique. You can see that we are not losing any information, we are just using a different way to represent each character.

One general doubt which comes here while reading this is "why we are using longer length codeword for characters which have less frequency, can't we use a shorter length for them too? E.g.- couldn't we use code of 3-bit long code to represent 'c' instead of 4?".

The answer is no and this is due to the prefix code. We are going to discuss prefix code in a while but let's first discuss how storing, encoding and decoding is done.

Storing, Encoding and Decoding


We basically concatenate characters while storing them into a file. For example, to store 'abc', we would use 000.001.010 i.e., 000001010 using a fixed character codeword. Now for the decoding, we know that all of our characters are 3 bits long, so we would break the code for every 3 bits and we can easily get 000, 001 and 010 which can be translated back to 'abc'.

encoding characters

Now, we can't use any specific length which can separate our character if we are using variable length codeword and to simplify decoding the codeword back to characters, we use prefix codes.

Prefix Codes


In variable length codeword, we only use such code which are not the prefix of any other character and these codes is known as prefix codes. For example, if we use 0, 1, 01, 10 to represent 'a', 'b', 'c' and 'd' respectively (0 is prefix of 01 and 1 is prefix of 10), then the code 00101 can be translated into 'aabab', 'acc', 'aadb', 'acab' or 'aabc'. To avoid this kind of ambiguity, we use prefix codes.

Using prefix codes made decoding unambiguous. In the above example, we have used 0, 111 and 1100 for 'a', 'b' and 'c' respectively. None of the code is the prefix of any other codes and thus any combination of these codes will decode into unique value. For example, if we write 01111100, it will uniquely decode into 'abc' only. Give it a try and try to decode it into something else.

Now, we know what is Huffman code and how it works. Let's now focus on how to use it.

Implementing Huffman Code


Let's first look at the binary tree given below.

tree for huffman code

We construct this type of binary tree from the frequencies of the characters given to us and we will learn how to do this in a while. Let's focus on using this binary tree first and then we will learn about its construction.

Let's first focus on the decoding process.

Decoding


One thing to notice here is that all the characters are on the leaves of the tree and to get the codeword for any character, we start from the root of the tree and proceed to that character. Now, if we move right from any node, we interpret that movement as 1 and if left, then 0. This is also indicated on the branch of the binary tree in the picture given above. So, we move from root to the leaf containing that character and combining the 0s and 1s of each movement, we get the codeword of the character. This is described in the picture given below.

In this way, we can get the code of each character.

code, character and frequency table

We can proceed in a similar way with any code given to us to decode it. For example, let's take a case of 01111100.

We will start from the root and since the first number is 0, so we will move left. By moving left, we encountered a character 'a' and thus the first character is a.

decoding huffman code

Now, we will again start from the root, we have 1 in the code, so we will move right.

Since we have not reached any of the leaves, we will continue it. The next number is also 1 and so we will again move right.

Still, we have not reached the leaf, so continuing, the next number is also 1. Moving right this time will give us the character 'b'. Thus, 'ab' is the string we have decoded till now.

decoding huffman code

Similarly, again starting from the root and moving for the next numbers will make us reach 'c'.

decoding huffman code

Thus, we have decoded 01111100 into 'abc'.

We now know how to decode for Huffman code. Let's look at the encoding process now.

Encoding


As stated above, encoding is simple. We just have to concatenate the code of the characters to encode them. For example, to encode 'abc', we will just concatenate 0, 111 and 1100 i.e., 01111100.

Now, our next task is to create this binary tree for the frequencies we have.

Construction of Binary Tree for Huffman Code


This is the part where we use the greedy strategy. Basically, we have to assign shorter code to the character with higher frequency and vice-versa. We can do this in different ways and that will result in different trees, but a full binary tree (a tree in which every node has 2 children, except the leaves) gives us the optimal code i.e., using that code will save the maximum space in storing the file.

To construct this tree for optimal prefix code, Huffman invented a greedy algorithm which we are going to use for the construction of the tree.

We start by sorting the characters according to their frequencies.

characters sorted according to frequencies

Now, we make a new node and then greedily pick the first two nodes from the sorted character and make the first node left child of the new node and second node as the right child of the new node. The value of the new node is the summation of the values of the children nodes.

greeidly picking nodes

We again sort the nodes according to the values and repeat the same process with the first two nodes.

step-wise huffman algorithm


step-wise huffman algorithm


step-wise huffman algorithm


step-wise huffman algorithm

animation of huffman greedy algorithm

In this way, we construct the tree for optimal prefix code.

We can see that the depth of a leaf is also the length of the prefix code of the character. So, the depth $d_i$ of leaf i is the length of the prefix code of the character at leaf i. If we multiply it by its frequency i.e., $d_i*i.freq$, then this is the number of bits used by this character in the entire file.

depth of node = length of code

We can sum all these for each character to get the total number of bits used to store the file. $$ \text{Total bits} = \sum_{i}d_i*i.freq $$

This can also be used to prove why full binary tree gives us optimal codes because this summation of depths will be least in a full binary tree.

Now, we know how to construct the tree from their frequencies and then use that tree to know the prefix codes of characters and how to encode and decode. So, let's see the coding implementation for the construction of the tree.

Code for Huffman Code


In the steps of constructing the tree, we can see that we are continuously using the sorted nodes and using the first two elements from it. This feature will be well taken care of if we use minimum priority queue because just by inserting a new node, it will go to a position where the nodes will be in the sorted order and also extracting from a minimum priority queue will give us the nodes with least values first.

We need the frequencies of the characters to make the tree, so we will start making our function by passing the array of the nodes containing the characters to it - GREEDY-HUFFMAN-CODE(C), C is the array which has the characters stored in nodes.

Our next task is to make a minimum priority queue using these nodes.

min_queue.build(C)

Minimum priority queue will automatically add these nodes in a position such that they are in the sorted order.

min-queue for huffman code

Now, we need to extract the first element from the queue and set it as the left child of a new node.

n = min_queue.length
z = new node
z.left = min_queue.extract()

Similarly, we need to set the right child of this new node to the element we will extract next.

z.right = min_queue.extract()

Now, the value of this new node will be the summation of the values of its children.

z.freq = z.left.freq + z.right.freq

After this, our queue and the node z will like this:

making nodes in huffman algorithm

Our next task is to insert this node in the queue. Since the queue is a minimum priority queue, it will add the node to a position where the queue will remain sorted.

storing nodes in min-queue

min_queue.insert(z)

We need to repeat this process until the length of min_queue becomes 1 i.e., only the root is stored in the queue.

final min-queue in base case

Also, we are extracting two nodes and adding one node to the queue in each step, thus making an iteration from 1 to n-1 to achieve the same.

while min_queue.length > 1
  z.left = min_queue.extract()
  z.right = min_queue.extract()
  ...

At last, we will extract the root of the tree from the queue and return it.

while min_queue.length > 1
  ...
return min_queue.extract()

GREEDY-HUFFMAN-CODE(C)
  min_queue.build(C)

  while min_queue.length > 1
    z = new node
    z.left = min_queue.extract()
    z.right = min_queue.extract()
    z.freq = z.left.freq + z.right.freq
    min_queue.insert(z)

  return min_queue.extract()
  • C
  • Python
  • Java
#include <stdio.h>
#include <stdlib.h>

// node
/*
          _____
         | data|
         | freq|
         |_____|
left ch./       \right chlid address
  _____/         \ _____
 | data|          | data|
 | freq|          | freq|
 |_____|          |_____|
*/
typedef struct node
{
  int frequency;
  char data;
  struct node *left;
  struct node *right;
}node;

int heap_array_size = 100; // size of array storing heap
int heap_size = 0;
const int INF = 100000;

//function to swap nodes
void swap( node *a, node *b ) {
  node t;
  t = *a;
  *a = *b;
  *b = t;
}

/*
  function to print tree
  https://www.codesdope.com/blog/article/binary-tree-in-c-linked-representation-traversals/
*/
void inorder(struct node *root)
{
  if(root!=NULL) // checking if the root is not null
  {
    inorder(root->left); // visiting left child
    printf(" %d ", root->frequency); // printing data at root
    inorder(root->right);// visiting right child
  }
}

/*
  function for new node
*/
node* new_node(char data, int freq) {
  node *p;
  p = malloc(sizeof(struct node));
  p->data = data;
  p->frequency = freq;
  p->left = NULL;
  p->right = NULL;

  return p;
}

//function to get right child of a node of a tree
int get_right_child(int index) {
  if((((2*index)+1) <= heap_size) && (index >= 1))
    return (2*index)+1;
  return -1;
}

//function to get left child of a node of a tree
int get_left_child(int index) {
    if(((2*index) <= heap_size) && (index >= 1))
        return 2*index;
    return -1;
}

//function to get the parent of a node of a tree
int get_parent(int index) {
  if ((index > 1) && (index <= heap_size)) {
    return index/2;
  }
  return -1;
}

/*
  Functions taken from minimum priority queue
  https://www.codesdope.com/blog/article/priority-queue-using-heap/
  https://www.codesdope.com/blog/article/heap-binary-heap/
*/

void insert(node A[], node* a, int key) {
  heap_size++;
  A[heap_size] = *a;
  int index = heap_size;
  while((index>1) && (A[get_parent(index)].frequency > a->frequency)) {
    swap(&A[index], &A[get_parent(index)]);
    index = get_parent(index);
  }
}

node* build_queue(node c[], int size) {
  node* a = malloc(sizeof(node)*heap_array_size); // a is the array to store heap
  int i;
  for(i=0; i<size; i++) {
    insert(a, &c[i], c[i].frequency); // inserting node in array a(min-queue)
  }
  return a;
}

void min_heapify(node A[], int index) {
  int left_child_index = get_left_child(index);
  int right_child_index = get_right_child(index);

  // finding smallest among index, left child and right child
  int smallest = index;

  if ((left_child_index <= heap_size) && (left_child_index>0)) {
    if (A[left_child_index].frequency < A[smallest].frequency) {
      smallest = left_child_index;
    }
  }

  if ((right_child_index <= heap_size && (right_child_index>0))) {
    if (A[right_child_index].frequency < A[smallest].frequency) {
      smallest = right_child_index;
    }
  }

  // smallest is not the node, node is not a heap
  if (smallest != index) {
    swap(&A[index], &A[smallest]);
    min_heapify(A, smallest);
  }
}

node* extract_min(node A[]) {
  node minm = A[1];
  A[1] = A[heap_size];
  heap_size--;
  min_heapify(A, 1);
  node *z;
  // copying minimum element
  z = malloc(sizeof(struct node));
  z->data = minm.data;
  z->frequency = minm.frequency;
  z->left = minm.left;
  z->right = minm.right;
  return z; //returning minimum element
}

// Huffman code
node* greedy_huffman_code(node C[]) {
  node *min_queue = build_queue(C, 6); // making min-queue

  while(heap_size > 1) {
    node *h = extract_min(min_queue);
    node *i = extract_min(min_queue);
    node *z;
    z = malloc(sizeof(node));
    z->data = '\0';
    z->left = h;
    z->right = i;
    z->frequency = z->left->frequency + z->right->frequency;
    insert(min_queue, z, z->frequency);
  }
  return extract_min(min_queue);
}

int main() {
  node *a = new_node('a', 42);
  node *b = new_node('b', 20);
  node *c = new_node('c', 5);
  node *d = new_node('d', 10);
  node *e = new_node('e', 11);
  node *f = new_node('f', 12);
  node C[] = {*a, *b, *c, *d, *e , *f};

  node* z = greedy_huffman_code(C);
  inorder(z); //printing tree
  printf("\n");

  return 0;
}
import heapq
import itertools

'''
    heapq library of python - https://docs.python.org/2/library/heapq.html
'''

pq = []                         # list of entries arranged in a heap
entry_finder = {}               # mapping of tasks to entries
REMOVED = '<removed-task>'      # placeholder for a removed task
counter = itertools.count()     # unique sequence count

def add_task(task, priority=0):
    'Add a new task or update the priority of an existing task'
    if task in entry_finder:
        remove_task(task)
    count = next(counter)
    entry = [priority, count, task]
    entry_finder[task] = entry
    heapq.heappush(pq, entry)

def remove_task(task):
    'Mark an existing task as REMOVED.  Raise KeyError if not found.'
    entry = entry_finder.pop(task)
    entry[-1] = REMOVED

def pop_task():
    'Remove and return the lowest priority task. Raise KeyError if empty.'
    while pq:
        priority, count, task = heapq.heappop(pq)
        if task is not REMOVED:
            del entry_finder[task]
            return task
    raise KeyError('pop from an empty priority queue')

def inorder(root):
    if(root != None):
        inorder(root.left)
        print(root.frequency)
        inorder(root.right)

class Node():
    def __init__(self, d, f):
        self.data = d
        self.frequency = f
        self.left = None
        self.right = None

def greedy_huffman():
    while(len(pq) > 1):
        left = pop_task()
        right = pop_task()
        freq = left.frequency + right.frequency
        z = Node(None, freq)
        z.left = left
        z.right = right
        add_task(z, z.frequency)

    return pop_task()

if __name__ == '__main__':
    a = Node('a', 42)
    b = Node('b', 20)
    c = Node('c', 5)
    d = Node('d', 10)
    e = Node('e', 11)
    f = Node('f', 12)

    add_task(a, a.frequency)
    add_task(b, b.frequency)
    add_task(c, c.frequency)
    add_task(d, d.frequency)
    add_task(e, e.frequency)
    add_task(f, f.frequency)

    tree = greedy_huffman()
    inorder(tree)
class Node {
  private String data;
  private int frequency;
  private Node left;
  private Node right;

  public Node(String element, int freq){
    data = element;
    frequency = freq;
    left = null;
    right = null;
  }

  public void setRightChild(Node n)
  {
    right = n;
  }

  public void setLeftChild(Node n){
    left = n;
  }

  public Node getRightChild(){
    return right;
  }

  public Node getLeftChild(){
    return left;
  }

  public String getData(){
    return data;
  }

  public int getFrequency(){
    return frequency;
  }

  public static int getLeftChildIndex(int index) {
    if(((2*index) <= MinHeap.heapSize) && (index >= 1)) {
      return 2*index;
    }
    return -1;
  }

  public static int getRightChildIndex(int index) {
    if((((2*index)+1) <= MinHeap.heapSize) && (index >= 1)) {
      return (2*index)+1;
    }
    return -1;
  }

  public static int getParentIndex(int index){
    if((index > 1 && (index <= MinHeap.heapSize))) {
      return index/2;
    }
    return -1;
  }

  public static void inorder(Node root) {
    if(root != null) {
      inorder(root.getLeftChild());
      System.out.print(" "+root.getFrequency()+" ");
      inorder(root.getRightChild());
    }
  }
}

class MinHeap {
  public static int heapSize = 0;
  public static final int heapArraySize = 100;
  public static final int INF = 100000;

  public static void minHeapify(Node A[], int index) {
    int leftChildIndex = Node.getLeftChildIndex(index);
    int rightChildIndex = Node.getRightChildIndex(index);

    int smallest = index;

    if ((leftChildIndex <= MinHeap.heapSize) && (leftChildIndex>0)) {
      if (A[leftChildIndex].getFrequency() < A[smallest].getFrequency()) {
        smallest = leftChildIndex;
      }
    }

    if ((rightChildIndex <= MinHeap.heapSize) && (rightChildIndex>0)) {
      if (A[rightChildIndex].getFrequency() < A[smallest].getFrequency()) {
        smallest = rightChildIndex;
      }
    }

    // smallest is not the node, node is not a heap
    if (smallest != index) {
      Node temp;
      temp = A[index];
      A[index] = A[smallest];
      A[smallest] = temp;
      minHeapify(A, smallest);
    }
  }
}

class MinQueue {

  public static void insert(Node A[], Node a, int key) {
    MinHeap.heapSize++;
    A[MinHeap.heapSize] = a;
    int index = MinHeap.heapSize;
    while((index>1) && (A[Node.getParentIndex(index)].getFrequency() > a.getFrequency())) {
      Node temp;
      temp = A[index];
      A[index] = A[Node.getParentIndex(index)];
      A[Node.getParentIndex(index)] = temp;
      index = Node.getParentIndex(index);
    }
  }

  public static Node[] buildQueue(Node c[], int size) {
    Node[] a = new Node[MinHeap.heapArraySize];
    for(int i=0; i<size; i++) {
      MinQueue.insert(a, c[i], c[i].getFrequency());
    }
    return a;
  }

  public static Node extractMin(Node A[]) {
    Node minm = A[1];
    A[1] = A[MinHeap.heapSize];
    MinHeap.heapSize--;
    MinHeap.minHeapify(A, 1);
    return minm;
  }
}

class Huffman {
  public static Node greedyHuffmanCode(Node C[]) {
    Node[] minQueue = MinQueue.buildQueue(C, 6);

    while(MinHeap.heapSize > 1) {
      Node h = MinQueue.extractMin(minQueue);
      Node i = MinQueue.extractMin(minQueue);
      Node z = new Node("NONE", h.getFrequency()+i.getFrequency());
      z.setLeftChild(h);
      z.setRightChild(i);
      MinQueue.insert(minQueue, z, z.getFrequency());
    }
    return MinQueue.extractMin(minQueue);
  }

  public static void main(String[] args) {
    Node a = new Node("a", 42);
    Node b = new Node("b", 20);
    Node c = new Node("c", 5);
    Node d = new Node("d", 10);
    Node e = new Node("e", 11);
    Node f = new Node("f", 12);

    Node[] C = {a, b, c, d, e ,f};

    Node z = Huffman.greedyHuffmanCode(C);
    Node.inorder(z);
    System.out.println("");
  }
}

Analysis of Huffman Code


By using the min-heap to implement the queue, we can perform each operation of the queue in $O(\lg{n})$ time. Also, the initialization of the queue is going to take $O(n)$ time i.e., time for building a heap.

Now, we know that each operation is taking $O(\lg{n})$ time and we have already discussed that there are a total of n-1 iterations. Thus, the total running time of the algorithm will be $O(n\lg{n})$.

With this, we are going to finish the section of greedy algorithms. Next, we are going to learn some graph algorithms.

Today is cruel. Tomorrow is crueller. And the day after tomorrow is beautiful.
- Jack Ma

Ask Yours
Post Yours
Doubt? Ask question