If you're seeing this message, it means we're having trouble loading external resources on our website.

If you're behind a web filter, please make sure that the domains *.kastatic.org and *.kasandbox.org are unblocked.

## Computers and the Internet

### Course: Computers and the Internet>Unit 1

Lesson 7: Data compression

# Lossless bit compression

Computers represent all data in binary, so all types of files, from text to images to videos, are ultimately sequences of bits. Regardless of whether the bits represent a document or a GIF, computers can use a bit compression technique called Huffman coding.

### Huffman coding algorithm

Let's see how it works with a simple textual example. This example language uses only 4 different characters, and yet is incredibly important to us: it's the language used to represent DNA and is made up of sequences of four characters A, C, G and T.
For example, the 4.6 million characters representing an E.coli DNA sequence happens to start with:
`agcttttcattct`
Since we need to represent four characters, a computer would typically represent each character using 2 bits, such as:
characterbinary code
a00
c01
g10
t11
The 13 characters above would be written using 26 bits as follows - notice that we don't need gaps between the codes for each bits.
`00100111111111010011110111`
But we can do better than this. In the short sample text above the letter "t" is more common than the other letters ("t" occurs 7 times, "c" 3 times, "a" twice, and "g" just once). If we give a shorter code to "t", then we'd be using less space 54% of the time (7 out of 13 characters). For example, we could use the codes:
characterbinary code
a010
c00
g011
t1
Then our 13 characters would be coded as:
`0100110011110001011001`
That's just 22 bits, four less bits than our original encoding. That may not seem like a lot, but imagine if we used an optimization like that on the entire 4.6 million characters of the DNA!

### Huffman decoding

You might be scratching your head at the new binary codes we're using, with all different lengths. Is it still possible to decode it reliably? Yes, with the right set of codes.
Try it yourself!
Decode the following bits using the optimized binary codes.
`111001`
characterbinary code
a010
c00
g011
t1
Make sure you start at the first bit on the left, and match up the codes from left to right. What DNA string do you come up with?

That's the beauty of Huffman coding: the algorithm gives us a way to come up with a set of binary codes for a given sequence that ensures the data can be reconstructed unambiguously and reliably.

### Uses for Huffman coding

Many file formats utilize some kind of Huffman coding to reduce the size of their file. Fax machines use Huffman coding after using RLE on the black and white runs. PNG images compress using LZ77, an algorithm similar to the text compression technique we learned, combined with Huffman coding on the results.

## Want to join the conversation?

• Hi. Why are A and G given one more bit?
• `A` and `G` are assigned one more bit because they are the least common letters in the sequence.

If they were only assigned 2 bits, there would be confusion when decoding messages. Imagine we had the following code:
`a - 01c - 00g - 10t - 1`

If I asked you to decode the message `101`, it would not be clear whether the original message was `ta` or `gt`.
• With Huffman coding, does it take every 2 bits, so 00, 01, 10, or 11, convert them to a, g, t, or c, and then re-convert them to binary as 1, 00, 010, and 001 based on which appears most often? What if the letters appear the same amount of times so that Huffman coding expands it rather than compressing?
• Hello, is the Huffman optimized binary codes universally standard?
• Hi Fredrick,

I don't think the Huffman codes are universally standard. I did a short search and found a short snippet of the rules used in the coding:

1. The code length of a character depends on how frequently it occurs in the given text.
2. The character which occurs most frequently gets the smallest code.
3. The character which occurs least frequently gets the largest code.

Based on the above, it seems to me that the Huffman encoding is done on-the-fly, so the same characters would be stored as different codes depending on its frequency of appearance in a file.

The file can then come with metadata that provides info on what codes represent what characters. This is just my guess, but I hope it helps!
• Could a RLE encoder also be used for DNA where encode

abcdddacddd as

1,1,1,3,1,0,1,3 for example? Could you combine the two encoding methods?

Or am I confusing two different things?
• Yes, you could use RLE encoding to represent DNA sequences. (Personally, I don't think it would be very effective given that there are not likely to be many long repetitions of the same character in the sequence though.)

Yes, you can combine RLE and Huffman coding.

Let's take the example you gave, "abcdddacddd". The RLE encoding would be "1,1,1,3,1,0,1,3" which would be represented as "0101011101000111" in binary. Then, we can use Huffman coding with the following codes; 0 = 100, 1 = 0, 2 = 111, 3 = 110. This gives the final binary representation of "00011001000110". After using the Huffman codes, we save another 2 bits.
• Why wasn't one utilized for one of the remaining letters ((b, c, or d-in the DNA section)? Wouldn't that make the code even shorter?
(1 vote)
• Sometimes the best way to see why is something is done the way it's done is to try to do it differently.

So we have a, c, g and t and we want to encode them. So Pamela always used two bits in the first example, so let's try something else.

a - 0
c - 1
g - 00
t - 01

So this should give use shorter codes, so let's encode the string aaggct, that gives us

000010100111 with the original code

so the new code is a lot shorter.

So now send the codeword and the encoding table to someone (that someone being ourselves) and we decode, that gives us

aaaaaacac or gggct or ggaact or gaagcac or ... (a lot more possibilities) using the new code
or just
aaggct with the original.

You see while the new code is shorter it's also not very good because decoding it can create different results.

With the code in the first example, you always know that two bits are one letter, so there is no way to go about decoding it the wrong way.

With the Huffman code, that one is "prefix-free" (a very important characteristic), no codeword has a prefix that could be another code word. So if you have the codeword 1 you never have the codeword 11 or 111 or something similar. That way decoding it will always give you the same result.
That's why its such a great encoding scheme, its very short, but still prevents coding errors.
• So a huffman coded file will always have a decode algorithm also?
• I'm not sure I understand you correctly.
With a Huffman code, you need a translation table to convert the Huffman code into something readable. In fact, a translation table or an algorithm to decode a specific coded message is always needed, no matter the code.
So if you want to read a code someone needs to either give you a means of decoding the code or you have to break the code. Breaking the code is in some cases really easy and in others almost impossible.
• I did not understand how a DNA code can be used in computing, when it is a biological term?
(1 vote)
• Adenine (A), cytosine (C), guanine (G), thymine(T) are the building blocks of DNA. If you're doing research on DNA you can represent those biological structures on your computer to do research.
Here those blocks are introduced as example of the advantages of compression, by using the Huffman encoding you can reduce the amount of space the DNA strings require.
• I was a little confused at this part: "Since we need to represent four characters, a computer would typically represent each character using 2 bits." Why wouldn't the computer just use 1 bit per character?