Main content

## Computer science theory

### Course: Computer science theory > Unit 3

Lesson 2: Modern information theory- Symbol rate
- Introduction to channel capacity
- Message space exploration
- Measuring information
- Origin of Markov chains
- Markov chain exploration
- A mathematical theory of communication
- Markov text exploration
- Information entropy
- Compression codes
- Error correction
- The search for extraterrestrial intelligence

© 2023 Khan AcademyTerms of usePrivacy PolicyCookie Notice

# Compression codes

What is the limit of compression? Created by Brit Cruise.

## Want to join the conversation?

- Is audio data pseudo-random from a binary point of view? Is audio compression done by throwing away non-audible frequencies from a fast fourier transform?(17 votes)
- A lot of audio data is reasonably compressible, which suggests that it would be a poor candidate for randomness (we shouldn't be able to compress "random" data). However, it would be possible to find audio signals that are good candidates for randomness.

For lossy compression, for both images and audio signals, you are basically correct in that they use the idea of transforming the signals into the frequency domain and discard frequencies that won't be noticed. However, the transform that is used is typically the Discrete Cosine Transform (DCT), which is similar to the Discrete Fourier Transform. DCT is used for .mp3 and .jpg files.

For more details on DCT see:

http://en.wikipedia.org/wiki/Discrete_cosine_transform

Hope this makes sense(22 votes)

- 2:40"So for this to work, you need to introduce letter spaces, which cancel out savings during transmissions."

How much are letter spaces accounted for in compression codes? They don't seem to be accounted for when compressing the message from 2000 bits to 1750 bits at3:20bit are accounted for when calculating savings. Letter spaces don't seem to be bits, but they give information by showing where one letter ends and another begins, so how would you measure them information-wise?3:28"...And Claude Shannon was the first to claim that the limit of compression would be the entropy of the message source."

This claim is intuitive, but if we compress D further to just 0 (which doesn't seem to create any problems...although that's not a Huffman coding anymore...), wouldn't that compress the message to 1.5 bits per letter which is smaller than the entropy or does that throw away information?

I think that compressing D to just 0 would cause some kind of problem because otherwise contradicts Shannon's claim and is thus counter-intuitive since the smallest average number of bits one letter takes up should be the number of bits of information one letter has and not any smaller. Also, if I was right, the claim most likely wouldn't even be mentioned without stating or at least implying it wasn't right.(12 votes)- Noble,

Your confusion is understandable. You've been caught by a parenthesis at the lecture.

Professor Brit says that the Huffman Coding is optimal. At2:21he says "for examples of the following type. You cannot beat it. Go ahead and try. For an example..." Here start the parenthesis.

After this short introduction, Professor emulated an unsuccessful attempt to beat the Huffman Coding. He reduces the code for "D" (to be only 0). And so, the conclusion of the unsuccessful attempt: it will not work, because generates ambiguity. He continued talking about the "reduced code", and concluded further at2:39: "So for this to work (to transcend the ambiguity), you'd need to introduce letter spaces, which cancel out any savings during transmission."

With the last frase, he concluded the parenthesis. The tentative to beat the Huffman Coding was unsuccessful. The proposed code (that reduces the code for "D" to be only 0) was not only worst than Huffman's, it was ineffective. The letter spaces would cancel out any savings.

The sequence of the lecture is completely disconnected with the unsuccessful attempt to beat Huffman Coding (the proposed code does not even compress the information). The parenthesis and the letter spaces went to oblivion and Professor Brit follows explaning about the compression with Huffman Coding.

I hope it helps :)(6 votes)

- Does radio send information using Binary Digits? Or do the waves cause the receiver to vibrate at a frequency conducive for analog decompression into audio waves?(8 votes)
- Yes! It does now, but traditional radio was born out of the latter. Since FM can involved several frequencies at once, it allowed for developments such as stereo audio and subcarriers to transmit digital information through the waves (ie. text info for digital receivers). Now days through a service called "HD radio" from iBiquity, digital audio data is sent alongside the standard analog signal for those who have digital receivers that can decode the information. The digital signal has a much cleaner sound as the receiver is effectively judging 100% yes or no decisions, throwing out all the low level junk noise/distortion from the analog airwaves.(8 votes)

- Entropy is a measure of disorder or randomness. In information theory, it specifically refers the average amount of information in every message, which increases as the message gets more random.

For example, the machine in he videos that is equally likely to give you four symbols has more entropy than a machine that gives different symbols at different rates. I'd suggest rewatching the Information Entropy video.

Entropy is often used in other areas of science, especially thermodynamics. It's the measurement of disorder or chaos in a system, and relates to the ability of a system to do useful work. Over time, entropy always increases in a closed system as the system becomes more disorganized.

For example, if you drop food colouring into a glass of water, the food colouring starts off fairly ordered (all in the spot you dropped it). However, disorder and entropy will increase over time: the food colouring will slowly spread out until it is equally randomly dispersed throughout the water,(7 votes)

- The probability distribution of the four letters in the video worked out well in that 2*12.5 = 25 and 2*25 = 50. But, what if the probabilities weren't such clean multiples of each other? What if you had probabilities of 0.41, 0.31, 0.17, and 0.11 respectively for A, B, C, and D? How do you group the nodes when their probabilities don't imply obvious divisions?(3 votes)
- Here's how you would generate a Huffman coding for the following nodes:

A=0.41, B=0.31, C=0.17, D=0.11

Find the two least probable nodes: C and D

Group them together and create a new node, whose probability is their sum:

We create: Node (C,D) with probability 0.17+ 0.11 = 0.28

Now we have:

A=0.41, B=0.31, (C,D)=0.28

Find the two least probable nodes: B and (C,D)

Group them together and create a new node, whose probability is their sum:

We create: Node (B,(C,D)) with probability 0.31+ 0.28 = 0.59

Now we have:

A=0.41, (B,(C,D))=0.59

Find the two least probable nodes: A and (B,(C,D))

Group them together and create a new node, whose probability is their sum:

We create: Node (A,(B,(C,D))) with probability 0.41+ 0.59 = 1.00

We have just created this Huffman Coding tree`/\`

A *

/ \

* B

/\

D C

or with probabilities shown:`(A,(B,(C,D)))=1.00`

/ \

A=0.41 (B,(C,D))=0.59

/ \

(C,D)=0.28 B=0.31

/ \

D=0.11 C=0.17

If we use 0 for each left branch and 1 for each right branch the code is

A is 0 (length of 1 bit)

B is 11 (length of 2 bits)

C is 101 (length of 3 bits)

D is 100 (length of 3 bits)

You can try entering this distribution into this program:

https://www.khanacademy.org/computer-programming/huffman-coding/6047325206085632

The average bit size per symbol is:

0.41 * 1 + 0.31 * 2 + 0.17 * 3 + 0.11 * 3 = 1.87

The entropy of the probability distribution is:

-0.41* log_2(0.41) -0.31* log_2(0.31) -0.17* log_2(0.17) -0.11 * log_2(0.11) = 1.836

The difference between the entropy of the probability distribution and our average size

per symbol is caused by the probabilities not being powers of 2.

However, this is the best we can do using a prefix free code (assuming that we stick to binary

branching and only encode one symbol at a time), since the Huffman code is the provably

optimal prefix free code.

If instead we tried the distribution in the video with

A= 1/2 B= 1/8 C=1/8 D= 1/4

The average bit size per symbol is:

1/2 * 1 + 1/4 * 2 + 1/8 * 3 + 1/8 * 3 = 1.75

The entropy of the probability distribution is:

-1/2* log_2(1/2) -1/4* log_2(1/4) -1/8 * log_2(1/8) -1/8 * log_2(1/8) = 1.75

Here, the average bit size per symbol actually matches the entropy of the probability

distribution (because we used nice powers of two for our probabilities).

So if our probabilities are powers of 2, our Huffman coding will be perfect, if not, it won't be, but we won't be able to find a better prefix free code (assuming that we stick to binary branching and only encode one symbol at a time).

Hope this makes sense(5 votes)

- This is really tricky stuff, Shannon was an incredibly deep thinker. I know this is not a question but these videos are the best I have seen, I am going to have to go over these quite a few times. I wonder if this material may be the foundation of explaining the weirdness of Quantum Mechanics in a more "realistic" way ?(4 votes)
- Why can we number the edges in any order we want. Wouldn't this cause confusion on the other end?(3 votes)
- Your observation is of course correct but the idea is that you always share the code/"numbering of the edges" with the other side/reciever. Otherwise it will cause confusion as you said.(1 vote)

- What is the prefix used to say "This is a new character"?(2 votes)
- If you're using Huffman coding, you don't need a prefix to say "This is a new character," because Huffman coding produces a
*prefix-free code*. Therefore, no code for any symbol will be at the beginning of the code for another symbol.

In the video, the code is A=1, B=011, C=010, and D=00. Try decoding the following message from left to right. You'll find there is no ambiguity!

0101011

First digit is 0, doesn't correspond to a code. Second digit is 1, so now we're working with 01, doesn't correspond to a code. Third digit is 0, so now we're working with 010 = C! Next digit is 1, which corresponds to A! and so on, to get CAB. You see there is no need to signal that "This is a new character" with a prefix-free code.(2 votes)

- So this Huffman binary tree seems really good at storing information efficiently, without the use of separators (like spaces). Nevertheless aren't separators really useful for quickly accessing a certain symbol?

Let me**rephrase**this question:

Consider the following code (using the encoding from the video):`100101001110110111`

This code translates to`addacbabba`

if read from left to right. But what if you wanted to start somewhere in the middle, e.g. with the 5th symbol? Is there any way to systematically know where the 5th symbol starts (without decoding from left to right)?(1 vote)- Since the symbols have different lengths you need to look at almost all of the bits to determine where the 5th symbol starts.

This should illustrate why we are forced to look at almost all of the bits to find out where the 5th symbols starts. Imagine, we skipped over just 3 bits. That 3 bits might have been:

1 symbol e.g. 011

1 symbol plus part of another symbol e.g. 000(followed by 0 or 10 or 11)

2 symbols e.g. 001

3 symbols e.g. 111

So even skipping over just 3 bits makes it impossible to tell where the 5th digit starts.(2 votes)

- This is one point of confusion for me. The video gives an equation for H at around3:32that is summed (from 1 - n). Is this the same "n" as in the number of bounces. I don't think so. I think in this case the equation is implying that n = 4 (sum over all for options), but the maximum number of bounces (previously called "n") is 3. Am I correct in observing that "n" has been used for 2 different things?(1 vote)
- I couldn't see where n was used for bounces in the video.

With regards to the equation:

For H = Sum(from i=1 to n) p_i * log_2( 1/p_i)

The p_i is the probability of the i_th outcome occurring

n is the total number of possible outcomes

(The sum of all p_i values should add up to 1)

So in the example the outcomes are A,B,C,D (4 possible outcomes).

So for the example in the video we would have:

H = Sum(from i=1 to 4) p_i * log_2( 1/p_i)

Which outcome we assign to which i doesn't matter.

H =

p(A) * log_2( 1/p(A)) +

p(B) * log_2( 1/p(B)) +

p(C) * log_2( 1/p(C)) +

p(D) * log_2( 1/p(D))

where p(x) is the probability of event x occurring(2 votes)

## Video transcript

Voiceover: When we represent
information, such as an image, such as an image, digitally, it means we must slice
it up into tiny chunks. This allows us to send an image as a sequence of color symbols, and these colors can be represented as unique numbers, using some code. Consider the following challenge. Alice and Bob can transmit
and receive messages binary. (Morse code) They charge their
customers 1 penny per bit to use their system, and
a regular customer arrives who wants to send a message, and their messages are 1,000 symbols long. The meaning of the messages is completely unknown. Normally, this is sent
with a standard 2-bit code, which results in charging for 2,000 bits. However, Alice and Bob
already did some analysis on this customer before,
and determined that the probability of each symbol
in the message is different. Can they use these known probabilities to compress the transmission
and increase their profits? What's the optimal coding strategy? David Huffman famously provided the optimal strategy,
which he published in 1952, and based on building a binary
tree from the bottom up. To begin, we can list
all symbols at the bottom which we can call nodes. Then we find the two least probable nodes, in this case B and C, and merge them into one, and add the probabilities together. Then we repeat with the
next two least likely nodes, and continue merging until you have a single node at the top. Finally, we label the edges in this tree with 0 or 1 in any order. Now the code for each
letter is just the path from the top of the tree
to the given letter. So for A, it's just one edge, or 1. This is now known as a Huffman coding, and for examples of the following type you cannot beat it. Go ahead and try. For example, if you shorten the code for D to just 0, then a message
011 could perhaps mean DAA, or maybe just B. So for this to work, you
would need to introduce letter spaces, which
cancel out any savings during transmission. Now, how far does this
compress the message compared to the original 2,000 bits? Well, we just need to calculate the number of bits per letter on average. So we multiply the length of each code times the probability of occurrence, and add them together, which results in an average length of 1.75 bits per symbol on average. That means with this Huffman coding we can expect to compress the messages from 2,000 bits to 1,750 bits. And Claude Shannon was the first to claim that the limit of
compression will always be the entropy of the message source. As the entropy, or uncertainty, of our source decreases due to
known statistical structure, the ability to compress increases. (Morse code) While, if the entropy increases
due to unpredictability, our ability to compress decreases. (Morse code) if we want to compress beyond entropy, we must necessarily throw away information in our messages.