If you're seeing this message, it means we're having trouble loading external resources on our website.

If you're behind a web filter, please make sure that the domains *.kastatic.org and *.kasandbox.org are unblocked.

## Computers and the Internet

### Course: Computers and the Internet>Unit 1

Lesson 7: Data compression

# Lossless text compression

How can a computer compress text? Here's a hint: lots of people compress text every day, when they're writing text messages and they don't want to type a lot.
For example: if I want to say "Great, see you later!", I could type "Gr8, see u l8r!"
I shortened that text by looking for repeated sequences, and replacing those sequences with shorter sequences ("8" and "u").

### Compression algorithm

Computers can compress text in a similar way, by finding repeated sequences and replacing them with shorter representations. They don't need to worry about the end result sounding the same, like people do, so they can compress even further.
Let's try it with this quote from William Shakespeare:
`to be or not to be, that is the question`
The most obvious repeated sequences are "to" and "be", so the computer could represent them with a character that isn't part of the original text, like:
`⊜ ⬗ or not ⊜ ⬗, that is the question`
Any repeated sequence can be replaced, even if it's not a whole word, so the computer can also replace "th":
`⊜ ⬗ or not ⊜ ⬗, ⟡at is ⟡e question`
The computer also needs to store the table of replacements that it made, so that it can reconstruct the original.
replacementoriginal
to
be
th
See if you can find ways to compress this verse from Dr. Seuss:
``````I am Sam,
Sam I am.
That Sam-I-am! That Sam-I-am!
I do not like that Sam-I-am!
Do you like green eggs and ham?
I do not like them, Sam-I-am.
I do not like green eggs and ham.``````
Which sequences could be replaced to compress the text?

### Compression amount

As you can see, some text can be compressed down quite a bit— more repetition means more compression.
Some text can't be compressed at all, like the alphabet:
`ABCDEFGHIJKLMNOPQRSTUVWXYZ`
In fact, a "compressed" version of the alphabet might take more bytes to represent than the original version, depending if the algorithm puts additional metadata in the file.
🤔 Can you think of other examples of text that wouldn't get smaller from compression? What about examples that would compress really well?
We're okay with the fact that compression doesn't guarantee a smaller file, since overall, most files do contain repeated sequences and do benefit from compression.

## Want to join the conversation?

• I think Chinese characters are unlikely to compress well, since each word is effectively one character. That's equivalent to an alphabet in the English language carrying meaning by itself (eg. each character in 我在学习 conveys something). The only way to compress Chinese texts is probably to look at recurring phrases like "学习" and replace them.
• Everything is represented with 0 and 1 in a computer, and Chinese characters are no exception. There are different kind of encoding for Chinese. For the GB code, it uses two bytes to represent a Chinese character, and you can compress those two bytes.
• Metadata is data about data. For example, the name of a file is metadata because it is information about the file (the data).
• "We're okay with the fact that compression doesn't guarantee a smaller file, since overall, most files do contain repeated sequences and do benefit from compression."
I'm not sure I understand : does it mean that the point of compression is to make data more simple for the computer rather than smaller ?
• The point of compression is to represent some data using less bits than what the data normally requires. Unfortunately, no compression algorithm is guaranteed to shorten every set of data. In some cases, applying the algorithm will increase the number of bits needed to represent the data we meant to shorten. However, most of the data we are going to compress will contain patterns and repeated sets of bits which makes the data easier to compress. So, even if a compression algorithm will not make every data set smaller, it will often shorten the data we compress on a regular basis.
• As part of the discussion :
I can't think of a kind of text that wouldn't get smaller from compression, but I'd say a dictionary would compress quite well.
(1 vote)
• Many compression algorithms rely on repeated sequences, so a text file with random characters will not be compressed well. This is relevant in cryptography since an encoded message will often end up appearing to be a sequence of random characters, meaning it is not easily compressed.
• another thing that would be good to compress is the amount of braincells in my brain. They would not fit if not compressed
• how about compressing the alphabet in a text when the entire 'abcdefghijklmnopqrstuvwxyz' is repeated several times? Would it still be impossible to compress?
(1 vote)
• Sure, that is definitely possible. Just represent the alphabet string with a character that doesn't appear in the text (or use a letter character that only appears inside the alphabet strings). For example, "abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyz" could be compressed to "555" with the translation key being "5 = abcdefghijklmnopqrstuvwxyz"
• i do not understand this
(1 vote)
• good question! Maybe he needed to take an errand to buy some milk, but accidentally went to russia!