If you're seeing this message, it means we're having trouble loading external resources on our website.

If you're behind a web filter, please make sure that the domains *.kastatic.org and *.kasandbox.org are unblocked.

Main content

Lossless text compression

AP.CSP:
DAT‑1 (EU)
,
DAT‑1.D (LO)
,
DAT‑1.D.1 (EK)
,
DAT‑1.D.2 (EK)
,
DAT‑1.D.3 (EK)
How can a computer compress text? Here's a hint: lots of people compress text every day, when they're writing text messages and they don't want to type a lot.
For example: if I want to say "Great, see you later!", I could type "Gr8, see u l8r!"
I shortened that text by looking for repeated sequences, and replacing those sequences with shorter sequences ("8" and "u").

Compression algorithm

Computers can compress text in a similar way, by finding repeated sequences and replacing them with shorter representations. They don't need to worry about the end result sounding the same, like people do, so they can compress even further.
Let's try it with this quote from William Shakespeare:
to be or not to be, that is the question
The most obvious repeated sequences are "to" and "be", so the computer could represent them with a character that isn't part of the original text, like:
⊜ ⬗ or not ⊜ ⬗, that is the question
Any repeated sequence can be replaced, even if it's not a whole word, so the computer can also replace "th":
⊜ ⬗ or not ⊜ ⬗, ⟡at is ⟡e question
The computer also needs to store the table of replacements that it made, so that it can reconstruct the original.
replacementoriginal
to
be
th
Check your understanding
See if you can find ways to compress this verse from Dr. Seuss:
I am Sam, 
Sam I am.
That Sam-I-am! That Sam-I-am!
I do not like that Sam-I-am! 
Do you like green eggs and ham?
I do not like them, Sam-I-am.
I do not like green eggs and ham.
Which sequences could be replaced to compress the text?
Choose all answers that apply:
Choose all answers that apply:

Compression amount

As you can see, some text can be compressed down quite a bit— more repetition means more compression.
Some text can't compressed at all, like the alphabet:
ABCDEFGHIJKLMNOPQRSTUVWXYZ
In fact, a "compressed" version of the alphabet might take more bytes to represent than the original version, depending if the algorithm puts additional metadata in the file.
🤔 Can you think of other examples of text that wouldn't get smaller from compression? What about examples that would compress really well?
We're okay with the fact that compression doesn't guarantee a smaller file, since overall, most files do contain repeated sequences and do benefit from compression.
🔍 If you'd like to learn more about this type of compression, you can research the Lempel-Ziv-Welch (LZW) algorithm.

🙋🏽🙋🏻‍♀️🙋🏿‍♂️Do you have any questions about this topic? We'd love to answer— just ask in the questions area below!

Want to join the conversation?

  • blobby green style avatar for user Romeo Jeng
    I think Chinese characters are unlikely to compress well, since each word is effectively one character. That's equivalent to an alphabet in the English language carrying meaning by itself (eg. each character in 我在学习 conveys something). The only way to compress Chinese texts is probably to look at recurring phrases like "学习" and replace them.
    (11 votes)
    Default Khan Academy avatar avatar for user
    • piceratops ultimate style avatar for user Yik Yang
      Everything is represented with 0 and 1 in a computer, and Chinese characters are no exception. There are different kind of encoding for Chinese. For the GB code, it uses two bytes to represent a Chinese character, and you can compress those two bytes.
      (9 votes)
  • blobby green style avatar for user Mathilde
    As part of the discussion :
    I can't think of a kind of text that wouldn't get smaller from compression, but I'd say a dictionary would compress quite well.
    (2 votes)
    Default Khan Academy avatar avatar for user
    • starky ultimate style avatar for user KLaudano
      Many compression algorithms rely on repeated sequences, so a text file with random characters will not be compressed well. This is relevant in cryptography since an encoded message will often end up appearing to be a sequence of random characters, meaning it is not easily compressed.
      (14 votes)
  • blobby green style avatar for user Mathilde
    "We're okay with the fact that compression doesn't guarantee a smaller file, since overall, most files do contain repeated sequences and do benefit from compression."
    I'm not sure I understand : does it mean that the point of compression is to make data more simple for the computer rather than smaller ?
    (2 votes)
    Default Khan Academy avatar avatar for user
    • starky ultimate style avatar for user KLaudano
      The point of compression is to represent some data using less bits than what the data normally requires. Unfortunately, no compression algorithm is guaranteed to shorten every set of data. In some cases, applying the algorithm will increase the number of bits needed to represent the data we meant to shorten. However, most of the data we are going to compress will contain patterns and repeated sets of bits which makes the data easier to compress. So, even if a compression algorithm will not make every data set smaller, it will often shorten the data we compress on a regular basis.
      (13 votes)
  • blobby green style avatar for user christopher.floyd
    Perhaps I need further clarification than what the article is providing. So, compression removes repeating sequences, which occupy space, and replaces them with 'characters' that also occupy space?
    So, the idea is the 'characters' are supposed to occupy less space? I guess I'm a bit befuddled on how it's saving space if you're having to replace bytes(bits), with bytes everywhere there are repeating bytes. If you're removing even short snippets from words (Th) and replacing it with a character that probably has as many bytes(bits) so it can be reconstructed later, why do that at all?
    How does the receiving computer know how that file is to be translated, unless the replacement table goes with the file, which I'm thinking would be no benefit at all.
    (2 votes)
    Default Khan Academy avatar avatar for user
    • starky ultimate style avatar for user KLaudano
      Yes, one way to compress information is to replace frequently repeated sequences with shorter sequences. Suppose we want to compress the string "abcabababdabf". We can see that "ab" repeats many times in the string, so we can replace "ab" with something shorter, such as "z". In this case, the compressed string is "zczzzdzf" which is 5 characters shorter than the original string.

      In order for a computer to reconstruct the original message, we have a couple of options. As you stated, we can include a table with the compressed message that tells how to undo the compression. Including a table takes up more memory though and makes the compression less effective. (Even with a replacement table included in the compressed message, it can still be shorter than the original message, depending on the algorithm used to compress it and the message itself.) Another option is to use a compression table that is known beforehand by both the computer doing the compression and the computer doing the expansion. Both of the computers already know the compression table, so it is not necessary to include it with the compressed message. Since the table is predetermined, it is likely not the most efficient way to compress your message and the compressed message will not be as short as it could be.
      (7 votes)
  • blobby green style avatar for user Baraka Mujtaba
    If the whole point of compression is to reduce the file size, how can benefit occur from lossless text compression, if it doesn't reduce or sometimes increases the size of the file? Just curious~
    (2 votes)
    Default Khan Academy avatar avatar for user
  • blobby green style avatar for user chansovanmonyyoeun03
    After replacement, does the text display the original text, and how many words can be replaced in one text?
    (2 votes)
    Default Khan Academy avatar avatar for user
    • starky ultimate style avatar for user KLaudano
      After replacement, the text is not the same as the original text. We would need to substitute the original words for the replaced words to get the original text back.

      You can technically replace any number of words in a text if you are willing to substitute them for words of equal or greater length. However, you can only replace at most half of the different words with a shorter word.
      (4 votes)
  • blobby green style avatar for user vsvrahul01
    Can we get a PDF notes of all this basics
    (3 votes)
    Default Khan Academy avatar avatar for user
  • blobby green style avatar for user kristinelovett8
    Could you compress a lossless fin- file in data production for analog music.. say mp..?
    (2 votes)
    Default Khan Academy avatar avatar for user
  • blobby green style avatar for user arianit.lk
    Why is not replacing words with characters the same, knowing that characters also take place
    (1 vote)
    Default Khan Academy avatar avatar for user
    • starky ultimate style avatar for user KLaudano
      You could replace words with characters, however, you would need to ensure that whatever characters you use to replace a word do not appear in the text you are compressing (otherwise it would be ambiguous whether the character replaced a word or not). Given that requirement, you could only compress less than 256 different words.
      (3 votes)
  • aqualine sapling style avatar for user Caleb
    Is there any way to view what a compressed file looks like (like looking at one of our compressed files)?
    (1 vote)
    Default Khan Academy avatar avatar for user
    • starky ultimate style avatar for user KLaudano
      Generally, viewing a compressed file is not all that useful since it is not stored in a way that is meaningful for humans to understand. However, if you would like to see the raw data of a file, you can use the 'Format-Hex' terminal command (in Windows) or the 'hexdump' terminal command (in Linux).
      (2 votes)