If you're seeing this message, it means we're having trouble loading external resources on our website.

If you're behind a web filter, please make sure that the domains *.kastatic.org and *.kasandbox.org are unblocked.

Main content

Lossless text compression

How can a computer compress text? Here's a hint: lots of people compress text every day, when they're writing text messages and they don't want to type a lot.
For example: if I want to say "Great, see you later!", I could type "Gr8, see u l8r!"
I shortened that text by looking for repeated sequences, and replacing those sequences with shorter sequences ("8" and "u").

Compression algorithm

Computers can compress text in a similar way, by finding repeated sequences and replacing them with shorter representations. They don't need to worry about the end result sounding the same, like people do, so they can compress even further.
Let's try it with this quote from William Shakespeare:
to be or not to be, that is the question
The most obvious repeated sequences are "to" and "be", so the computer could represent them with a character that isn't part of the original text, like:
⊜ ⬗ or not ⊜ ⬗, that is the question
Any repeated sequence can be replaced, even if it's not a whole word, so the computer can also replace "th":
⊜ ⬗ or not ⊜ ⬗, ⟡at is ⟡e question
The computer also needs to store the table of replacements that it made, so that it can reconstruct the original.
replacementoriginal
to
be
th
Check your understanding
See if you can find ways to compress this verse from Dr. Seuss:
I am Sam, 
Sam I am.
That Sam-I-am! That Sam-I-am!
I do not like that Sam-I-am! 
Do you like green eggs and ham?
I do not like them, Sam-I-am.
I do not like green eggs and ham.
Which sequences could be replaced to compress the text?
Choose all answers that apply:

Compression amount

As you can see, some text can be compressed down quite a bit— more repetition means more compression.
Some text can't be compressed at all, like the alphabet:
ABCDEFGHIJKLMNOPQRSTUVWXYZ
In fact, a "compressed" version of the alphabet might take more bytes to represent than the original version, depending if the algorithm puts additional metadata in the file.
🤔 Can you think of other examples of text that wouldn't get smaller from compression? What about examples that would compress really well?
We're okay with the fact that compression doesn't guarantee a smaller file, since overall, most files do contain repeated sequences and do benefit from compression.
🔍 If you'd like to learn more about this type of compression, you can research the Lempel-Ziv-Welch (LZW) algorithm.

🙋🏽🙋🏻‍♀️🙋🏿‍♂️Do you have any questions about this topic? We'd love to answer— just ask in the questions area below!

Want to join the conversation?

  • blobby green style avatar for user Romeo Jeng
    I think Chinese characters are unlikely to compress well, since each word is effectively one character. That's equivalent to an alphabet in the English language carrying meaning by itself (eg. each character in 我在学习 conveys something). The only way to compress Chinese texts is probably to look at recurring phrases like "学习" and replace them.
    (20 votes)
    Default Khan Academy avatar avatar for user
    • piceratops ultimate style avatar for user Yik Yang
      Everything is represented with 0 and 1 in a computer, and Chinese characters are no exception. There are different kind of encoding for Chinese. For the GB code, it uses two bytes to represent a Chinese character, and you can compress those two bytes.
      (24 votes)
  • blobby green style avatar for user saha
    What is metadata?
    (8 votes)
    Default Khan Academy avatar avatar for user
  • blobby green style avatar for user Mathilde
    "We're okay with the fact that compression doesn't guarantee a smaller file, since overall, most files do contain repeated sequences and do benefit from compression."
    I'm not sure I understand : does it mean that the point of compression is to make data more simple for the computer rather than smaller ?
    (3 votes)
    Default Khan Academy avatar avatar for user
    • starky ultimate style avatar for user KLaudano
      The point of compression is to represent some data using less bits than what the data normally requires. Unfortunately, no compression algorithm is guaranteed to shorten every set of data. In some cases, applying the algorithm will increase the number of bits needed to represent the data we meant to shorten. However, most of the data we are going to compress will contain patterns and repeated sets of bits which makes the data easier to compress. So, even if a compression algorithm will not make every data set smaller, it will often shorten the data we compress on a regular basis.
      (19 votes)
  • blobby green style avatar for user Mathilde
    As part of the discussion :
    I can't think of a kind of text that wouldn't get smaller from compression, but I'd say a dictionary would compress quite well.
    (1 vote)
    Default Khan Academy avatar avatar for user
    • starky ultimate style avatar for user KLaudano
      Many compression algorithms rely on repeated sequences, so a text file with random characters will not be compressed well. This is relevant in cryptography since an encoded message will often end up appearing to be a sequence of random characters, meaning it is not easily compressed.
      (19 votes)
  • leaf yellow style avatar for user My name
    how about compressing the alphabet in a text when the entire 'abcdefghijklmnopqrstuvwxyz' is repeated several times? Would it still be impossible to compress?
    (2 votes)
    Default Khan Academy avatar avatar for user
    • starky ultimate style avatar for user KLaudano
      Sure, that is definitely possible. Just represent the alphabet string with a character that doesn't appear in the text (or use a letter character that only appears inside the alphabet strings). For example, "abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyz" could be compressed to "555" with the translation key being "5 = abcdefghijklmnopqrstuvwxyz"
      (15 votes)
  • male robot johnny style avatar for user Ricart Jeycol
    I'm biggering. I'm figuring I'm biggering.
    Though biggering is triggering more biggering.
    (8 votes)
    Default Khan Academy avatar avatar for user
  • blobby green style avatar for user autumn.burgess
    i do not understand this
    (0 votes)
    Default Khan Academy avatar avatar for user
    • blobby green style avatar for user Ender Parker-Szabo
      I know that this was three years ago, but here.
      Basically, when a file has a group of characters that's repeated, computers can replace those repeated parts with a single character, which usually makes the amount of bits needed to store it smaller. There are cases where this can make the file larger or not change the size at all, but that usually isn't the case. Hope this helps!
      (4 votes)
  • blobby green style avatar for user Baraka Mujtaba
    If the whole point of compression is to reduce the file size, how can benefit occur from lossless text compression, if it doesn't reduce or sometimes increases the size of the file? Just curious~
    (3 votes)
    Default Khan Academy avatar avatar for user
  • blobby green style avatar for user christopher.floyd
    Perhaps I need further clarification than what the article is providing. So, compression removes repeating sequences, which occupy space, and replaces them with 'characters' that also occupy space?
    So, the idea is the 'characters' are supposed to occupy less space? I guess I'm a bit befuddled on how it's saving space if you're having to replace bytes(bits), with bytes everywhere there are repeating bytes. If you're removing even short snippets from words (Th) and replacing it with a character that probably has as many bytes(bits) so it can be reconstructed later, why do that at all?
    How does the receiving computer know how that file is to be translated, unless the replacement table goes with the file, which I'm thinking would be no benefit at all.
    (2 votes)
    Default Khan Academy avatar avatar for user
    • starky ultimate style avatar for user KLaudano
      Yes, one way to compress information is to replace frequently repeated sequences with shorter sequences. Suppose we want to compress the string "abcabababdabf". We can see that "ab" repeats many times in the string, so we can replace "ab" with something shorter, such as "z". In this case, the compressed string is "zczzzdzf" which is 5 characters shorter than the original string.

      In order for a computer to reconstruct the original message, we have a couple of options. As you stated, we can include a table with the compressed message that tells how to undo the compression. Including a table takes up more memory though and makes the compression less effective. (Even with a replacement table included in the compressed message, it can still be shorter than the original message, depending on the algorithm used to compress it and the message itself.) Another option is to use a compression table that is known beforehand by both the computer doing the compression and the computer doing the expansion. Both of the computers already know the compression table, so it is not necessary to include it with the compressed message. Since the table is predetermined, it is likely not the most efficient way to compress your message and the compressed message will not be as short as it could be.
      (9 votes)
  • aqualine ultimate style avatar for user mahinur
    It is something like math isn't it? We can write x^2 as t so we can write x^4 as t^2. Is it some stuff which is common in our personal computers or is it something that only useful for engineers and programmers?
    (4 votes)
    Default Khan Academy avatar avatar for user