If you're seeing this message, it means we're having trouble loading external resources on our website.

If you're behind a web filter, please make sure that the domains *.kastatic.org and *.kasandbox.org are unblocked.

### Course: Computers and the Internet>Unit 1

Lesson 5: Storing text in binary

# Storing text in binary

Computers store more than just numbers in binary. But how can binary numbers represent non-numbers such as letters and symbols?
As it turns out, all it requires is a bit of human cooperation. We must agree on encodings, mappings from a character to a binary number.

## A simple encoding

For example, what if we wanted to store the following symbols in binary?
☮️❤️😀
We can invent this simple encoding:
BinarySymbol
$\mathtt{\text{0}}\mathtt{\text{1}}$☮️
$\mathtt{\text{10}}$❤️
$\mathtt{\text{11}}$😀
Let's call it the HPE encoding. It helps for encodings to have names, so that programmers know they're using the same encoding.
If a computer program needs to store the ❤️ symbol in computer memory, it can store $\mathtt{\text{10}}$ instead. When the program needs to display $\mathtt{\text{10}}$ to the user, it can remember the HPE encoding and display ❤️ instead.
Computer programs and files often need to store multiple characters, which they can do by stringing each character's encoding together.
A program could write a file called "msg.hpe" with this data:
$\mathtt{\text{0}}\mathtt{\text{10111111010}}$
A program on another computer that understands the HPE encoding can then open "msg.hpe" and display the sequence of symbols.
What sequence would the program display?

The HPE encoding only uses 2 bits, so that limits how many symbols it can represent.
How many symbols can the 2-bit encoding represent?

However, with more bits of information, an encoding can represent enough letters for computers to store messages, documents, and webpages.

## ASCII encoding

ASCII was one of the first standardized encodings. It was invented back in the 1960s when telegraphy was the primary form of long-distance communication, but is still in use today on modern computing systems. ${}^{1}$
Teletypists would type messages on teleprinters such as this one:
The teleprinter would then use the ASCII standard to encode each typed character into binary and then store or transmit the binary data.
This page from a 1972 teleprinter manual shows the 128 ASCII codes:
Each ASCII character is encoded in binary using 7 bits. In the chart above, the column heading indicates the first 3 bits and the row heading indicates the final 4 bits. The very first character is "NUL", encoded as $\mathtt{\text{0}}\mathtt{\text{0}}\mathtt{\text{0}}\mathtt{\text{0}}\mathtt{\text{0}}\mathtt{\text{0}}\mathtt{\text{0}}$.
The first 32 codes represent "control characters," characters which cause some effect besides printing a letter. "BEL" (encoded in binary as $\mathtt{\text{0}}\mathtt{\text{0}}\mathtt{\text{0}}\mathtt{\text{0}}\mathtt{\text{111}}$) caused an audible bell or beep. "ENQ" (encoded as $\mathtt{\text{0}}\mathtt{\text{0}}\mathtt{\text{0}}\mathtt{\text{0}}\mathtt{\text{101}}$) represented an enquiry, a request for the receiving station to identify themselves.
The control characters were originally designed for teleprinters and telegraphy, but many have been re-purposed for modern computers and the Internet—especially "CR" and "LF". "CR" ($\mathtt{\text{0}}\mathtt{\text{0}}\mathtt{\text{0}}\mathtt{\text{1101}}$) represented a "carriage return" on teleprinters, moving the printing head to the start of the line. "LF" ($\mathtt{\text{0}}\mathtt{\text{0}}\mathtt{\text{0}}\mathtt{\text{1010}}$) represented a "line feed", moving the printing head down one line. Modern Internet protocols, such as HTTP, FTP, and SMTP, use a combination of "CR" + "LF" to represent the end of lines.
The remaining 96 ASCII characters look much more familiar.
Here are the first 8 uppercase letters:
BinaryCharacter
$\mathtt{\text{10}}\mathtt{\text{0}}\mathtt{\text{0}}\mathtt{\text{0}}\mathtt{\text{0}}\mathtt{\text{1}}$A
$\mathtt{\text{10}}\mathtt{\text{0}}\mathtt{\text{0}}\mathtt{\text{0}}\mathtt{\text{10}}$B
$\mathtt{\text{10}}\mathtt{\text{0}}\mathtt{\text{0}}\mathtt{\text{0}}\mathtt{\text{11}}$C
$\mathtt{\text{10}}\mathtt{\text{0}}\mathtt{\text{0}}\mathtt{\text{10}}\mathtt{\text{0}}$D
$\mathtt{\text{10}}\mathtt{\text{0}}\mathtt{\text{0}}\mathtt{\text{101}}$E
$\mathtt{\text{10}}\mathtt{\text{0}}\mathtt{\text{0}}\mathtt{\text{110}}$F
$\mathtt{\text{10}}\mathtt{\text{0}}\mathtt{\text{0}}\mathtt{\text{111}}$G
$\mathtt{\text{10}}\mathtt{\text{0}}\mathtt{\text{10}}\mathtt{\text{0}}\mathtt{\text{0}}$H
Following the ASCII standard, we can encode a four-letter message into binary:
$\mathtt{\text{10}}\mathtt{\text{0}}\mathtt{\text{0}}\mathtt{\text{0}}\mathtt{\text{11}}\phantom{\rule{0.167em}{0ex}}\mathtt{\text{10}}\mathtt{\text{0}}\mathtt{\text{10}}\mathtt{\text{0}}\mathtt{\text{0}}\phantom{\rule{0.167em}{0ex}}\mathtt{\text{10}}\mathtt{\text{0}}\mathtt{\text{0}}\mathtt{\text{101}}\phantom{\rule{0.167em}{0ex}}\mathtt{\text{10}}\mathtt{\text{0}}\mathtt{\text{0}}\mathtt{\text{110}}$
What word does that ASCII-encoded binary data represent?

There are several problems with the ASCII encoding, however.
The first big problem is that ASCII only includes letters from the English alphabet and a limited set of symbols.
A language that uses less than 128 characters could come up with their own version of ASCII to encode text in just their language, but what about a text file with characters from multiple languages? ASCII couldn't encode a string like: "Hello, José, would you care for Glühwein? It costs 10 €".
And what about languages with thousands of logograms? ASCII could not encode enough logograms to cover a Chinese sentence like "你好，想要一盘饺子吗？十块钱。"
The other problem with the ASCII encoding is that it uses 7 bits to represent each character, whereas computers typically store information in bytes—units of 8 bits—and programmers don't like to waste memory.
When the earliest computers first started using ASCII to encode characters, different computers would come up with various ways to utilize the final bit. For example, HP computers used the eighth bit to represent characters used in European countries (e.g. "£" and "Ü"), TRS-80 computers used the bit for colored graphics, and Atari computers used the bit for inverted white-on-black versions of the first 128 characters. ${}^{2}$
The result? An "ASCII" file created in one application might look like gobbledygook when opened in another "ASCII"-compatible application.
Computers needed a new encoding, an encoding based on 8-bit bytes that could represent all the languages of the world.

### Unicode

But first, how many characters do you even need to represent the world's languages? Which characters are basically the same across languages, even if they have different sounds?
In 1987, a group of computer engineers attempted to answer those questions. They eventually came up with Unicode, a universal character set which assigns each a "code point" (a hexadecimal number) and a name to each character. ${}^{3}$
For example, the character "ą" is assigned to "U+0105" and named "Latin Small Letter A with Ogonek". There's a character that looks like "ą" in 13 languages, such as Polish and Lithuanian. Thus, according to Unicode, the "ą" in the Polish word "robią" and the "ą" in the Lithuanian word "aslą" are both the same character. Unicode saves space by unifying characters across languages.
But there are still quite a few characters to encode. The Unicode character set started with 7,129 named characters in 1991 and has grown to 137,929 named characters in 2019. The majority of those characters describe logograms from Chinese, Japanese, and Korean, such as "U+6728" which refers to "木". It also includes over 1,200 emoji symbols ("U+1F389" = "🎉"). ${}^{4}$
Unicode is a character set, but it is not an encoding. Fortunately, another group of engineers tackled the problem of efficiently encoding Unicode into binary.

### UTF-8

In 1992, computer scientists invented UTF-8, an encoding that is compatible with ASCII encoding but also solves its problems. ${}^{5}$
UTF-8 can describe every character from the Unicode standard using either 1, 2, 3, or 4 bytes.
When a computer program is reading a UTF-8 text file, it knows how many bytes represent the next character based on how many 1 bits it finds at the beginning of the byte.
Number of bytesByte 1Byte 2Byte 3Byte 4
10xxxxxxx
2110xxxxx10xxxxxx
31110xxxx10xxxxxx10xxxxxx
411110xxx10xxxxxx10xxxxxx10xxxxxx
If there are no 1 bits in the prefix (if the first bit is a 0), that indicates a character represented by a single byte. The remaining 7 bits of the byte are used to represent the original 128 ASCII characters. That means a sequence of 8-bit ASCII characters is also a valid UTF-8 sequence.
Two bytes beginning with 110 are used to encode the rest of the characters from Latin-script languages (e.g. Spanish, German) plus other languages such as Greek, Hebrew, and Arabic. Three bytes starting with 1110 encode most of the characters for Asian languages (e.g. Chinese, Japanese, Korean). Four bytes starting with 11110 encode everything else, from rarely used historical scripts to the increasingly commonly used emoji symbols.
According to the UTF-8 standard, how many characters are represented by these 8 bytes?
$\mathtt{\text{0}}\mathtt{\text{10}}\mathtt{\text{0}}\mathtt{\text{10}}\mathtt{\text{0}}\mathtt{\text{1}}\phantom{\rule{0.167em}{0ex}}\mathtt{\text{11110}}\mathtt{\text{0}}\mathtt{\text{0}}\mathtt{\text{0}}\phantom{\rule{0.167em}{0ex}}\mathtt{\text{10}}\mathtt{\text{0}}\mathtt{\text{11111}}\phantom{\rule{0.167em}{0ex}}\mathtt{\text{10}}\mathtt{\text{0}}\mathtt{\text{10}}\mathtt{\text{0}}\mathtt{\text{10}}\phantom{\rule{0.167em}{0ex}}\mathtt{\text{10}}\mathtt{\text{0}}\mathtt{\text{110}}\mathtt{\text{0}}\mathtt{\text{1}}\phantom{\rule{0.167em}{0ex}}\mathtt{\text{1110}}\mathtt{\text{0}}\mathtt{\text{0}}\mathtt{\text{10}}\phantom{\rule{0.167em}{0ex}}\mathtt{\text{10}}\mathtt{\text{0}}\mathtt{\text{10}}\mathtt{\text{0}}\mathtt{\text{11}}\phantom{\rule{0.167em}{0ex}}\mathtt{\text{10}}\mathtt{\text{0}}\mathtt{\text{0}}\mathtt{\text{1010}}$

Most modern programming languages have built-in support for UTF-8, so most programmers never need to know exactly how to convert from characters to binary.
✏️ Try out using JavaScript to encode strings in UTF-8 in the form below. Play around with multiple languages and symbols.
The UTF-8 encoding standard is now the dominant encoding of HTML files on the web, accounting for 94.5% of webpages as of December 2019. ${}^{6}$
🔎 If you right click and select "view page source" on this webpage right now, you can search for the string "utf-8" and see that this webpage is encoded as UTF-8.
Generally, a good encoding is one that can represent the maximum amount of information with the least number of bits. UTF-8 is a great example of that, since it can encode common English letters with just 1 byte but is flexible enough to encode thousands of letters with additional bytes.
UTF-8 is only one possible encoding, however. UTF-16 and UTF-32 are alternative encodings that are also capable of representing all Unicode characters. There are also language specific encodings such as Shift-JIS for Japanese. Computer programs can use the encoding that best suits their needs and constraints.

## Want to join the conversation?

• This article was very informative and helpful overall. However, I got confused during the Javascript part, where I had to type in a number and they would convert that number to the code UTF-8. For example, I typed in the number 1 and the UTF-8 code translated it to 00110001, which is neither a binary conversion or a hexadecimal conversion. I am not understanding the correlation between these two values. Can you please explain? Thank you so much!
• When you typed in the number '1', it was being represented as a character rather than a numeric 1. Therefore, it will be stored in binary using its ASCII representation. The character '1' is encoded as the decimal number 49 per ASCII standards, and 49 represented in binary is 00110001. I hope that helps to answer your question!
• How does the computer know if we are trying to represent number or letters, etc. in binary? Do we have to put a certain thing before the binary code in order to tell it what to convert it to?
• Yes exactly, you (or your compiler or your program) always have to explicitly tell the computer what you want it to do with the bit string.
• Why are numbers read backwards in binary ie right to left but words and symbols appear to be left to right? I noticed this when doing the emoji question peace signs, smiley faces and hearts. I originally thought the answer would be Heart Heart Smiley Smiley Peace Peace.
• Whether the binary data is stored "left to right
or "right to left" actually depends on the machine that is storing data. Some machines use "big-endian" (storing the most significant bits first) and others use "little-endian" (storing the least significant bits first).
• dont quite understand the reason why that according to UTF-8, those 8 bytes can only represent 3 characters?
• The first byte - "01001001" - begins with a 0.
This means that the first character is represented by one byte.

The second byte - "11110000" - begins with fours 1's. This means that the second character is represented by four bytes - "11110000 10011111 10010010 10011001"

The next byte - "11100010" - begins with three 1's.
This means that the third character is represented by three bytes "11100010 10010011 10001010."
• If the sun sets at 5 then what time can I eat pizza
• 00110
• In the JS part, I typed in "Hello world" and all the bytes in the encoded sequence were starting with 0. I expected it to start with 110 as it's a language. Or have I not properly understood something ? Thanks for your answer :-)
• Hello world uses regular English characters that were part of the original ASCII character set that can be represented with only 7 bits. These original ASCII characters are represented in the UTF-8 encoding with one byte, with the first bit always set to zero to indicate that the character will be represented in only one byte.
• What if the byte begins with 10? How many bytes would it take to represent that character?
• Bytes beginning with 10 indicate that they are a continuation of a previous byte-chain. For example, if you have a 3-long byte, it would look like this
1110xxxxx 10xxxxxx 10xxxxxx. Having a byte that starts with a 10, and is not part of a byte chain would return an error.