If you're seeing this message, it means we're having trouble loading external resources on our website.

If you're behind a web filter, please make sure that the domains *.kastatic.org and *.kasandbox.org are unblocked.

Main content

## AP®︎/College Computer Science Principles

### Course: AP®︎/College Computer Science Principles>Unit 1

Lesson 4: Storing text in binary

# Storing text in binary

Computers store more than just numbers in binary. But how can binary numbers represent non-numbers such as letters and symbols?
As it turns out, all it requires is a bit of human cooperation. We must agree on encodings, mappings from a character to a binary number.

## A simple encoding

For example, what if we wanted to store the following symbols in binary?
☮️❤️😀
We can invent this simple encoding:
BinarySymbol
$\mathtt{\text{0}}\mathtt{\text{1}}$☮️
$\mathtt{\text{10}}$❤️
$\mathtt{\text{11}}$😀
Let's call it the HPE encoding. It helps for encodings to have names, so that programmers know they're using the same encoding.
If a computer program needs to store the ❤️ symbol in computer memory, it can store $\mathtt{\text{10}}$ instead. When the program needs to display $\mathtt{\text{10}}$ to the user, it can remember the HPE encoding and display ❤️ instead.
Computer programs and files often need to store multiple characters, which they can do by stringing each character's encoding together.
A program could write a file called "msg.hpe" with this data:
$\mathtt{\text{0}}\mathtt{\text{10111111010}}$
A program on another computer that understands the HPE encoding can then open "msg.hpe" and display the sequence of symbols.
Check your understanding
What sequence would the program display?
Choose 1 answer:

The HPE encoding only uses 2 bits, so that limits how many symbols it can represent.
Check your understanding
How many symbols can the 2-bit encoding represent?

However, with more bits of information, an encoding can represent enough letters for computers to store messages, documents, and webpages.

## ASCII encoding

ASCII was one of the first standardized encodings. It was invented back in the 1960s when telegraphy was the primary form of long-distance communication, but is still in use today on modern computing systems. ${}^{1}$
Teletypists would type messages on teleprinters such as this one:
The teleprinter would then use the ASCII standard to encode each typed character into binary and then store or transmit the binary data.
This page from a 1972 teleprinter manual shows the 128 ASCII codes:
Each ASCII character is encoded in binary using 7 bits. In the chart above, the column heading indicates the first 3 bits and the row heading indicates the final 4 bits. The very first character is "NUL", encoded as $\mathtt{\text{0}}\mathtt{\text{0}}\mathtt{\text{0}}\mathtt{\text{0}}\mathtt{\text{0}}\mathtt{\text{0}}\mathtt{\text{0}}$.
The first 32 codes represent "control characters," characters which cause some effect besides printing a letter. "BEL" (encoded in binary as $\mathtt{\text{0}}\mathtt{\text{0}}\mathtt{\text{0}}\mathtt{\text{0}}\mathtt{\text{111}}$) caused an audible bell or beep. "ENQ" (encoded as $\mathtt{\text{0}}\mathtt{\text{0}}\mathtt{\text{0}}\mathtt{\text{0}}\mathtt{\text{101}}$) represented an enquiry, a request for the receiving station to identify themselves.
The control characters were originally designed for teleprinters and telegraphy, but many have been re-purposed for modern computers and the Internet—especially "CR" and "LF". "CR" ($\mathtt{\text{0}}\mathtt{\text{0}}\mathtt{\text{0}}\mathtt{\text{1101}}$) represented a "carriage return" on teleprinters, moving the printing head to the start of the line. "LF" ($\mathtt{\text{0}}\mathtt{\text{0}}\mathtt{\text{0}}\mathtt{\text{1010}}$) represented a "line feed", moving the printing head down one line. Modern Internet protocols, such as HTTP, FTP, and SMTP, use a combination of "CR" + "LF" to represent the end of lines.
The remaining 96 ASCII characters look much more familiar.
Here are the first 8 uppercase letters:
BinaryCharacter
$\mathtt{\text{10}}\mathtt{\text{0}}\mathtt{\text{0}}\mathtt{\text{0}}\mathtt{\text{0}}\mathtt{\text{1}}$A
$\mathtt{\text{10}}\mathtt{\text{0}}\mathtt{\text{0}}\mathtt{\text{0}}\mathtt{\text{10}}$B
$\mathtt{\text{10}}\mathtt{\text{0}}\mathtt{\text{0}}\mathtt{\text{0}}\mathtt{\text{11}}$C
$\mathtt{\text{10}}\mathtt{\text{0}}\mathtt{\text{0}}\mathtt{\text{10}}\mathtt{\text{0}}$D
$\mathtt{\text{10}}\mathtt{\text{0}}\mathtt{\text{0}}\mathtt{\text{101}}$E
$\mathtt{\text{10}}\mathtt{\text{0}}\mathtt{\text{0}}\mathtt{\text{110}}$F
$\mathtt{\text{10}}\mathtt{\text{0}}\mathtt{\text{0}}\mathtt{\text{111}}$G
$\mathtt{\text{10}}\mathtt{\text{0}}\mathtt{\text{10}}\mathtt{\text{0}}\mathtt{\text{0}}$H
Following the ASCII standard, we can encode a four-letter message into binary:
$\mathtt{\text{10}}\mathtt{\text{0}}\mathtt{\text{0}}\mathtt{\text{0}}\mathtt{\text{11}}\phantom{\rule{0.167em}{0ex}}\mathtt{\text{10}}\mathtt{\text{0}}\mathtt{\text{10}}\mathtt{\text{0}}\mathtt{\text{0}}\phantom{\rule{0.167em}{0ex}}\mathtt{\text{10}}\mathtt{\text{0}}\mathtt{\text{0}}\mathtt{\text{101}}\phantom{\rule{0.167em}{0ex}}\mathtt{\text{10}}\mathtt{\text{0}}\mathtt{\text{0}}\mathtt{\text{110}}$
Check your understanding
What word does that ASCII-encoded binary data represent?
Choose 1 answer:

There are several problems with the ASCII encoding, however.
The first big problem is that ASCII only includes letters from the English alphabet and a limited set of symbols.
A language that uses less than 128 characters could come up with their own version of ASCII to encode text in just their language, but what about a text file with characters from multiple languages? ASCII couldn't encode a string like: "Hello, José, would you care for Glühwein? It costs 10 €".
And what about languages with thousands of logograms? ASCII could not encode enough logograms to cover a Chinese sentence like "你好，想要一盘饺子吗？十块钱。"
The other problem with the ASCII encoding is that it uses 7 bits to represent each character, whereas computers typically store information in bytes—units of 8 bits—and programmers don't like to waste memory.
When the earliest computers first started using ASCII to encode characters, different computers would come up with various ways to utilize the final bit. For example, HP computers used the eighth bit to represent characters used in European countries (e.g. "£" and "Ü"), TRS-80 computers used the bit for colored graphics, and Atari computers used the bit for inverted white-on-black versions of the first 128 characters. ${}^{2}$
The result? An "ASCII" file created in one application might look like gobbledygook when opened in another "ASCII"-compatible application.
Computers needed a new encoding, an encoding based on 8-bit bytes that could represent all the languages of the world.

### Unicode

But first, how many characters do you even need to represent the world's languages? Which characters are basically the same across languages, even if they have different sounds?
In 1987, a group of computer engineers attempted to answer those questions. They eventually came up with Unicode, a universal character set which assigns each a "code point" (a hexadecimal number) and a name to each character. ${}^{3}$
For example, the character "ą" is assigned to "U+0105" and named "Latin Small Letter A with Ogonek". There's a character that looks like "ą" in 13 languages, such as Polish and Lithuanian. Thus, according to Unicode, the "ą" in the Polish word "robią" and the "ą" in the Lithuanian word "aslą" are both the same character. Unicode saves space by unifying characters across languages.
But there are still quite a few characters to encode. The Unicode character set started with 7,129 named characters in 1991 and has grown to 137,929 named characters in 2019. The majority of those characters describe logograms from Chinese, Japanese, and Korean, such as "U+6728" which refers to "木". It also includes over 1,200 emoji symbols ("U+1F389" = "🎉"). ${}^{4}$
Unicode is a character set, but it is not an encoding. Fortunately, another group of engineers tackled the problem of efficiently encoding Unicode into binary.

### UTF-8

In 1992, computer scientists invented UTF-8, an encoding that is compatible with ASCII encoding but also solves its problems. ${}^{5}$
UTF-8 can describe every character from the Unicode standard using either 1, 2, 3, or 4 bytes.
When a computer program is reading a UTF-8 text file, it knows how many bytes represent the next character based on how many 1 bits it finds at the beginning of the byte.
Number of bytesByte 1Byte 2Byte 3Byte 4
10xxxxxxx
2110xxxxx10xxxxxx
31110xxxx10xxxxxx10xxxxxx
411110xxx10xxxxxx10xxxxxx10xxxxxx
If there are no 1 bits in the prefix (if the first bit is a 0), that indicates a character represented by a single byte. The remaining 7 bits of the byte are used to represent the original 128 ASCII characters. That means a sequence of 8-bit ASCII characters is also a valid UTF-8 sequence.
Two bytes beginning with 110 are used to encode the rest of the characters from Latin-script languages (e.g. Spanish, German) plus other languages such as Greek, Hebrew, and Arabic. Three bytes starting with 1110 encode most of the characters for Asian languages (e.g. Chinese, Japanese, Korean). Four bytes starting with 11110 encode everything else, from rarely used historical scripts to the increasingly commonly used emoji symbols.
Check your understanding
According to the UTF-8 standard, how many characters are represented by these 8 bytes?
$\mathtt{\text{0}}\mathtt{\text{10}}\mathtt{\text{0}}\mathtt{\text{10}}\mathtt{\text{0}}\mathtt{\text{1}}\phantom{\rule{0.167em}{0ex}}\mathtt{\text{11110}}\mathtt{\text{0}}\mathtt{\text{0}}\mathtt{\text{0}}\phantom{\rule{0.167em}{0ex}}\mathtt{\text{10}}\mathtt{\text{0}}\mathtt{\text{11111}}\phantom{\rule{0.167em}{0ex}}\mathtt{\text{10}}\mathtt{\text{0}}\mathtt{\text{10}}\mathtt{\text{0}}\mathtt{\text{10}}\phantom{\rule{0.167em}{0ex}}\mathtt{\text{10}}\mathtt{\text{0}}\mathtt{\text{110}}\mathtt{\text{0}}\mathtt{\text{1}}\phantom{\rule{0.167em}{0ex}}\mathtt{\text{1110}}\mathtt{\text{0}}\mathtt{\text{0}}\mathtt{\text{10}}\phantom{\rule{0.167em}{0ex}}\mathtt{\text{10}}\mathtt{\text{0}}\mathtt{\text{10}}\mathtt{\text{0}}\mathtt{\text{11}}\phantom{\rule{0.167em}{0ex}}\mathtt{\text{10}}\mathtt{\text{0}}\mathtt{\text{0}}\mathtt{\text{1010}}$

Most modern programming languages have built-in support for UTF-8, so most programmers never need to know exactly how to convert from characters to binary.
✏️ Try out using JavaScript to encode strings in UTF-8 in the form below. Play around with multiple languages and symbols.
The UTF-8 encoding standard is now the dominant encoding of HTML files on the web, accounting for 94.5% of webpages as of December 2019. ${}^{6}$
🔎 If you right click and select "view page source" on this webpage right now, you can search for the string "utf-8" and see that this webpage is encoded as UTF-8.
Generally, a good encoding is one that can represent the maximum amount of information with the least number of bits. UTF-8 is a great example of that, since it can encode common English letters with just 1 byte but is flexible enough to encode thousands of letters with additional bytes.
UTF-8 is only one possible encoding, however. UTF-16 and UTF-32 are alternative encodings that are also capable of representing all Unicode characters. There are also language specific encodings such as Shift-JIS for Japanese. Computer programs can use the encoding that best suits their needs and constraints.

🙋🏽🙋🏻‍♀️🙋🏿‍♂️Do you have any questions about this topic? We'd love to answer— just ask in the questions area below!

## Want to join the conversation?

• This article was very informative and helpful overall. However, I got confused during the Javascript part, where I had to type in a number and they would convert that number to the code UTF-8. For example, I typed in the number 1 and the UTF-8 code translated it to 00110001, which is neither a binary conversion or a hexadecimal conversion. I am not understanding the correlation between these two values. Can you please explain? Thank you so much!
(20 votes)
• When you typed in the number '1', it was being represented as a character rather than a numeric 1. Therefore, it will be stored in binary using its ASCII representation. The character '1' is encoded as the decimal number 49 per ASCII standards, and 49 represented in binary is 00110001. I hope that helps to answer your question!
(85 votes)
• How does the computer know if we are trying to represent number or letters, etc. in binary? Do we have to put a certain thing before the binary code in order to tell it what to convert it to?
(9 votes)
• Yes exactly, you (or your compiler or your program) always have to explicitly tell the computer what you want it to do with the bit string.
(15 votes)
• Why are numbers read backwards in binary ie right to left but words and symbols appear to be left to right? I noticed this when doing the emoji question peace signs, smiley faces and hearts. I originally thought the answer would be Heart Heart Smiley Smiley Peace Peace.
(4 votes)
• Whether the binary data is stored "left to right
or "right to left" actually depends on the machine that is storing data. Some machines use "big-endian" (storing the most significant bits first) and others use "little-endian" (storing the least significant bits first).
(14 votes)
• hello i am bill gates i will buy this platform thank you
(8 votes)
• ? what are you referring to
(3 votes)
• If the sun sets at 5 then what time can I eat pizza
(8 votes)
• 00110
(2 votes)
• dont quite understand the reason why that according to UTF-8, those 8 bytes can only represent 3 characters?
(4 votes)
• The first byte - "01001001" - begins with a 0.
This means that the first character is represented by one byte.

The second byte - "11110000" - begins with fours 1's. This means that the second character is represented by four bytes - "11110000 10011111 10010010 10011001"

The next byte - "11100010" - begins with three 1's.
This means that the third character is represented by three bytes "11100010 10010011 10001010."
(10 votes)
• In the JS part, I typed in "Hello world" and all the bytes in the encoded sequence were starting with 0. I expected it to start with 110 as it's a language. Or have I not properly understood something ? Thanks for your answer :-)
(5 votes)
• Hello world uses regular English characters that were part of the original ASCII character set that can be represented with only 7 bits. These original ASCII characters are represented in the UTF-8 encoding with one byte, with the first bit always set to zero to indicate that the character will be represented in only one byte.
(3 votes)
• Does HPE stand for something
(6 votes)
• Likely after the initials of the names of the emojis, "Heart", "Peace symbol", "Emoji"(?)
(4 votes)
• I understand to binary and I didn't understand to bytes it's bit hard for me. But I am trying to understand. Who can explain me clearly? Thanks!
(6 votes)
• The first methods of counting were machines called "abacus". The beads on a stick. If you want to count your cows you can count one bead for each cow. but this takes long, if you wanna remember later. So they created a bead that represents 10 cows. You don't need 10 beads, you just need one bead that says ten. But if you write down a 10 in english, you have a one and a zero. This does not mean that you have one cow and then zero cows, it means something else. You don't want to write 1+1+1+1+1+1... etc. You just write 10. And in Binary you have either 1 or 0. But what if you have three ones? Then you write down: [1], then [10], then [11]. (Sometimes people start counting at Zero, this is also right and depends on the case! Please do not get confused). It gets longer, just like numerical ten and then numerical hundred. And at the amount of eight bits people decided that this sis a good amount of bits to call it a byte. It worked with how computers were built back then.
(3 votes)
• All the possible UTF-8 Sequences on a standard keyboard
00101101 00111101 00101011 01100000 00110001 00110010 00110011 00110100 00110101 00110110 00110111 00111000 00111001 00110000 01011111 00101001 00101000 00101010 00100110 01011110 00100101 00100100 11000010 10100011 00100010 00100001 11000010 10101100 00101100 00101110 00101111 00111011 00100111 00100011 01011011 01011101 00111100 00111110 00111111 00111010 01000000 01111110 01111011 01111101 01110001 01110111 01100101 01110010 01110100 01111001 01110101 01101001 01101111 01110000 01100001 01110011 01100100 01100110 01100111 01101000 01101010 01101011 01101100 01111010 01111000 01100011 01110110 01100010 01101110 01101101 01011100 01111100 01010001 01010111 01000101 01010010 01010100 01011001 01010101 01001001 01001111 01010000 01000001 01010011 01000100 01000110 01000111 01001000 01001010 01001011 01001100 01011010 01011000 01000011 01010110 01000010 01001110 01001101
or -=+`1234567890_)(*&^%\$£"!¬,./;'#[]<>?:@~{}qwertyuiopasdfghjklzxcvbnm\|QWERTYUIOPASDFGHJKLZXCVBNM
(6 votes)