encoding - freeCodeCamp.org

What is base64 Encoding and Why is it Necessary?

freeCodeCamp — Tue, 28 Nov 2023 18:30:41 +0000

By Sergei Bachinin

In this article, we'll thoroughly explore base64 encoding. You'll learn how it came into being and why it's still so prevalent in modern systems.

Here's what we'll cover:

What is base64?
Why use base64?
When to use base64
A Case of SMTP
Are These Limitations Still Relevant Today?
How base64 Helps with These Limitations
Why Only 64 Characters?
Do I Have to Use base64 with HTTP/1.1?
Why is base64 Used for Data URLs?
Conclusion

What is base64?

Base64 is a way of transforming any data into a gibberish of digits and letters. Today it's widely used and taken for granted. Most of the time you use it because there is no choice: in many situations, only such text-like gibberish is considered valid.

At the same time, base64 is a costly instrument. It makes data about 33% larger in terms of memory usage. So base64 is one of these little things that make software slow. That's why you should use it only when it's absolutely necessary.

Again, typically you don't have a choice. But sometimes you do, and that's why it can be useful to understand how it works and what problems it is meant to solve.

This article is aimed at those who've already encountered base64 in their programming practice and may be wondering what on earth it is. You'll likely get more out of this guide if you have at least basic knowledge of data types and some understanding of how computer memory is organized at a binary level (bits, bytes and all that).

The encoding algorithm of base64 is straightforward. It takes as an input one binary sequence and ouputs another binary sequence. It doesn't care what the original bytes stood for – be it an image or a PDF or whatever. It just goes through the original binary stuff, splits it into chunks of 6 bits, and converts every chunk into a safe textual character (or, strictly speaking, a binary sequence that represents such a character). A "safe" character refers to one from a very limited set, which we'll discuss more later.

This image shows approximately how the text "Sun" is encoded to base64

Then at some point you have to decode this pseudo-text back into an original data – because base64 gibberish in itself is useless. Decoding performs similar binary operations but in reverse. The restored data is guaranteed to be exactly the same as before encoding.

I won't discuss the encoding/decoding algorithm in more depth here. Wikipedia explains it in exhaustive detail if you want to read more.

In practice, you (as a developer) will almost never deal with anything in binary itself when encoding your data. Instead, you'll use convenient wrappers that are available in almost any language, such as the btoa() and atob() functions in JavaScript. You can read more about them here.

Why Use base64?

What problems is base64 meant to solve? We have 3 main theories here:

Base64 is used because some systems are restricted to ASCII characters but are actually used for all kinds of data. Base64 can "camouflage" this data as ASCII and thus help this data pass validation.
Because some older systems think that data consists of 7-bit chunks (bytes), whereas modern ones use 8-bit bytes. This may lead to misunderstandings between systems. And base64 presumably can help with this – but it's not so obvious how.
Because some characters may have special meaning and this will differ from system to system. Base64 tackles this by using only the most "purely textual" characters from the ASCII set.

Many experts on base64 advocate one theory and turn down all other theories. It may seem we should be able to (or that we need to) pick the right one from the list – but actually they are all correct. Well, the "7-bit problem" is much less of a problem today but the other two are still relevant and base64 addresses them both.

When to Use base64

When do you need to present data as base64 gibberish? In a great many situations, of course. But let's understand the main principle.

The official spec (RFC4648) says that base64 "is used to store or transfer data in environments that, perhaps for legacy reasons, are restricted to ASCII data".

So you need base64 when:

Incompatible data is transmitted through the network. First of all it's a problem of emails – for example, encoding is necessary when you need to attach an image to a textual message. This was the reason why base64 was first introduced.
Incompatible data is stored in files or elsewhere. Often you need to embed non-textual data in a textual file like JSON, XML or HTML. Or to store something fancy in a brower cookie (and cookies must be only text).

The official spec continues that base64 also:

"makes it possible to manipulate objects with text editors". Apparently it means "rich text editors" such as Microsoft Word where you can combine text with images and other things. Do such programs use base64 for embedded content? That's possible, but let's leave it to more persistent researchers.

A Case of SMTP

SMTP is an age-old email protocol still prevalent today and used by Gmail, Outlook, and others. It defines how email messages must be transmitted between servers. Original SMTP rules were really strict, and by the end of 80s they were already a great nuisance.

First of all, SMTP allowed only plain text in the English language (which consists of only 128 characters known as ASCII). So some workarounds were necessary to send both non-English text and non-textual attachments.

Diagram showing email going between two people via servers using English text only

Base64 solved this problem by converting all data to "safe English text". This conversion is necessary only for the time being, and when the danger is over, data is converted back to normal.

This is basically a very ugly practice that is somewhat similar to contraband. Data is obfuscated (well, really) in order to cheat a system that doesn't allow this kind of data.

And by the way, the restrictions of SMTP were meant for safety in the first place. But it was long ago and today nobody needs this safety anymore. Instead everyone needs an unobtrusive channel to transport anything whatsoever. So the old rules act like a pesky bureaucracy. But that's what legacy systems are and you have to live with them.

Are These Limitations Still Relevant Today?

Yes, despite all the extensions and tricks.

Many restrictions of SMTP are relaxed thanks to various extensions. For instance, the "8BIT MIME" extension allows you to send email messages in 8-bit bytes and in characters other than ASCII. So in theory, today you can send a letter with an attachment without encoding.

But in practice it's still impossible to ignore the old restrictions. Because there are outdated email servers which didn't adopt new extensions. And you have to be able to communicate with them even if your own email server supports all the modern stuff.

Before sending a message to a certain email server, you first ask what kind of SMTP rules it supports. Does it implement an 8BIT MIME extension? If not, you will need to convert your message to the older format.

How base64 Helps with These Limitations

It's self-evident that base64 must be a solution to an "ASCII only" problem because it transforms everything to ASCII characters. But it becomes less obvious when you combine it with the "7 bits" problem. Because no matter what kind of characters you use, they must be somehow transmittable by both 7-bit and 8-bit channels, depending on the situation.

Experts usually say something like:

"Base64 transforms the binary sequence 01001101 01100001 01101110 (whatever it means) into text "TWFu".

Such statements can lead you to think that something binary becomes non-binary. In fact, all ASCII characters produced by base64 are ultimately bits and bytes, just like the original data. (But of course the amount and order of bits change).

Here is a bash command to get a binary representation of a string:

echo -n "TWFu" | xxd -b

It will tell you that "TWFu" is actually "01010100 01010111 01000110 01110101". But if every character is 8 bits long, a 7-bit channel may not recognize this TWFu as text. Apparently some additional binary juggling must take place to make it work for all channels.

Fortunately, with ASCII characters this binary juggling is easy. Because to store an ASCII character in memory you actually need only 7 bits. They can be fattened to 8 bits only if you need a conventional "octet". This is done simply by adding a "0" bit in front of the original seven. For example, "T" can be stored as both 1010100 and 01010100.

So, conversion between 7-bit and 8-bit ASCII characters is a matter of adding/removing the leftmost "0" bit. Apparently email servers have to perform this kind of stuff when talking to each other.

This image shows how a character "T" produced by base64 can be consumed by both 7- and 8-bit channels

So let's keep in mind that base64 in itself doesn't solve the "7 bit" problem. It just produces ASCII characters and this allows for a brisk conversion between 7 and 8 bits. But this conversion is the responsibility of a wider system, such as MIME (extension of SMTP).

Memory cost depends on byte length

By the way, if you use only 7 bits per character, then base64 must be less wasteful in terms of memory usage.

The main theory is that base64 causes a memory overhead of 33% (or 37% or whatever). But it seems to be correct only for an 8-bit scenario.

Because, as I previously explained, base64 converts every 6-bit chunk of original data into a single ASCII character. If such a character is 8 bits long, it means that you are wasting 2 bits per every original chunk which is about 33%, just as promised by the experts. But with 7-bit characters this loss must be about half the size.

Why Only 64 Characters?

Base64 would be overkill if it was made only to bypass the ASCII restriction.

ASCII is 128 characters long, and if you could use all of them, the encoding algorithm would be more memory-efficient. It's cumbersome to explain but, in short, with a 128-long alphabet, a single character could represent 7 (instead of only 6) bits of original data.

But the authors of base64 decided to use only half of the ASCII characters, and surely they had a very good reason for that.

Characters can be wrong (unsafe) for at least two reasons:

They can be invalid (forbidden by a system, like SMTP forbids anything beyond ASCII)
They can be valid but mistakenly recognized as a special character

In ASCII there are 30+ "control characters". They are not printable and are meant to cause some other effects. For example: "line feed", "backspace", "delete", "escape". Many of these commands are a legacy of some prehistoric devices like teleprinters.

Table showing 34 non-printable characters from ASCII set

Apparently you have to exclude all non-printable characters from the encoding alphabet. So you are left with some 90+ printable ones. But printable doesn't mean safe and reliable. They can also have special meaning in different systems. And a bunch of them have special meaning in SMTP, for example "@", "<", and ">". They will surely produce "some other effects" and it won't make you happy.

So it ended up with 64 chars – first of all, because 64 is easier to deal with algorithmically. And it looks like a really safe alphabet that can be used in a wide range of systems, not only SMTP.

Unfortunately it's not completely safe. There are only 62 characters that are guaranteed to have no special meaning in all systems. They are digits, English small letters, and English capital letters. You need 2 more characters for base64. And their choice differs from system to system.

The most popular candidates are "+" and "/" but still in some situations they will break stuff. For example, they have special meaning in URLs. That's why we have a "base64url" variant where the last two characters are "_" and "-".

Base64 does not fix the problem of "ambiguous" special characters

Here I'd like to briefly explain one misconception that often pops up in various discussions about base64.

Some special characters are treated differently by different systems. The most notorious of them are "line feed" and "carriage return". Both are concerned with line breaks. But different systems have different opinions on how to combine these two chars to produce an actual line break.

There is a widespread belief that base64 somehow helps to reconcile these differences.

But base64 has nothing to do with how data is interpreted – for example, how text is displayed on a screen. Because in order to display something you first need to decode it from base64 back to the original form. Decoded data will be exactly the same as it was before encoding. It will be the same sequence of bits with the same ambiguous characters as before.

So, again, base64 can only conceal special chars for the time being. It doesn't magically sanitize data from dangerous chars.

Do I Have to Use base64 with HTTP/1.1?

HTTP/1.1 is a text-based protocol. So you may ask whether non-textual data has to be encoded to transmit stuff over the Internet.

(Also, let's call it just HTTP. Keep in mind that HTTP/2 and /3 are binary protocols, so they are not in question).

HTTP actually allows all kinds of data in the body of a message. The body is not restricted to textual characters. So, for instance, an image can be sent in its original form, without jumbling its bits and bytes.

What is really "text-based" about HTTP is headers. Basically they are restricted to ASCII (this is not exactly true, but it's a good practice to use only ASCII).

Today HTTP headers are used in many different ways – and sometimes they have to carry non-ASCII stuff too.

For example, a basic HTTP authentication scheme suggests that you send username and password as part of the "Authentication" header. Username and password can contain a lot of incompatible stuff, so you have to encode them to safe textual chars. Base64 is recommended to use in such cases. For this reason some developers think that base64 is a kind of encryption (data protection) which is not true.

Why is base64 Used for Data URLs? Is There a More Efficient Encoding for This?

Data URLs are probably the most well-known use of base64 today. It's a way to inline various assets like images (not links to them but their actual code) into HTML or CSS files.

Note that Data URLs have nothing to do with the transmission of data. Base64 is used here not because HTTP protocol forbids any binary sequences. When you send HTML or CSS file over network, HTTP doesn't care what's inside these files. But HTML and CSS files have to be only text to be properly interpreted (displayed) by text editors and browsers.

It makes sense, but still it's a pity – because, again, base64 is expensive. This notorious 33% or 37% of memory overhead is especially annoying with Data URLs. In most cases it defeats their purpose entirely.

The purpose is to improve performance of course. You get an image without an extra HTTP request and thus save some milliseconds. But this performance gain is small and easily nullified by the extra bytes created by base64.

So why base64 is used for data URLs at all? Could we use some less wasteful encoding for that (that is, encoding that uses a wider alphabet and thus outputs shorter strings of characters)?

At present it's very unlikely, because browsers allow only URL-safe characters in Data URLs. And there aren't too many URL-safe characters – just a tiny bit more than 64. Why are we restricted to URL-safe characters? Because we insert the encoded data into places where browsers expect a URL.

In theory, browsers could be smarter and relax this limitation when necessary. So let's ask a broader question next.

Is there a better encoding for non-textual data in HTML?

Technically, Data URLs are not the only way to embed non-textual data in HTML files.

You can build your homemade solutions where non-textual data is inserted elsewhere in HTML markup (attributes, innerText, or whatever). In such cases, you are not restricted to a URL-safe alphabet. So in theory, you can use any characters that are allowed in HTML.

Most HTML files use the UTF-8 character set. It contains all Unicode characters, and it's more than 1 million. So it must be possible to create some baseNNN encoding where at least 256 Unicode characters will be used. Such encoding would have no (or almost no) memory overhead.

In practice, everything is much more complicated.

Diagram showing how few characters from UTF-8 are "really safe" and usable for a hypothetical baseNNN encoding

Why not encode stuff with Chinese characters, for example? Because they are too heavy. They take 3 or even 4 bytes of memory. That's how UTF-8 encoding works – it uses different number of bytes for different characters. And we are interested only in those that take 1 byte.

(You may consider using multi-byte characters for our baseNNN encoding, but let's not go into that. You need only 1-byte ones – let's take it as an axiom.)

How many 1-byte characters are there in UTF-8? Only 128, and it's a good old ASCII range. Can you take all of them? No, because, again, you need only printable characters. You really need to see them in text editors and dev tools and elsewhere.

Then, just like in case of SMTP, you have to exclude a bunch of visible characters because they have special meaning in HTML. For example, double quotes (") won't do because it can prematurely terminate the Data URL string. This won't work:

So a possible alphabet again shrinks to 80-90 characters. This, in theory, allows you to create another encoding that will use a bit less memory than base64.

Such encodings actually exist, for example base85 made in Adobe. It is more memory-efficient because it encodes 4 bytes of original data into 5 characters. But base85 is also much slower to compute, so its overall benefits are tiny, if any. And by the way it's not intended for web development and contains characters that can break things in HTML and CSS. (Though it must be possible to build a similar but web-friendly algorithm by swapping some characters.)

Can we find a better baseNNN for other kinds of text files (JSON, XML, and so on)? It seems unlikely, because these formats are predominantly UTF-8-encoded and this means roughly the same limitations as in case of HTML. Only the amount of special printable characters may differ (it's very small in JSON for example) but it's not a big deal.

Conclusion

Base64 was first introduced as a way to bypass a number of archaic restrictions imposed on email messages by the SMTP protocol.

Base64 allowed us to camouflage any data as text in order to pass validation when transmitted between email servers. It also ensured that this pseudo-text contained only safe characters, that is 1) only printable ones and 2) only those that have no special meaning in SMTP and (hopefully) in most other systems.

The final alphabet happened to be really narrow, and it allowed us to use base64 practically everywhere (but in slightly different variants). It helps in many cases where you have to mix textual and non-textual data.

Modern systems also use base64 despite its significant memory cost. At first glance this practice looks strange, because modern systems don't have these age-old restrictions that SMTP had. But it turns out that in most cases you still have very few cheap and non-special characters, and potential alternatives to base64 offer very small benefits.

How to Encode and Decode HTML Base64 using JavaScript – JS Encoding Example

Joel Olawanle — Wed, 08 Feb 2023 19:20:23 +0000

When building an application or writing a program, you may need to encode or decode with HTML Base64 in JavaScript.

This is possible thanks to two Base64 helper functions that are part of the HTML specification and are supported by all modern browsers.

In this article, you will learn about Base64 and how it works to convert binary data, regular strings, and lots more into ASCII text.

What is Base64?

Base64 is a group of binary-to-text encoding schemes representing binary data in ASCII string format. It is commonly used to encode data that needs to be stored or transmitted in a way that cannot be directly represented as text.

Base64 encoding works by mapping binary data to 64 characters from the ASCII character set. The 64 characters used in Base64 encoding are: A-Z, a-z, 0-9, +, and /.

The encoding process takes 3 bytes of binary data and maps it to 4 characters from the above set, such that a single character represents every 6 bits of binary data. The result is a string of ASCII characters that can be transmitted or stored as text.

Base64 decoding is the reverse process of encoding. It takes a Base64 encoded string and maps each character back to its 6-bit binary representation. The resulting binary data is a reconstruction of the original binary data encoded to Base64.

How to Encode and Decode HTML Base64 using JavaScript

To encode and decode in JavaScript, you will use the btoa() and atob() JavaScript functions that are available and supported by modern web browsers.

These JavaScript helper functions are named after old Unix commands for converting binary to ASCII (btoa) and ASCII to binary (atob).

You can encode a string to base64 in JavaScript using the btoa() function and decode a base64 string using atob() function. For example, if you have a string stored in a variable, as seen below, you can first encode it to Base64:

let myString = "Welcome to freeCodeCamp!";
let encodedValue = btoa(myString);
console.log(encodedValue); // V2VsY29tZSB0byBmcmVlQ29kZUNhbXAh

You can also decode the encodedValue back to its original form using the atob() function. This function takes the encoded value and decodes it from Base64:

let myString = "Welcome to freeCodeCamp!";
let encodedValue = btoa(myString);
let decodedValue = atob(encodedValue);
console.log(decodedValue); // Welcome to freeCodeCamp!

You now know how to encode and decode Base64 in JavaScript.

More JavaScript Encoding Examples

You can also encode binary data to Base64 encoded ASCII text in JavaScript using the btoa() function:

let binaryData = new Uint8Array([72, 101, 108, 108, 111, 32, 87, 111, 114, 108, 100]);
let stringValue = String.fromCharCode.apply(null, binaryData);
console.log(stringValue); // "Hello World"

let encodedValue = btoa(stringValue);
console.log(encodedValue); // SGVsbG8gV29ybGQ=

In the above, you first converted Unicode values to characters and then encoded the string.

You can also decode the Base64 encoded ASCII text to binary data in JavaScript using the atob() function:

let encodedValue = "SGVsbG8gV29ybGQ=";
let binaryData = new Uint8Array(atob(encodedValue).split("").map(function (c) {
    return c.charCodeAt(0);
}));
console.log(binaryData); // Uint8Array [72, 101, 108, 108, 111, 32, 87, 111, 114, 108, 100]

Wrapping Up!

In this article, you have learned what Base64 means, how it works, and when to encode and decode in JavaScript.

Base64 is not intended to be a secure encryption method, nor is it intended to be a compression method, because encoding a string to Base64 typically results in a 33% longer output.

Base64 encoding is commonly used in JavaScript for situations like:

Storing and transmitting binary data as text.
Encrypting data where the encoded data is sent over an insecure channel and decoded on the other end. However, it should not be considered a secure encryption method, as it can easily be decoded.
Data transfer between systems with different character sets.
Storing binary data in a database.

Thanks for reading and have fun coding!

You can access over 185 of my articles by visiting my website. You can also use the search field to see if I've written a specific article.

Unicode Characters – What Every Developer Should Know About Encoding

Kealan Parr — Mon, 01 Mar 2021 16:01:00 +0000

If you are coding an international app that uses multiple languages, you'll need to know about encoding. Or even if you're just curious how words end up on your screen – yep, that's encoding, too.

I'll explain a brief history of encoding in this article (and I'll discuss how little standardisation there was) and then I'll talk about what we use now. I'll also cover some Computer Science theory you need to understand.

Introduction to Encoding

A computer only can understand binary. Binary is the language of computers, and is made up of 0's and 1's. There is nothing else allowed. One digit is called a bit, and a byte is 8 bits. So 8 0's or 1's make up one byte.

Everything eventually ends up as binary – programming languages, mouse moves, typing, and all the words on the screen.

If all the text you're reading was once binary too, then how do we turn binary into text? Let's look at what we used to do back in the beginning.

A Brief History of Encoding

In the early days of the internet, it was English only. We didn't need to worry about any other characters and the American Standard Code for Information Interchange (ASCII) was the character encoding that fit this purpose.

ASCII is a mapping, from binary to alphanumeric characters. So when the PC receives binary:

01001000 01100101 01101100 01101100 01101111 00100000 01110111 01101111 01110010 01101100 01100100

With ASCII it can translate that into "Hello world".

One byte (eight bits) was large enough to fit every English character, and some control characters too. Some of these control characters were used for instruments called teleprinters, so at the time they were useful (not so much now!)

But the control characters were things like 7 (111 in binary) that would make a bell sound on your PC, 8 (1000 in binary) that would print over the last character it just printed, or 12 (1100 in binary) that would clear a video terminal from all the text just written.

Computers at this time were using 8 bits for one byte (they didn't always), so there were no issues. We could store all our control characters, all our numbers, all the English characters and have some left! Because one byte can encode 255 characters, and ASCII only needed 127 characters. So we had 128 encodings that were unused.

Let's look at an ASCII table here to see every character. All lowercase and uppercase A-Z and 0-9 were encoded to binary numbers. Remember the first 32 are unprintable control characters.

ASCII Character Table

Can you see how it ends with 127? We have some spare room at the end.

Issues with ASCII

The spare characters were 127 through to 255. People began to think about what would be best to fill those remaining characters. But everyone had different ideas about what those final characters should be.

The American National Standards Institute (ANSI - don't get confused with ASCII) is a standards body for establishing standards across lots of different fields. They decided what everyone was doing with 0-127, which is what ASCII was already doing. But the rest were open.

No one was debating what 0-127 in the ASCII encoding was. The problem was with the spare ones.

Below is what the first IBM computers did with the 128-255 encodings for ASCII.

Some squiggles, some background icons, math operators and some accented characters like é.

But other computers didn't all follow this. And everyone wanted to implement their own encodings for the end of ASCII.

These different endings for ASCII were called code pages.

What are ASCII Code Pages?

Here's a collection of over 465 different codepages! You can see that there were multiple codepages EVEN for the same language. Greek and Chinese both have multiple codepages, for example.

So how on EARTH were we ever going to standardise this? Or make it work between differing languages? Between the same language with different codepages? In a non English language?

Chinese has over 100,000 different characters. We don't even have enough spare characters for Chinese, let alone agreeing that the final characters should be Chinese ones. This isn't looking so good.

This problem even has its own term: Mojibake.

It's garbled text you may sometimes see from decoding text, but using the wrong decoding. It means character transformation in Japanese.

Example of completely garbled text (mojibake).

This sounds a little insane...

Exactly! We will have zero chance of reliably interchanging data.

The internet is just a huge connection of computers around the world. Imagine if all these countries decided what they each thought the standards should be. If the Greek computers only accepted Greek and the English computers only sent English...? You would just be shouting into an empty cave. No one would understand you. And no one would be able to decode the nonsense.

ASCII wasn't fit for real life use. In a global, connected internet, we had to evolve, or else forever deal with hundreds of codepages.

�� Unless you �� fancied trying �� to �� read paragraphs like this. �֎֏0590֐��׀ׁׂ׃ׅׄ׆ׇ

Along Came Unicode

Unicode is sometimes called the Universal Coded Character Set (UCS), or even ISO/IEC 10646. But Unicode is its more common name.

But, this is where Unicode entered the scene to help solve the problems that encoding and code pages were causing.

Unicode is made up of lots of code points (mapping lots of characters from around the world to a key that all computers can reference.) A collection of code points is called a character set - which is what Unicode is.

We can map something abstract to a letter we want to reference. And it does every character! Even Egyptian Hieroglyphs.

Some people did all the hard work of mapping what each character would be (in all the languages) to a key that we could all access. They look like this:

"Hello World"

U+0048 : LATIN CAPITAL LETTER H

U+0065 : LATIN SMALL LETTER E
U+006C : LATIN SMALL LETTER L
U+006C : LATIN SMALL LETTER L
U+006F : LATIN SMALL LETTER O
U+0020 : SPACE [SP]
U+0057 : LATIN CAPITAL LETTER W
U+006F : LATIN SMALL LETTER O
U+0072 : LATIN SMALL LETTER R
U+006C : LATIN SMALL LETTER L
U+0064 : LATIN SMALL LETTER D

The U+ lets us know it's the Unicode standard, and the number is what results when the binary get's transformed to numbers. It uses hexadecimal notation which is just a simpler way of representing binary numbers. You don't have to worry too much about the hexadecimal here, though.

Here's a link where you can type whatever you want into the text box, and see the Unicode character encoding. Or look at all 143,859 Unicode character points here. You can also see where each character is from in the world!

I just want to be clear. At this point we have a big dictionary of code points mapping to characters. A really big character set. Nothing more.

There's one final ingredient we need to add to our mix.

Unicode Transform Protocol (UTF)

UTF is a way we encode Unicode code points. The UTF encodings are defined by the Unicode standard, and are able to encode every single Unicode code point we need.

But there are different types of UTF standards. They differ depending on the amount of bytes used to encode one code point. It also depends on whether you're using UTF-8 (one byte per code point), UTF-16 (two bytes per code point) or UTF-32 (four bytes per code point).

If we have these different encodings, how do we know which encoding a file will use? There's a thing called a Byte Order Mark (BOM) - sometimes referred to as an Encoding Signature. The BOM is a two-byte marker at the beginning of a file that tells what encoding the file is using.

UTF-8 is the most used on the internet, and is also specified in HTML5 as the preferred encoding for new documents, so I'll spend the most time explaining this one.

You can see in the diagram even from 2012, UTF-8 was widely becoming the most used encoding. And for the web it still is.

_W3 diagram to show how well used UTF-8 is used on a variety of websites._

What is UTF-8 and How Does it Work?

UTF-8 encodes all the Unicode code points from 0-127 in 1 byte (the same as ASCII). This means that if you were coding your program using ASCII, and your users used UTF-8, they wouldn't notice anything was wrong. Everything would just work.

Just remember how strong a selling point this is. We needed to remain ASCII backwards compatible while UTF-8 was being implemented and used by everyone. It doesn't break anything currently being used.

Because it's called UTF-8, remember that's the minimum number of bits (8 bits being one byte!) that a code point will be. There are other Unicode characters that are stored in multiple bytes (up to 6 bytes depending on the character). This is what people mean when the encoding is called variable length.

It might be more, depending on the language. English is 1 byte. European (Latin), Hebrew and Arabic are represented with 2 bytes. 3 bytes are used for Chinese, Japanese, Korean and other Asian characters. You get the idea.

When you need a character to span more than one byte, you have a bit combination to identify a continuation sign, saying this character is continued over the next several bytes. So you’ll still only use one byte per character for English, but if you need a document to contain some foreign characters, you can do that too.

And now, wonderfully, we can all agree on what the Sumerian cuneiform characters encoding is (𒀵 𒁷𒂅 𒐤), as well as some emoji's 😉😉 so we can all communicate!

The high level overview is: You first read the BOM so you know your encoding. You decode the file into Unicode code points, and then represent the characters from the Unicode character set into characters drawn onto the screen.

A Final Word About UTF

Remember, encoding is key. If I send the complete wrong encoding you can't read anything. Be aware of it when receiving or sending data. Often it is abstracted away in the tools you use everyday, but as programmers it's important to understand what is happening under the hood.

How do we specify our encodings, then? Because HTML is written in English, and almost all encodings can deal with English fine. We can embed it right at the top in the section.

<html lang="en">
<head>
  <meta charset="utf-8">
head>

It's important to do this at the very start of the , as the parsing of the HTML may have to start again if the encoding it's currently using is wrong.

We also can get the encoding from the Content-Type header from the HTTP request/ response.

If an HTML document doesn't contain the encoding tag, the HTML5 spec has some interesting ways it can guess the encoding called BOM sniffing. This is where it guesses the encoding from the Byte Order Mark (BOM) we discussed earlier.

So is that it?

Unicode isn't finished. Like any standard, we add, remove and make new proposals to the standard. No specification is ever considered "complete".

There are generally 1 or 2 release a year, and you can find them here.

Recently I read about a very interesting bug around Twitter rendering Russian Unicode characters incorrectly.

If you have read this far, congratulations – it's a lot to digest.

I would encourage you to do one last piece of homework.

Look at how broken websites can really be when the encoding is wrong. I used this Google Chrome extension and changed my encoding and tried to read webpages. The message was completely unclear. Try and read this article. Try and navigate Wikipedia. See Mojibake for yourself.

It helps to see how important encoding truly is.

Conclusion

In my time spent researching and trying to simplify this article, I learned about Michael Everson. Since 1993, he has proposed over 200 Unicode changes, and has added thousands of characters to the standard. As of 2003, he was credited as the leading contributor of Unicode proposals. He is one huge reason why Unicode is what it is. Very impressive, and he has done a great deal for the Internet as we know it.

I hope this has explained a good overview of why we need encodings, what problems encoding solves, and what happens when it goes wrong.

I share my writing on Twitter if you enjoyed this article and want to see more.