unicode - freeCodeCamp.org

What is Unicode —The Secret Language Behind Every Text You See

Manish Shivanandhan — Thu, 31 Jul 2025 13:50:08 +0000

Have you ever sent a message with an emoji? Read a blog in another language? Or copied some strange symbol from the internet?

All of these are possible because of something called Unicode.

Unicode is a powerful system that lets computers understand and show text in nearly any language, including fun stuff like emojis. 😃

In this article, we’ll break down what Unicode is, why it matters, and how it powers global communication.

The Problem Before Unicode
What Is Unicode?
How Does Unicode Work
- What Are Unicode Encodings?
- Code Points, Characters, and Glyphs
Unicode in Programming
Why Unicode Matters
Conclusion

The Problem Before Unicode

Let’s rewind to the early days of computers when each country had its own way of showing text. These systems were called character encodings.

For example, English text used ASCII, while others used ISO-8859, Shift-JIS, and more.

But here’s the problem: the same number could mean different things in different systems.

For example, the number 0x41 meant the letter A in one system, but in another, it might mean something else entirely.

This caused chaos when sharing documents between systems. Special characters would turn into random symbols, and non-English languages were often unreadable.

It was clear that the world needed one universal system. Something that could handle all languages and symbols in a single, consistent way.

That’s where Unicode comes in.

What Is Unicode?

Unicode is a standard system that assigns a unique number, called a code point, to every character. It includes letters, numbers, emojis, symbols, and even invisible control characters.

Think of it like giving every character in every language its own ID number.

For example:

The capital letter A is given the code U+0041
The Greek letter Ω is U+03A9
The emoji 😀 is U+1F600

This means no matter what device, app, or country you’re in, the same code will always mean the same character.

How Does Unicode Work?

At its core, Unicode assigns a code point to each character.

Code points look like this: U+XXXX, where XXXX is a number written in hexadecimal (a base-16 system computers use).

But computers don’t store code points directly. They store bytes, the 1s and 0s under the hood. So Unicode needs a way to turn those code points into bytes. This is called encoding.

What Are Unicode Encodings?

Unicode gives every character a unique code point, but computers don’t store “U+1F600” directly – they store bytes. To convert these code points into bytes that computers can save or transmit, we need encodings.

There are three main ways to turn Unicode code points into bytes:

1. UTF-8 (Most common)

Uses 1 to 4 bytes.
Great for English and most symbols.
Saves space.
Works on the web and most systems.

2. UTF-16

Uses 2 or 4 bytes.
Used in Windows, Java, and some older systems.

3. UTF-32

Uses 4 bytes for everything.
Easy to work with, but uses more memory.

If you’re storing or sending text, the encoding decides how many bytes are used. Choosing UTF‑8 can save space, especially for English-heavy data. When you see garbled text or � symbols, it’s usually a mismatch between encoding and decoding.

Web servers, databases, and APIs often require you to specify the encoding to ensure multilingual text displays correctly. In short, knowing the difference between UTF‑8, UTF‑16, and UTF‑32 helps you prevent bugs, save storage, and build apps that handle text from any language reliably.

So, UTF-8 is often the best choice. It’s efficient, and it works nearly everywhere.

Code Points, Characters, and Glyphs

Let’s break down the main parts of Unicode:

Code Point:

This is the number assigned to a character. For example:

U+0041 is the code point for A
U+20AC is for the Euro sign €
U+1F600 is for the smiley face 😀

Character:

The actual letter or symbol we see. For example, “A”, “Ω”, or “😎”.

Glyph:

This is the visual design of a character. For example, “A” in Arial looks different from “A” in Times New Roman, but the character is the same.

Unicode in Programming

Modern programming languages have embraced Unicode, making it easier than ever to build applications that support global audiences.

Whether you’re writing a command-line tool or building a web app, Unicode ensures your text renders correctly, no matter the language.

Take Python, for instance. It natively supports Unicode strings:

print("Welcome 😊")  # This works because Python uses Unicode under the hood

You can even mix languages and emojis in the same output without a problem:

print("こんにちは, friend! 🚀")

In JavaScript, Unicode enables developers to use characters from virtually any script:

console.log("नमस्ते");  // Prints “Namaste” in Hindi
console.log("مرحبا بالعالم");  // Arabic: "Hello, world"

Or even create multilingual UIs:

document.getElementById("greeting").textContent = "Bonjour, мир!";

Before Unicode, developers had to juggle different encodings like ASCII, which often led to corrupted text when files moved between systems. Now, thanks to Unicode, most languages, including Java, C#, Ruby, Go, and Rust, handle international text gracefully by default.

This shift means developers can write apps that support global users from day one. Whether you’re building a chat app, an international e-commerce site, or a multilingual blog – with Unicode, your code speaks every language.

Why Unicode Matters

Before Unicode, digital communication across languages was chaotic.

Different systems used different character sets, leading to garbled text, random boxes, or strings of question marks whenever someone typed in a non-Latin-based language. Unicode changed all of that.

With Unicode, you can now mix languages like Chinese and English in the same document without a problem. Whether you’re copying text between applications or transferring data across platforms, it just works.

This consistency has been a game-changer for building multilingual websites and applications. Developers no longer need to worry about separate encodings for different regions. A single, unified standard handles it all.

Unicode isn’t something most users think about, but it’s embedded in almost everything.

It powers the text you see on websites and in your email, your smartphone’s keyboard, and even the way you chat in online games. Social media posts, search queries, and programming languages, all rely on Unicode.

Behind the scenes, the Unicode Consortium, made up of industry giants like Google, Apple, and Microsoft, regularly updates the standard. They decide which new characters and emojis make it into our digital vocabulary.

That’s why your favourite facepalm emoji or regional script exists. Someone proposed it, and Unicode made it happen.

Unicode isn’t just a technical convenience. It plays a direct role in how people engage with content.

Pages with broken symbols or unreadable characters had significantly lower engagement rates compared to cleanly rendered ones. It was a clear signal that readability isn’t just about aesthetics – it affects how long people stay and interact with your content.

That’s why even small encoding errors can have a real impact, especially on multilingual platforms or international blogs. Unicode silently keeps everything running smoothly.

Conclusion

Unicode is one of the unsung heroes of our digital world. Without it, the internet would still be a confusing mix of broken characters and language barriers. Because of Unicode, we can type “Hello 😊”, mix multiple languages in a single message, or build global apps that just work.

So the next time you post an emoji, read a message in a different script, or switch languages on your keyboard, take a moment to appreciate the invisible infrastructure behind it all. That’s Unicode, working quietly to make sure we stay connected, no matter what language we speak.

Join my newsletter for a summary of my articles every Friday. You can also connect with me on Linkedin.

How to Use RegEx to Match Emoji – Discord Emotes Regular Expression Tutorial

Naomi Carrigan — Wed, 13 Jul 2022 23:04:07 +0000

Emoji are special Unicode characters that render pictographs. But these characters can be very tricky to identify with regular expressions (RegEx).

I was recently working on a Discord bot that had to detect the number of emotes in a given message. Today I'll share my process with you, including the newer JavaScript RegEx feature that finally solved the issues I was having.

How Unicode Emoji Work

The Unicode Consortium defines specific character codes for each emoji. They even maintain a helpful emoji chart as a reference. As an example, U+1F600 corresponds to the 😀 emoji.

Some emoji consist of multiple Unicode characters. This is most common with the flag emoji, which consists of the "regional indicators" that make up the country's two-letter country code.

This means the United States flag, 🇺🇸, consists of the two Unicode characters U+1F1FA and U+1F1F8, which correspond to the regional indicators U and S.

As a fun fact, it is up to the operating system to determine how to render an emoji. If you are on Windows, for example, you won't see a flag above. You'll see US.

What are Discord Emotes?

One of Discord's many features is allowing communities to upload their own custom emotes. These emotes are identified by a name, and are used with the syntax :emote_name:.

However, the way they are identified by the client/API is different. Each emote has a unique ID, and they're sent in the message content as <:emote_name:1234567890>, or for animated emotes.

You can see this in Discord by putting a backslash \ before the emote and sending it. It will render something like this:

`" width="600" height="400" loading="lazy">

How to Match Emoji and Emotes with RegEx

My original approach had two different RegEx phrases.

I was using /()?/g to catch the Discord emotes. This RegEx was successfully picking up Discord emotes, which was great!

I paired it with /:[^:\s]*(?:::[^:\s]*)*:/g to match the Unicode emoji, which only partially worked. The problem here was that I was seeing some emotes being counted twice – because the Discord RegEx was matching them. And others were being missed entirely.

So, with RegEx being what it is, I tried to make it more complex. <:[^:\s]+:\d+>||(\u00a9|\u00ae|[\u2000-\u3300]|\ud83c[\ud000-\udfff]|\ud83d[\ud000-\udfff]|\ud83e[\ud000-\udfff]|\ufe0f)/g was a bit more successful in matching the built-in emoji, but still wasn't perfect. This RegEx was designed to match unicode characters specifically.

I played around with trimming whitespace, using the word boundary \b character, and a few other tweaks, before finally giving up and doing some research.

And then I discovered Unicode Property Escapes. This RegEx feature allows you to add the u flag to your RegEx, unlocking the Unicode Properties denoted with the \p character.

With some additional research, I was able to find the Emoji Character properties – specifically, the Extended_Pictograph property. This enabled me to update the RegEx to a final, functional value:

/<a?:.+?:\d{18}>|\p{Extended_Pictographic}/gu

The \p{Extended_Pictographic} property seems to match Unicode emotes as well as character modifiers (often used for skin tones in emoji).

Conclusion

This RegEx is currently running in my production code and hasn't shown any issues yet.

Hopefully this article has helped you. If you are interested in exploring Unicode Property Escapes further, the Unicode Consortium offers a full list of the available values.

Happy Coding!

HTML Arrow – Symbol Unicode for Single and Double Arrows, Left and Right Arrows

freeCodeCamp — Fri, 17 Jun 2022 15:35:12 +0000

By Dillion Megida

What are Unicodes?

Unicodes are universal characters that represent different things. These could be symbols, characters, scripts, and many more forms of character combinations.

Unicodes are adopted by many platforms (mobile and web) to make characters available everywhere.

Why are Unicodes useful?

Unicodes are useful because they provide a standard for character representations across different systems and languages.

Unicode also represents special characters that are not available in ASCII and helps us create consistent character displays on various platforms.

You can also apply styles (colors, sizes) like you would with other characters.

How to use Unicodes in HTML

You can write most Unicode symbols in two ways: using the hexadecimal reference or using the entity name.

Hexadecimal references are usually hard to read, but entity names are generally descriptive for the Unicode symbol you want to write.

For hexadecimal numbers, you write them in between &# (an ampersand and a number sign) and ; (a semi-colon) like this:

&#[NUMBER];

For entity names, you write them in between & and ; like this:

&[ENTITY];

This syntax is necessary so that HTML understands that the characters you're writing are not just text but Unicode symbol representations.

Unicode for Single and Double Left and Right Arrows

Now that we've briefly looked at what Unicode is and how to use it in HTML, let's look at some examples.

There are many symbols with Unicode representations you can use in HTML. For this article, I will share four examples of arrow symbols.

There are different arrow symbols and Unicode values for them. The arrows used here are just examples.

Left Arrow

For the single left arrow:

The hexadecimal reference is 8592 and the entity name is larr. In HTML, it would be written like:

←

←

This code will print this on a page:

For the double left arrow:

The hexadecimal reference is 8647 and the entity name is llarr written as:

⇇

#llarr

This will result in:

Right arrow

For the single right arrow:

The hexadecimal reference is 8594 and the entity name is rarr written like:

→

→

The result:

For the double right arrow:

The hexadecimal reference is 8649 and the entity name is rrarr written as:

⇉;

#rrarr

This will result in:

You can use Unicode representations to print many other symbols in HTML. You can either use the hexadecimal reference or the entity name of the symbol, as I have shown you in this article.

Dot Symbol – Bullet Point in HTML Unicode

Kolade Chris — Thu, 14 Apr 2022 00:45:10 +0000

In your HTML documents, you'll often need to make a list of items. And you can use bullet points for this purpose.

You can show bullet points with the Unicode character (or entity) for bullet points.

In this article, I will show you the Unicode and HTML entities for showing bullet points on a web page.

Towards the end of this article, I will also show you the 5-key combinations with which you can type a big dot symbol.

The Unicode and HTML Entities for Bullet Points

The Unicode character for showing the dot symbol or bullet point is U+2022.

But to use this Unicode correctly, remove the U+ and replace it with ampersand (&), pound sign (#), and x. Then type the 2022 number in, and then add a semi-colon. So, it becomes •.

It'll look like this:

<h1>Languages of the webh1>
<h3>• HTMLh3>
<h3>• CSSh3>
<h3>• JavaScripth3>
<h3>• PHPh3>

Apart from the • Unicode character, you can also use • and • HTML entitles to show bullets or dot symbols on the web page.

<h1>Languages of the webh1>
<h3>• HTMLh3>
<h3>• CSSh3>
<h3>• JavaScripth3>
<h3>• PHPh3>

The output remains the same:

The Keyboard Shortcut for Typing a Dot Symbol

To type the dot symbol on your keyboard, turn on the numeric keypad by pressing NumLk, hold Alt and press the 0, 1, 4, and 9 keys in succession.

If you don’t type the numbers with the numeric keypad, the dot symbol will not show.

Thank you for reading!

Checkmark Symbol – HTML for Checkmark Unicode

Kolade Chris — Tue, 12 Apr 2022 19:17:04 +0000

If you take a look at your keyboard, you'll see that there’s no key for typing a checkmark.

You could decide to copy the checkmark symbol from the internet and paste it directly into your HTML code, but an easier way to do it is to use the appropriate Unicode character or HTML character entity.

If you are wondering what Unicode and HTML character entities are, they are both a piece of text that represents different emojis, symbols, and characters.

In your web projects, you might want to show a checkmark for the purpose of consent or agreement. So, in this article, I will show you how to use the appropriate Unicode and HTML character entity to bring checkmarks into your web projects. I will also show you 4 other variations of the checkmark symbol.

The Unicode and HTML Characters for Checkmarks

The Unicode character for showing a checkmark is U+2713. If you decide to use this Unicode to show a checkmark in HTML and you type it in like that, what you type is shown like this:

 <h1>Languages of the webh1>
 <h3>U+2713 HTMLh3>
 <h3>U+2713 CSSh3>
 <h3>U+2713 JavaScripth3>
 <h3>U+2713 PHPh3>

So, how do you use the U+2713 Unicode to show the checkmark symbol?

Remove the U+ and replace it with an ampersand (&), pound sign (#), and x. Then type the 2713 in, and then a semi-colon. So, it becomes ✓.

 <h1>Languages of the webh1>
 <h3>✓ HTMLh3>
 <h3>✓ CSSh3>
 <h3>✓ JavaScripth3>
 <h3>✓ PHPh3>

You can also use the HTML character entity for a checkmark to show the checkmark symbol. This is ✓ or ✓:

<h1>Languages of the webh1>
<h3>✓ HTMLh3>
<h3>✓ CSSh3>
<h3>✓ JavaScripth3>
<h3>✓ PHPh3>

Other Variations of the Checkmark Symbol

Apart from the traditional U+2713, ✓ or ✓, there are other variations such as:

`Ϭ` for a bolder checkmark

<h1>Languages of the webh1>
<h3>✔ HTMLh3>
<h3>✔ CSSh3>
<h3>✔ JavaScripth3>
<h3>✔ PHPh3>

U+2705 – `✅` for a white heavy checkmark

<h1>Languages of the webh1>
<h3>✅ HTMLh3>
<h3>✅ CSSh3>
<h3>✅ JavaScripth3>
<h3>✅ PHPh3>

U+2611 – `☑` for a ballot checkmark

<h1>Languages of the webh1>
<h3>☑ HTMLh3>
<h3>☑ CSSh3>
<h3>☑ JavaScripth3>
<h3>☑ PHPh3>

U+221A – `√` for a square root checkmark

<h1>Languages of the webh1>
<h3>√ HTMLh3>
<h3>√ CSSh3>
<h3>√ JavaScripth3>
<h3>√ PHPh3>

Conclusion

This article has shown you the Unicode string for a checkmark, how to use it, and other variations of it.

You also learned about the equivalent HTML character entity for the checkmark symbol, in case you don’t want to show it with the Unicode string.

Now, go insert some checkmarks into your code.

Unicode Characters – What Every Developer Should Know About Encoding

Kealan Parr — Mon, 01 Mar 2021 16:01:00 +0000

If you are coding an international app that uses multiple languages, you'll need to know about encoding. Or even if you're just curious how words end up on your screen – yep, that's encoding, too.

I'll explain a brief history of encoding in this article (and I'll discuss how little standardisation there was) and then I'll talk about what we use now. I'll also cover some Computer Science theory you need to understand.

Introduction to Encoding

A computer only can understand binary. Binary is the language of computers, and is made up of 0's and 1's. There is nothing else allowed. One digit is called a bit, and a byte is 8 bits. So 8 0's or 1's make up one byte.

Everything eventually ends up as binary – programming languages, mouse moves, typing, and all the words on the screen.

If all the text you're reading was once binary too, then how do we turn binary into text? Let's look at what we used to do back in the beginning.

A Brief History of Encoding

In the early days of the internet, it was English only. We didn't need to worry about any other characters and the American Standard Code for Information Interchange (ASCII) was the character encoding that fit this purpose.

ASCII is a mapping, from binary to alphanumeric characters. So when the PC receives binary:

01001000 01100101 01101100 01101100 01101111 00100000 01110111 01101111 01110010 01101100 01100100

With ASCII it can translate that into "Hello world".

One byte (eight bits) was large enough to fit every English character, and some control characters too. Some of these control characters were used for instruments called teleprinters, so at the time they were useful (not so much now!)

But the control characters were things like 7 (111 in binary) that would make a bell sound on your PC, 8 (1000 in binary) that would print over the last character it just printed, or 12 (1100 in binary) that would clear a video terminal from all the text just written.

Computers at this time were using 8 bits for one byte (they didn't always), so there were no issues. We could store all our control characters, all our numbers, all the English characters and have some left! Because one byte can encode 255 characters, and ASCII only needed 127 characters. So we had 128 encodings that were unused.

Let's look at an ASCII table here to see every character. All lowercase and uppercase A-Z and 0-9 were encoded to binary numbers. Remember the first 32 are unprintable control characters.

ASCII Character Table

Can you see how it ends with 127? We have some spare room at the end.

Issues with ASCII

The spare characters were 127 through to 255. People began to think about what would be best to fill those remaining characters. But everyone had different ideas about what those final characters should be.

The American National Standards Institute (ANSI - don't get confused with ASCII) is a standards body for establishing standards across lots of different fields. They decided what everyone was doing with 0-127, which is what ASCII was already doing. But the rest were open.

No one was debating what 0-127 in the ASCII encoding was. The problem was with the spare ones.

Below is what the first IBM computers did with the 128-255 encodings for ASCII.

Some squiggles, some background icons, math operators and some accented characters like é.

But other computers didn't all follow this. And everyone wanted to implement their own encodings for the end of ASCII.

These different endings for ASCII were called code pages.

What are ASCII Code Pages?

Here's a collection of over 465 different codepages! You can see that there were multiple codepages EVEN for the same language. Greek and Chinese both have multiple codepages, for example.

So how on EARTH were we ever going to standardise this? Or make it work between differing languages? Between the same language with different codepages? In a non English language?

Chinese has over 100,000 different characters. We don't even have enough spare characters for Chinese, let alone agreeing that the final characters should be Chinese ones. This isn't looking so good.

This problem even has its own term: Mojibake.

It's garbled text you may sometimes see from decoding text, but using the wrong decoding. It means character transformation in Japanese.

Example of completely garbled text (mojibake).

This sounds a little insane...

Exactly! We will have zero chance of reliably interchanging data.

The internet is just a huge connection of computers around the world. Imagine if all these countries decided what they each thought the standards should be. If the Greek computers only accepted Greek and the English computers only sent English...? You would just be shouting into an empty cave. No one would understand you. And no one would be able to decode the nonsense.

ASCII wasn't fit for real life use. In a global, connected internet, we had to evolve, or else forever deal with hundreds of codepages.

�� Unless you �� fancied trying �� to �� read paragraphs like this. �֎֏0590֐��׀ׁׂ׃ׅׄ׆ׇ

Along Came Unicode

Unicode is sometimes called the Universal Coded Character Set (UCS), or even ISO/IEC 10646. But Unicode is its more common name.

But, this is where Unicode entered the scene to help solve the problems that encoding and code pages were causing.

Unicode is made up of lots of code points (mapping lots of characters from around the world to a key that all computers can reference.) A collection of code points is called a character set - which is what Unicode is.

We can map something abstract to a letter we want to reference. And it does every character! Even Egyptian Hieroglyphs.

Some people did all the hard work of mapping what each character would be (in all the languages) to a key that we could all access. They look like this:

"Hello World"

U+0048 : LATIN CAPITAL LETTER H

U+0065 : LATIN SMALL LETTER E
U+006C : LATIN SMALL LETTER L
U+006C : LATIN SMALL LETTER L
U+006F : LATIN SMALL LETTER O
U+0020 : SPACE [SP]
U+0057 : LATIN CAPITAL LETTER W
U+006F : LATIN SMALL LETTER O
U+0072 : LATIN SMALL LETTER R
U+006C : LATIN SMALL LETTER L
U+0064 : LATIN SMALL LETTER D

The U+ lets us know it's the Unicode standard, and the number is what results when the binary get's transformed to numbers. It uses hexadecimal notation which is just a simpler way of representing binary numbers. You don't have to worry too much about the hexadecimal here, though.

Here's a link where you can type whatever you want into the text box, and see the Unicode character encoding. Or look at all 143,859 Unicode character points here. You can also see where each character is from in the world!

I just want to be clear. At this point we have a big dictionary of code points mapping to characters. A really big character set. Nothing more.

There's one final ingredient we need to add to our mix.

Unicode Transform Protocol (UTF)

UTF is a way we encode Unicode code points. The UTF encodings are defined by the Unicode standard, and are able to encode every single Unicode code point we need.

But there are different types of UTF standards. They differ depending on the amount of bytes used to encode one code point. It also depends on whether you're using UTF-8 (one byte per code point), UTF-16 (two bytes per code point) or UTF-32 (four bytes per code point).

If we have these different encodings, how do we know which encoding a file will use? There's a thing called a Byte Order Mark (BOM) - sometimes referred to as an Encoding Signature. The BOM is a two-byte marker at the beginning of a file that tells what encoding the file is using.

UTF-8 is the most used on the internet, and is also specified in HTML5 as the preferred encoding for new documents, so I'll spend the most time explaining this one.

You can see in the diagram even from 2012, UTF-8 was widely becoming the most used encoding. And for the web it still is.

_W3 diagram to show how well used UTF-8 is used on a variety of websites._

What is UTF-8 and How Does it Work?

UTF-8 encodes all the Unicode code points from 0-127 in 1 byte (the same as ASCII). This means that if you were coding your program using ASCII, and your users used UTF-8, they wouldn't notice anything was wrong. Everything would just work.

Just remember how strong a selling point this is. We needed to remain ASCII backwards compatible while UTF-8 was being implemented and used by everyone. It doesn't break anything currently being used.

Because it's called UTF-8, remember that's the minimum number of bits (8 bits being one byte!) that a code point will be. There are other Unicode characters that are stored in multiple bytes (up to 6 bytes depending on the character). This is what people mean when the encoding is called variable length.

It might be more, depending on the language. English is 1 byte. European (Latin), Hebrew and Arabic are represented with 2 bytes. 3 bytes are used for Chinese, Japanese, Korean and other Asian characters. You get the idea.

When you need a character to span more than one byte, you have a bit combination to identify a continuation sign, saying this character is continued over the next several bytes. So you’ll still only use one byte per character for English, but if you need a document to contain some foreign characters, you can do that too.

And now, wonderfully, we can all agree on what the Sumerian cuneiform characters encoding is (𒀵 𒁷𒂅 𒐤), as well as some emoji's 😉😉 so we can all communicate!

The high level overview is: You first read the BOM so you know your encoding. You decode the file into Unicode code points, and then represent the characters from the Unicode character set into characters drawn onto the screen.

A Final Word About UTF

Remember, encoding is key. If I send the complete wrong encoding you can't read anything. Be aware of it when receiving or sending data. Often it is abstracted away in the tools you use everyday, but as programmers it's important to understand what is happening under the hood.

How do we specify our encodings, then? Because HTML is written in English, and almost all encodings can deal with English fine. We can embed it right at the top in the section.

<html lang="en">
<head>
  <meta charset="utf-8">
head>

It's important to do this at the very start of the , as the parsing of the HTML may have to start again if the encoding it's currently using is wrong.

We also can get the encoding from the Content-Type header from the HTTP request/ response.

If an HTML document doesn't contain the encoding tag, the HTML5 spec has some interesting ways it can guess the encoding called BOM sniffing. This is where it guesses the encoding from the Byte Order Mark (BOM) we discussed earlier.

So is that it?

Unicode isn't finished. Like any standard, we add, remove and make new proposals to the standard. No specification is ever considered "complete".

There are generally 1 or 2 release a year, and you can find them here.

Recently I read about a very interesting bug around Twitter rendering Russian Unicode characters incorrectly.

If you have read this far, congratulations – it's a lot to digest.

I would encourage you to do one last piece of homework.

Look at how broken websites can really be when the encoding is wrong. I used this Google Chrome extension and changed my encoding and tried to read webpages. The message was completely unclear. Try and read this article. Try and navigate Wikipedia. See Mojibake for yourself.

It helps to see how important encoding truly is.

Conclusion

In my time spent researching and trying to simplify this article, I learned about Michael Everson. Since 1993, he has proposed over 200 Unicode changes, and has added thousands of characters to the standard. As of 2003, he was credited as the leading contributor of Unicode proposals. He is one huge reason why Unicode is what it is. Very impressive, and he has done a great deal for the Internet as we know it.

I hope this has explained a good overview of why we need encodings, what problems encoding solves, and what happens when it goes wrong.

I share my writing on Twitter if you enjoyed this article and want to see more.

A Beginner-Friendly Guide to Unicode in Python

freeCodeCamp — Wed, 18 Jul 2018 23:51:28 +0000

By Jimmy Zhang

I once spent a couple of frustrating days at work learning how to properly deal with Unicode strings in Python. During those two days, I ate a lot of snacks — roughly one bag of goldfish per one of these errors encountered, which should be all too familiar to those who program with Python:

UnicodeDecodeError: ‘ascii’ codec can’t decode byte 0xf0 in position 0: ordinal not in range(128)

While solving my issue, I did a lot of googling, which pointed me to a few indispensable articles. But as great as they are, they were all written without the help of a crucial aspect of communication in today’s day and age.

That is: they were all written without the help of emoji.

So, in order to take advantage of this situation, I decided to write my own guide to understanding Unicode, with plenty of faces and icons rendered along the way ?✌?.

Before diving into technical details, let’s begin with a fun question. What is your favorite emoji?

Mine is the “face with open mouth”, which looks like this ?— with one major caveat. What you see is actually dependent on the platform you are using to read this post!

Viewed on my Mac, the emoji looks like a yellow bowling ball. On my Samsung tablet, the eyes are black and circular, accentuated by a white dot which betrays a greater depth of emotion.

Copy and paste the emoji (?) into Twitter, and you’ll see something completely different. Copy and paste it into messenger.com, however, and you’ll see why it is my favorite.

???? Why are they all different?

_From left to right: Apple, Samsung, messenger.com ([source](https://emojipedia.org/face-with-open-mouth/" rel="noopener" target="blank" title=")).

Note: As of July 9th, 2018: Messenger seems to have updated their emoji icons, so the icon at the top right no longer applies. ?

This fun little mystery is our segue into the world of Unicode, as emojis have been part of the Unicode Standard since 2010. Aside from giving us emoji, Unicode is important because it is the Internet’s preferred choice for the consistent “encoding, representation, and handling of text”.

Unicode & Encoding: A Brief Primer

As with many topics, the best way to understand Unicode is to know the context surrounding its creation — and for that, Joel Spolsky’s article is required reading.

Code Points

Since we’ve now entered the world of Unicode, we need to first dissociate emojis from the wonderfully expressive icons they are, and associate them with something much less exciting. So instead of thinking about emojis in terms of the things or the emotions that they represent, we will instead think about each emoji as a plain number. This number is known as a code point.

Code points are the key concept of Unicode, which was “designed to support the worldwide interchange, processing, and display of the written texts of the diverse languages…of the modern world.” It does so by associating virtually every printable character with an unique code point. Together, these characters comprise the Unicode character set.

Code points are typically written in hexadecimal and prefixed with U+ to denote the connection to Unicode, representing characters from:

exotic languages such as Telugu [ఋ | code point: U+0C0B]
chess symbols [♖ | code point: U+2656]
and, of course, emojis [? | code point: U+1F64C]

Glyphs Are What You See

The actual on-screen representation of code points are called glyphs, (the complete mapping of code points to glyphs is known as a font).

As an example, take this letter A, which is code point U+0041 in Unicode. The “A” you see with your eyes is a glyph — it looks like the way it does because it is rendered with Medium’s font. If you were to change the font to, Times New Roman for example, only the glyph of “A” would change — the underlying code point would not.

Fonts map the same code point to different glyphs

Glyphs are the answer to our little rendering mystery. Under the hood, all variations of the face with open mouth emoji point to the same code point, U+1F62E, but the glyph representing it varies by platform ?.

Code Points are Abstractions

Because they say nothing about how they are rendered visually (requiring a font and a glyph to “bring them to life”), code points are said to be an abstraction.

But just as code points are an abstraction to end users, they are also abstractions to computers. This is because code points require a character encoding to convert them into the one thing which computers can interpret: bytes. Once converted to bytes, code points can be saved to files or sent over the network to another computer ?➡️?.

UTF-8 is currently the world’s most popular character encoding. UTF-8 uses a set of rules to convert a code point into an unique sequence of (1 to 4) bytes, and vice versa. Code points are said to be encoded into a sequence of bytes, and sequences of bytes are decoded into code points. This Stack Overflow post explains how the UTF-8 encoding algorithm works.

However, even though UTF-8 is the predominant character encoding in the world, it is far from the only one. For example, UTF-16 is an alternative character encoding of the Unicode character set. The image below compares the UTF-8 and UTF-16 encodings of our emoji ?.

Problems arise when one computer encodes code points into bytes with one encoding, and another computer (or another process on the same computer) decodes those bytes with another.

Luckily, UTF-8 is ubiquitous enough that, for the most part, we don’t have to worry about mismatched character encodings. But when they do occur, a familiarity with the concepts mentioned above is required to extricate yourself from the mess.

Brief Recap

Unicode is a collection of code points, which are plain numbers typically written in hexadecimal and prefixed with U+. These code points map to virtually every printable character from the written languages around the world.
Glyphs are the physical manifestation of a character. This guy ? is a glyph. A font is a mapping of code points to glyphs.
In order to send them across the network or save them in a file, characters and their underlying code points must be encoded into bytes. A character encoding contains the details of how a code point is embedded into a sequence of bytes.
UTF-8 is currently the world’s must popular character encoding. Given a code point, UTF-8 encodes it into a sequence of bytes. Given a sequence of bytes, UTF-8 decodes it into a code point.

A Practical Example

The correct rendering of Unicode characters involves traversing a chain, ranging from bytes to code points to glyphs.

Let’s now use a text editor to see a practical example of this chain — as well as the types of issues that can arise when things go awry. Text editors are perfect, because they involve all three parts of the rendering chain shown above.

Note: The following example was done on my MacOS using Sublime Text 3. And to give credit where credit is due: the beginning of this example is heavily inspired by this post from Philip Guo, which introduced me to the hexdump command (and a whole lot more).

We’ll start with a text file containing a single character — my favorite “face with open mouth” emoji. For those who want to follow along, I’ve hosted this file in a Github gist, which you get locally with curl.

curl https://gist.githubusercontent.com/jzhang621/d7d9eb167f25084420049cb47510c971/raw/e35f9669785d83db864f9d6b21faf03d9e51608d/emoji.txt > emoji.txt

As we learned, in order for it be saved to a file, the emoji was encoded into bytes using a character encoding. This particular file was encoded using UTF-8, and we can use the hexdump command to examine the actual byte contents of the file.

j|encoding: hexdump emoji.txt0000000 f0 9f 98 ae 0000004

The output of hexdump tells us the file contains 4 bytes total, each of which is written in hexadecimal. The actual byte sequence f0 9f 98 ae matches the expected UTF-8 encoded byte sequence, as shown below.

_Source: [http://www.ltg.ed.ac.uk/~richard/utf-8.cgi?input=%F0%9F%98%AE&mode=char](http://www.ltg.ed.ac.uk/~richard/utf-8.cgi?input=%F0%9F%98%AE&mode=char" rel="noopener" target="blank" title=")

Now, let’s open our file in Sublime Text, where we should see our single ? character. Since we see the expected glyph, we can assume Sublime Text used the correct character encoding to decode those bytes into code points. Let’s confirm by opening up the console View -> Show Console, and inspecting th[e vi](https://www.sublimetext.com/docs/3/api_reference.html#sublime.View)ew object that Sublime Text exposes as part of its Python API.

>>> view0x1112d7310>

# returns the encoding currently associated with the file>>> view.encoding()'UTF-8'

With a bit of Python knowledge, we can also find the Unicode code point associated with our emoji:

# Returns the character at the given position>>> view.substr(0)'?'

# ord returns an integer representing the Unicode code point of the character (docs)>>> ord(view.substr(0))128558

# convert code point to hexadecimal, and format with U+>>> print('U+%x' % ord(view.substr(0)))U+1f62e

Again, just as we expected. This illustrates a full traversal of the Unicode rendering chain, which involved:

reading the file as a sequence of UTF-8 encoded bytes.
decoding the bytes into a Unicode code point.
rendering the glyph associated with the code point.

The actual glyph that you see is dependent on the platform.

So far, so good ?.

Different Bytes, Same Emoji

Aside from being my favorite text editor, I chose Sublime Text for this example because it allows for easy experimentation with character encodings.

We can now save the file using a different character encoding. To do so, click File -> Save with Encoding -> UTF-16 BE. (Very briefly, UTF-16 is an alternative character encoding of the Unicode character set. Instead of encoding the most common characters using one byte, like UTF-8, UTF-16 encodes every point from 1–65536 using two bytes. Code points greater than 65536, like our emoji, are encoded using surrogate pairs. The BE stands for Big Endian).

When we use hexdump to inspect the file again, we see that byte contents have changed.

# (before: UTF-8)j|encoding: hexdump emoji.txt0000000 f0 9f 98 ae 0000004

# (after: UTF-16 BE)j|encoding: hexdump emoji.txt0000000 d8 3d de 2e0000004

Back in Sublime Text, we still see the same ? character staring at us. Saving the file with a different character encoding might have changed the actual contents of the file, but it also updated Sublime Text’s internal representation of how to interpret those bytes. We can confirm by firing up the console again.

>>> view.encoding()'UTF-16 BE'

From here on up, everything else is the same.

>>> view.substr(0)'?'

>>> ord(view.substr(0))128558

>>> print('U+%x' % ord(view.substr(0)))U+1f62e

The bytes may have changed, but the code point did not — and the emoji remains the same.

Same Bytes, But What The đŸ˜®

Time for some encoding “fun”. First, let’s re-encode our file using UTF-8, because it makes for a better example.

Let’s now go ahead use Sublime Text to re-open an existing file using a different character encoding. Under File -> Reopen with Encoding, click Vietnamese (Windows 1258), which turns our emoji character into the following four nonsensical characters: đŸ˜®.

When we click “Reopen with Encoding”, we aren’t changing the actual byte contents of the file, but rather, the way Sublime Text interprets those bytes. Hexdump confirms the bytes are the same:

j|encoding: hexdump emoji.txt0000000 f0 9f 98 ae0000004

To understand why we see these nonsensical characters, we need to consult the Windows-1258 code page, which is a mapping of bytes to a Vietnamese language character set. (Think of a code page as the table produced by a character encoding). As this code page contains a character set with less than 255 characters, each character’s code points can be expressed as a decimal number between 0 and 255, which in turn can all be encoded using 1 byte.

The Windows-1258 code page, which maps decimal code points to Vietnamese language characters. Taken from Wikipedia, with some custom styling applied to show the 4 code points relevant to this example.

Because our single ? emoji requires 4 bytes to encode using UTF-8, we now see 4 characters when we interpret the file with the Windows-1258 encoding.

A wrong choice of character encoding has a direct impact on what we can see and comprehend by garbling characters into an incomprehensible mess.

Now, onto the “fun” part, which I include to add some color to Unicode and why it exists. Before Unicode, there were many different code pages such as Windows-1258 in existence, each with a different way of mapping 1 byte’s worth of data into 255 characters. Unicode was created in order to incorporate all the different characters of the all the different code pages into one system. In other words, Unicode is a superset of Windows-1258, and each character in the Windows-1258 code page has a Unicode counterpart.

_The Unicode counterparts for each character is listed on the middle row of each cell ([Wikipedia](https://en.wikipedia.org/wiki/Windows-1258" rel="noopener" target="blank" title="))

In fact, these Unicode counterparts are what allows Sublime Text to convert between different character encodings with a click of a button. Internally, Sublime Text still represents each of our “Windows-1258 decoded” characters as a Unicode code point, as we see below when we fire up the console:

>>> view.encoding()'Vietnamese (Windows 1258)'

# Python 3 strings are "immutable sequences of Unicode code points">>> type(view.substr(0))<class 'str'>

>>> view.substr(0)'đ'>>> view.substr(1)'Ÿ'>>> view.substr(2)'˜'>>> view.substr(3)'®'

>>> ['U+%04x' % ord(view.substr(x)) for x in range(0, 4)]['U+0111', 'U+0178', 'U+02dc', 'U+00ae']

This means that we can re-save our 4 nonsensical characters using UTF-8. I’ll leave this one up to you — if you do so, and can correctly predict the resulting hexdump of the file, then you’ve successfully understood the key concepts behind Unicode, code points, and character encodings. (Use this UTF-8 code page. Answer can be found at the very end of this article. ).

Wrapping up

Working effectively with Unicode involves always knowing what level of the rendering chain you are operating on. It means always asking yourself: what do I have? Under the hood, glyphs are nothing but code points. If you are working with code points, know that those code points must be encoded into bytes with a character encoding. If you have a sequence of bytes representing text, know that those bytes are meaningless without knowing the character encoding that was used create those bytes.

As with any computer science topic, the best way to learn about Unicode is to experiment. Enter characters, play with character encodings, and make predictions that you verify using hexdump. While I hope this article explains everything you need to know about Unicode, I will be more than happy if it merely sets you up to run your own experiments.

Thanks for reading! ?

Answer:

j|encoding: $ hexdump emoji.txt0000000 c4 91 c5 b8 cb 9c c2 ae0000008

unicode - freeCodeCamp.org

What is Unicode —The Secret Language Behind Every Text You See

Table of Contents

The Problem Before Unicode

What Is Unicode?

How Does Unicode Work?

What Are Unicode Encodings?

Code Points, Characters, and Glyphs

Unicode in Programming

Why Unicode Matters

Conclusion

How to Use RegEx to Match Emoji – Discord Emotes Regular Expression Tutorial

How Unicode Emoji Work

What are Discord Emotes?

How to Match Emoji and Emotes with RegEx

Conclusion

HTML Arrow – Symbol Unicode for Single and Double Arrows, Left and Right Arrows

What are Unicodes?

Why are Unicodes useful?

How to use Unicodes in HTML

Unicode for Single and Double Left and Right Arrows

Left Arrow

Right arrow

Dot Symbol – Bullet Point in HTML Unicode

The Unicode and HTML Entities for Bullet Points

The Keyboard Shortcut for Typing a Dot Symbol

Checkmark Symbol – HTML for Checkmark Unicode

The Unicode and HTML Characters for Checkmarks

Other Variations of the Checkmark Symbol

Ϭ for a bolder checkmark

U+2705 – ✅ for a white heavy checkmark

U+2611 – ☑ for a ballot checkmark

U+221A – √ for a square root checkmark

Conclusion

Unicode Characters – What Every Developer Should Know About Encoding

Introduction to Encoding

A Brief History of Encoding

ASCII Character Table

Issues with ASCII

What are ASCII Code Pages?

This sounds a little insane...

Along Came Unicode

U+0048 : LATIN CAPITAL LETTER H

Unicode Transform Protocol (UTF)

What is UTF-8 and How Does it Work?

A Final Word About UTF

So is that it?

Conclusion

A Beginner-Friendly Guide to Unicode in Python

Unicode & Encoding: A Brief Primer

Code Points

Glyphs Are What You See

Code Points are Abstractions

Brief Recap

A Practical Example

Different Bytes, Same Emoji

Same Bytes, But What The đŸ˜®

Wrapping up

Answer:

`Ϭ` for a bolder checkmark

U+2705 – `✅` for a white heavy checkmark

U+2611 – `☑` for a ballot checkmark

U+221A – `√` for a square root checkmark