<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/"
    xmlns:atom="http://www.w3.org/2005/Atom" xmlns:media="http://search.yahoo.com/mrss/" version="2.0">
    <channel>
        
        <title>
            <![CDATA[ unicode - freeCodeCamp.org ]]>
        </title>
        <description>
            <![CDATA[ Browse thousands of programming tutorials written by experts. Learn Web Development, Data Science, DevOps, Security, and get developer career advice. ]]>
        </description>
        <link>https://www.freecodecamp.org/news/</link>
        <image>
            <url>https://cdn.freecodecamp.org/universal/favicons/favicon.png</url>
            <title>
                <![CDATA[ unicode - freeCodeCamp.org ]]>
            </title>
            <link>https://www.freecodecamp.org/news/</link>
        </image>
        <generator>Eleventy</generator>
        <lastBuildDate>Mon, 11 May 2026 10:29:18 +0000</lastBuildDate>
        <atom:link href="https://www.freecodecamp.org/news/tag/unicode/rss.xml" rel="self" type="application/rss+xml" />
        <ttl>60</ttl>
        
            <item>
                <title>
                    <![CDATA[ What is Unicode —The Secret Language Behind Every Text You See ]]>
                </title>
                <description>
                    <![CDATA[ Have you ever sent a message with an emoji? Read a blog in another language? Or copied some strange symbol from the internet?  All of these are possible because of something called Unicode.  Unicode is a powerful system that lets computers understand... ]]>
                </description>
                <link>https://www.freecodecamp.org/news/what-is-unicode-the-secret-language-behind-every-text-you-see/</link>
                <guid isPermaLink="false">688b74903e00617596a6f3ce</guid>
                
                    <category>
                        <![CDATA[ Computer Science ]]>
                    </category>
                
                    <category>
                        <![CDATA[ localization ]]>
                    </category>
                
                    <category>
                        <![CDATA[ unicode ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Manish Shivanandhan ]]>
                </dc:creator>
                <pubDate>Thu, 31 Jul 2025 13:50:08 +0000</pubDate>
                <media:content url="https://cdn.hashnode.com/res/hashnode/image/upload/v1753969659647/1f49bf21-9be3-4e60-861f-50c714d7ae87.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>Have you ever sent a message with an emoji? Read a blog in another language? Or copied some strange symbol from the internet? </p>
<p>All of these are possible because of something called <a target="_blank" href="https://en.wikipedia.org/wiki/Unicode"><strong>Unicode</strong></a>. </p>
<p>Unicode is a powerful system that lets computers understand and show text in nearly any language, including fun stuff like emojis. 😃</p>
<p>In this article, we’ll break down what Unicode is, why it matters, and how it powers global communication.</p>
<h2 id="heading-table-of-contents">Table of Contents</h2>
<ul>
<li><p><a class="post-section-overview" href="#heading-the-problem-before-unicode">The Problem Before Unicode</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-what-is-unicode">What Is Unicode?</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-how-does-unicode-work">How Does Unicode Work</a></p>
<ul>
<li><p><a class="post-section-overview" href="#heading-what-are-unicode-encodings">What Are Unicode Encodings?</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-code-points-characters-and-glyphs">Code Points, Characters, and Glyphs</a></p>
</li>
</ul>
</li>
<li><p><a class="post-section-overview" href="#heading-unicode-in-programming">Unicode in Programming</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-why-unicode-matters">Why Unicode Matters</a></p>
</li>
<li><p><a class="post-section-overview" href="#heading-conclusion">Conclusion</a></p>
</li>
</ul>
<h2 id="heading-the-problem-before-unicode">The Problem Before Unicode</h2>
<p>Let’s rewind to the early days of computers when each country had its own way of showing text. These systems were called character encodings. </p>
<p>For example, English text used <a target="_blank" href="https://en.wikipedia.org/wiki/ASCII">ASCII</a>, while others used ISO-8859, Shift-JIS, and more.</p>
<p>But here’s the problem: the same number could mean different things in different systems. </p>
<p>For example, the number <code>0x41</code> meant the letter A in one system, but in another, it might mean something else entirely.</p>
<p>This caused chaos when sharing documents between systems. Special characters would turn into random symbols, and non-English languages were often unreadable. </p>
<p>It was clear that the world needed one universal system. Something that could handle all languages and symbols in a single, consistent way.</p>
<p>That’s where Unicode comes in.</p>
<h2 id="heading-what-is-unicode">What Is Unicode?</h2>
<p>Unicode is a standard system that assigns a unique number, called a code point, to every character. It includes letters, numbers, emojis, symbols, and even <a target="_blank" href="https://invisible-characters.com/">invisible control characters</a>.</p>
<p>Think of it like giving every character in every language its own ID number.</p>
<p>For example:</p>
<ul>
<li><p>The capital letter <strong>A</strong> is given the code <code>U+0041</code></p>
</li>
<li><p>The Greek letter <strong>Ω</strong> is <code>U+03A9</code></p>
</li>
<li><p>The emoji 😀 is <code>U+1F600</code></p>
</li>
</ul>
<p>This means no matter what device, app, or country you’re in, the same code will always mean the same character.</p>
<h2 id="heading-how-does-unicode-work">How Does Unicode Work?</h2>
<p>At its core, Unicode assigns a code point to each character. </p>
<p>Code points look like this: <code>U+XXXX</code>, where <code>XXXX</code> is a number written in hexadecimal (a base-16 system computers use).</p>
<p>But computers don’t store code points directly. They store bytes, the 1s and 0s under the hood. So Unicode needs a way to turn those code points into bytes. This is called encoding.</p>
<h3 id="heading-what-are-unicode-encodings">What Are Unicode Encodings?</h3>
<p>Unicode gives every character a unique code point, but computers don’t store “U+1F600” directly – they store bytes. To convert these code points into bytes that computers can save or transmit, we need encodings.</p>
<p>There are three main ways to turn Unicode code points into bytes:</p>
<p><strong>1. UTF-8 (Most common)</strong></p>
<ul>
<li><p>Uses 1 to 4 bytes.</p>
</li>
<li><p>Great for English and most symbols.</p>
</li>
<li><p>Saves space.</p>
</li>
<li><p>Works on the web and most systems.</p>
</li>
</ul>
<p><strong>2. UTF-16</strong></p>
<ul>
<li><p>Uses 2 or 4 bytes.</p>
</li>
<li><p>Used in Windows, Java, and some older systems.</p>
</li>
</ul>
<p><strong>3. UTF-32</strong></p>
<ul>
<li><p>Uses 4 bytes for everything.</p>
</li>
<li><p>Easy to work with, but uses more memory.</p>
</li>
</ul>
<p>If you’re storing or sending text, the encoding decides how many bytes are used. Choosing UTF‑8 can save space, especially for English-heavy data. When you see garbled text or � symbols, it’s usually a mismatch between encoding and decoding.</p>
<p>Web servers, databases, and APIs often require you to specify the encoding to ensure multilingual text displays correctly. In short, knowing the difference between UTF‑8, UTF‑16, and UTF‑32 helps you prevent bugs, save storage, and build apps that handle text from any language reliably.</p>
<p>So, UTF-8 is often the best choice. It’s efficient, and it works nearly everywhere.</p>
<h3 id="heading-code-points-characters-and-glyphs">Code Points, Characters, and Glyphs</h3>
<p>Let’s break down the main parts of Unicode:</p>
<p><strong>Code Point:</strong></p>
<p>This is the number assigned to a character. For example:</p>
<ul>
<li><p><code>U+0041</code> is the code point for <strong>A</strong></p>
</li>
<li><p><code>U+20AC</code> is for the Euro sign <strong>€</strong></p>
</li>
<li><p><code>U+1F600</code> is for the smiley face 😀</p>
</li>
</ul>
<p><strong>Character:</strong></p>
<p>The actual letter or symbol we see. For example, “A”, “Ω”, or “😎”.</p>
<p><strong>Glyph:</strong></p>
<p>This is the visual design of a character. For example, “A” in Arial looks different from “A” in Times New Roman, but the character is the same.</p>
<h2 id="heading-unicode-in-programming">Unicode in Programming</h2>
<p>Modern programming languages have embraced Unicode, making it easier than ever to build applications that support global audiences. </p>
<p>Whether you’re writing a command-line tool or building a web app, Unicode ensures your text renders correctly, no matter the language.</p>
<p>Take <a target="_blank" href="https://www.freecodecamp.org/news/an-animated-introduction-to-programming-with-python/">Python</a>, for instance. It natively supports Unicode strings:</p>
<pre><code class="lang-typescript">print(<span class="hljs-string">"Welcome 😊"</span>)  # This works because Python uses Unicode under the hood
</code></pre>
<p>You can even mix languages and emojis in the same output without a problem:</p>
<pre><code class="lang-typescript">print(<span class="hljs-string">"こんにちは, friend! 🚀"</span>)
</code></pre>
<p>In <a target="_blank" href="https://www.freecodecamp.org/news/what-is-javascript-definition-of-js/">JavaScript</a>, Unicode enables developers to use characters from virtually any script:</p>
<pre><code class="lang-typescript"><span class="hljs-built_in">console</span>.log(<span class="hljs-string">"नमस्ते"</span>);  <span class="hljs-comment">// Prints “Namaste” in Hindi</span>
<span class="hljs-built_in">console</span>.log(<span class="hljs-string">"مرحبا بالعالم"</span>);  <span class="hljs-comment">// Arabic: "Hello, world"</span>
</code></pre>
<p>Or even create multilingual UIs:</p>
<pre><code class="lang-typescript"><span class="hljs-built_in">document</span>.getElementById(<span class="hljs-string">"greeting"</span>).textContent = <span class="hljs-string">"Bonjour, мир!"</span>;
</code></pre>
<p>Before Unicode, developers had to juggle different encodings like ASCII, which often led to corrupted text when files moved between systems. Now, thanks to Unicode, most languages, including Java, C#, Ruby, Go, and Rust, handle international text gracefully by default.</p>
<p>This shift means developers can write apps that support global users from day one. Whether you’re building a chat app, an international e-commerce site, or a multilingual blog – with Unicode, your code speaks every language.</p>
<h2 id="heading-why-unicode-matters">Why Unicode Matters</h2>
<p>Before Unicode, digital communication across languages was chaotic. </p>
<p>Different systems used different character sets, leading to garbled text, random boxes, or strings of question marks whenever someone typed in a non-Latin-based language. Unicode changed all of that.</p>
<p>With Unicode, you can now mix languages like Chinese and English in the same document without a problem. Whether you’re copying text between applications or transferring data across platforms, it just works. </p>
<p>This consistency has been a game-changer for building multilingual websites and applications. Developers no longer need to worry about separate encodings for different regions. A single, unified standard handles it all.</p>
<p>Unicode isn’t something most users think about, but it’s embedded in almost everything. </p>
<p>It powers the text you see on websites and in your email, your smartphone’s keyboard, and even the way you chat in online games. Social media posts, search queries, and programming languages, all rely on Unicode.</p>
<p>Behind the scenes, the <a target="_blank" href="https://www.unicode.org/consortium/consort.html">Unicode Consortium</a>, made up of industry giants like Google, Apple, and Microsoft, regularly updates the standard. They decide which new characters and emojis make it into our digital vocabulary. </p>
<p>That’s why your favourite facepalm emoji or regional script exists. Someone proposed it, and Unicode made it happen.</p>
<p>Unicode isn’t just a technical convenience. It plays a direct role in how people engage with content. </p>
<p>Pages with broken symbols or unreadable characters had significantly lower engagement rates compared to cleanly rendered ones. It was a clear signal that readability isn’t just about aesthetics – it affects how long people stay and interact with your content.</p>
<p>That’s why even small encoding errors can have a real impact, especially on multilingual platforms or international blogs. Unicode silently keeps everything running smoothly.</p>
<h2 id="heading-conclusion">Conclusion</h2>
<p>Unicode is one of the unsung heroes of our digital world. Without it, the internet would still be a confusing mix of broken characters and language barriers. Because of Unicode, we can type “Hello 😊”, mix multiple languages in a single message, or build global apps that just work.</p>
<p>So the next time you post an emoji, read a message in a different script, or switch languages on your keyboard, take a moment to appreciate the invisible infrastructure behind it all. That’s Unicode, working quietly to make sure we stay connected, no matter what language we speak.</p>
<p><a target="_blank" href="https://blog.manishshivanandhan.com/">Join my newsletter</a> for a summary of my articles every Friday. You can also <a target="_blank" href="https://linkedin.com/in/manishmshiva">connect with me on Linkedin</a>.</p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ How to Use RegEx to Match Emoji – Discord Emotes Regular Expression Tutorial ]]>
                </title>
                <description>
                    <![CDATA[ Emoji are special Unicode characters that render pictographs. But these characters can be very tricky to identify with regular expressions (RegEx).  I was recently working on a Discord bot that had to detect the number of emotes in a given message. T... ]]>
                </description>
                <link>https://www.freecodecamp.org/news/how-to-use-regex-to-match-emoji-including-discord-emotes/</link>
                <guid isPermaLink="false">66ac7f4eed08c5b0125be18f</guid>
                
                    <category>
                        <![CDATA[ discord ]]>
                    </category>
                
                    <category>
                        <![CDATA[ emoji ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Regex ]]>
                    </category>
                
                    <category>
                        <![CDATA[ unicode ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Naomi Carrigan ]]>
                </dc:creator>
                <pubDate>Wed, 13 Jul 2022 23:04:07 +0000</pubDate>
                <media:content url="https://www.freecodecamp.org/news/content/images/2022/07/pexels-roman-odintsov-6898861.jpg" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>Emoji are special Unicode characters that render pictographs. But these characters can be very tricky to identify with regular expressions (RegEx). </p>
<p>I was recently working on a Discord bot that had to detect the number of emotes in a given message. Today I'll share my process with you, including the newer JavaScript RegEx feature that finally solved the issues I was having.</p>
<h2 id="heading-how-unicode-emoji-work">How Unicode Emoji Work</h2>
<p>The Unicode Consortium defines specific character codes for each emoji. They even maintain a <a target="_blank" href="https://unicode.org/emoji/charts/full-emoji-list.html">helpful emoji chart</a> as a reference. As an example, <code>U+1F600</code> corresponds to the 😀 emoji.</p>
<p>Some emoji consist of multiple Unicode characters. This is most common with the flag emoji, which consists of the "regional indicators" that make up the country's two-letter country code. </p>
<p>This means the United States flag, 🇺🇸, consists of the two Unicode characters <code>U+1F1FA</code> and <code>U+1F1F8</code>, which correspond to the regional indicators <code>U</code> and <code>S</code>.</p>
<blockquote>
<p>As a fun fact, it is up to the operating system to determine <strong>how</strong> to render an emoji. If you are on Windows, for example, you won't see a flag above. You'll see <code>US</code>.</p>
</blockquote>
<h2 id="heading-what-are-discord-emotes">What are Discord Emotes?</h2>
<p>One of Discord's many features is allowing communities to upload their own custom emotes. These emotes are identified by a name, and are used with the syntax <code>:emote_name:</code>.</p>
<p>However, the way they are identified by the client/API is different. Each emote has a unique ID, and they're sent in the message content as <code>&lt;:emote_name:1234567890&gt;</code>, or <code>&lt;a:emote_name:1234567890&gt;</code> for animated emotes.</p>
<p>You can see this in Discord by putting a backslash <code>\</code> before the emote and sending it. It will render something like this:</p>
<p><img src="https://www.freecodecamp.org/news/content/images/2022/07/image-162.png" alt="A Discord message showing an emote's raw value `<:NaomiGrin:938275644092063784>`" width="600" height="400" loading="lazy"></p>
<h2 id="heading-how-to-match-emoji-and-emotes-with-regex">How to Match Emoji and Emotes with RegEx</h2>
<p>My original approach had two different RegEx phrases.</p>
<p>I was using <code>/(&lt;a?)?:\w+:(\d{18}&gt;)?/g</code> to catch the Discord emotes. This RegEx was successfully picking up Discord emotes, which was great! </p>
<p>I paired it with <code>/:[^:\s]*(?:::[^:\s]*)*:/g</code> to match the Unicode emoji, which only partially worked. The problem here was that I was seeing some emotes being counted twice – because the Discord RegEx was matching them. And others were being missed entirely.</p>
<p>So, with RegEx being what it is, I tried to make it more complex. <code>&lt;:[^:\s]+:\d+&gt;|&lt;a:[^:\s]+:\d+&gt;|(\u00a9|\u00ae|[\u2000-\u3300]|\ud83c[\ud000-\udfff]|\ud83d[\ud000-\udfff]|\ud83e[\ud000-\udfff]|\ufe0f)/g</code> was a bit more successful in matching the built-in emoji, but still wasn't perfect. This RegEx was designed to match unicode characters specifically.</p>
<p>I played around with trimming whitespace, using the word boundary <code>\b</code> character, and a few other tweaks, before finally giving up and doing some research. </p>
<p>And then I discovered <a target="_blank" href="https://developer.mozilla.org/en-US/docs/Web/JavaScript/Guide/Regular_Expressions/Unicode_Property_Escapes">Unicode Property Escapes</a>. This RegEx feature allows you to add the <code>u</code> flag to your RegEx, unlocking the Unicode Properties denoted with the <code>\p</code> character.</p>
<p>With some additional research, I was able to find the <a target="_blank" href="https://unicode.org/reports/tr51/#Emoji_Properties">Emoji Character properties</a> – specifically, the <code>Extended_Pictograph</code> property. This enabled me to update the RegEx to a final, functional value:</p>
<pre><code class="lang-js">/<span class="xml"><span class="hljs-tag">&lt;<span class="hljs-name">a?:.+?:\d{18}</span>&gt;</span>|\p{Extended_Pictographic}/gu</span>
</code></pre>
<p>The <code>\p{Extended_Pictographic}</code> property seems to match Unicode emotes as well as character modifiers (often used for skin tones in emoji).</p>
<h2 id="heading-conclusion">Conclusion</h2>
<p>This RegEx is currently running in my production code and hasn't shown any issues yet. </p>
<p>Hopefully this article has helped you. If you are interested in exploring Unicode Property Escapes further, the Unicode Consortium offers a <a target="_blank" href="https://www.unicode.org/Public/UCD/latest/ucd/PropertyValueAliases.txt">full list</a> of the available values.</p>
<p>Happy Coding!</p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ HTML Arrow – Symbol Unicode for Single and Double Arrows, Left and Right Arrows ]]>
                </title>
                <description>
                    <![CDATA[ By Dillion Megida What are Unicodes? Unicodes are universal characters that represent different things. These could be symbols, characters, scripts, and many more forms of character combinations. Unicodes are adopted by many platforms (mobile and web... ]]>
                </description>
                <link>https://www.freecodecamp.org/news/html-arrow-symbol-unicode-for-single-and-double-arrows-left-and-right-arrows/</link>
                <guid isPermaLink="false">66d84f2439c4dccc43d4d48a</guid>
                
                    <category>
                        <![CDATA[ HTML ]]>
                    </category>
                
                    <category>
                        <![CDATA[ unicode ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ freeCodeCamp ]]>
                </dc:creator>
                <pubDate>Fri, 17 Jun 2022 15:35:12 +0000</pubDate>
                <media:content url="https://www.freecodecamp.org/news/content/images/2022/06/unicode-for-arrows.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>By Dillion Megida</p>
<h2 id="heading-what-are-unicodes">What are Unicodes?</h2>
<p>Unicodes are universal characters that represent different things. These could be symbols, characters, scripts, and many more forms of character combinations.</p>
<p>Unicodes are adopted by many platforms (mobile and web) to make characters available everywhere.</p>
<h2 id="heading-why-are-unicodes-useful">Why are Unicodes useful?</h2>
<p>Unicodes are useful because they provide a standard for character representations across different systems and languages. </p>
<p>Unicode also represents special characters that are not available in <a target="_blank" href="https://en.wikipedia.org/wiki/ASCII">ASCII</a> and helps us create consistent character displays on various platforms.</p>
<p>You can also apply styles (colors, sizes) like you would with other characters.</p>
<h2 id="heading-how-to-use-unicodes-in-html">How to use Unicodes in HTML</h2>
<p>You can write most Unicode symbols in two ways: using the hexadecimal reference or using the entity name.</p>
<p>Hexadecimal references are usually hard to read, but entity names are generally descriptive for the Unicode symbol you want to write.</p>
<p>For hexadecimal numbers, you write them in between <code>&amp;#</code> (an ampersand and a number sign) and <code>;</code> (a semi-colon) like this:</p>
<pre><code class="lang-html">&amp;#[NUMBER];
</code></pre>
<p>For entity names, you write them in between <code>&amp;</code> and <code>;</code> like this:</p>
<pre><code class="lang-html">&amp;[ENTITY];
</code></pre>
<p>This syntax is necessary so that HTML understands that the characters you're writing are not just text but Unicode symbol representations.</p>
<h2 id="heading-unicode-for-single-and-double-left-and-right-arrows">Unicode for Single and Double Left and Right Arrows</h2>
<p>Now that we've briefly looked at what Unicode is and how to use it in HTML, let's look at some examples.</p>
<p>There are many symbols with Unicode representations you can use in HTML. For this article, I will share four examples of arrow symbols.</p>
<p>There are different arrow symbols and Unicode values for them. The arrows used here are just examples.</p>
<h3 id="heading-left-arrow">Left Arrow</h3>
<p>For the single left arrow:</p>
<p>The hexadecimal reference is <strong>8592</strong> and the entity name is <strong>larr</strong>. In HTML, it would be written like:</p>
<pre><code class="lang-html"><span class="hljs-symbol">&amp;#8592;</span>
<span class="hljs-comment">&lt;!-- or --&gt;</span>
<span class="hljs-symbol">&amp;larr;</span>
</code></pre>
<p>This code will print this on a page:</p>
<p><img src="https://www.freecodecamp.org/news/content/images/2022/06/image-100.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p>For the double left arrow:</p>
<p>The hexadecimal reference is <strong>8647</strong> and the entity name is <strong>llarr</strong> written as:</p>
<pre><code class="lang-html"><span class="hljs-symbol">&amp;#8647;</span>
<span class="hljs-comment">&lt;!-- or --&gt;</span>
#llarr
</code></pre>
<p>This will result in:</p>
<p><img src="https://www.freecodecamp.org/news/content/images/2022/06/image-99.png" alt="Image" width="600" height="400" loading="lazy"></p>
<h3 id="heading-right-arrow">Right arrow</h3>
<p>For the single right arrow:</p>
<p>The hexadecimal reference is <strong>8594</strong> and the entity name is <strong>rarr</strong> written like:</p>
<pre><code class="lang-html"><span class="hljs-symbol">&amp;#8594;</span>
<span class="hljs-comment">&lt;!-- or --&gt;</span>
<span class="hljs-symbol">&amp;rarr;</span>
</code></pre>
<p>The result:</p>
<p><img src="https://www.freecodecamp.org/news/content/images/2022/06/image-98.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p>For the double right arrow:</p>
<p>The hexadecimal reference is <strong>8649</strong> and the entity name is <strong>rrarr</strong> written as:</p>
<pre><code class="lang-html"><span class="hljs-symbol">&amp;#8649;</span>;
<span class="hljs-comment">&lt;!-- or --&gt;</span>
#rrarr
</code></pre>
<p>This will result in:</p>
<p><img src="https://www.freecodecamp.org/news/content/images/2022/06/image-101.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p>You can use Unicode representations to print many other symbols in HTML. You can either use the hexadecimal reference or the entity name of the symbol, as I have shown you in this article.</p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ Dot Symbol – Bullet Point in HTML Unicode ]]>
                </title>
                <description>
                    <![CDATA[ In your HTML documents, you'll often need to make a list of items. And you can use bullet points for this purpose. You can show bullet points with the Unicode character (or entity) for bullet points. In this article, I will show you the Unicode and H... ]]>
                </description>
                <link>https://www.freecodecamp.org/news/dot-symbol-bullet-point-in-html-unicode/</link>
                <guid isPermaLink="false">66adf0bf1ecaa5001d700511</guid>
                
                    <category>
                        <![CDATA[ HTML ]]>
                    </category>
                
                    <category>
                        <![CDATA[ unicode ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Kolade Chris ]]>
                </dc:creator>
                <pubDate>Thu, 14 Apr 2022 00:45:10 +0000</pubDate>
                <media:content url="https://www.freecodecamp.org/news/content/images/2022/04/bullet.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>In your HTML documents, you'll often need to make a list of items. And you can use bullet points for this purpose.</p>
<p>You can show bullet points with the Unicode character (or entity) for bullet points.</p>
<p>In this article, I will show you the Unicode and HTML entities for showing bullet points on a web page. </p>
<p>Towards the end of this article, I will also show you the 5-key combinations with which you can type a big dot symbol.</p>
<h2 id="heading-the-unicode-and-html-entities-for-bullet-points">The Unicode and HTML Entities for Bullet Points</h2>
<p>The Unicode character for showing the dot symbol or bullet point is <code>U+2022</code>. </p>
<p>But to use this Unicode correctly, remove the <code>U+</code> and replace it with ampersand (<code>&amp;</code>), pound sign (<code>#</code>), and <code>x</code>. Then type the 2022 number in, and then add a semi-colon. So, it becomes <code>&amp;#x2022;</code>. </p>
<p>It'll look like this:</p>
<pre><code class="lang-html"><span class="hljs-tag">&lt;<span class="hljs-name">h1</span>&gt;</span>Languages of the web<span class="hljs-tag">&lt;/<span class="hljs-name">h1</span>&gt;</span>
<span class="hljs-tag">&lt;<span class="hljs-name">h3</span>&gt;</span><span class="hljs-symbol">&amp;#x2022;</span> HTML<span class="hljs-tag">&lt;/<span class="hljs-name">h3</span>&gt;</span>
<span class="hljs-tag">&lt;<span class="hljs-name">h3</span>&gt;</span><span class="hljs-symbol">&amp;#x2022;</span> CSS<span class="hljs-tag">&lt;/<span class="hljs-name">h3</span>&gt;</span>
<span class="hljs-tag">&lt;<span class="hljs-name">h3</span>&gt;</span><span class="hljs-symbol">&amp;#x2022;</span> JavaScript<span class="hljs-tag">&lt;/<span class="hljs-name">h3</span>&gt;</span>
<span class="hljs-tag">&lt;<span class="hljs-name">h3</span>&gt;</span><span class="hljs-symbol">&amp;#x2022;</span> PHP<span class="hljs-tag">&lt;/<span class="hljs-name">h3</span>&gt;</span>
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2022/04/ss1-2.png" alt="ss1-2" width="600" height="400" loading="lazy"></p>
<p>Apart from the <code>&amp;#x2022;</code> Unicode character, you can also use <code>&amp;bull;</code> and <code>&amp;#8226;</code> HTML entitles to show bullets or dot symbols on the web page.</p>
<pre><code class="lang-html"><span class="hljs-tag">&lt;<span class="hljs-name">h1</span>&gt;</span>Languages of the web<span class="hljs-tag">&lt;/<span class="hljs-name">h1</span>&gt;</span>
<span class="hljs-tag">&lt;<span class="hljs-name">h3</span>&gt;</span><span class="hljs-symbol">&amp;#8226;</span> HTML<span class="hljs-tag">&lt;/<span class="hljs-name">h3</span>&gt;</span>
<span class="hljs-tag">&lt;<span class="hljs-name">h3</span>&gt;</span><span class="hljs-symbol">&amp;bull;</span> CSS<span class="hljs-tag">&lt;/<span class="hljs-name">h3</span>&gt;</span>
<span class="hljs-tag">&lt;<span class="hljs-name">h3</span>&gt;</span><span class="hljs-symbol">&amp;#8226;</span> JavaScript<span class="hljs-tag">&lt;/<span class="hljs-name">h3</span>&gt;</span>
<span class="hljs-tag">&lt;<span class="hljs-name">h3</span>&gt;</span><span class="hljs-symbol">&amp;bull;</span> PHP<span class="hljs-tag">&lt;/<span class="hljs-name">h3</span>&gt;</span>
</code></pre>
<p>The output remains the same:
<img src="https://www.freecodecamp.org/news/content/images/2022/04/ss2-4.png" alt="ss2-4" width="600" height="400" loading="lazy"></p>
<h2 id="heading-the-keyboard-shortcut-for-typing-a-dot-symbol">The Keyboard Shortcut for Typing a Dot Symbol</h2>
<p>To type the dot symbol on your keyboard, turn on the numeric keypad by pressing <code>NumLk</code>, hold <code>Alt</code> and press the <code>0</code>, <code>1</code>, <code>4</code>, and <code>9</code> keys in succession. </p>
<p>If you don’t type the numbers with the numeric keypad, the dot symbol will not show.</p>
<p>Thank you for reading!</p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ Checkmark Symbol – HTML for Checkmark Unicode ]]>
                </title>
                <description>
                    <![CDATA[ If you take a look at your keyboard, you'll see that there’s no key for typing a checkmark.  You could decide to copy the checkmark symbol from the internet and paste it directly into your HTML code, but an easier way to do it is to use the appropria... ]]>
                </description>
                <link>https://www.freecodecamp.org/news/checkmark-symbol-html-for-checkmark-unicode/</link>
                <guid isPermaLink="false">66adf07b7550d4f37c20199c</guid>
                
                    <category>
                        <![CDATA[ HTML ]]>
                    </category>
                
                    <category>
                        <![CDATA[ unicode ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Kolade Chris ]]>
                </dc:creator>
                <pubDate>Tue, 12 Apr 2022 19:17:04 +0000</pubDate>
                <media:content url="https://www.freecodecamp.org/news/content/images/2022/04/checkmark.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>If you take a look at your keyboard, you'll see that there’s no key for typing a checkmark. </p>
<p>You could decide to copy the checkmark symbol from the internet and paste it directly into your HTML code, but an easier way to do it is to use the appropriate Unicode character or HTML character entity.</p>
<p>If you are wondering what Unicode and HTML character entities are, they are both a piece of text that represents different emojis, symbols, and characters.</p>
<p>In your web projects, you might want to show a checkmark for the purpose of consent or agreement. So, in this article, I will show you how to use the appropriate Unicode and HTML character entity to bring checkmarks into your web projects. I will also show you 4 other variations of the checkmark symbol.</p>
<h2 id="heading-the-unicode-and-html-characters-for-checkmarks">The Unicode and HTML Characters for Checkmarks</h2>
<p>The Unicode character for showing a checkmark is <code>U+2713</code>. If you decide to use this Unicode to show a checkmark in HTML and you type it in like that, what you type is shown like this:</p>
<pre><code class="lang-html"> <span class="hljs-tag">&lt;<span class="hljs-name">h1</span>&gt;</span>Languages of the web<span class="hljs-tag">&lt;/<span class="hljs-name">h1</span>&gt;</span>
 <span class="hljs-tag">&lt;<span class="hljs-name">h3</span>&gt;</span>U+2713 HTML<span class="hljs-tag">&lt;/<span class="hljs-name">h3</span>&gt;</span>
 <span class="hljs-tag">&lt;<span class="hljs-name">h3</span>&gt;</span>U+2713 CSS<span class="hljs-tag">&lt;/<span class="hljs-name">h3</span>&gt;</span>
 <span class="hljs-tag">&lt;<span class="hljs-name">h3</span>&gt;</span>U+2713 JavaScript<span class="hljs-tag">&lt;/<span class="hljs-name">h3</span>&gt;</span>
 <span class="hljs-tag">&lt;<span class="hljs-name">h3</span>&gt;</span>U+2713 PHP<span class="hljs-tag">&lt;/<span class="hljs-name">h3</span>&gt;</span>
</code></pre>
<p><strong>So, how do you use the U+2713 Unicode to show the checkmark symbol?</strong></p>
<p>Remove the <code>U+</code> and replace it with an ampersand (<code>&amp;</code>), pound sign (#), and <code>x</code>. Then type the 2713 in, and then a semi-colon. So, it becomes <code>&amp;#x2713;</code>.</p>
<pre><code class="lang-html"> <span class="hljs-tag">&lt;<span class="hljs-name">h1</span>&gt;</span>Languages of the web<span class="hljs-tag">&lt;/<span class="hljs-name">h1</span>&gt;</span>
 <span class="hljs-tag">&lt;<span class="hljs-name">h3</span>&gt;</span><span class="hljs-symbol">&amp;#x2713;</span> HTML<span class="hljs-tag">&lt;/<span class="hljs-name">h3</span>&gt;</span>
 <span class="hljs-tag">&lt;<span class="hljs-name">h3</span>&gt;</span><span class="hljs-symbol">&amp;#x2713;</span> CSS<span class="hljs-tag">&lt;/<span class="hljs-name">h3</span>&gt;</span>
 <span class="hljs-tag">&lt;<span class="hljs-name">h3</span>&gt;</span><span class="hljs-symbol">&amp;#x2713;</span> JavaScript<span class="hljs-tag">&lt;/<span class="hljs-name">h3</span>&gt;</span>
 <span class="hljs-tag">&lt;<span class="hljs-name">h3</span>&gt;</span><span class="hljs-symbol">&amp;#x2713;</span> PHP<span class="hljs-tag">&lt;/<span class="hljs-name">h3</span>&gt;</span>
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2022/04/ss2-3.png" alt="ss2-3" width="600" height="400" loading="lazy"></p>
<p>You can also use the HTML character entity for a checkmark to show the checkmark symbol. This is <code>&amp;#10003;</code> or <code>&amp;check;</code>:</p>
<pre><code class="lang-html"><span class="hljs-tag">&lt;<span class="hljs-name">h1</span>&gt;</span>Languages of the web<span class="hljs-tag">&lt;/<span class="hljs-name">h1</span>&gt;</span>
<span class="hljs-tag">&lt;<span class="hljs-name">h3</span>&gt;</span><span class="hljs-symbol">&amp;check;</span> HTML<span class="hljs-tag">&lt;/<span class="hljs-name">h3</span>&gt;</span>
<span class="hljs-tag">&lt;<span class="hljs-name">h3</span>&gt;</span><span class="hljs-symbol">&amp;#10003;</span> CSS<span class="hljs-tag">&lt;/<span class="hljs-name">h3</span>&gt;</span>
<span class="hljs-tag">&lt;<span class="hljs-name">h3</span>&gt;</span><span class="hljs-symbol">&amp;#10003;</span> JavaScript<span class="hljs-tag">&lt;/<span class="hljs-name">h3</span>&gt;</span>
<span class="hljs-tag">&lt;<span class="hljs-name">h3</span>&gt;</span><span class="hljs-symbol">&amp;check;</span> PHP<span class="hljs-tag">&lt;/<span class="hljs-name">h3</span>&gt;</span>
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2022/04/ss2-3.png" alt="ss2-3" width="600" height="400" loading="lazy"></p>
<h2 id="heading-other-variations-of-the-checkmark-symbol">Other Variations of the Checkmark Symbol</h2>
<p>Apart from the traditional <code>U+2713</code>, <code>&amp;#10003;</code> or <code>&amp;check;</code>, there are other variations such as:</p>
<h3 id="heading-for-a-bolder-checkmark"><code>&amp;#1004;</code> for a bolder checkmark</h3>
<pre><code class="lang-html"><span class="hljs-tag">&lt;<span class="hljs-name">h1</span>&gt;</span>Languages of the web<span class="hljs-tag">&lt;/<span class="hljs-name">h1</span>&gt;</span>
<span class="hljs-tag">&lt;<span class="hljs-name">h3</span>&gt;</span><span class="hljs-symbol">&amp;#10004;</span> HTML<span class="hljs-tag">&lt;/<span class="hljs-name">h3</span>&gt;</span>
<span class="hljs-tag">&lt;<span class="hljs-name">h3</span>&gt;</span><span class="hljs-symbol">&amp;#10004;</span> CSS<span class="hljs-tag">&lt;/<span class="hljs-name">h3</span>&gt;</span>
<span class="hljs-tag">&lt;<span class="hljs-name">h3</span>&gt;</span><span class="hljs-symbol">&amp;#10004;</span> JavaScript<span class="hljs-tag">&lt;/<span class="hljs-name">h3</span>&gt;</span>
<span class="hljs-tag">&lt;<span class="hljs-name">h3</span>&gt;</span><span class="hljs-symbol">&amp;#10004;</span> PHP<span class="hljs-tag">&lt;/<span class="hljs-name">h3</span>&gt;</span>
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2022/04/ss3-2.png" alt="ss3-2" width="600" height="400" loading="lazy"></p>
<h3 id="heading-u2705-for-a-white-heavy-checkmark">U+2705 – <code>&amp;#x2705;</code> for a white heavy checkmark</h3>
<pre><code class="lang-html"><span class="hljs-tag">&lt;<span class="hljs-name">h1</span>&gt;</span>Languages of the web<span class="hljs-tag">&lt;/<span class="hljs-name">h1</span>&gt;</span>
<span class="hljs-tag">&lt;<span class="hljs-name">h3</span>&gt;</span><span class="hljs-symbol">&amp;#x2705;</span> HTML<span class="hljs-tag">&lt;/<span class="hljs-name">h3</span>&gt;</span>
<span class="hljs-tag">&lt;<span class="hljs-name">h3</span>&gt;</span><span class="hljs-symbol">&amp;#x2705;</span> CSS<span class="hljs-tag">&lt;/<span class="hljs-name">h3</span>&gt;</span>
<span class="hljs-tag">&lt;<span class="hljs-name">h3</span>&gt;</span><span class="hljs-symbol">&amp;#x2705;</span> JavaScript<span class="hljs-tag">&lt;/<span class="hljs-name">h3</span>&gt;</span>
<span class="hljs-tag">&lt;<span class="hljs-name">h3</span>&gt;</span><span class="hljs-symbol">&amp;#x2705;</span> PHP<span class="hljs-tag">&lt;/<span class="hljs-name">h3</span>&gt;</span>
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2022/04/ss4-1.png" alt="ss4-1" width="600" height="400" loading="lazy"></p>
<h3 id="heading-u2611-for-a-ballot-checkmark">U+2611 – <code>&amp;#x2611;</code> for a ballot checkmark</h3>
<pre><code class="lang-html"><span class="hljs-tag">&lt;<span class="hljs-name">h1</span>&gt;</span>Languages of the web<span class="hljs-tag">&lt;/<span class="hljs-name">h1</span>&gt;</span>
<span class="hljs-tag">&lt;<span class="hljs-name">h3</span>&gt;</span><span class="hljs-symbol">&amp;#x2611;</span> HTML<span class="hljs-tag">&lt;/<span class="hljs-name">h3</span>&gt;</span>
<span class="hljs-tag">&lt;<span class="hljs-name">h3</span>&gt;</span><span class="hljs-symbol">&amp;#x2611;</span> CSS<span class="hljs-tag">&lt;/<span class="hljs-name">h3</span>&gt;</span>
<span class="hljs-tag">&lt;<span class="hljs-name">h3</span>&gt;</span><span class="hljs-symbol">&amp;#x2611;</span> JavaScript<span class="hljs-tag">&lt;/<span class="hljs-name">h3</span>&gt;</span>
<span class="hljs-tag">&lt;<span class="hljs-name">h3</span>&gt;</span><span class="hljs-symbol">&amp;#x2611;</span> PHP<span class="hljs-tag">&lt;/<span class="hljs-name">h3</span>&gt;</span>
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2022/04/ss5-1.png" alt="ss5-1" width="600" height="400" loading="lazy"></p>
<h3 id="heading-u221a-for-a-square-root-checkmark">U+221A – <code>&amp;#x221A;</code> for a square root checkmark</h3>
<pre><code class="lang-html"><span class="hljs-tag">&lt;<span class="hljs-name">h1</span>&gt;</span>Languages of the web<span class="hljs-tag">&lt;/<span class="hljs-name">h1</span>&gt;</span>
<span class="hljs-tag">&lt;<span class="hljs-name">h3</span>&gt;</span><span class="hljs-symbol">&amp;#x221A;</span> HTML<span class="hljs-tag">&lt;/<span class="hljs-name">h3</span>&gt;</span>
<span class="hljs-tag">&lt;<span class="hljs-name">h3</span>&gt;</span><span class="hljs-symbol">&amp;#x221A;</span> CSS<span class="hljs-tag">&lt;/<span class="hljs-name">h3</span>&gt;</span>
<span class="hljs-tag">&lt;<span class="hljs-name">h3</span>&gt;</span><span class="hljs-symbol">&amp;#x221A;</span> JavaScript<span class="hljs-tag">&lt;/<span class="hljs-name">h3</span>&gt;</span>
<span class="hljs-tag">&lt;<span class="hljs-name">h3</span>&gt;</span><span class="hljs-symbol">&amp;#x221A;</span> PHP<span class="hljs-tag">&lt;/<span class="hljs-name">h3</span>&gt;</span>
</code></pre>
<p><img src="https://www.freecodecamp.org/news/content/images/2022/04/ss6-1.png" alt="ss6-1" width="600" height="400" loading="lazy"></p>
<h2 id="heading-conclusion">Conclusion</h2>
<p>This article has shown you the Unicode string for a checkmark, how to use it, and other variations of it. </p>
<p>You also learned about the equivalent HTML character entity for the checkmark symbol, in case you don’t want to show it with the Unicode string.</p>
<p>Now, go insert some checkmarks into your code.</p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ Unicode Characters – What Every Developer Should Know About Encoding ]]>
                </title>
                <description>
                    <![CDATA[ If you are coding an international app that uses multiple languages, you'll need to know about encoding. Or even if you're just curious how words end up on your screen – yep, that's encoding, too. I'll explain a brief history of encoding in this arti... ]]>
                </description>
                <link>https://www.freecodecamp.org/news/everything-you-need-to-know-about-encoding/</link>
                <guid isPermaLink="false">66bc55d2cd8a65d579e3aa06</guid>
                
                    <category>
                        <![CDATA[ ascii ]]>
                    </category>
                
                    <category>
                        <![CDATA[ binary ]]>
                    </category>
                
                    <category>
                        <![CDATA[ encoding ]]>
                    </category>
                
                    <category>
                        <![CDATA[ unicode ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ Kealan Parr ]]>
                </dc:creator>
                <pubDate>Mon, 01 Mar 2021 16:01:00 +0000</pubDate>
                <media:content url="https://www.freecodecamp.org/news/content/images/2020/12/Title.png" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>If you are coding an international app that uses multiple languages, you'll need to know about encoding. Or even if you're just curious how words end up on your screen – yep, that's encoding, too.</p>
<p>I'll explain a brief history of encoding in this article (and I'll discuss how little standardisation there was) and then I'll talk about what we use now. I'll also cover some <strong>Computer Science</strong> theory you need to understand.</p>
<h2 id="heading-introduction-to-encoding">Introduction to Encoding</h2>
<p>A computer only can understand binary. Binary is the language of computers, and is made up of <code>0</code>'s and <code>1</code>'s. There is nothing else allowed. One digit is called a <code>bit</code>, and a <code>byte</code> is 8 bits. So 8 <code>0</code>'s or <code>1</code>'s make up one <code>byte</code>.</p>
<p>Everything eventually ends up as binary – programming languages, mouse moves, typing, and all the words on the screen.</p>
<p>If all the text you're reading was once binary too, then how do we turn binary into text? Let's look at what we used to do back in the beginning.</p>
<h2 id="heading-a-brief-history-of-encoding">A Brief History of Encoding</h2>
<p>In the early days of the internet, it was English only. We didn't need to worry about any other characters and the <strong>American Standard Code for Information Interchange</strong> (<strong>ASCII</strong>) was the character encoding that fit this purpose. </p>
<p><strong>ASCII</strong> is a mapping, from binary to alphanumeric characters. So when the PC receives binary:</p>
<pre><code><span class="hljs-number">01001000</span> <span class="hljs-number">01100101</span> <span class="hljs-number">01101100</span> <span class="hljs-number">01101100</span> <span class="hljs-number">01101111</span> <span class="hljs-number">00100000</span> <span class="hljs-number">01110111</span> <span class="hljs-number">01101111</span> <span class="hljs-number">01110010</span> <span class="hljs-number">01101100</span> <span class="hljs-number">01100100</span>
</code></pre><p>With <strong>ASCII</strong> it can translate that into "Hello world".</p>
<p>One byte (eight bits) was large enough to fit every English character, and some control characters too. Some of these control characters were used for instruments called teleprinters, so at the time they were useful (not so much now!) </p>
<p>But the control characters were things like  7 (<code>111</code> in binary) that would make a bell sound on your PC, 8 (<code>1000</code> in binary) that would print over the last character it just printed, or 12 (<code>1100</code> in binary) that would clear a video terminal from all the text just written.</p>
<p>Computers at this time were using 8 bits for one byte (they didn't always), so there were no issues. We could store all our control characters, all our numbers, all the English characters and have some left! Because one byte can encode 255 characters, and ASCII only needed 127 characters. So we had 128 encodings that were unused.</p>
<p>Let's look at an ASCII table here to see every character. All lowercase and uppercase A-Z and 0-9 were encoded to binary numbers. Remember the first 32 are unprintable control characters.</p>
<h2 id="heading-ascii-character-table">ASCII Character Table</h2>
<p><img src="https://www.freecodecamp.org/news/content/images/2020/12/image-172.png" alt="Image" width="600" height="400" loading="lazy"></p>
<p>Can you see how it ends with 127? We have some spare room at the end.</p>
<h1 id="heading-issues-with-ascii">Issues with ASCII</h1>
<p>The spare characters were 127 through to 255. People began to think about what would be best to fill those remaining characters. <strong>But everyone had different ideas about what those final characters should be.</strong></p>
<p>The American National Standards Institute (<strong>ANSI</strong> - don't get confused with <strong>ASCII</strong>) is a standards body for establishing standards across lots of different fields. They decided what everyone was doing with 0-127, which is what <strong>ASCII</strong> was already doing. But the rest were open.</p>
<p>No one was debating what 0-127 in the <strong>ASCII</strong> encoding was. The problem was with the <strong>spare ones</strong>.</p>
<p>Below is what the first IBM computers did with the 128-255 encodings for ASCII.</p>
<p><img src="https://www.freecodecamp.org/news/content/images/2020/12/image-169.png" alt="Image" width="600" height="400" loading="lazy">
<em>Some squiggles, some background icons, math operators and some accented characters like é.</em></p>
<p>But other computers didn't all follow this. And everyone wanted to implement their own encodings for the end of <strong>ASCII</strong>.</p>
<p>These different endings for <strong>ASCII</strong> were called <strong>code pages</strong>.</p>
<h3 id="heading-what-are-ascii-code-pages">What are ASCII Code Pages?</h3>
<p><a target="_blank" href="https://www.aivosto.com/articles/charsets-codepages.html">Here's</a> a collection of over 465 different codepages! You can see that there were multiple codepages <strong>EVEN</strong> <strong>for the same language</strong>. Greek and Chinese both have multiple codepages, for example.</p>
<p>So how on EARTH were we ever going to standardise this? Or make it work between differing languages? Between the same language with different codepages? In a non English language? </p>
<p>Chinese has over 100,000 different characters. We don't even have enough spare characters for Chinese, let alone agreeing that the final characters should be Chinese ones. This isn't looking so good.</p>
<p>This problem even has its own term: <strong>Mojibake</strong>.</p>
<p>It's garbled text you may sometimes see from decoding text, but using the wrong decoding. It means <strong>character transformation</strong> in Japanese.</p>
<p><img src="https://www.freecodecamp.org/news/content/images/2020/12/image-171.png" alt="Image" width="600" height="400" loading="lazy">
<em>Example of completely garbled text (mojibake).</em></p>
<h2 id="heading-this-sounds-a-little-insane">This sounds a little insane...</h2>
<p>Exactly! We will have zero chance of reliably interchanging data.</p>
<p>The internet is just a huge connection of computers around the world. Imagine if all these countries decided what they each thought the standards should be. If the Greek computers only accepted Greek and the English computers only sent English...? You would just be shouting into an empty cave. No one would understand you. And no one would be able to decode the nonsense.</p>
<p>ASCII wasn't fit for real life use. In a global, connected internet, we had to evolve, or else forever deal with hundreds of codepages.</p>
<p>��� <strong>Unless you</strong> ������ <strong>fancied trying</strong> ��� <strong>to</strong> ��� <strong>read paragraphs like this.</strong> �֎֏0590֐��׀ׁׂ׃ׅׄ׆ׇ</p>
<h2 id="heading-along-came-unicode">Along Came Unicode</h2>
<p>Unicode is sometimes called the <a target="_blank" href="https://en.wikipedia.org/wiki/Universal_Coded_Character_Set">Universal Coded Character Set</a> (UCS), or even ISO/IEC 10646. But Unicode is its more common name.</p>
<p>But, this is where Unicode entered the scene to help solve the problems that <strong>encoding</strong> and <strong>code pages</strong> were causing.</p>
<p>Unicode is made up of lots of <strong>code points</strong> (mapping lots of characters from around the world to a key that all computers can reference.) A collection of <strong>code points</strong> is called a <strong>character set</strong> - which is what Unicode is.</p>
<p>We can map something abstract to a letter we want to reference. And it does every character! Even <a target="_blank" href="https://unicode.org/charts/PDF/U13000.pdf">Egyptian Hieroglyphs</a>.</p>
<p>Some people did all the hard work of mapping what each character would be (in all the languages) to a key that we could all access. They look like this:</p>
<p><strong>"Hello World"</strong> </p>
<h6 id="heading-u0048-latin-capital-letter-h">U+0048 : LATIN CAPITAL LETTER H</h6>
<p>U+0065 : LATIN SMALL LETTER E<br>U+006C : LATIN SMALL LETTER L<br>U+006C : LATIN SMALL LETTER L<br>U+006F : LATIN SMALL LETTER O<br>U+0020 : SPACE [SP]<br>U+0057 : LATIN CAPITAL LETTER W<br>U+006F : LATIN SMALL LETTER O<br>U+0072 : LATIN SMALL LETTER R<br>U+006C : LATIN SMALL LETTER L<br>U+0064 : LATIN SMALL LETTER D</p>
<p>The U+ lets us know it's the Unicode standard, and the number is what results when the binary get's transformed to numbers. It uses <a target="_blank" href="https://www.bbc.co.uk/bitesize/guides/zp73wmn/revision/1#:~:text=Hexadecimal%20(or%20hex)%20is%20a,values%20in%20binary%20and%20denary.">hexadecimal</a> notation which is just a simpler way of representing binary numbers. You don't have to worry too much about the hexadecimal here, though. </p>
<p><a target="_blank" href="https://www.babelstone.co.uk/Unicode/whatisit.html">Here's</a> a link where you can type whatever you want into the text box, and see the Unicode character encoding. Or look at all 143,859 Unicode character points <a target="_blank" href="https://unicode-table.com/en/">here</a>. You can also see where each character is from in the world!</p>
<p>I just want to be clear. At this point we have a big dictionary of <strong>code points</strong> mapping to characters. A really big <strong>character set</strong>. Nothing more. </p>
<p><strong>There's one final ingredient we need to add to our mix.</strong></p>
<h2 id="heading-unicode-transform-protocol-utf">Unicode Transform Protocol (UTF)</h2>
<p>UTF is a way we encode Unicode code points. The UTF encodings are defined by the Unicode standard, and are able to encode every single Unicode <strong>code point</strong> we need.</p>
<p>But there are different types of UTF standards. They differ depending on the amount of bytes used to encode one <strong>code point</strong>. It also depends on whether you're using <strong>UTF-8</strong> (one byte per code point), <strong>UTF-16</strong> (two bytes per code point) or <strong>UTF-32</strong> (four bytes per code point).</p>
<p>If we have these different encodings, how do we know which encoding a file will use? There's a thing called a <strong>Byte Order Mark</strong> (<strong>BOM</strong>) - sometimes referred to as an <strong>Encoding Signature</strong>. The <strong>BOM</strong> is a two-byte marker at the beginning of a file that tells what encoding the file is using.</p>
<p><strong>UTF-8</strong> is the most used on the internet, and is also specified in HTML5 as the preferred encoding for new documents, so I'll spend the most time explaining this one.</p>
<p><img src="https://www.freecodecamp.org/news/content/images/2020/12/image-163.png" alt="Image" width="600" height="400" loading="lazy">
<em>You can see in the <a target="_blank" href="https://en.wikipedia.org/wiki/UTF-8#/media/File:Utf8webgrowth.svg">diagram </a>even from 2012, UTF-8 was widely becoming the most used encoding. And for the web it still is.</em></p>
<p><img src="https://www.freecodecamp.org/news/content/images/2020/12/image-179.png" alt="Image" width="600" height="400" loading="lazy">
_W3 <a target="_blank" href="https://w3techs.com/technologies/cross/character_encoding/ranking">diagram </a>to show how well used UTF-8 is used on a variety of websites._</p>
<h2 id="heading-what-is-utf-8-and-how-does-it-work">What is UTF-8 and How Does it Work?</h2>
<p><strong>UTF-8</strong> encodes all the Unicode code points from 0-127 in 1 byte (the same as <strong>ASCII</strong>). This means that if you were coding your program using <strong>ASCII</strong>, and your users used <strong>UTF-8,</strong> they <em>wouldn't notice anything was wrong</em>. Everything would just work. </p>
<p>Just remember how strong a selling point this is. We needed to remain <strong>ASCII</strong> backwards compatible while <strong>UTF-8</strong> was being implemented and used by everyone. It doesn't break anything currently being used.</p>
<p>Because it's called <strong>UTF-8</strong>, remember that's the minimum number of bits (8 bits being one byte!) that a <strong>code point</strong> will be. There are other Unicode characters that are stored in multiple bytes (up to 6 bytes depending on the character). This is what people mean when the encoding is called <strong>variable length</strong>.</p>
<p>It might be more, depending on the language. English is 1 byte. <a target="_blank" href="https://design215.com/toolbox/ascii-utf8.php">European (Latin), Hebrew and Arabic</a> are represented with 2 bytes. 3 bytes are used for <a target="_blank" href="https://design215.com/toolbox/utf8-3byte-characters.php">Chinese, Japanese, Korean and other Asian characters</a><em>.</em> You get the idea. </p>
<p>When you need a character to span more than one byte, you have a bit combination to identify a continuation sign, saying this character is continued over the next several bytes. So you’ll still only use one byte per character for English, but if you need a document to contain some foreign characters, you can do that too.</p>
<p>And now, wonderfully, we can all agree on what the <a target="_blank" href="https://en.wikipedia.org/wiki/Cuneiform_(Unicode_block)">Sumerian cuneiform characters</a> encoding is (𒀵 𒁷𒂅 𒐤), as well as some <a target="_blank" href="https://unicode.org/emoji/charts/full-emoji-list.html">emoji's</a> 😉😉 so we can all communicate!</p>
<p>The high level overview is: You first read the <strong>BOM</strong> so you know your encoding. You decode the file into Unicode <strong>code points</strong>, and then represent the characters from the Unicode character set into characters drawn onto the screen.</p>
<h2 id="heading-a-final-word-about-utf">A Final Word About UTF</h2>
<p>Remember, encoding is <strong>key</strong>. If I send the complete wrong encoding you can't read anything. Be aware of it when receiving or sending data. Often it is abstracted away in the tools you use everyday, but as programmers it's important to understand what is happening under the hood. </p>
<p>How do we specify our encodings, then? Because HTML is written in English, and almost all encodings can deal with English fine. We can embed it right at the top in the <code>&lt;head&gt;</code> section.</p>
<pre><code class="lang-html"><span class="hljs-tag">&lt;<span class="hljs-name">html</span> <span class="hljs-attr">lang</span>=<span class="hljs-string">"en"</span>&gt;</span>
<span class="hljs-tag">&lt;<span class="hljs-name">head</span>&gt;</span>
  <span class="hljs-tag">&lt;<span class="hljs-name">meta</span> <span class="hljs-attr">charset</span>=<span class="hljs-string">"utf-8"</span>&gt;</span>
<span class="hljs-tag">&lt;/<span class="hljs-name">head</span>&gt;</span>
</code></pre>
<p>It's important to do this at the very start of the <code>&lt;head&gt;</code>, as the parsing of the <a target="_blank" href="https://html.spec.whatwg.org/multipage/parsing.html#determining-the-character-encoding">HTML may have to start again</a> if the encoding it's currently using is wrong.</p>
<p>We also can get the encoding from the <a target="_blank" href="https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Content-Type">Content-Type</a> header from the HTTP request/ response.</p>
<p>If an HTML document doesn't contain the encoding tag, the <a target="_blank" href="https://html.spec.whatwg.org/multipage/parsing.html#determining-the-character-encoding">HTML5 spec</a> has some interesting ways it can guess the encoding called <a target="_blank" href="https://encoding.spec.whatwg.org/#bom-sniff"><strong>BOM sniffing</strong></a>. This is where it guesses the encoding from the <strong>Byte Order Mark</strong> (<strong>BOM</strong>) we discussed earlier. </p>
<h2 id="heading-so-is-that-it">So is that it?</h2>
<p>Unicode isn't finished. Like any standard, we add, remove and make new proposals to the standard. No specification is ever considered "complete".</p>
<p>There are generally 1 or 2 release a year, and you can find them <a target="_blank" href="https://unicode.org/history/publicationdates.html">here</a>.</p>
<p>Recently I read about a very interesting bug around <a target="_blank" href="https://twitter.com/availablegreen/status/1332774350613835779">Twitter rendering Russian Unicode characters incorrectly</a>. </p>
<p>If you have read this far, congratulations – it's a lot to digest.</p>
<p>I would encourage you to do one last piece of homework. </p>
<p>Look at how broken websites can really be when the encoding is wrong. I used <a target="_blank" href="https://chrome.google.com/webstore/detail/set-character-encoding/bpojelgakakmcfmjfilgdlmhefphglae?hl=en">this</a> Google Chrome extension and changed my encoding and tried to read webpages. The message was completely unclear. Try and read this article. Try and navigate Wikipedia. See <strong>Mojibake</strong> for yourself.</p>
<p>It helps to see how important encoding truly is.</p>
<p><img src="https://www.freecodecamp.org/news/content/images/2020/12/image-164.png" alt="Image" width="600" height="400" loading="lazy"></p>
<h2 id="heading-conclusion">Conclusion</h2>
<p>In my time spent researching and trying to simplify this article, I learned about <a target="_blank" href="https://en.wikipedia.org/wiki/Michael_Everson#">Michael Everson</a>. Since 1993, he has proposed over 200 Unicode changes, and has added thousands of characters to the standard. As of 2003, he was credited as the leading contributor of Unicode proposals. He is one huge reason why Unicode is what it is. Very impressive, and he has done a great deal for the Internet as we know it.</p>
<p>I hope this has explained a good overview of why we need encodings, what problems encoding solves, and what happens when it goes wrong.</p>
<p>I share my writing on <a target="_blank" href="https://twitter.com/kealanparr">Twitter</a> if you enjoyed this article and want to see more.</p>
 ]]>
                </content:encoded>
            </item>
        
            <item>
                <title>
                    <![CDATA[ A Beginner-Friendly Guide to Unicode in Python ]]>
                </title>
                <description>
                    <![CDATA[ By Jimmy Zhang I once spent a couple of frustrating days at work learning how to properly deal with Unicode strings in Python. During those two days, I ate a lot of snacks — roughly one bag of goldfish per one of these errors encountered, which shoul... ]]>
                </description>
                <link>https://www.freecodecamp.org/news/a-beginner-friendly-guide-to-unicode-d6d45a903515/</link>
                <guid isPermaLink="false">66c341bd93db2451bd4413dd</guid>
                
                    <category>
                        <![CDATA[ emoji ]]>
                    </category>
                
                    <category>
                        <![CDATA[ General Programming ]]>
                    </category>
                
                    <category>
                        <![CDATA[ Python ]]>
                    </category>
                
                    <category>
                        <![CDATA[ technology ]]>
                    </category>
                
                    <category>
                        <![CDATA[ unicode ]]>
                    </category>
                
                <dc:creator>
                    <![CDATA[ freeCodeCamp ]]>
                </dc:creator>
                <pubDate>Wed, 18 Jul 2018 23:51:28 +0000</pubDate>
                <media:content url="https://cdn-media-1.freecodecamp.org/images/1*2TiN1yOMlCq2fyqQTqgt-w.jpeg" medium="image" />
                <content:encoded>
                    <![CDATA[ <p>By Jimmy Zhang</p>
<p>I once spent a couple of frustrating days at work learning how to properly deal with Unicode strings in Python. During those two days, I ate a lot of snacks — roughly one bag of goldfish per one of these errors encountered, which should be all too familiar to those who program with Python:</p>
<pre><code>UnicodeDecodeError: ‘ascii’ codec can’t decode byte <span class="hljs-number">0xf0</span> <span class="hljs-keyword">in</span> position <span class="hljs-number">0</span>: ordinal not <span class="hljs-keyword">in</span> range(<span class="hljs-number">128</span>)
</code></pre><p>While solving my issue, I did a lot of googling, which pointed me to <a target="_blank" href="https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/">a</a> <a target="_blank" href="https://nedbatchelder.com/text/unipain.html">few</a> <a target="_blank" href="https://betterexplained.com/articles/unicode/">indispensable</a> <a target="_blank" href="http://www.pgbovine.net/unicode-python.htm">articles</a>. But as great as they are, they were all written without the help of a crucial aspect of communication in today’s day and age.</p>
<p>That is: they were all written without the help of emoji.</p>
<p>So, in order to take advantage of this situation, I decided to write my own guide to understanding Unicode, with plenty of faces and icons rendered along the way ?✌?.</p>
<p>Before diving into technical details, let’s begin with a fun question. What is your favorite emoji?</p>
<p>Mine is the “<a target="_blank" href="https://emojipedia.org/face-with-open-mouth/">face with open mouth</a>”, which looks like this ?— with one major caveat. What you see is actually dependent on the platform you are using to read this post!</p>
<p>Viewed on my Mac, the emoji looks like a yellow bowling ball. On my Samsung tablet, the eyes are black and circular, accentuated by a white dot which betrays a greater depth of emotion.</p>
<p>Copy and paste the emoji (?) into Twitter, and you’ll see something completely different. Copy and paste it into messenger.com, however, and you’ll see why it is my favorite.</p>
<p>???? Why are they all different?</p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/xcqApm6uLo00aEV6quvQz5hbv0SMFnrxJwPc" alt="Image" width="600" height="400" loading="lazy"></p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/xBNeNbexzlYavyagf0TqljD-3nGKVgBSqdtD" alt="Image" width="600" height="400" loading="lazy"></p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/hj38DT5kCCXEAlXuor1E1JLjTtPBQtVKlaun" alt="Image" width="600" height="400" loading="lazy">
_From left to right: Apple, Samsung, messenger.com ([source](https://emojipedia.org/face-with-open-mouth/" rel="noopener" target="<em>blank" title=")).</em></p>
<p>Note: As of July 9th, 2018: Messenger seems to have updated their emoji icons, so the icon at the top right no longer applies. ?</p>
<p>This fun little mystery is our segue into the world of Unicode, as emojis have been part of the <a target="_blank" href="https://emojipedia.org/unicode-6.0/">Unicode Standard</a> since 2010. Aside from giving us emoji, Unicode is important because it is the Internet’s preferred choice for the consistent “<a target="_blank" href="https://en.wikipedia.org/wiki/Unicode">encoding, representation, and handling of text</a>”.</p>
<h3 id="heading-unicode-amp-encoding-a-brief-primer">Unicode &amp; Encoding: A Brief Primer</h3>
<p>As with many topics, the best way to understand Unicode is to know the context surrounding its creation — and for that, Joel Spolsky’s <a target="_blank" href="https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/">article</a> is required reading.</p>
<h4 id="heading-code-points">Code Points</h4>
<p>Since we’ve now entered the world of Unicode, we need to first dissociate emojis from the wonderfully expressive icons they are, and associate them with something much less exciting. So instead of thinking about emojis in terms of the things or the emotions that they represent, we will instead think about each emoji as a plain number. This number is known as a <strong>code point</strong>.</p>
<p>Code points are the <a target="_blank" href="https://www.unicode.org/standard/standard.html">key concept of Unicode</a>, which was “designed to support the worldwide interchange, processing, and display of the written texts of the diverse languages…of the modern world.” It does so by associating virtually every printable character with an unique code point. Together, these characters comprise the Unicode <strong>character set</strong>.</p>
<p>Code points are typically written in hexadecimal and prefixed with <code>U+</code> to denote the connection to Unicode, representing characters from:</p>
<ul>
<li>exotic languages such as <a target="_blank" href="https://en.wikipedia.org/wiki/Telugu_(Unicode_block)">Telugu</a> [ఋ | code point: U+0C0B]</li>
<li><a target="_blank" href="https://en.wikipedia.org/wiki/Chess_symbols_in_Unicode">chess symbols</a> [♖ | code point: U+2656]</li>
<li>and, of course, <a target="_blank" href="https://en.wikipedia.org/wiki/Emoticons_(Unicode_block)">emojis</a> [? | code point: U+1F64C]</li>
</ul>
<h4 id="heading-glyphs-are-what-you-see">Glyphs Are What You See</h4>
<p>The actual on-screen representation of code points are called <strong>glyphs</strong>, (the complete mapping of code points to glyphs is known as a <strong>font</strong>)<strong>.</strong></p>
<p>As an example<strong>,</strong> take this letter A, which is code point <code>U+0041</code> in Unicode. The “A” you see with your eyes is a glyph — it looks like the way it does because it is rendered with Medium’s font. If you were to change the font to, Times New Roman for example, only the glyph of “A” would change — the underlying code point would not.</p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/TUckLVh6eCihRLcdZhfa3l8qIE6IuxmCvLxY" alt="Image" width="600" height="400" loading="lazy">
<em>Fonts map the same code point to different glyphs</em></p>
<p>Glyphs are the answer to our little rendering mystery. Under the hood, all variations of the face with open mouth emoji point to the same code point, <code>U+1F62E</code>, but the <strong>glyph</strong> representing it varies by platform ?.</p>
<h4 id="heading-code-points-are-abstractions">Code Points are Abstractions</h4>
<p>Because they say nothing about how they are rendered visually (requiring a font and a glyph to “bring them to life”), code points are said to be an abstraction.</p>
<p>But just as code points are an abstraction to end users, they are also abstractions to computers. This is because code points require a <strong>character encoding</strong> to convert them into the one thing which computers can interpret: bytes. Once converted to bytes, code points can be saved to files or sent over the network to another computer ?➡️?.</p>
<p>UTF-8 is currently the world’s <a target="_blank" href="https://en.wikipedia.org/wiki/UTF-8#/media/File:Utf8webgrowth.svg">most popular character encoding</a>. UTF-8 uses a set of rules to convert a code point into an unique sequence of (1 to 4) bytes, and vice versa. Code points are said to be <strong>encoded</strong> into a sequence of bytes, and sequences of bytes are <strong>decoded</strong> into code points. <a target="_blank" href="https://stackoverflow.com/questions/1543613/how-does-utf-8-variable-width-encoding-work">This Stack Overflow post</a> explains how the UTF-8 encoding algorithm works.</p>
<p>However, even though UTF-8 is the predominant character encoding in the world, it is far from the only one. For example, UTF-16 is an alternative character encoding of the Unicode character set. The image below compares the UTF-8 and UTF-16 encodings of our emoji ?.</p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/k1TgNZ8m7zeByOT1BLyLSD8F7NBESOp7WLQO" alt="Image" width="600" height="400" loading="lazy"></p>
<p>Problems arise when one computer encodes code points into bytes with one encoding, and another computer (or another process on the same computer) decodes those bytes with another.</p>
<p>Luckily, UTF-8 is ubiquitous enough that, for the most part, we don’t have to worry about mismatched character encodings. But when they do occur, a familiarity with the concepts mentioned above is required to extricate yourself from the mess.</p>
<h4 id="heading-brief-recap">Brief Recap</h4>
<ul>
<li>Unicode is a collection of <strong>code points</strong>, which are plain numbers typically written in hexadecimal and prefixed with <code>U+</code>. These code points map to virtually every printable character from the written languages around the world.</li>
<li><strong>Glyphs</strong> are the physical manifestation of a character. This guy ? is a glyph. A f<strong>ont</strong> is a mapping of code points to glyphs.</li>
<li>In order to send them across the network or save them in a file, characters and their underlying code points must be encoded into bytes. A <strong>character encoding</strong> contains the details of how a code point is embedded into a sequence of bytes.</li>
<li><strong>UTF-8</strong> is currently the world’s must popular character encoding. Given a code point, UTF-8 <strong>encodes</strong> it into a sequence of bytes. Given a sequence of bytes, UTF-8 <strong>decodes</strong> it into a code point.</li>
</ul>
<h3 id="heading-a-practical-example"><strong>A Practical Example</strong></h3>
<p>The correct rendering of Unicode characters involves traversing a chain, ranging from bytes to code points to glyphs.</p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/4EWd0DC-ca2Xc-KykyfW7iVAHJhe6SjGG2Vx" alt="Image" width="600" height="400" loading="lazy"></p>
<p>Let’s now use a text editor to see a practical example of this chain — as well as the types of issues that can arise when things go awry. Text editors are perfect, because they involve all three parts of the rendering chain shown above.</p>
<p>Note: The following example was done on my MacOS using Sublime Text 3. And to give credit where credit is due: the beginning of this example is heavily inspired by <a target="_blank" href="http://pgbovine.net/unicode-python.htm">this post</a> from Philip Guo, which introduced me to the <code>hexdump</code> command (and a whole lot more).</p>
<p>We’ll start with a text file containing a single character — my favorite “face with open mouth” emoji. For those who want to follow along, I’ve hosted this file in a Github <a target="_blank" href="https://gist.githubusercontent.com/jzhang621/d7d9eb167f25084420049cb47510c971/raw/e35f9669785d83db864f9d6b21faf03d9e51608d/emoji.txt">gist</a>, which you get locally with <code>curl</code>.</p>
<pre><code>curl https:<span class="hljs-comment">//gist.githubusercontent.com/jzhang621/d7d9eb167f25084420049cb47510c971/raw/e35f9669785d83db864f9d6b21faf03d9e51608d/emoji.txt &gt; emoji.txt</span>
</code></pre><p>As we learned, in order for it be saved to a file, the emoji was encoded into bytes using a character encoding. This particular file was encoded using UTF-8, and we can use the <code>hexdump</code> command to examine the actual byte contents of the file.</p>
<pre><code>j|encoding: hexdump emoji.txt0000000 f0 <span class="hljs-number">9</span>f <span class="hljs-number">98</span> ae <span class="hljs-number">0000004</span>
</code></pre><p>The output of <code>hexdump</code> tells us the file contains 4 bytes total, each of which is written in hexadecimal. The actual byte sequence <code>f0 9f 98 ae</code> matches the expected UTF-8 encoded byte sequence, as shown below.</p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/zRTpkcw12y2aFQOJyfTARuWgucf0CobcaKzf" alt="Image" width="600" height="400" loading="lazy">
_Source: [http://www.ltg.ed.ac.uk/~richard/utf-8.cgi?input=%F0%9F%98%AE&amp;mode=char](http://www.ltg.ed.ac.uk/~richard/utf-8.cgi?input=%F0%9F%98%AE&amp;mode=char" rel="noopener" target="<em>blank" title=")</em></p>
<p>Now, let’s open our file in Sublime Text, where we should see our single ? character. Since we see the expected glyph, we can assume Sublime Text used the correct character encoding to decode those bytes into code points. Let’s confirm by opening up the console V<code>iew -&gt; Show Conso</code>le, and inspecting th<code>[e vi](https://www.sublimetext.com/docs/3/api_reference.html#sublime.View)</code>ew object that Sublime Text exposes as part of its Python API.</p>
<pre><code>&gt;&gt;&gt; view&lt;sublime.View object at <span class="hljs-number">0x1112d7310</span>&gt;
</code></pre><pre><code># returns the encoding currently associated <span class="hljs-keyword">with</span> the file&gt;&gt;&gt; view.encoding()<span class="hljs-string">'UTF-8'</span>
</code></pre><p>With a bit of Python knowledge, we can also find the Unicode code point associated with our emoji:</p>
<pre><code># Returns the character at the given position&gt;&gt;&gt; view.substr(<span class="hljs-number">0</span>)<span class="hljs-string">'?'</span>
</code></pre><pre><code># ord returns an integer representing the Unicode code point <span class="hljs-keyword">of</span> the character (docs)&gt;&gt;&gt; ord(view.substr(<span class="hljs-number">0</span>))<span class="hljs-number">128558</span>
</code></pre><pre><code># convert code point to hexadecimal, and format <span class="hljs-keyword">with</span> U+&gt;&gt;&gt; print(<span class="hljs-string">'U+%x'</span> % ord(view.substr(<span class="hljs-number">0</span>)))U+<span class="hljs-number">1</span>f62e
</code></pre><p>Again, just as we expected. This illustrates a full traversal of the Unicode rendering chain, which involved:</p>
<ul>
<li>reading the file as a sequence of UTF-8 encoded bytes.</li>
<li>decoding the bytes into a Unicode code point.</li>
<li>rendering the glyph associated with the code point.</li>
</ul>
<p><img src="https://cdn-media-1.freecodecamp.org/images/tgfnKyW9kpVCBK4tkSwTiDncDR9-COPmFpw5" alt="Image" width="600" height="400" loading="lazy">
<em>The actual glyph that you see is dependent on the platform.</em></p>
<p>So far, so good ?.</p>
<h4 id="heading-different-bytes-same-emoji">Different Bytes, Same Emoji</h4>
<p>Aside from being my favorite text editor, I chose Sublime Text for this example because it allows for easy experimentation with character encodings.</p>
<p>We can now save the file using a different character encoding. To do so, click <code>File -&gt; Save with Encoding -&gt; UTF</code>-16 BE. (Very bri<a target="_blank" href="https://en.wikipedia.org/wiki/UTF-16">efly,</a> UTF-16 is an alternative character encoding of the Unicode character set. Instead of encoding the most common characters using one byte, like UTF-8, UTF-16 encodes every point from 1–65536 using two bytes. Code points greater than 65536, like our emoji<a target="_blank" href="https://stackoverflow.com/a/5903080/1586242">, are encoded using surrogate</a> pairs. The BE stands for Big Endian).</p>
<p>When we use <code>hexdump</code> to inspect the file again, we see that byte contents have changed.</p>
<pre><code># (before: UTF<span class="hljs-number">-8</span>)j|encoding: hexdump emoji.txt0000000 f0 <span class="hljs-number">9</span>f <span class="hljs-number">98</span> ae <span class="hljs-number">0000004</span>
</code></pre><pre><code># (after: UTF<span class="hljs-number">-16</span> BE)j|encoding: hexdump emoji.txt0000000 d8 <span class="hljs-number">3</span>d de <span class="hljs-number">2e0000004</span>
</code></pre><p>Back in Sublime Text, we still see the same ? character staring at us. Saving the file with a different character encoding might have changed the actual contents of the file, but it also updated Sublime Text’s internal representation of how to interpret those bytes. We can confirm by firing up the console again.</p>
<pre><code>&gt;&gt;&gt; view.encoding()<span class="hljs-string">'UTF-16 BE'</span>
</code></pre><p>From here on up, everything else is the same.</p>
<pre><code>&gt;&gt;&gt; view.substr(<span class="hljs-number">0</span>)<span class="hljs-string">'?'</span>
</code></pre><pre><code>&gt;&gt;&gt; ord(view.substr(<span class="hljs-number">0</span>))<span class="hljs-number">128558</span>
</code></pre><pre><code>&gt;&gt;&gt; print(<span class="hljs-string">'U+%x'</span> % ord(view.substr(<span class="hljs-number">0</span>)))U+<span class="hljs-number">1</span>f62e
</code></pre><p>The bytes may have changed, but the code point did not — and the emoji remains the same.</p>
<h4 id="heading-same-bytes-but-what-the-d">Same Bytes, But What The đŸ˜®</h4>
<p>Time for some encoding “fun”. First, let’s re-encode our file using UTF-8, because it makes for a better example.</p>
<p>Let’s now go ahead use Sublime Text to re-open an existing file using a different character encoding. Under <code>File -&gt; Reopen with Encod</code>ing, cli<code>ck Vietnamese (Windows 12</code>58), which turns our emoji character into the following four nonsensical characters: đŸ˜®.</p>
<p>When we click “Reopen with Encoding”, we aren’t changing the actual byte contents of the file, but rather, the way Sublime Text interprets those bytes. Hexdump confirms the bytes are the same:</p>
<pre><code>j|encoding: hexdump emoji.txt0000000 f0 <span class="hljs-number">9</span>f <span class="hljs-number">98</span> ae0000004
</code></pre><p>To understand why we see these nonsensical characters, we need to consult the <a target="_blank" href="https://en.wikipedia.org/wiki/Windows-1258">Windows-1258</a> code page, which is a mapping of bytes to a Vietnamese language character set. (Think of a code page as the table produced by a character encoding). As this code page contains a character set with less than 255 characters, each character’s code points can be expressed as a decimal number between 0 and 255, which in turn can all be encoded using 1 byte.</p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/fvczjUIBIrUtsDTPr08NIdXLJqb9hjn6Hdd0" alt="Image" width="600" height="400" loading="lazy">
<em>The Windows-1258 code page, which maps decimal code points to Vietnamese language characters. Taken from Wikipedia, with some custom styling applied to show the 4 code points relevant to this example.</em></p>
<p>Because our single ? emoji requires 4 bytes to encode using UTF-8, we now see 4 characters when we interpret the file with the Windows-1258 encoding.</p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/04v6iTqdJ7XiQfOMQxUtJMwto3JPS8gWcRZk" alt="Image" width="600" height="400" loading="lazy"></p>
<p>A wrong choice of character encoding has a direct impact on what we can see and comprehend by garbling characters into an incomprehensible mess.</p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/lYN9Y31uDX5NwCb3ihQLoplb7e19gCepIxKf" alt="Image" width="600" height="400" loading="lazy"></p>
<p>Now, onto the “fun” part, which I include to add some color to Unicode and why it exists. Before Unicode, there were many different code pages such as Windows-1258 in existence, each with a different way of mapping 1 byte’s worth of data into 255 characters. <strong>Unicode was created in order to incorporate all the different characters of the all the different code pages into one system</strong>. In other words, Unicode is a superset of Windows-1258, and each character in the Windows-1258 code page has a <a target="_blank" href="https://stackoverflow.com/a/3441690/1586242">Unicode counterpart</a>.</p>
<p><img src="https://cdn-media-1.freecodecamp.org/images/PzRE5GqbSr6PLTSxNg2I3B5zeeRFfgVFCOBT" alt="Image" width="600" height="400" loading="lazy">
_The Unicode counterparts for each character is listed on the middle row of each cell ([Wikipedia](https://en.wikipedia.org/wiki/Windows-1258" rel="noopener" target="<em>blank" title="))</em></p>
<p>In fact, these Unicode counterparts are what allows Sublime Text to convert between different character encodings with a click of a button. Internally, Sublime Text still represents each of our “Windows-1258 decoded” characters as a Unicode code point, as we see below when we fire up the console:</p>
<pre><code>&gt;&gt;&gt; view.encoding()<span class="hljs-string">'Vietnamese (Windows 1258)'</span>
</code></pre><pre><code># Python <span class="hljs-number">3</span> strings are <span class="hljs-string">"immutable sequences of Unicode code points"</span>&gt;&gt;&gt; type(view.substr(<span class="hljs-number">0</span>))&lt;<span class="hljs-class"><span class="hljs-keyword">class</span> '<span class="hljs-title">str</span>'&gt;</span>
</code></pre><pre><code>&gt;&gt;&gt; view.substr(<span class="hljs-number">0</span>)<span class="hljs-string">'đ'</span>&gt;&gt;&gt; view.substr(<span class="hljs-number">1</span>)<span class="hljs-string">'Ÿ'</span>&gt;&gt;&gt; view.substr(<span class="hljs-number">2</span>)<span class="hljs-string">'˜'</span>&gt;&gt;&gt; view.substr(<span class="hljs-number">3</span>)<span class="hljs-string">'®'</span>
</code></pre><pre><code>&gt;&gt;&gt; [<span class="hljs-string">'U+%04x'</span> % ord(view.substr(x)) <span class="hljs-keyword">for</span> x <span class="hljs-keyword">in</span> range(<span class="hljs-number">0</span>, <span class="hljs-number">4</span>)][<span class="hljs-string">'U+0111'</span>, <span class="hljs-string">'U+0178'</span>, <span class="hljs-string">'U+02dc'</span>, <span class="hljs-string">'U+00ae'</span>]
</code></pre><p>This means that we can re-save our 4 nonsensical characters using UTF-8. I’ll leave this one up to you — if you do so, and can correctly predict the resulting <code>hexdump</code> of the file, then you’ve successfully understood the key concepts behind Unicode, code points, and character encodings. (<a target="_blank" href="https://www.utf8-chartable.de/unicode-utf8-table.pl?number=512">Use this UTF-8 code page</a>. Answer can be found at the very end of this article. ).</p>
<h3 id="heading-wrapping-up">Wrapping up</h3>
<p>Working effectively with Unicode involves always knowing what level of the rendering chain you are operating on. It means always asking yourself: what do I have? Under the hood, glyphs are nothing but code points. If you are working with code points, know that those code points must be encoded into bytes with a character encoding. If you have a sequence of bytes representing text, know that those bytes are meaningless without knowing the character encoding that was used create those bytes.</p>
<p>As with any computer science topic, the best way to learn about Unicode is to experiment. Enter characters, play with character encodings, and make predictions that you verify using <code>hexdump</code>. While I hope this article explains everything you need to know about Unicode, I will be more than happy if it merely sets you up to run your own experiments.</p>
<p>Thanks for reading! ?</p>
<h4 id="heading-answer">Answer:</h4>
<pre><code>j|encoding: $ hexdump emoji.txt0000000 c4 <span class="hljs-number">91</span> c5 b8 cb <span class="hljs-number">9</span>c c2 ae0000008
</code></pre> ]]>
                </content:encoded>
            </item>
        
    </channel>
</rss>
